Search / Comparison

Levenshtein

Basic Usage

Calculates the Levenshtein distance between a and b.

Usage:

>>> from web2py_utils import search

>>> leve = search.Levenshtein()

>>> leve.levenshtein('hello', 'hellp')
1

Suggestion

Compares a string to a list of strings and returns list of closest matches. Sorted by lowest distance first.

Usage:

>>> leve.suggestion('hello', ['hello', 'world', 'mello', 'hi', 'hell'],
                    number_of_matches=3)
['hello', 'hell', 'mello']

NCD

Performs an NCD (Normalized Compression Distance) comparison between two keys.

You can read some documentation on the algorithm here: http://www.sophos.com/blogs/sophoslabs/?p=188

These can either be the contents of a file or search terms.

This will be a float value between 0.0 and 1.1, where the lower the value the closer in similarity.

Basic Usage

Calculates the similarity between key1 and key2

Usage:

>>> from web2py_utils.search import ncd
>>> key1 = "h3ll0"
>>> key2 = "hello"
>>> key3 = "lehlo"
>>> key4 = "aeiou"

>>> ncd(key1, key2)
0.071428571428571425
>>> ncd(key2, key3)
0.024390243902439025
>>> ncd(key3, key4)
0.11904761904761904
>>> ncd(key1, key3)
0.095238095238095233
>>> ncd(key1, key4)
0.16666666666666666
>>> ncd(key2, key4)
0.095238095238095233

NGram

Usage:

# A simple example
>>> base = ['sdafaf', 'asfwef', 'asdfawe', 'adfwe', 'askfjwehiuasdfji']
>>> tg = search.Ngram(base, min_sim = 0.0)

>>> pprint.pprint(tg.getSimilarStrings('askfjwehiuasdfji'))
>>> print
>>> pprint.pprint(Ngram.compare('sdfeff', 'sdfeff'))
>>> print
>>> pprint.pprint(tg.getBestMatch('afadfwe', 2))

Constructor

PARAMETERS:

haystack - String or list of strings. This is where we will look for
           matches

OPTIONAL PARAMETERS:

min_sim     - minimum similarity score for a string to be considered
              worthy. Default = 0.0
warp        - If warp > 1 short strings are getting away better, if
              warp < 1 they are getting away worse. Default = 1.0
ic          - Ignore case? Default = False
only_alnum  - Only consider alphanumeric characters? Default = False
ngram_len   - n-gram size. Default = 3 (trigram)
padding     - padding size. Default = ngram_len - 1
noise       - noise characters that should be ignored when comparing
class web2py_utils.search.Ngram(haystack, **kwargs)

Used to compute similarities between strings. Can easily be used to compare one string to a whole batch of strings or pick the best string out of a list

This code was largely inspired by String::Trigram by Tarek Ahmed. Actually I just translated it to python ;)

You can find the original Perl-Module at http://search.cpan.org/dist/String-Trigram/

classmethod compare(s1, s2)

Simply compares two strings and returns the similarity score.

This is a class method. It can be called without instantiating the ngram class first. Example:

>> ngram = Ngram() >> ngram.compare(‘sfewefsf’, ‘sdfafwgah’) >> 0.050000000000000003

PARAMETERS:
s1 - First string s2 - Last string
computeActSimOld(samegrams, allgrams, warp=None)

Computes the similarity score between two sets of n-grams according to the following formula:

(a = all trigrams, d = different trigrams, e = warp)

(a**e - d**e)/a**e

PARAMETERS:
samegrams - n-grams that were found in the string allgrams - All n-grams in the search space warp - the warp factor. See __init__ for explanation
RETURNS:
Similarity score as float
computeSimilarity(string, siminfo)

Calculates the similarity of the given some information about n-grams. PARAMETERS:

string - This is what we want to get the score of siminfo - A dictionary containing info about n-gram distribution.

(see getSimilarStrings)
RETURNS:
the score as float
compute_act_sim_old(samegrams, allgrams, warp=None)

Computes the similarity score between two sets of n-grams according to the following formula:

(a = all trigrams, d = different trigrams, e = warp)

(a**e - d**e)/a**e

PARAMETERS:
samegrams - n-grams that were found in the string allgrams - All n-grams in the search space warp - the warp factor. See __init__ for explanation
RETURNS:
Similarity score as float
compute_similarity(string, siminfo)

Calculates the similarity of the given some information about n-grams. PARAMETERS:

string - This is what we want to get the score of siminfo - A dictionary containing info about n-gram distribution.

(see getSimilarStrings)
RETURNS:
the score as float
getBestMatch(needle, count=1)

Returns the best match for the given string

PARAMETERS:
needle: The string to search for count: How many results to return
RETURNS:
String that best matched the supplied parameter and the score with which it matched.
getSimilarStrings(string)

Retrieves all strings that have a similarity higher than min_sim with “string”

PARAMETERS:
  • string: The string to compare with the search space
RETURNS:

Dictionary of scoring strings. Example output:

{‘askfjwehiuasdfji’: 1.0, ‘asdfawe’: 0.17391304347826086}

get_best_match(needle, count=1)

Returns the best match for the given string

PARAMETERS:
needle: The string to search for count: How many results to return
RETURNS:
String that best matched the supplied parameter and the score with which it matched.
get_similar_strings(string)

Retrieves all strings that have a similarity higher than min_sim with “string”

PARAMETERS:
  • string: The string to compare with the search space
RETURNS:

Dictionary of scoring strings. Example output:

{‘askfjwehiuasdfji’: 1.0, ‘asdfawe’: 0.17391304347826086}

ic(*args)
min_sim(*args)
ngram_len(*args)
ngramify(haystack, ic=None, only_alnum=None, padding=None, noise=None)

Takes list of strings and puts them into an index of ngrams. KEY is a ngram, VALUE a list of strings containing that ngram. VALUE has two KEYS:

grams: count of ngram occurring in string len: length of string with padding (if ignoring non alphanumerics,

this gets determined after those characters have been removed)
PARAMETERS:
haystack - List of strings to be indexed ic - Ignore Case? only_alnum - Only alphanumeric charactes? padding - Size of the padding noise - Noise characters that should be removed
RETURNS:

N-Gram index

Structure:
{
‘abc’: ‘string1’:{‘grams’:0, ‘len’:11}, ‘bcd’: ‘string2’:{‘grams’:0, ‘len’:11}, ‘ing’: ‘string3’:{‘grams’:1, ‘len’:11}, ...

}

noise(*args)
only_alnum(*args)
padding(*args)
reInit(base)

Reinitialises the search space.

PARAMETERS:
base - The new search space
re_init(base)

Reinitialises the search space.

PARAMETERS:
base - The new search space
warp(*args)

Table Of Contents

Previous topic

Pagination

Next topic

States and Years

This Page