Soft TF/IDF
- class py_stringmatching.similarity_measure.soft_tfidf.SoftTfIdf(corpus_list=None, sim_func=jaro_function, threshold=0.5)[source]
Computes soft TF/IDF measure.
Note
Currently, this measure is implemented without dampening. This is similar to setting dampen flag to be False in TF-IDF. We plan to add the dampen flag in the next release.
- Parameters
corpus_list (list of lists) – Corpus list (default is set to None) of strings. If set to None, the input list are considered the only corpus.
sim_func (function) – Secondary similarity function. This should return a similarity score between two strings (optional), default is the Jaro similarity measure.
threshold (float) – Threshold value for the secondary similarity function (defaults to 0.5). If the similarity of a token pair exceeds the threshold, then the token pair is considered a match.
- sim_func
An attribute to store the secondary similarity function.
- Type
function
- threshold
An attribute to store the threshold value for the secondary similarity function.
- Type
float
- get_raw_score(bag1, bag2)[source]
Computes the raw soft TF/IDF score between two lists given the corpus information.
- Parameters
bag1 (list) – Input lists
bag2 (list) – Input lists
- Returns
Soft TF/IDF score between the input lists (float).
- Raises
TypeError – If the inputs are not lists or if one of the inputs is None.
Examples
>>> soft_tfidf = SoftTfIdf([['a', 'b', 'a'], ['a', 'c'], ['a']], sim_func=Jaro().get_raw_score, threshold=0.8) >>> soft_tfidf.get_raw_score(['a', 'b', 'a'], ['a', 'c']) 0.17541160386140586 >>> soft_tfidf = SoftTfIdf([['a', 'b', 'a'], ['a', 'c'], ['a']], threshold=0.9) >>> soft_tfidf.get_raw_score(['a', 'b', 'a'], ['a']) 0.5547001962252291 >>> soft_tfidf = SoftTfIdf([['x', 'y'], ['w'], ['q']]) >>> soft_tfidf.get_raw_score(['a', 'b', 'a'], ['a']) 0.0 >>> soft_tfidf = SoftTfIdf(sim_func=Affine().get_raw_score, threshold=0.6) >>> soft_tfidf.get_raw_score(['aa', 'bb', 'a'], ['ab', 'ba']) 0.81649658092772592
References
the string matching chapter of the “Principles of Data Integration” book.
- get_sim_func()[source]
Get secondary similarity function.
- Returns
secondary similarity function (function).
- get_threshold()[source]
Get threshold used for the secondary similarity function.
- Returns
threshold (float).
- set_corpus_list(corpus_list)[source]
Set corpus list.
- Parameters
corpus_list (list of lists) – Corpus list.