abydos.fingerprint package
abydos.fingerprint.
The fingerprint package implements string fingerprints such as:
Basic fingerprinters originating in OpenRefine <http://openrefine.org>:
Fingerprints developed by Pollock & Zomora:
Skeleton key (
SkeletonKey)Omission key (
OmissionKey)Fingerprints developed by Cisłak & Grabowski:
Occurrence (
Occurrence)Occurrence halved (
OccurrenceHalved)Count (
Count)Position (
Position)The Synoname toolcode (
SynonameToolcode)Taft's codings:
Consonant coding (
Consonant)Extract - letter list (
Extract)Extract - position & frequency (
ExtractPositionFrequency)L.A. County Sheriff's System (
LACSS)Library of Congress Cutter table encoding (
LCCutter)Burrows-Wheeler transform (
BWTF) and run-length encoded Burrows-Wheeler transform (BWTRLEF)
Each fingerprint class has a fingerprint method that takes a string and
returns the string's fingerprint:
>>> sk = SkeletonKey()
>>> sk.fingerprint('orange')
'ORNGAE'
>>> sk.fingerprint('strange')
'STRNGAE'
- class abydos.fingerprint.BWTF(terminator: str = '\x00')[source]
Bases:
_FingerprintBurrows-Wheeler transform fingerprint.
This is a wrapper of the BWT class in abydos.compression, which provides the same interface as other descendants of _Fingerprint.
New in version 0.4.1.
Initialize BWTF instance.
- Parameters:
terminator (str) -- A character added to signal the end of the string
New in version 0.4.1.
- fingerprint(word: str) str[source]
Return the Burrows-Wheeler transform of a word.
- Parameters:
word (str) -- The word to fingerprint
- Returns:
The Burrows-Wheeler transform of a word
- Return type:
str
Examples
>>> fp = BWTF() >>> fp.fingerprint('hat') 'th\x00a' >>> fp.fingerprint('niall') 'linla\x00' >>> fp.fingerprint('colin') 'n\x00loic' >>> fp.fingerprint('atcg') 'g\x00tca' >>> fp.fingerprint('entreatment') 'term\x00teetnan'
New in version 0.4.1.
- class abydos.fingerprint.BWTRLEF(terminator: str = '\x00')[source]
Bases:
_FingerprintBurrows-Wheeler transform plus run-length encoding fingerprint.
This is a wrapper of the BWT and RLE classes in abydos.compression, which provides the same interface as other descendants of _Fingerprint.
New in version 0.4.1.
Initialize BWTRLEF instance.
- Parameters:
terminator (str) -- A character added to signal the end of the string
New in version 0.4.1.
- fingerprint(word: str) str[source]
Return the run-length encoded Burrows-Wheeler transform of a word.
- Parameters:
word (str) -- The word to fingerprint
- Returns:
The run-length encoded Burrows-Wheeler transform of a word
- Return type:
str
Examples
>>> fp = BWTRLEF() >>> fp.fingerprint('hat') 'th\x00a' >>> fp.fingerprint('niall') 'linla\x00' >>> fp.fingerprint('colin') 'n\x00loic' >>> fp.fingerprint('atcg') 'g\x00tca' >>> fp.fingerprint('entreatment') 'term\x00teetnan'
New in version 0.4.1.
- class abydos.fingerprint.Consonant(variant: int = 1, doubles: bool = True, vowels: Iterable[str] | str | None = None)[source]
Bases:
_FingerprintConsonant Coding Fingerprint.
Based on the consonant coding from [Taf70], variants 1, 2, 3, 1-D, 2-D, and 3-D.
New in version 0.4.1.
Initialize Consonant instance.
- Parameters:
variant (int) --
Selects between Taft's 3 variants, which assign to the vowel set one of:
A, E, I, O, & U
A, E, I, O, U, W, & Y
A, E, I, O, U, W, H, & Y
doubles (bool) -- If set to False, multiple consonants in a row are conflated to a single instance.
vowels (list, set, or str) -- Setting vowels to a non-None value overrides the variant setting and defines the set of letters to be removed from the input.
New in version 0.4.1.
- fingerprint(word: str) str[source]
Return the consonant coding.
- Parameters:
word (str) -- The word to fingerprint
- Returns:
The consonant coding
- Return type:
int
Examples
>>> cf = Consonant() >>> cf.fingerprint('hat') 'HT' >>> cf.fingerprint('niall') 'NLL' >>> cf.fingerprint('colin') 'CLN' >>> cf.fingerprint('atcg') 'ATCG' >>> cf.fingerprint('entreatment') 'ENTRTMNT'
New in version 0.4.1.
- class abydos.fingerprint.Count(n_bits: int = 16, most_common: Tuple[str, ...] = ('e', 't', 'a', 'o', 'i', 'n', 's', 'h', 'r', 'd', 'l', 'c', 'u', 'm', 'w', 'f'))[source]
Bases:
_FingerprintCount Fingerprint.
Based on the count fingerprint from [CislakG17].
New in version 0.3.6.
Initialize Count instance.
- Parameters:
n_bits (int) -- Number of bits in the fingerprint returned
most_common (list) -- The most common tokens in the target language, ordered by frequency
New in version 0.4.0.
- fingerprint(word: str) str[source]
Return the count fingerprint.
- Parameters:
word (str) -- The word to fingerprint
- Returns:
The count fingerprint
- Return type:
str
Examples
>>> cf = Count() >>> cf.fingerprint('hat') '0001010000000001' >>> cf.fingerprint('niall') '0000010001010000' >>> cf.fingerprint('colin') '0000000101010000' >>> cf.fingerprint('atcg') '0001010000000000' >>> cf.fingerprint('entreatment') '1111010000100000'
New in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
Changed in version 0.6.0: Changed to return a str and added fingerprint_int method
- fingerprint_int(word: str) int[source]
Return the count fingerprint.
- Parameters:
word (str) -- The word to fingerprint
- Returns:
The count fingerprint as an int
- Return type:
int
Examples
>>> cf = Count() >>> cf.fingerprint_int('hat') 5121 >>> cf.fingerprint_int('niall') 1104 >>> cf.fingerprint_int('colin') 336 >>> cf.fingerprint_int('atcg') 5120 >>> cf.fingerprint_int('entreatment') 62496
New in version 0.6.0.
- class abydos.fingerprint.Extract(letter_list: int | Iterable[str] = 1)[source]
Bases:
_FingerprintExtract Letter List fingerprint.
Based on the extract letter list coding from [Taf70], for lists 1, 2, 3, & 4.
New in version 0.4.1.
Initialize Extract instance.
- Parameters:
letter_list (int or iterable) -- If an integer (1-4) is supplied, Taft's specified letter lists are used. If an iterable is supplied, its values will be used as the list of letters to remove (in order).
New in version 0.4.1.
- fingerprint(word: str) str[source]
Return the extract letter list coding.
- Parameters:
word (str) -- The word to fingerprint
- Returns:
The extract letter list coding
- Return type:
str
Examples
>>> fp = Extract() >>> fp.fingerprint('hat') 'HAT' >>> fp.fingerprint('niall') 'NILL' >>> fp.fingerprint('colin') 'CLIN' >>> fp.fingerprint('atcg') 'ATCG' >>> fp.fingerprint('entreatment') 'NRMN'
New in version 0.4.1.
- class abydos.fingerprint.ExtractPositionFrequency[source]
Bases:
_FingerprintExtract - Position & Frequency fingerprint.
Based on the extract - position & frequency coding from [Taf70].
New in version 0.4.1.
- fingerprint(word: str) str[source]
Return the extract - position & frequency coding.
- Parameters:
word (str) -- The word to fingerprint
- Returns:
The extract - position & frequency coding
- Return type:
str
Examples
>>> fp = ExtractPositionFrequency() >>> fp.fingerprint('hat') 'HAT' >>> fp.fingerprint('niall') 'NILL' >>> fp.fingerprint('colin') 'COLN' >>> fp.fingerprint('atcg') 'ATCG' >>> fp.fingerprint('entreatment') 'NMNT'
New in version 0.4.1.
- class abydos.fingerprint.LACSS[source]
Bases:
_FingerprintL.A. County Sheriff's System fingerprint.
Based on the description from [Taf70].
New in version 0.4.1.
- fingerprint(word: str) str[source]
Return the LACSS coding.
- Parameters:
word (str) -- The word to fingerprint
- Returns:
The L.A. County Sheriff's System fingerprint
- Return type:
str
Examples
>>> cf = LACSS() >>> cf.fingerprint('hat') '4911211' >>> cf.fingerprint('niall') '6488374' >>> cf.fingerprint('colin') '3015957' >>> cf.fingerprint('atcg') '1772371' >>> cf.fingerprint('entreatment') '3882324'
New in version 0.4.1.
Changed in version 0.6.0: Changed to return a str and added fingerprint_int method
- fingerprint_int(word: str) int[source]
Return the LACSS coding.
- Parameters:
word (str) -- The word to fingerprint
- Returns:
The L.A. County Sheriff's System fingerprint as an int
- Return type:
int
Examples
>>> cf = LACSS() >>> cf.fingerprint_int('hat') 4911211 >>> cf.fingerprint_int('niall') 6488374 >>> cf.fingerprint_int('colin') 3015957 >>> cf.fingerprint_int('atcg') 1772371 >>> cf.fingerprint_int('entreatment') 3882324
New in version 0.6.0.
- class abydos.fingerprint.LCCutter(max_length: int = 64)[source]
Bases:
_FingerprintLibrary of Congress Cutter table encoding.
This is based on the Library of Congress Cutter table encoding scheme, as described at https://www.loc.gov/aba/pcc/053/table.html [oC13]. Handling for numerals is not included.
New in version 0.4.1.
Initialize LCCutter instance.
- Parameters:
max_length (int) -- The length of the code returned (defaults to 64)
New in version 0.4.1.
- fingerprint(word: str) str[source]
Return the Library of Congress Cutter table encoding of a word.
- Parameters:
word (str) -- The word to fingerprint
- Returns:
The Library of Congress Cutter table encoding
- Return type:
str
Examples
>>> cf = LCCutter() >>> cf.fingerprint('hat') 'H38' >>> cf.fingerprint('niall') 'N5355' >>> cf.fingerprint('colin') 'C6556' >>> cf.fingerprint('atcg') 'A834' >>> cf.fingerprint('entreatment') 'E5874386468'
New in version 0.4.1.
- class abydos.fingerprint.Occurrence(n_bits: int = 16, most_common: Tuple[str, ...] = ('e', 't', 'a', 'o', 'i', 'n', 's', 'h', 'r', 'd', 'l', 'c', 'u', 'm', 'w', 'f'))[source]
Bases:
_FingerprintOccurrence Fingerprint.
Based on the occurrence fingerprint from [CislakG17].
New in version 0.3.6.
Initialize Count instance.
- Parameters:
n_bits (int) -- Number of bits in the fingerprint returned
most_common (list) -- The most common tokens in the target language, ordered by frequency
New in version 0.4.0.
- fingerprint(word: str) str[source]
Return the occurrence fingerprint.
- Parameters:
word (str) -- The word to fingerprint
- Returns:
The occurrence fingerprint
- Return type:
str
Examples
>>> of = Occurrence() >>> of.fingerprint('hat') '0110000100000000' >>> of.fingerprint('niall') '0010110000100000' >>> of.fingerprint('colin') '0001110000110000' >>> of.fingerprint('atcg') '0110000000010000' >>> of.fingerprint('entreatment') '1110010010000100'
New in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
Changed in version 0.6.0: Changed to return a str and added fingerprint_int method
- fingerprint_int(word: str) int[source]
Return the occurrence fingerprint.
- Parameters:
word (str) -- The word to fingerprint
- Returns:
The occurrence fingerprint as an int
- Return type:
int
Examples
>>> of = Occurrence() >>> of.fingerprint_int('hat') 24832 >>> of.fingerprint_int('niall') 11296 >>> of.fingerprint_int('colin') 7216 >>> of.fingerprint_int('atcg') 24592 >>> of.fingerprint_int('entreatment') 58500
New in version 0.6.0.
- class abydos.fingerprint.OccurrenceHalved(n_bits: int = 16, most_common: Tuple[str, ...] = ('e', 't', 'a', 'o', 'i', 'n', 's', 'h', 'r', 'd', 'l', 'c', 'u', 'm', 'w', 'f'))[source]
Bases:
_FingerprintOccurrence Halved Fingerprint.
Based on the occurrence halved fingerprint from [CislakG17].
New in version 0.3.6.
Initialize Count instance.
- Parameters:
n_bits (int) -- Number of bits in the fingerprint returned
most_common (list) -- The most common tokens in the target language, ordered by frequency
New in version 0.4.0.
- fingerprint(word: str) str[source]
Return the occurrence halved fingerprint.
Based on the occurrence halved fingerprint from [CislakG17].
- Parameters:
word (str) -- The word to fingerprint
- Returns:
The occurrence halved fingerprint
- Return type:
str
Examples
>>> ohf = OccurrenceHalved() >>> ohf.fingerprint('hat') '0001010000000010' >>> ohf.fingerprint('niall') '0000010010100000' >>> ohf.fingerprint('colin') '0000001001010000' >>> ohf.fingerprint('atcg') '0010100000000000' >>> ohf.fingerprint('entreatment') '1111010000110000'
New in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
Changed in version 0.6.0: Changed to return a str and added fingerprint_int method
- fingerprint_int(word: str) int[source]
Return the occurrence halved fingerprint.
Based on the occurrence halved fingerprint from [CislakG17].
- Parameters:
word (int) -- The word to fingerprint
- Returns:
The occurrence halved fingerprint as an int
- Return type:
int
Examples
>>> ohf = OccurrenceHalved() >>> ohf.fingerprint_int('hat') 5122 >>> ohf.fingerprint_int('niall') 1184 >>> ohf.fingerprint_int('colin') 592 >>> ohf.fingerprint_int('atcg') 10240 >>> ohf.fingerprint_int('entreatment') 62512
New in version 0.6.0.
- class abydos.fingerprint.OmissionKey[source]
Bases:
_FingerprintOmission Key.
The omission key of a word is defined in [PZ84].
New in version 0.3.6.
- fingerprint(word: str) str[source]
Return the omission key.
- Parameters:
word (str) -- The word to transform into its omission key
- Returns:
The omission key
- Return type:
str
Examples
>>> ok = OmissionKey() >>> ok.fingerprint('The quick brown fox jumped over the lazy dog.') 'JKQXZVWYBFMGPDHCLNTREUIOA' >>> ok.fingerprint('Christopher') 'PHCTSRIOE' >>> ok.fingerprint('Niall') 'LNIA'
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.fingerprint.Phonetic(phonetic_algorithm: Callable[[str], str] | _Phonetic | None = None, joiner: str = ' ')[source]
Bases:
StringPhonetic Fingerprint.
A phonetic fingerprint is identical to a standard string fingerprint, as implemented in
String, but performs the fingerprinting function after converting the string to its phonetic form, as determined by some phonetic algorithm. This fingerprint is described at [Ope12].New in version 0.3.6.
Initialize Phonetic instance.
- phonetic_algorithmfunction
A phonetic algorithm that takes a string and returns a string (presumably a phonetic representation of the original string). By default, this function uses
double_metaphone().- joinerstr
The string that will be placed between each word
New in version 0.4.0.
- fingerprint(phrase: str) str[source]
Return the phonetic fingerprint of a phrase.
- Parameters:
phrase (str) -- The string from which to calculate the phonetic fingerprint
- Returns:
The phonetic fingerprint of the phrase
- Return type:
str
Examples
>>> pf = Phonetic() >>> pf.fingerprint('The quick brown fox jumped over the lazy dog.') '0 afr fks jmpt kk ls prn tk'
>>> from abydos.phonetic import Soundex >>> pf = Phonetic(Soundex()) >>> pf.fingerprint('The quick brown fox jumped over the lazy dog.') 'b650 d200 f200 j513 l200 o160 q200 t000'
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.fingerprint.Position(n_bits: int = 16, most_common: Tuple[str, ...] = ('e', 't', 'a', 'o', 'i', 'n', 's', 'h', 'r', 'd', 'l', 'c', 'u', 'm', 'w', 'f'), bits_per_letter: int = 3)[source]
Bases:
_FingerprintPosition Fingerprint.
Based on the position fingerprint from [CislakG17].
New in version 0.3.6.
Initialize Count instance.
- Parameters:
n_bits (int) -- Number of bits in the fingerprint returned
most_common (list) -- The most common tokens in the target language, ordered by frequency
New in version 0.4.0.
- fingerprint(word: str) str[source]
Return the position fingerprint.
- Parameters:
word (str) -- The word to fingerprint
- Returns:
The position fingerprint
- Return type:
str
Examples
>>> pf = Position() >>> pf.fingerprint('hat') '1110100011111111' >>> pf.fingerprint('niall') '1111110101110010' >>> pf.fingerprint('colin') '1111111110010111' >>> pf.fingerprint('atcg') '1110010001111111' >>> pf.fingerprint('entreatment') '0000101011111111'
New in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
Changed in version 0.6.0: Changed to return a str and added fingerprint_int method
- fingerprint_int(word: str) int[source]
Return the position fingerprint.
- Parameters:
word (str) -- The word to fingerprint
- Returns:
The position fingerprint as an int
- Return type:
int
Examples
>>> pf = Position() >>> pf.fingerprint_int('hat') 59647 >>> pf.fingerprint_int('niall') 64882 >>> pf.fingerprint_int('colin') 65431 >>> pf.fingerprint_int('atcg') 58495 >>> pf.fingerprint_int('entreatment') 2815
New in version 0.6.0.
- class abydos.fingerprint.QGram(qval: int = 2, start_stop: str = '', joiner: str = '', skip: int = 0)[source]
Bases:
_FingerprintQ-Gram Fingerprint.
A q-gram fingerprint is a string consisting of all of the unique q-grams in a string, alphabetized & concatenated. This fingerprint is described at [Ope12].
New in version 0.3.6.
Initialize Q-Gram fingerprinter.
- qvalint
The length of each q-gram (by default 2)
- start_stopstr
The start & stop symbol(s) to concatenate on either end of the phrase, as defined in
tokenizer.QGrams- joinerstr
The string that will be placed between each word
- skipint or Iterable
The number of characters to skip, can be an integer, range object, or list
New in version 0.4.0.
- fingerprint(phrase: str) str[source]
Return Q-Gram fingerprint.
- Parameters:
phrase (str) -- The string from which to calculate the q-gram fingerprint
- Returns:
The q-gram fingerprint of the phrase
- Return type:
str
Examples
>>> qf = QGram() >>> qf.fingerprint('The quick brown fox jumped over the lazy dog.') 'azbrckdoedeleqerfoheicjukblampnfogovowoxpequrortthuiumvewnxjydzy' >>> qf.fingerprint('Christopher') 'cherhehrisopphristto' >>> qf.fingerprint('Niall') 'aliallni'
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.fingerprint.SkeletonKey[source]
Bases:
_FingerprintSkeleton Key.
The skeleton key of a word is defined in [PZ84].
New in version 0.3.6.
- fingerprint(word: str) str[source]
Return the skeleton key.
- Parameters:
word (str) -- The word to transform into its skeleton key
- Returns:
The skeleton key
- Return type:
str
Examples
>>> sk = SkeletonKey() >>> sk.fingerprint('The quick brown fox jumped over the lazy dog.') 'THQCKBRWNFXJMPDVLZYGEUIOA' >>> sk.fingerprint('Christopher') 'CHRSTPIOE' >>> sk.fingerprint('Niall') 'NLIA'
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.fingerprint.String(joiner: str = ' ')[source]
Bases:
_FingerprintString Fingerprint.
The fingerprint of a string is a string consisting of all of the unique words in a string, alphabetized & concatenated with intervening joiners. This fingerprint is described at [Ope12].
New in version 0.3.6.
Initialize String instance.
- Parameters:
joiner (str) -- The string that will be placed between each word
New in version 0.4.0.
- fingerprint(phrase: str) str[source]
Return string fingerprint.
- Parameters:
phrase (str) -- The string from which to calculate the fingerprint
- Returns:
The fingerprint of the phrase
- Return type:
str
Example
>>> sf = String() >>> sf.fingerprint('The quick brown fox jumped over the lazy dog.') 'brown dog fox jumped lazy over quick the'
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.fingerprint.SynonameToolcode[source]
Bases:
_FingerprintSynoname Toolcode.
Cf. [Gro91, JPGTrust91].
New in version 0.3.6.
- fingerprint(lname: str, fname: str = '', qual: str = '', normalize: int = 0) str[source]
Build the Synoname toolcode.
- Parameters:
lname (str) -- Last name
fname (str) -- First name (can be blank)
qual (str) -- Qualifier
normalize (int) -- Normalization mode (0, 1, or 2)
- Returns:
The transformed names and the synoname toolcode, separated by commas
- Return type:
str
Examples
>>> st = SynonameToolcode() >>> st.fingerprint('hat') 'hat,,0000000003$$h' >>> st.fingerprint('niall') 'niall,,0000000005$$n' >>> st.fingerprint('colin') 'colin,,0000000005$$c' >>> st.fingerprint('atcg') 'atcg,,0000000004$$a' >>> st.fingerprint('entreatment') 'entreatment,,0000000011$$e'
>>> st.fingerprint('Ste.-Marie', 'Count John II', normalize=2) 'ste.-marie ii,count john,0200491310$015b049a127c$smcji' >>> st.fingerprint('Michelangelo IV', '', 'Workshop of') 'michelangelo iv,,3000550015$055b$mi'
New in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
Changed in version 0.6.0: Changed to return a comma-separated string instead of 3-tuple of strs
- fingerprint_tuple(lname: str, fname: str = '', qual: str = '', normalize: int = 0) Tuple[str, str, str][source]
Build the Synoname toolcode.
- Parameters:
lname (str) -- Last name
fname (str) -- First name (can be blank)
qual (str) -- Qualifier
normalize (int) -- Normalization mode (0, 1, or 2)
- Returns:
The transformed names and the synoname toolcode
- Return type:
tuple
Examples
>>> st = SynonameToolcode() >>> st.fingerprint_tuple('hat') ('hat', '', '0000000003$$h') >>> st.fingerprint_tuple('niall') ('niall', '', '0000000005$$n') >>> st.fingerprint_tuple('colin') ('colin', '', '0000000005$$c') >>> st.fingerprint_tuple('atcg') ('atcg', '', '0000000004$$a') >>> st.fingerprint_tuple('entreatment') ('entreatment', '', '0000000011$$e')
>>> st.fingerprint_tuple('Ste.-Marie', 'Count John II', normalize=2) ('ste.-marie ii', 'count john', '0200491310$015b049a127c$smcji') >>> st.fingerprint_tuple('Michelangelo IV', '', 'Workshop of') ('michelangelo iv', '', '3000550015$055b$mi')
New in version 0.6.0.