ASCII normalization is a custom OurBigBook defined normalization that converts many characters that look like Latin characters into Latin characters.
For now, we are using the
deburrmethod of Lodash: lodash.com/docs/4.17.15#deburr, which only affects Latin-like characters.
In addition to
deburrwe also convert:
- en-dash and em-dash to simple ASCII dash
-. Wikipedia Loves en-dashes in their article titles!
- greek letters are replaced with their standard latin names, e.g.
One notable effect is that it converts variants of ASCII letters to ASCII letters. E.g.
eremoving the accent.
This operation is kind of a superset of Unicode normalization acting only on Latin-like characters, where Unicode basically only removes things like diacritics.
OurBigBook normalization on the other also does other natural transformations that Unicode does not do, e.g.
aeas encoded by
deburrand further custom replacements.