ASCII normalization is a custom OurBigBook defined normalization that converts many characters that look like Latin characters into Latin characters.
For now, we are using the
deburr
method of Lodash: lodash.com/docs/4.17.15#deburr, which only affects Latin-like characters.In addition to
deburr
we also convert:- en-dash and em-dash to simple ASCII dash
-
. Wikipedia Loves en-dashes in their article titles! - greek letters are replaced with their standard latin names, e.g.
α
toalpha
One notable effect is that it converts variants of ASCII letters to ASCII letters. E.g.
é
to e
removing the accent.This operation is kind of a superset of Unicode normalization acting only on Latin-like characters, where Unicode basically only removes things like diacritics.
OurBigBook normalization on the other also does other natural transformations that Unicode does not do, e.g.
æ
to ae
as encoded by deburr
and further custom replacements.TODO
lodash.deburr
:- only deals with Unicode blocks "Latin-1 Supplement" and "Latin Extended-A", notably missing Latin Extended-B, C and D, which contain some important characters. Pull requests have been ignored:so maybe we should just code our own on top.
- misses some candidates in letterlike symbols
- mathematical operators block
Bibliography: