Jump to content

Appendix:Unicode normalization

From Wiktionary, the free dictionary
See also Unicode normalization considerations on the MediaWiki website.

Wikimedia, along with most servers on the internet, stores Unicode strings in the form called NFC or Normalization Form (Canonical) Composition. This means that often several different Unicode strings are mapped to the same canonical form. When you enter a Unicode string and save the page, it is automatically converted to the normalized form. Non-normalized strings cannot be saved in a Wiktionary page.

Equivalence

[edit]
Type of Canonical
Equivalence
Alternate
representation
NFC
Combining sequence C ◌̧ Ç
Ordering of combining marks q + ̣+ ̇ q+ ̇+ ̣
Hangul ᄀ +ᅡ
Singleton Å
Hebrew ל ָ ֽ ִ ל ִ ָ ֽ

Issues

[edit]

Most of the time NFC makes processing text easier, but there are some oddities, both semantic and non-semantic that do appear. There are four cases where single characters are not the NFC form.

  1. Sometimes an alternative single character is the canonical composed form.
    Example: U+212B ( Å - ANGSTROM SIGN) is converted to U+00C5 ( Å - LATIN CAPITAL LETTER A WITH RING ABOVE)
  2. For some scripts, precomposed characters are not preferred.
    Example: U+0958 ( क़ - DEVANAGARI LETTER QA) is converted to the decomposed क़ which is U+0915 ( - DEVANAGARI LETTER KA) + U+093C ( - DEVANAGARI SIGN NUKTA).
  3. Where a decomposition exists in pre-Unicode 3.0 for a precomposed character added afterwards, the decomposition is preferred.
    Example: U+2ADC ( ⫝̸ - FORKING) is converted to ⫝̸ which is U+2ADD ( - NONFORKING) + U+0338 ( ̸ - COMBINING LONG SOLIDUS OVERLAY).
  4. A decomposition is preferred to precomposed characters where the decomposition begins with a non-starter.
    Example: U+0344 ( ̈́ - COMBINING GREEK DIALYTIKA TONOS) is converted to U+0308 ( ̈ - COMBINING DIAERESIS (DIALYTIKA)) + U+0301 ( ́ - COMBINING ACUTE ACCENT (OXIA, TONOS)).

In a number of common cases, Unicode's canonical ordering of two diacritics is counterintuitive, and/or interoperates poorly with certain existing software. In other, less common cases, the problem is that the diacritics should not have a canonical ordering, because the two orderings are not actually equivalent (that is, the two diacritics should have the same value for the Canonical_Combining_Class (ccc) property, but instead they have different ones). For example, Hebrew לִַ ("lai") is mistakenly normalized to לִַ ("lia").

As the conversion is automatic, there cannot exist pages for the non-NFC form. Attempting to explicitly link to the non-NFC form, , will display the non-NFC form, but when clicked on will take the user to the NFC page Å.

Display

[edit]

One can display the non-NFC characters on a page using {{HTML char}} ({{HTML char|212B}} will show Å). To note canonical equivalence between two single characters, use {{normalization}} in the caption field of the appropriate {{character info}} template on the NFC character (see Å for an example). To note that the NCF of a precomposed character is a decomposition, use {{decomposed}} in the caption field of the appropriate {{character info}} template on the NFC decomposition (see क़ for an example).

Notes

[edit]

Wikimedia does not enforce Compatibility Equivalence which combines even more forms together (such as N and ).

See also

[edit]