Wiktionary talk:Votes/2011-07/Redirecting single-character digraphs
Add topicAffected characters
[edit]I wrote a Perl script to go through UnicodeData.txt and NamesList.txt (from http://www.unicode.org/Public/UNIDATA/) and find each character that has a direct compatibility mapping to a sequence of multiple non-modifier letters, without any compatibility formatting tag. That turns out to include these 56 characters:
- U+0132 IJ LATIN CAPITAL LIGATURE IJ → U+0049 I U+004A J
- U+0133 ij LATIN SMALL LIGATURE IJ → U+0069 i U+006A j
- U+01C4 DŽ LATIN CAPITAL LETTER DZ WITH CARON → U+0044 D U+017D Ž
- U+01C5 Dž LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON → U+0044 D U+017E ž
- U+01C6 dž LATIN SMALL LETTER DZ WITH CARON → U+0064 d U+017E ž
- U+01C7 LJ LATIN CAPITAL LETTER LJ → U+004C L U+004A J
- U+01C8 Lj LATIN CAPITAL LETTER L WITH SMALL LETTER J → U+004C L U+006A j
- U+01C9 lj LATIN SMALL LETTER LJ → U+006C l U+006A j
- U+01CA NJ LATIN CAPITAL LETTER NJ → U+004E N U+004A J
- U+01CB Nj LATIN CAPITAL LETTER N WITH SMALL LETTER J → U+004E N U+006A j
- U+01CC nj LATIN SMALL LETTER NJ → U+006E n U+006A j
- U+01F1 DZ LATIN CAPITAL LETTER DZ → U+0044 D U+005A Z
- U+01F2 Dz LATIN CAPITAL LETTER D WITH SMALL LETTER Z → U+0044 D U+007A z
- U+01F3 dz LATIN SMALL LETTER DZ → U+0064 d U+007A z
- U+0587 և ARMENIAN SMALL LIGATURE ECH YIWN → U+0565 ե U+0582 ւ
- U+0675 ٵ ARABIC LETTER HIGH HAMZA ALEF → U+0627 ا U+0674 ٴ
- U+0676 ٶ ARABIC LETTER HIGH HAMZA WAW → U+0648 و U+0674 ٴ
- U+0677 ٷ ARABIC LETTER U WITH HAMZA ABOVE → U+06C7 ۇ U+0674 ٴ
- U+0678 ٸ ARABIC LETTER HIGH HAMZA YEH → U+064A ي U+0674 ٴ
- U+0EDC ໜ LAO HO NO → U+0EAB ຫ U+0E99 ນ
- U+0EDD ໝ LAO HO MO → U+0EAB ຫ U+0EA1 ມ
- U+20A8 ₨ RUPEE SIGN → U+0052 R U+0073 s
- U+2116 № NUMERO SIGN → U+004E N U+006F o
- U+2121 ℡ TELEPHONE SIGN → U+0054 T U+0045 E U+004C L
- U+213B ℻ FACSIMILE SIGN → U+0046 F U+0041 A U+0058 X
- U+2161 Ⅱ ROMAN NUMERAL TWO → U+0049 I U+0049 I
- U+2162 Ⅲ ROMAN NUMERAL THREE → U+0049 I U+0049 I U+0049 I
- U+2163 Ⅳ ROMAN NUMERAL FOUR → U+0049 I U+0056 V
- U+2165 Ⅵ ROMAN NUMERAL SIX → U+0056 V U+0049 I
- U+2166 Ⅶ ROMAN NUMERAL SEVEN → U+0056 V U+0049 I U+0049 I
- U+2167 Ⅷ ROMAN NUMERAL EIGHT → U+0056 V U+0049 I U+0049 I U+0049 I
- U+2168 Ⅸ ROMAN NUMERAL NINE → U+0049 I U+0058 X
- U+216A Ⅺ ROMAN NUMERAL ELEVEN → U+0058 X U+0049 I
- U+216B Ⅻ ROMAN NUMERAL TWELVE → U+0058 X U+0049 I U+0049 I
- U+2171 ⅱ SMALL ROMAN NUMERAL TWO → U+0069 i U+0069 i
- U+2172 ⅲ SMALL ROMAN NUMERAL THREE → U+0069 i U+0069 i U+0069 i
- U+2173 ⅳ SMALL ROMAN NUMERAL FOUR → U+0069 i U+0076 v
- U+2175 ⅵ SMALL ROMAN NUMERAL SIX → U+0076 v U+0069 i
- U+2176 ⅶ SMALL ROMAN NUMERAL SEVEN → U+0076 v U+0069 i U+0069 i
- U+2177 ⅷ SMALL ROMAN NUMERAL EIGHT → U+0076 v U+0069 i U+0069 i U+0069 i
- U+2178 ⅸ SMALL ROMAN NUMERAL NINE → U+0069 i U+0078 x
- U+217A ⅺ SMALL ROMAN NUMERAL ELEVEN → U+0078 x U+0069 i
- U+217B ⅻ SMALL ROMAN NUMERAL TWELVE → U+0078 x U+0069 i U+0069 i
- U+FB00 ff LATIN SMALL LIGATURE FF → U+0066 f U+0066 f
- U+FB01 fi LATIN SMALL LIGATURE FI → U+0066 f U+0069 i
- U+FB02 fl LATIN SMALL LIGATURE FL → U+0066 f U+006C l
- U+FB03 ffi LATIN SMALL LIGATURE FFI → U+0066 f U+0066 f U+0069 i
- U+FB04 ffl LATIN SMALL LIGATURE FFL → U+0066 f U+0066 f U+006C l
- U+FB05 ſt LATIN SMALL LIGATURE LONG S T → U+017F ſ U+0074 t
- U+FB06 st LATIN SMALL LIGATURE ST → U+0073 s U+0074 t
- U+FB13 ﬓ ARMENIAN SMALL LIGATURE MEN NOW → U+0574 մ U+0576 ն
- U+FB14 ﬔ ARMENIAN SMALL LIGATURE MEN ECH → U+0574 մ U+0565 ե
- U+FB15 ﬕ ARMENIAN SMALL LIGATURE MEN INI → U+0574 մ U+056B ի
- U+FB16 ﬖ ARMENIAN SMALL LIGATURE VEW NOW → U+057E վ U+0576 ն
- U+FB17 ﬗ ARMENIAN SMALL LIGATURE MEN XEH → U+0574 մ U+056D խ
- U+FB4F ﭏ HEBREW LIGATURE ALEF LAMED → U+05D0 א U+05DC ל
Of course, we may not want this vote to include all of the above; <№>, for example, is not exactly a "digraph". And conversely, we may want it to include some things that aren't listed above; the above-mentioned search criteria are just a first pass, and I welcome other thoughts. But, it's hopefully a starting-point for discussion.
(I realize some of the above is probably gobbledegook to anyone who's not familiar with the guts of Unicode . . . if you have any questions, ask. Though I have to admit that I'm not terribly familiar with the guts of Unicode, either!)
—RuakhTALK 00:04, 5 July 2011 (UTC)
- Thanks, but, how complete is that list? You mentioned some ligatures, but not æ... --Daniel 16:44, 5 July 2011 (UTC)
- It is a perfectly complete list . . . of characters meeting the above-mentioned criteria. Unicode does not give a nontrivial compatibility decomposition for <æ>, so it didn't qualify. But as I mentioned, we may want to consider different criteria. Incidentally, Unicode names <æ> "LATIN SMALL LETTER AE", not "LATIN SMALL LIGATURE AE", though the latter is indicated to be an alias for it. Here is its full entry in NamesList.txt:
00E6 LATIN SMALL LETTER AE = latin small ligature ae (1.0) = ash (from Old English æsc) * Danish, Norwegian, Icelandic, Faroese, Old English, French, IPA x (latin small ligature oe - 0153) x (cyrillic small ligature a ie - 04D5)
—RuakhTALK 18:07, 6 July 2011 (UTC)
- It is a perfectly complete list . . . of characters meeting the above-mentioned criteria. Unicode does not give a nontrivial compatibility decomposition for <æ>, so it didn't qualify. But as I mentioned, we may want to consider different criteria. Incidentally, Unicode names <æ> "LATIN SMALL LETTER AE", not "LATIN SMALL LIGATURE AE", though the latter is indicated to be an alias for it. Here is its full entry in NamesList.txt:
- IMO some of these, at least, are of interest in their own right and should not redirect. U+FB4F ﭏ HEBREW LIGATURE ALEF LAMED, for example, can have an interesting etymology: when it was first used, why and when it's used, etc. This is independent of the page אל. The same can be said for the (now former) Rupee sign, the ffi and st families of ligatures, and perhaps more.—msh210℠ (talk) 17:22, 6 July 2011 (UTC)
- I agree about alef-lamed and the rupee sign, but I think the ffi and st ligatures are exactly what this vote should be about. They're exactly the kind of "character" that is no longer being added to Unicode. —RuakhTALK 18:07, 6 July 2011 (UTC)
- I suggest restricting this vote to entries written in Latin script only, regardless of whether characters in Hebrew, Lao, Armenian and Arabic would follow suit. --Daniel 18:24, 6 July 2011 (UTC)
- Redirecting ffi to ffi is a bad idea, because the latter does not exist. We can create it, but I don't see how it would be justifiable. --Daniel 18:24, 6 July 2011 (UTC)
- Oh, right. Good point. —RuakhTALK 20:52, 10 July 2011 (UTC)
- I agree about alef-lamed and the rupee sign, but I think the ffi and st ligatures are exactly what this vote should be about. They're exactly the kind of "character" that is no longer being added to Unicode. —RuakhTALK 18:07, 6 July 2011 (UTC)
Redirecting trigraphs
[edit]This vote should probably extend to trigraphs, to cover ℻ and ℡ as well. --Daniel 15:58, 5 July 2011 (UTC)
Redirecting Roman numerals
[edit]Apparently, should this vote pass, ⅺ will redirect to ⅹⅰ; and, not to xi, because we would still keep the distinction between "generic" Latin letters and Roman numerals. --Daniel 16:56, 5 July 2011 (UTC)
Specific digraphs
[edit]I restricted the list to 14 specific redirects. --Daniel 02:33, 10 July 2011 (UTC)