Module talk:scripts/data

Avoiding having to keep some things synchronized

Latest comment: 11 years ago4 comments4 people in discussion

Would it be possible to use this notation to store the common Hani+Jpan+Kore info under only one of those codes, so they no longer have to be kept synchronized by hand? - -sche (discuss) 09:28, 28 December 2013 (UTC)Reply

We can't do that because we have used square brackets inside the strings. --Z 09:32, 28 December 2013 (UTC)Reply

That can be changed, though. Maybe we should. —CodeCa t 16:41, 28 December 2013 (UTC)Reply

Maybe we should just consider merging "polytonic" with "Grek". Are there really that many fonts that can't display polytonic symbols? --Wiki Tiki 89 17:23, 28 December 2013 (UTC)Reply

Romaji

Latest comment: 10 years ago2 comments2 people in discussion

Should "Romaji" and "Rōmaji" be added as alternative names of Latn, or Jpan? (I'm guessing "Latn"...) - -sche (discuss) 19:14, 10 June 2014 (UTC)Reply

I guess so. — Keφr 21:18, 10 June 2014 (UTC)Reply

Font support for Latin Extended-D

Latest comment: 10 years ago3 comments2 people in discussion

Quivira 4.0 provides complete support for the Latin Extended-D block, AFAIK, and is a free font. Is there a way of having en.wiktionary.org use that font display the characters in this range (U+A720–U+A7FF)? — I.S.M.E.T.A. 03:49, 29 June 2014 (UTC)Reply

MediaWiki talk:Common.css would be the place to ask it. — Keφr 08:22, 29 June 2014 (UTC)Reply

Thanks, Keφr. I've done just that. — I.S.M.E.T.A. 17:12, 30 June 2014 (UTC)Reply

Georgian script

Latest comment: 9 years ago3 comments2 people in discussion

copied here from Wiktionary:Grease pit/2013/December

ISO 15924 gave two codes to the three scripts used by Georgian: Nuskhuri and Asomtavruli are Geok, Mkhedruli (the usual Georgian script) is Geor. Separately, the ISO gave the scripts 2–3 blocks of codes: the characters from Ⴀ-Ⴭ are Asomtavruli, and the characters right after them, from ა-ჿ, are Mkhedruli ([1]); ⴀ-ⴭ is Nuskhuri. Currently, Module:scripts/data defines Geor (Georgian, i.e. Mkhedruli) as including the characters of Asomtavruli, but I recently added Geok (defined as including these same characters). FYI. (Note: because several of the aforementioned characters just look like boxes on my computer, I may have made some typos.) - -sche (discuss) 23:45, 21 December 2013 (UTC)Reply

@-sche

So why does Geor include Asomtavruli?

Also, Asomtavruli and Khutsuri are not other names of Nuskhuri. Nuskhuri does not have other names. Khutsuri is a combination of Asomtavruli and Nuskuri, and thus not an independent script. Therefore having a script code for it is awkward.

Anyways, why do we need the code Geok at all? We do not and will not use any of the older scripts.--Dixtosa (talk) 15:38, 18 April 2015 (UTC)Reply

@Dixtosa

Re "why does Geor include Asomtavruli?": I don't know; it was like that when I got here. I wrote what I wrote above in the hope that someone would clarify whether or not it was problematic for Asomtavruli to be listed under two different codes.

Re "why do we need the code Geok?": even if we don't have entries for terms written in the older scripts, we'll have entries for the individual letters that make up those scripts, and it's possible that words written in the older scripts will be mentioned somewhere, e.g. in citations. It's helpful to have the characters mapped to a script code which is, in turn, mapped by MediaWiki talk:Common.css to fonts that display the characters correctly. But we could use Geor to cover all Georgian scripts, if the fonts we use for it (DejaVu Sans, Arial Unicode MS, Sylfaen) handle Asomtavruli and Nuskuri reasonably well.

Re the names: I've switched Geor to have Khutsuri as the canonical name and Asomtavruli and Nuskuri as otherNames. The otherNames field (here and in Module:languages) is not just for synonyms of the canonical name, but also for the names of parts that are subsumed into the code in question. Hence if we subsumed Khutsuri / Asomtavruli / Nuskuri into Geor, I would think they should be listed in the otherNames field.

- -sche (discuss) 03:37, 19 April 2015 (UTC)Reply

Font support for Latin Extended-E

Latest comment: 10 years ago1 comment1 person in discussion

Since the recent version 4.1 update, Quivira now supports completely both the Latin Extended-D and Latin Extended-E Unicode character blocks. Quivira is already the default font for Latinx. Currently, Latinx does not include any characters in the Latin Extended-E block. Accordingly, could someone who can change the character statement for Latinx in Module:scripts/data from "0-z¡-ɏḀ-ỿⱠ-Ɀ꜠-ꟿ" to "0-z¡-ɏḀ-ỿⱠ-Ɀ꜠-ꟿꬰ-ꭥ", please? — I.S.M.E.T.A. 15:24, 2 September 2014 (UTC)Reply

RFM discussion: July 2013–January 2015

Latest comment: 10 years ago4 comments3 people in discussion

The following discussion has been moved from Wiktionary:Requests for moves, mergers and splits (permalink).

This discussion is no longer live and is left here as an archive. Please do not modify this conversation, but feel free to discuss its conclusions.

~~Template:polytonic to Template:grc-Grek or to Template:Grek-polyton~~

The first option brings it in line with other script codes like Template:fa-Arab and Template:nv-Latn. The language code in the name doesn't have any meaning in itself, it's just to give a meaningful distinction.

The second makes it more compliant with the official language subtag registry, which recognises "polyton" as a subtag to indicate polytonic Greek (not as a script code in itself!). See [2]. This option is probably the more "correct" of the two. —CodeCa t 13:06, 23 July 2013 (UTC)Reply

Grek-polyton is preferable because not all Greek written in polytonic is Ancient Greek. Modern Greek is officially defined as starting in 1453 and it was written in polytonic for over 500 years after that. —An gr 09:12, 27 July 2013 (UTC)Reply

Like I said, the language code doesn't mean it's used only for Ancient Greek, it just means "the Ancient Greek script variety" or "the variety of the script associated with Ancient Greek". In the same way, "fa-Arab" is used by many other languages beside Persian. —CodeCa t 11:14, 27 July 2013 (UTC)Reply

Not renamed. The script code in Module:scripts/data was not changed either. — Keφr 14:04, 3 January 2015 (UTC)Reply

Please update Thai, Laoo, Lana

Latest comment: 8 years ago2 comments2 people in discussion

For Thai:

characters = "ก-๛",

For Laoo:

characters = "ກ-ໟ",

For Lana:

characters = "ᨠ-᪭",

(I have placed correct characters.) --Octahedron80 (talk) 03:15, 13 August 2015 (UTC)Reply

All done. (no change for Lana) Wyang (talk) 05:02, 25 May 2016 (UTC)Reply

Formatting error with Thai

Latest comment: 9 years ago1 comment1 person in discussion

See Wiktionary:List of scripts Instead of Thai, the Thai flag is displayed and the columns are off by one, shifted to the right. —Justin (koavf)❤T☮C☺M☯ 14:14, 3 January 2016 (UTC)Reply

Burmese

Latest comment: 8 years ago2 comments2 people in discussion

Please update "Mymr" to support all extensions, since Mon and Shan (and more) use characters beyond current range. Also add an alias.

m["Mymr"] = {
	canonicalName = "Burmese",
	otherNames = {"Myanmar"},
	characters = "က-႟ꩠ-ꩿꧠ-ꧾ",
	systems = {"abugida"},
}

--Octahedron80 (talk) 04:45, 25 May 2016 (UTC)Reply

Done. Wyang (talk) 05:01, 25 May 2016 (UTC)Reply

Abugida

Latest comment: 8 years ago1 comment1 person in discussion

What is currently use of systems = {"abugida"} ? And is there another possible value? --Octahedron80 (talk) 01:41, 2 November 2016 (UTC)Reply

Updating to clear out Category:Unspecified script languages

Latest comment: 7 years ago6 comments5 people in discussion

Several of these are known (e.g. Bandi/Gbandi or bza is Latn). Can someone either unprotect this to allow it to be updated or make a bot that can mass-import these values? —Justin (koavf)❤T☮C☺M☯ 01:22, 26 May 2017 (UTC)Reply

Some of them are known. Did you have a specific source you wanted to import the data from? DTLHS (talk) 06:13, 26 May 2017 (UTC)Reply

I unprotected this now. (to be edited by autoconfirmed users) Note that once Wikidata access is implemented here, I'd expect all this script information to be moved there somehow if people don't mind, but I guess it can't hurt to have this module updated before moving all the information there. --Daniel Carrero (talk) 06:18, 26 May 2017 (UTC)Reply

Actually, I don't think this is the module that has to be updated to clear out the unspecified script languages category; it's the language data modules that give the scripts for a particular language. — Eru·tuon 07:19, 26 May 2017 (UTC)Reply

If that's the case, I suppose we can give template editor status to Koavf? --Daniel Carrero (talk) 07:34, 26 May 2017 (UTC)Reply

Good point. Koavf had also posted about Category:Unspecified script characters, which is controlled by this module (AFAICT), but the languages are a different matter. - -sche (discuss) 08:18, 26 May 2017 (UTC)Reply

No characters for Zyyy ("undetermined")?

Latest comment: 7 years ago6 comments2 people in discussion

Not adding characters = m["Latn"].characters for Zyyy is problematic as it causes the templates to ask for transliteration for reconstructed words which are written in Latin, e.g. in *ganzabara-. --Z 11:37, 6 June 2017 (UTC)Reply

Hmm, an alternative is to add "Latn" as xme's script, as the script our entries are written in (regardless of whether or not it was used by the Medians themselves); we also do that with languages which are so far only attested in scholarly works that use Latin script (even when they are spoken in e.g. India and China and would possibly switch to using Deva or Hani if they were eventually written natively); OTOH, we don't list Latn as a script of Gothic or Proto-Runic. Another idea is to add a special code for "reconstructed-language Latin", maybe "und-Latn", and add it as the script of xme etc. However, what you propose is probably the best idea. (I suppose there will be no new side effects to Latin-script characters being listed as belonging not just to Latin but to Zyyy, since they are already also listed as belonging to a number of other scripts.) - -sche (discuss) 13:37, 6 June 2017 (UTC)Reply

I have added this. --Z 11:15, 10 June 2017 (UTC)Reply

Oh, the change doesn't fix the problem, but I shouldadd Latn to xme in the language data modules. --Z 11:24, 10 June 2017 (UTC)Reply

Hmm, adding characters = m["Latn"].characters to Zyyy did not fix the problem? I wonder if that is because xme explicitly listed Zyyy as its script? Only two languages do that, out of the hundreds with no known native script (and of those two, Sentinelese has no content), so it appears to be non-normal (the norm seems to be to list no script), and maybe modules/templates don't handle it well. Purely unattested (proto-) languages often have Latinx as their script, so setting Median to Grek, Latn seems appropriate. - -sche (discuss) 16:38, 10 June 2017 (UTC)Reply

I see why now, after checking the relevant module that adds "transliteration needed", I think it would also fix by adding Latn before Zyyy, not the other way around as I have done in my edit in the data module. --Z 20:52, 10 June 2017 (UTC)Reply

Aphabetic Thai and Lao

Latest comment: 3 years ago15 comments4 people in discussion

Please add codes for alphabetic Thai and provide two codes for abugidic and alphabetic Lao, as opposed to a single code for the Lao script.

The need arises because there are two Thai script writing systems and two families of Lao script writing systems for Lao. This causes a few ambiguities in transliteration, such as between alphabetic สุตวา (sutvā) and abugidic สุตวา (sutavā). All these systems are alphasyllabaries, so I don't know what the Wiktionary classification means by 'abugida'. Under Daniels' definition, the Thai language is written in an abugida while the modern Lao language as prescribed is an alphabet, i.e. all vowels are written.

Correction: writing systems for Pali.

The 'script' for the alphabetic Thai script writing system, the dominant script for Pali by Internet use, though not the most prestigious one, can be described by an entry

m["pi-Thai"] = {
   canonicalName = "Thai",
   characters = m["Thai"].characters,
   systems = {"alphabet"}, -- No implicit vowels!
   parent = "Thai"
}

On the internet, the other Thai writing system is the second commonest, and Latin then comes in in third place.

You will need to preserve the comment if you decide that the system should be 'abugida' because it as an alphasyllabary. The name may be objected to on the grounds that the prestigious writing system for Pali is the abugida, the system currently recorded. In my test cases I have been using the script code "aThai" because it writes the vowel 'a', which is implicit in the other, abugidic system.

The script system for the Lao language is currently documented as

m["Laoo"] = {
   canonicalName = "Lao",
   characters = "ກ-ໟ",
   systems = {"abugida"},
}

I suggest adding a comment '-- Implicit vowels deprecated' to that description if it's being an alphasyllabary overrides its being an alphabet in Daniels' terminology.

The apparently higher prestige system for the Pali in the Lao script should have fields

{
    canonicalName = "Lao",
    characters = m.Laoo.characters,
    systems = {"abugida"}, -- Uses implicit vowels!
    parent = "Laoo"
}

I don't know what you should call it. In my test cases, I'd been using 'aLaoo' for the system closer to modern Lao. Perhaps you could call this one 'iLaao' as it has implicit vowels. The other, alphabetic system seems to be the commoner. RichardW57m (talk) 12:55, 21 April 2021 (UTC)Reply

P.S. The traditional Lao writing system for Pali uses the Tai Tham script. RichardW57m (talk) 12:57, 21 April 2021 (UTC)Reply

I see I can actually edit this data module myself. So, unless I hear to the contrary, I will add scripts "pi-Thai" and "iLaao" this weekend. Then I will ask that Sanskrit be switched from using script "Laoo" to using script "iLaoo", so I can default transliteration from "Laoo" Pali and Sanskrit to assuming explicit vowels in the absence of easy evidence to the contrary. @Octahedron80, Kutchkutch, Wyang, Bhagadatta, AryamanA, Hintha, ZxxZxxZ, -sche, Erutuon, Suzukaze-c, Atitarev, Benwing2, Victar, Mahagaja. RichardW57 (talk) 16:56, 23 April 2021 (UTC)Reply

@RichardW57: I would suggest Laoo prefixed with a language code and - (like fa-Arab). iLaoo doesn't follow any of the patterns of our script codes. But what is the purpose of the script code? We have generally used script codes to enable us to specify different fonts in MediaWiki:Common.css, but you haven't mentioned anything about fonts. If it's just the systems value, that's only used in categories like Category:Lao script, so I'd suggest systems = {"abugida", "alphabet"}. Or is the purpose so that transliteration modules will have something to latch onto to give different transliteration for the same language and script? I.e., require "Module:something-translit".tr("...", "pi", "pi-Thai") emitting a different translation for the same "..." from require "Module:something-translit".tr("...", "pi", "Thai"). — Eru·tuon 19:23, 23 April 2021 (UTC)Reply

@Erutuon: The issue is the transliteration. Different rules are needed to transliterate สุตวา (and ສຸຕວາ) depending on whether they are written in an abugida (as originally defined), i.e. with short /a/ not written, or in an alphabet (Daniels' definition), i.e. short /a/ written (in which case the CV combination in the Lao script is ກັ or ກະ). In either case, the writing system is an alphasyllabary.

Now there are font issues, but they divide things differently. In the Thai script, for abugidic Pali and some minority living languages, one needs an easily visible phinthu. It's also helpful for reading phonetic respelling in the Thai script. The Thai language is written using an abugida.

We have identified three major systems of alphabetic Pali in the Lao script:

Ambiguous Pali - Plain and aspirated voiced Pali consonants are not distinguished. Their Lao pronuniation is identical. This is the only system that was supported by Unicode prior to Unicode 13.0, and has the greatest Internet presence of the Lao script Pali writing systems.
Unambiguous Pali using the Buddhist Institute extra characters that were encoded for Unicode 13.0.
Unambiguous Pali using a 'nukta' - the nukta is the same character (we hope!) as the virama in the abugidic Pali. This lacked Unicode support until Unicode 13.0.

There only seems to be one basic system of abugidic Pali in the Lao script. This one needs the characters added in Unicode 13.0.

If we know that the writing system is alphabetic, and we apply the principle that transliteration should be based only on spelling and pronunciation, then apart from some worries over interpreting LAO LETTER NYO, a single algorithm can transliterate all three Lao script alphabetic writing systems to the Latin script. (OK, there may be some issues with preposed vowels, but Pali phonology can probably resolve them if they are ever encountered.)

The Lao language is basically alphabetic, though there is some debate as to whether historically anaptyctic vowels should be written. It is thus the abugidic Pali that is most different from the Lao writing system.

It might be simpler to pass a writing system identifier (or set of flags); Eastern Nagari is looking potentially complicated. As it is, I am not confident that Pali and Sanskrit can use the same logic to handle it. Octahedon80 and I had some fun and games with preventing the character being used for /v/ from forming rephas! Unfortunately, the interfaces don't support that, but require language and script code to suffice.

@Benwing2 I'd also like to control whether the transliteration generates links. This is particular relevant when participles need their own inflectional or semantic notes. Present participles seems to have some inflectional complications, and past participles easily acquire extra meanings. I'd like the reader to be able to jump to the non-Latin or Latin form using only a single click. Unfortunately, for ambiguous Lao, the transliteration into Latin is not the Latin script form. A case in point is ທັມມະ (damma), whose Latin script form may be damma (“to be tamed”) or dharma (“Dharma”). Thus the current mechanisms don't seem to easily support the necessary logic, but I think I can tweak my way round the lack. RichardW57 (talk) 21:02, 23 April 2021 (UTC)Reply

Correction: damma (“to be tamed”) or dhamma (“Dharma”). RichardW57 (talk) 22:13, 23 April 2021 (UTC)Reply

I forgot the answer to your question. The upshot is that as Pali in Lao script seems usually to be written alphabetically, "pi-Laoo" does not seem appropriate for the abugidic writing system. RichardW57 (talk) 21:07, 23 April 2021 (UTC)Reply

"pi" and "Laoo" just follow ISO standard and many modules rely on it. You should not invent anything out of the line. Thai and Lao scripts are basically abugida among South and Southeast Asia as they are written for many languages, so it is quite useless to change their types. (Abugida means a vowel modifies a consonant; even they break into pieces today.) Sub-entry of a script is useful only when a language use a subset of the whole script range. (Many languages of Latin script don't split however.) In case of Pali and Sanskrit, only little amount of characters is never used; it is not worth to create a new set for them. If you are here for solving different transcription/transliteration system, you are in the wrong place. --Octahedron80 (talk) 22:24, 23 April 2021 (UTC)Reply

@RichardW57: I'm having a lot of trouble following what you're saying. I don't see a clear answer to my question. Currently transliteration is tied to the combination of language (Pali, Thai, Lao) and script (Thai, Lao). A transliteration function returns the same thing when given the same sequence of code points, language code, and script code. Does any particular combination of language and script require more than one transliteration function? To try to be very specific, if you've got a sequence of x script code points in language y, are there multiple sequences of Latin code points that the sequence of x script code points should be automatically converted to by the language y script x transliteration function? That would be a possible reason to split x script code into multiple script codes, one for each possible sequence of Latin code points, so that the language y transliteration function(s) can behave differently for each script code. — Eru·tuon 22:52, 23 April 2021 (UTC)Reply

@Erutuon: In the second paragraph of this section, I gave the example of alphabetic สุตวา (sutavā, “sutvā”) and abugidic สุตวา (sutavā, “sutavā”). The script is Thai, the language is Pali, and they are similar but different inflections of the verb. The first is formation usually called the absolutive or gerundive, a formation found in Indic and Dravidian, and the second is, for various genders, the nominative and vocative singular of the perfect active participle (follow the links for details). They may be related, just as the Russian gerundive and present participle are related. The ambiguity will typically arise when the absolutive and past participle attach the 't' to the stem the same way (which is usually the case with underived verbs) and the stem does not contain /a/. The same ambiguity arises for Lao. RichardW57 (talk) 23:59, 23 April 2021 (UTC)Reply

@Erutuon, Octahedron80:Daniels' definitions of alphabet and abugida is a neat shorthand for the differences between the two systems. Basically, the difference is whether the lack of a vowel symbol means no vowel (alphabet) or /a/ (abugida). In this sense, the writing system of the Lao language has been converted to an alphabet, while Thai is mostly an abuguda, except that word-final vowels are almost always written. Both remain alphasyllabaries. RichardW57 (talk) 23:59, 23 April 2021 (UTC)Reply

@Octahedron80:"Beng" and "as-Beng" are already being used for subtle, non-conflicting usage differences in Sanskrit. If we find Pali being significantly used in an Assamese style, we would have a real conflict for Pali, as it has been concluded that the Bengali usage uses U+09F0 for /v/, while the Assamese usage would probably use the same character for /r/ as in Assamese. RichardW57 (talk) 23:59, 23 April 2021 (UTC)Reply

@Erutuon, Octahedron80: An option that would avoid the need for additional script codes would be to replace, for Pali use, {{link}}, {{mention}} and function full_link in Module:links by templates and functions "pi-link", "pi-mention" and "pi-full-link", and have backdoor access to Module:pi-translit instead of going through the standard interface. I think so doing would upset some. RichardW57 (talk) 00:29, 24 April 2021 (UTC)Reply

@Erutuon, Benwing2, Octahedron80: So, which should I do? The 'script' name "iLaoo" has been objected to as being as unsystematic as "polytonic" and "Latinx", but no reasonable alternative has been proposed. Perhaps sa-Laoo would work, but I'd like to see some quotable Lao script Sanskrit. Is using back doors into Module:pi-translit a tolerable alternative? The boilerplate for its description says that one should not even invoke it explicitly, with the implication that language and 'script' will provide the necessary information. I'd rather use the back door solution - the back door would be to export a transliteration function with extra arguments. RichardW57m (talk) 12:08, 26 April 2021 (UTC)Reply

Incidentally, I would use the child script names as a last resort for when I cannot straightforwardly adequately deduce the writing system from the string to be transliterated. RichardW57m (talk) 12:08, 26 April 2021 (UTC)Reply

Clarification

Latest comment: 2 years ago5 comments2 people in discussion

@Surjection Thanks! Just to clarify, I was about to revert it myself, I realised what a mistake that was نعم البدل (talk) 16:17, 27 June 2022 (UTC)Reply

Don't do test edits on modules that are used on a ton of pages. What were you even trying to do, anyway? — SURJECTION ^{/ T / C / L /} 16:18, 27 June 2022 (UTC)Reply

My bad - I will refrain from doing something like that again. I was trying to embed the Nastaliq script code "Aran" as an alias or a child of the "Arab" code. نعم البدل (talk) 16:20, 27 June 2022 (UTC)Reply

There's a comment saying "Aran (Nastaliq) is subsumed into Arab". — SURJECTION ^{/ T / C / L /} 16:25, 27 June 2022 (UTC)Reply

@Surjection I read that, hence why I didn't just add "Aran" as an entirely separate script, but the "Aran" code wasn't working when I tried to make Template:User Aran. I was going to revert my edit and leave it be, before you reverted it for me. نعم البدل (talk) 16:33, 27 June 2022 (UTC)Reply

Unicode script definitions, normalization

Latest comment: 2 years ago10 comments3 people in discussion

@theknightwho: Probably more maintainable to use Unicode script definitions as in your edit. We did have different patterns for Cyrs and Cyrl that permitted {{scripttaglinks}} and JavaScript, using data derived from the patterns, to differentiate them (User:Erutuon/scripts/scriptRecognition.js, User:Erutuon/scripts/watchlistScriptTagging.js), but I don't know if the Cyrs pattern was correct.

The script data also had simpler code point ranges that included unassigned characters (which is why polytonic had ἀ-῾ rather than what you changed it to) because they should not occur in Wiktionary entries and we would ignore Unknown script characters in script tagging, and it would probably somewhat speed up the script detection of languages with multiple scripts to collapse as many code point ranges as we can. We could automatically process the code point ranges to add unassigned characters if it's worth it. The new exact character ranges are more confusing to read too.

About my reversion of the polytonic change: some of the characters in ranges of code points were NFC-normalized into characters with lower code points when the module text was saved, making the character range invalid. For instance, ΐ (U+1FD3, GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA) was normalized to ΐ (U+0390, GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS), lower than the first character of the range, ῆ (U+1FC6, GREEK SMALL LETTER ETA WITH PERISPOMENI). This can be avoided by generating the character with mw.ustring.char(0x1FD3). Or we could just omit the non-NFC characters as someone did with most of the CJK Compatibility Ideographs. (The Hani pattern only contains the compatibility ideographs 﨎﨏﨑﨓﨔﨟﨡﨣﨤﨧﨨﨩, which I guess are the only ones that don't have NFC equivalents.) The non-NFC characters can only occur in Lua or in a text box before the preview button is pressed, not in saved wikitext. — Eru·tuon 16:13, 23 December 2022 (UTC)Reply

@Erutuon Thanks - somewhat shamefully, I had to go do something else and saved what I'd done up to that point. Looking back, you're right that there must be a more manageable way to do this.

In particular, what I'd really like us to include are the appropriate common characters that are shared between certain scripts in all the relevant scripts, because our previous method certainly did not include many of them at all, and those we did include tended to be limited only to their original script.

Regarding Cyrillic, I hadn't realsied that we relied on this to distinguish them. Our Cyrillic definition was definitely too restrictive, but I agree that it was rash of me to just set Early Cyrillic to being the same. That being said, are there any languages that use both (excluding Translingual)? If not, it may not be necessary to be so specific anyway, as we can determine the difference based on the language (e.g. Russian vs OCS). (I can also confirm that you're correct about 﨎﨏﨑﨓﨔﨟﨡﨣﨤﨧﨨﨩, too - and in fact they're not actually compatibility ideographs at all, despite the block they're in. They're just ordinary CJK characters, as far as Unicode is concerned.)Theknightwho (talk) 16:54, 23 December 2022 (UTC)Reply

@Theknightwho: I wrote out a Rust program to analyze script and block data from the Unicode database and figure out whether it makes sense to ignore Unknown-script code points. I looked at the gaps where Unknown-script code points (I guess generally unassigned) are surrounded by two code points of the same script. If I've coded it right, the gaps generally belong to the same block as one or the other of the two code points surrounding them. The only cases where the code points in a gap belong to different blocks (ignoring code points that have no block) is with two Common-script gaps, U+1ECB5..U+1ED00 in the Indic Siyaq Numbers (U+1EC70..U+1ECBF) and Ottoman Siyaq Numbers (U+1ED00..U+1ED4F) blocks, and U+FFEF..U+FFF8 in the Halfwidth and Fullwidth Forms (U+FF00..U+FFEF) and Specials (U+FFF0..U+FFFF) blocks. We don't have a "Common" script, so this doesn't affect us. Not sure if that's important, but it's one way of analyzing what the gaps actually are. The program also has a function to generate Lua-appropriate patterns.

I noticed you ran some of the patterns through mw.ustring.toNFC. That doesn't have any effect unless some of the characters in the input to the function are generated by mw.ustring.char because all literal text saved on a page is already NFC. What's the goal there? — Eru·tuon 19:09, 31 December 2022 (UTC)Reply

@Erutuon: Just on your last point: for some reason, I’d forgotten that certain characters are excluded from composition under NFC, and had mistakenly assumed that the issue you mentioned (where they were decomposing on input) could be fixed by normalization. Obviously that doesn’t actually make sense if WP itself is running NFC automatically when pages are saved, so I’ll remove them shortly. I’ll give a fuller reply in a bit, as it sounds interesting! Have you incorporated Unicode’s ScriptExtensions.txt as well? Theknightwho (talk) 19:35, 31 December 2022 (UTC)Reply

I haven't worked with script extensions. I'm not sure what it means for script recognition if a character is in two script sets of the same language (which likely would happen in Sanskrit), and I had the impression many of the characters that would be encountered in Wiktionary would be diacritical marks or punctuation. But I'm looking at the script extensions with my program now. — Eru·tuon 21:46, 31 December 2022 (UTC)Reply

@theknightwho: The difference between script and script-and-script-extension patterns can be seen here (though it's not very illuminating because it's a diff between Lua patterns). The script extensions are mainly symbols (222 code points), numbers (206), marks (81), and punctuation (57). There are only 28 letter code points. So it's more complicated than I thought, but it might improve script recognition sometimes. — Eru·tuon 16:15, 2 January 2023 (UTC)Reply

@Erutuon Thanks! I think it's worth implemeting these, as in most cases, the question of which script should be chosen (if more than one match) is merely down to which is first in the list for that particular language. There are two issues which are a bit more tricky:

Hani, Hant and Hans are currently set to override any other script if they match at least one character, which realistically only occurs with Latn in Chinese languages (because Jpan and Kore bypass this). Hant and Hans don't share any characters with non-Chinese scripts, but Hani does: if we have a purely Latin string that contains (e.g.) any of U+A700 to U+A707 - which are in Hani and Latn - then we don't want the override to occur. This is easily possible to solve, but it'll need to be done relatively carefully in order to remain efficient.
There is also the question of which script should be returned by findBestScriptWithoutLang. I'm not sure what the best way to deal with this is.

Theknightwho (talk) 17:50, 2 January 2023 (UTC)Reply

The only case I can think of where taking note of script extensions might help is Garshuni, where we might want to select a font that can handle Syriac consonants with Arabic vowels, as opposed to one that can only handle Syriac vowels with Syriac consonants. Now, I may be overlooking categories based on the presence of stray characters, but that's the only example I can think of. A major bug with the Unicode property is that a few years ago it was purged of usages not recorded in CLDR, overlooking the fact that the requirements for a CLDR entry effectively excluded languages like Sanskrit that tend not to have standard formats for time of day. That was when the dandas were dropped as Latin-script extensions, even though they're quite common in Roman script Sanskrit text. --RichardW57m (talk) 15:34, 3 February 2023 (UTC)Reply

We have a rule for choosing between scripts - choose the script which matches the most characters. We need this for {{sa-sc}} to have a prayer of distinguishing between the Assamese and Bengali Wiktionary scripts. --RichardW57m (talk) 15:34, 3 February 2023 (UTC)Reply

It can tell them apart if there is an 'r' in the word, and 'v' is a giveaway if it is Assamese. The difference is really only relevant for inflection, which we don't yet do for Sanskrit for scripts other than Devanagari. --RichardW57m (talk) 15:55, 3 February 2023 (UTC)Reply