User talk:Benwing2/test-th-translit
Add topic@Atitarev Here I have copied your test examples using {{m}}
and, below each one, I've placed the output of the three functions that need to be overridden: `translit` (which generates the translit), `makeEntryName` (which generates the output form with links in it) and `makeDisplayText` (which generates the output form without links in it, at least I think; User:Theknightwho can you clarify what the difference between `makeEntryName` and `makeDisplayText` is?). The font is smaller than normal because this is raw output without script or language tagging. I am going to create an etym-only language 'th-new' for testing purposes (a child of 'th') that uses these three functions, so we can test it out using {{m}}
, {{uxi}}
and the like without disturbing existing Thai entries. User:Theknightwho you said you might be able to provide a flag to send completely unprocessed output to the translit and other functions, can you do that for 'th-new'? Benwing2 (talk) 04:27, 2 December 2023 (UTC)
- @Theknightwho I suspect what I'm trying to do in the `makeEntryName` function, which is to insert links around the space-separated terms, can't be done in the current setup, and we may need another hook function to allow this. Benwing2 (talk) 04:44, 2 December 2023 (UTC)
- @Theknightwho Perhaps it's enough to change the code that uses `makeEntryName` and adds links to it so that if the output of `makeEntryName` contains links, the linking code reparses the output for links rather than trying to add a gigantic link around everything. Benwing2 (talk) 05:50, 2 December 2023 (UTC)
- @Theknightwho Thanks for your response about `makeEntryName` and `makeDisplayText`. I need more info about where these are used; I'll take a look in the code. Did you see my comments here? Benwing2 (talk) 11:03, 2 December 2023 (UTC)
- @Theknightwho I think that `makeEntryName` and `makeDisplayText` for Thai should behave the same, essentially removing single spaces and converting double spaces to single spaces (unless one runs before the other and is the input to the other)? Whereas we need a new function `preprocessLinks` or similar, that is run by full_link() and language_link() and adds links to space-delimited Thai words. Can you help me implement this, along with disabling processing for Thai? Benwing2 (talk) 20:46, 2 December 2023 (UTC)
- One more thing: I added a translit method to 'th-new' in MOD:etymology languages/data but isn't working in that it isn't getting called. You can see this by adding an error() call at the beginning of the tr() method in Module:User:Benwing2/th-scraping-translit and previewing User:Benwing2/test-th-translit; the calls at the very beginning of the file use
{{m|th-new|...}}
and should be transliterating the text, but the error() call doesn't get tripped. Benwing2 (talk) 22:24, 2 December 2023 (UTC)
- @Theknightwho Thanks for your response about `makeEntryName` and `makeDisplayText`. I need more info about where these are used; I'll take a look in the code. Did you see my comments here? Benwing2 (talk) 11:03, 2 December 2023 (UTC)
- @Theknightwho Perhaps it's enough to change the code that uses `makeEntryName` and adds links to it so that if the output of `makeEntryName` contains links, the linking code reparses the output for links rather than trying to add a gigantic link around everything. Benwing2 (talk) 05:50, 2 December 2023 (UTC)
- @Benwing2: Thank you. It looks interesting and a lot of work already done.
- I wonder if you need to know anything about respelling rules in transliterations.
- We already talked about ฿, which should just produce ฿.
- Thai numerals should convert to Arabic numerals, e.g. ๕๐๐ (500), which they do already.
- ๆ is probably the most complex symbol for transliteration purposes but there others
- In กรุงเทพฯ (grung-têep), the abbreviation symbol ฯ means something like "etcetera". I guess all undefined words should have a respelling but maybe the symbol itself should display something as well, e.g. "~"? Because กรุงเทพ (grung-têep) is a misspelling of กรุงเทพฯ (grung-têep), which has the symbol. No strong opinion yet.
- Thai translit handles monosyllabic words with consonant clusters for letters, which can actually form clusters, as กรีน (griin). Perhaps, if a word contains a suspicious cluster, it should demand a respelling {กฺรีน} with ◌ฺ (phinthu), as in กรีน (griin). ◌ฺ is used in respellings. This happens when you try to create a new entry with controversial/ambiguous spelling using
{{th-new}}
. (Don't create the entry yet, let's keep it as a test case) - The module currently fails on some Thai punctuation symbols, for which there are no respellings, as it as the case with ฿(bàat) but this fix I made is wrong. Failing symbols are
{{th-xi|๏}}
,{{th-xi|๛}}
, etc. (they are uncommon but perhaps there should produce nothing or some default or assigned value)
- Anatoli T. (обсудить/вклад) 05:13, 2 December 2023 (UTC)
- @Atitarev Can you create test cases for #1, #2, #4, #6? I tried to make it handle the bhat symbol ฿ correctly by excluding it from the set of symbols considered as part of Thai words, so it may already work correctly. Also, complex examples that mix Thai characters with non-Thai characters (e.g. boldface, Arabic numerals, Thai numerals) would be good if you can create them. As for #5, that can be implemented if you give me the rules for determining what counts as a "suspicious cluster". Benwing2 (talk) 05:47, 2 December 2023 (UTC)
- @Benwing2: Hi. Done. Please take a look at User:Benwing2/test-th-translit#Some_special_cases.
- Probably no need to worry too much about my explanations on Thai consonant clusters unless you're dealing with a situation.
- Note: symbol "฿" is not working as expected. It was in the same boat as "๏", "๛" until I made a Thai entry with
{{th-pron}}
. Anatoli T. (обсудить/вклад) 09:13, 2 December 2023 (UTC)- @Atitarev I fixed things so that the translit of "฿" and other symbols like "๏", "๛" should work fine. Unfortunately there's currently a bug in
{{m|th-new|...}}
(see above) but you can see the output of directly calling the translit() function in Module:User:Benwing2/th-scraping-translit, and it is correctly handling these symbols as well as triple apostrophes and curly quotes and such. Benwing2 (talk) 22:45, 2 December 2023 (UTC)- @Benwing2: Hi. Please let me know if you're waiting for any input from me or want me to check/confirm any cases. You probably know yourself what is working and what is not. Otherwise, I'll leave it with you. It seems very complex. I don't know if I can help with anything. Thank you! Anatoli T. (обсудить/вклад) 00:54, 5 December 2023 (UTC)
- @Atitarev I'm just waiting on responses to my questions from User:Theknightwho, as we'll need to implement some changes in Module:links and Module:languages in order to support this. Benwing2 (talk) 01:09, 5 December 2023 (UTC)
- @Benwing2: Hi. Please let me know if you're waiting for any input from me or want me to check/confirm any cases. You probably know yourself what is working and what is not. Otherwise, I'll leave it with you. It seems very complex. I don't know if I can help with anything. Thank you! Anatoli T. (обсудить/вклад) 00:54, 5 December 2023 (UTC)
- @Atitarev I fixed things so that the translit of "฿" and other symbols like "๏", "๛" should work fine. Unfortunately there's currently a bug in
- @Atitarev Can you create test cases for #1, #2, #4, #6? I tried to make it handle the bhat symbol ฿ correctly by excluding it from the set of symbols considered as part of Thai words, so it may already work correctly. Also, complex examples that mix Thai characters with non-Thai characters (e.g. boldface, Arabic numerals, Thai numerals) would be good if you can create them. As for #5, that can be implemented if you give me the rules for determining what counts as a "suspicious cluster". Benwing2 (talk) 05:47, 2 December 2023 (UTC)
- @Benwing2 `makeEntryName` generates the page name (e.g. it removes acute and grave accents in Russian). `makeDisplayText` generates the text used as the display form (which is unchanged for most languages, but is used for things like normalizing palochkas since it’s common for them to be entered with the wrong character). Theknightwho (talk) 06:12, 2 December 2023 (UTC)