Module talk:typing-aids/data/sa

From Wiktionary, the free dictionary
Latest comment: 9 months ago by RichardW57 in topic Multiple-code-point vowels
Jump to navigation Jump to search

Deva reverse transliteration module

[edit]

@Erutuon Hello, hello. I though you might enjoy a break your normal work to do an easy side project. A while ago while working on the long-stalled Sanskrit declension module at Module:User:JohnC5/Sandbox, I realized that doing the entire module in Devanagari was too annoying given its alphabetic properties. Also, someone (I don't recall who) asked me to make this module be extensible to other scripts. This led me to the realization that I should have the internals of the module be in IAST with transliterators and reverse-transliterators at either end. I was wondering whether you could either make me an IAST to Devanagari transliterator or make a module-accessbile entrance point to this code. The one extra requirement would be for this to ignore accents on the letters (so ápas would become अपस्). Theoretically in the future, adding coverage for other scripts would just involve creating transliterators for other scripts and plugging them in. Does this make more sense as a separate module or as functionality of this one? Thanks. —JohnC5 18:20, 30 August 2017 (UTC)Reply

@Erutuon: Never mind. I made a series of functions to convert between Devanagari, IAST, and SLP1 (you can see testcases here). SLP1 will be the easiest for writing the declension module because of its one-to-one character relationship. —JohnC5 07:10, 1 September 2017 (UTC)Reply

Prakrit Support

[edit]

@Svartava, Kutchkutch There were problems in Latn to Deva conversion affecting inflection tables and PoS headers affecting short vowels and retroflex laterals. I haven't investigated the question of short vowels in Prakrit in the Kannada script - I haven't found much evidence, beyond the fact that the Kannada script was using the long vowel symbols for Sanskrit /e/ and /o/ by the end of the 19th century, so I haven't tested its handling of short vowels. --RichardW57 (talk) 22:58, 6 June 2022 (UTC)Reply

The inflection code used and still uses the transliteration data tables

The most striking difference between Module:typing-aids/data/sa and Module:typing-aids/data/inc-pra-Deva is that the former converts ĕ and ŏ to ए and ओ whereas the latter converts them to ऎ and ऒ. I propose we keep it that way. The code in Module:typing-aids/data/sa needs to be modified to work out whether 'l' with underdot is a consonant or a vowel. I'm pretty sure that as a vowel, it does not occur next to a vowel except for a very small chance after 'a' (MW documented a case of initial ar̥-), and as a consonant ḷ occurs in only one cluster, ḷh. That should be enough to cope with real words. I've replaced at least one usage of the 'inc-pra-Deva' code in the headword template family by 'sa', improving a headword link as a result.

My changes in this area were started on 6 June, and have been done under user names RichardW57 and RichardW57m.

If disambiguation logic fails, the inflection module when outputting Devanagari can first do a global edit to replace 'ḷ' by 'L', and Module:typing-aids/data/sa be made to recognise that as the consonant. The headword templates contain an override parameter |deva=, so the system can work if I can't produce an adequate resolution algorithm. --RichardW57 (talk) 22:58, 6 June 2022 (UTC)Reply

@Svartava, Kutchkutch I have implemented resolution logic. How are the module's testcases progressing? --RichardW57 (talk) 11:38, 7 June 2022 (UTC)Reply
@RichardW57, RichardW57m: By module's testcases, are you referring to Module:typing-aids/testcases? Kutchkutch (talk) 01:16, 8 June 2022 (UTC)Reply
@Kutchkutch, Svartava: I managed not to see that. I am not sure how to reference those tests clearly. Write a testcase module that incorporates them? I've done something similar at Module:pi-decl/noun/Brah/testcases. Alternatively, mention them in the documentation, hoping that the experienced reader doesn't look see the red-linked testcases and read no further? --RichardW57m (talk) 09:35, 8 June 2022 (UTC)Reply
I think I've broken the intra-Latin conversion, but the test cases haven't picked it up. --RichardW57m (talk) 09:35, 8 June 2022 (UTC)Reply

Multiple-code-point vowels

[edit]

@RichardW57: I was looking at local Lvowels = "āeĕēiïīoŏōuŭuṛṝl̥̄l̥ḹ", which you introduced here and it contains two grapheme clusters that have multiple code points:

  • U+006C LATIN SMALL LETTER L
  • U+0325 COMBINING RING BELOW
  • U+0304 COMBINING MACRON

and

  • U+006C LATIN SMALL LETTER L
  • U+0325 COMBINING RING BELOW

mw.ustring patterns operate on code points, so patterns that use it don't work as intended. For instance, local Lvowel1 = "[" .. Lvowels .. "]" and local Lvowel2 = "[" .. Lvowels .. "a]" are going to match (among other things) a small letter l or a combining ring below or a combining macron. {{subst:chars|m|sa|ḷhl}} resolves to ळ्ह्ल् (ḷhl) because of {"(ḷ)(h?"..Lvowel2..")", "L%2"}, even though the last letter isn't a vowel. That's a contrived example that might not ever occur in real inputs, but it illustrates what's going on. Maybe you will have a better idea whether Lvowel1 and Lvowel2 will match where they shouldn't or not match and cause real inputs to the function to give the wrong outputs. — Eru·tuon 17:54, 19 September 2023 (UTC)Reply

@Erutuon: Well spotted! I think this bug won't happen in real life, but that was not and should not be part of the design. My solution should even speed the code up. My plan, which I am in no great hurry to implement, is:
1. Replace the syllabic vowels in Lvowels that have rings below by the 4 Devanagari syllabic consonant vowel letters (ऋ ॠ ऌ ॡ).
2. As a first replacement, replace all four syllabic vowels written with a ring below by Devanagari vowel letters. That also reduces 3 (ought to be 4, but r̥̄ is currently not handled) gsub calls to one. For timing, should also check doing short and long vowels separately.
@Kutchkutch: We should also add some test cases for this resolution.
I might even reduce the conversion of vowel letters to vowel marks ('diacritics') to two substitutions while I'm at it, now I've seen the testcases. --RichardW57 (talk) 20:30, 19 September 2023 (UTC)Reply