Module talk:sa-convert
Add topic@AryamanA Hello. I tried to take what you did with Pali and create a converter for Devanagari to all the various scripts for Sanskrit. Started with Gujarati, but I can't get the module to work exactly how I'd like it. Check my contributions to see what I've made changes to. DerekWinters (talk) 08:31, 13 May 2018 (UTC)
- Also, @Wyang if you can make any sense of my mess. DerekWinters (talk) 08:33, 13 May 2018 (UTC)
- @DerekWinters: Should be working now. The Pali transcription was not my idea or creation btw, it was made by @Octahedron80, who might be able to help out here as well. —AryamanA (मुझसे बात करें • योगदान) 13:39, 13 May 2018 (UTC)
- @AryamanA: Oh, whoops. Thanks though! DerekWinters (talk) 13:39, 13 May 2018 (UTC)
- @AryamanA What did you do with the vowel diacritics to get them to adjoin properly? DerekWinters (talk) 14:27, 13 May 2018 (UTC)
- @DerekWinters: I used the Gujarati equivalents from AP:Unicode/Gujarati. Same for virama, candrabindu, visarga, and anusvara. —AryamanA (मुझसे बात करें • योगदान) 14:36, 13 May 2018 (UTC)
- @DerekWinters: Should be working now. The Pali transcription was not my idea or creation btw, it was made by @Octahedron80, who might be able to help out here as well. —AryamanA (मुझसे बात करें • योगदान) 13:39, 13 May 2018 (UTC)
@Wyang Hello! Please help if you can. Adding {{sa-alt}} to any Sanskrit page (currently I'm trying मेघ (megha)) creates an error saying:
- Lua error: attempt to index field '?' (a nil value)
- Backtrace:
- [C]: in function "v"
- mw.ustring.lua:84: in function "gsub"
- Module:sa-convert:126: in function "translit"
- Module:sa-headword:31: in function "chunk"
- mw.lua:511: ?
- [C]: ?
I'm not yet good enough to figure out how to fix the gsub issue. DerekWinters (talk) 21:15, 13 May 2018 (UTC)
- Test: Alternative scripts
- মেঘ (Assamese script)
- ᬫᬾᬖ (Balinese script)
- মেঘ (Bengali script)
- 𑰦𑰸𑰑 (Bhaiksuki script)
- 𑀫𑁂𑀖 (Brahmi script)
- မေဃ (Burmese script)
- मेघ (Devanagari script)
- મેઘ (Gujarati script)
- ਮੇਘ (Gurmukhi script)
- 𑌮𑍇𑌘 (Grantha script)
- ꦩꦺꦓ (Javanese script)
- 𑂧𑂵𑂐 (Kaithi script)
- ಮೇಘ (Kannada script)
- មេឃ (Khmer script)
- ເມຆ (Lao script)
- മേഘ (Malayalam script)
- ᠮᡝᢚᠠ (Manchu script)
- 𑘦𑘹𑘑 (Modi script)
- ᠮᠧᠺᠾᠠ᠋ (Mongolian script)
- 𑧆𑧚𑦱 (Nandinagari script)
- 𑐩𑐾𑐑 (Newa script)
- ମେଘ (Odia script)
- ꢪꢾꢕ (Saurashtra script)
- 𑆩𑆼𑆔 (Sharada script)
- 𑖦𑖸𑖑 (Siddham script)
- මෙඝ (Sinhalese script)
- 𑩴𑩔𑩟 (Soyombo script)
- 𑚢𑚲𑚍 (Takri script)
- மேக⁴ (Tamil script)
- మేఘ (Telugu script)
- เมฆ (Thai script)
- མེ་གྷ (Tibetan script)
- 𑒧𑒹𑒒 (Tirhuta script)
- 𑨢𑨄𑨎 (Zanabazar Square script)
- @DerekWinters Good to see this being created, it seems to be quite an ambitious (but worthwhile) project. :) I've made the errors and currently unsupported languages go away. Some of the codes in sa-convert are not listed as a script for Sanskrit at the moment, hence undisplayed. By the way, I wrote the first version of Module:pi-Latn-translit... but this module is using a different type of input and logic. It would be good to have the reverse function of this module in place in Module:sa-translit (or elsewhere), to handle the Assamese-script Sanskrit links in Assamese etymologies, currently manually linked to their Devanagari equivalents. Wyang (talk) 09:31, 14 May 2018 (UTC)
- @Wyang: Thank you! I'm very excited for this to come to fruition. @AryamanA, could you add some of these (at least Java, Bali, Saur, and Tirh) to Sanskrit? Also, because Assamese uses a different subset than Bengali, how would we handle them being separate (particularly for Assamese vs Bengali script Sanskrit links in etymologies)? DerekWinters (talk) 15:18, 14 May 2018 (UTC)
- Also, I'm not sure how to handle the reverse function, especially because not all Devanagari letters have equivalents in other scipts (most frequently, ळ and ऽ (avagraha) are missing). DerekWinters (talk) 15:56, 14 May 2018 (UTC)
- @DerekWinters: Yep, added the script codes to MOD:languages/data2. I had the same problem while adding Gurmukhi transliteration; I think Devanagari to other scripts is good enough without reverse translit for now, since Devanagari is our standardized primary script for Sanskrit. Also, I'm not sure about linking to other script forms, is it necessary (or useful) to make entries for such a wide variety of script variants? —AryamanA (मुझसे बात करें • योगदान) 20:27, 15 May 2018 (UTC)
Burmese section
[edit]@Hintha Hello! Could you help out here with the Burmese section. Basically, we want to convert Devanagari written text in Sanskrit to Burmese script text (like what you'd find in Sanskrit Buddhist material in Myanmar). I'm not sure exactly how it would work though. DerekWinters (talk) 00:08, 23 May 2018 (UTC)
- Unfortunately, my familiarity with Sanskrit is quite limited. Unlike Pali, which is commonly used in Burmese text, Sanskrit spellings are quite obscure and rare. I would assume that there would be a fair amount of overlap with Pali, but I'm not familiar with how to map the following Sanskrit-specific letters because we do not encounter them in day-to-day life:
- 1050 ၐ MYANMAR LETTER SHA
- 1051 ၑ MYANMAR LETTER SSA
- 1052 ၒ MYANMAR LETTER VOCALIC R
- 1053 ၓ MYANMAR LETTER VOCALIC RR
- 1054 ၔ MYANMAR LETTER VOCALIC L
- 1055 ၕ MYANMAR LETTER VOCALIC LL
- 1056 $ၖ MYANMAR VOWEL SIGN VOCALIC R
- 1057 $ၗ MYANMAR VOWEL SIGN VOCALIC RR
- 1058 $ၘ MYANMAR VOWEL SIGN VOCALIC L
- 1059 $ၙ MYANMAR VOWEL SIGN VOCALIC LL
- --Hintha (talk) 03:51, 23 May 2018 (UTC)
- @Hintha: These ones are in fact very easy to map! I'll start it and I'll ping you when it's ready, so that you can verify that it's correct. DerekWinters (talk) 02:50, 27 May 2018 (UTC)
- @Hintha So I finished, but there are many issues still. @Wyang Can you help with the Burmese issues (some are referenced here 1 DerekWinters (talk) 03:32, 27 May 2018 (UTC)
- @DerekWinters It should be okay now. It requires special font to display - the Burmese1 font renders it like this, which looks to be correct, if one ignores the unsupported -ry- cluster and the varying font sizes... Wyang (talk) 04:50, 27 May 2018 (UTC)
- @Wyang: Thanks! That sounds good! Can you take a look at the Javanese and Balinese issues (listed on the documentation page)? The issue stems mostly from the lack of spaces in the two. DerekWinters (talk) 16:44, 27 May 2018 (UTC)
- @DerekWinters It should be okay now. It requires special font to display - the Burmese1 font renders it like this, which looks to be correct, if one ignores the unsupported -ry- cluster and the varying font sizes... Wyang (talk) 04:50, 27 May 2018 (UTC)
- @DerekWinters Most if not all should be fixed. I added the module outputs in front of the slashes for a better comparison. Could you please check, specifically: (1) there should be no space in the output at all in the two; (2) Javanese for पुत्र and पुत् र are correct - the doc page says the expected output should be ᬧᬸᬢ᭄ᬭ with adeg adeg virama ᭄ + ra ᬭ, without cakra ꦿ. Thanks! Wyang (talk) 08:09, 28 May 2018 (UTC)
- @Wyang: Thank you, you're truly a godsend. I'll check for the two. DerekWinters (talk) 19:38, 31 May 2018 (UTC)
- @DerekWinters Most if not all should be fixed. I added the module outputs in front of the slashes for a better comparison. Could you please check, specifically: (1) there should be no space in the output at all in the two; (2) Javanese for पुत्र and पुत् र are correct - the doc page says the expected output should be ᬧᬸᬢ᭄ᬭ with adeg adeg virama ᭄ + ra ᬭ, without cakra ꦿ. Thanks! Wyang (talk) 08:09, 28 May 2018 (UTC)
Additional nuance re: ာ vs. ါ
[edit]- @DerekWinters Great work on supporting Burmese! Just wanted to note a nuance in transcribing <ā> (ာ) in Burmese. When combined with the initial consonants <kh> <g> <ng> <d>
<v> in Burmese, ာ takes the form of ါ to disambiguate from otherwise similar looking letters (i.e., ခါ ဂါ ငါ ဒါ ပါ ဝါ instead of ခာ ဂာ ငာ ဒာ ပာ ဝာ which respectively resemble the Burmese letters ဆ <ch> က <k> က <k> အ <a> ဟ <h> တ <t>). Just came across this while reviewing the Burmese output for वाराणसी, which yields ဝာရာဏသီ instead of ဝါရာဏသီ, which is the correct spelling. Additional reference here: [1] Thanks! -Hintha (talk) 02:53, 28 August 2018 (UTC)
င်္ for <-ṅ->
[edit]- @DerekWinters Just wanted to flag another note to your attention. The kinzi (င်္) is used to denote an intervening <ṅ> when immediately preceding a letter from the <k> group in Burmese (က ခ ဂ ဃ င). For instance, कलिङ्ग would be spelled ကလိင်္ဂ, not ကလိင္ဂ. --Hintha (talk) 03:12, 28 August 2018 (UTC)
- This applies to all -ṅX- conjunctions, not only k group (if it happens, it'll be rare case) This also happens in Tai Tham
and Khmertoo; there are extra glyphes for the case. --Octahedron80 (talk) 00:08, 11 April 2021 (UTC)- I think the Khmer cases are just uses of anusvara for homorganic nasal. RichardW57 (talk) 08:19, 11 April 2021 (UTC)
- Sorry I confused. --Octahedron80 (talk) 00:17, 12 April 2021 (UTC)
- I think the Khmer cases are just uses of anusvara for homorganic nasal. RichardW57 (talk) 08:19, 11 April 2021 (UTC)
- This applies to all -ṅX- conjunctions, not only k group (if it happens, it'll be rare case) This also happens in Tai Tham
Burmese Visible Virama
[edit]Visible virama should not be U+1039 VIRAMA. The glyph that appears for it does not occur in finished Burmese script text; rather, it is a reminder that something is missing, borrowed from the encoding of Khmer, and now largely also borrowed for Tai Tham. It's become the conventional reminder of the presence of an unquenched invisible stacker. I believe the correct encoding is U+103A ASAT, but one should find some Burmese script text to check. Thus, at the end of a phrase, न् should correspond not to န္ but to န်.
The genealogy of Christ at https://www.bible.com/bible/2158/MAT.1.SANBU shows plenty of final consonants. I don't know how happy you are to trust that sample. --RichardW57m (talk) 14:47, 5 December 2019 (UTC)
- This is easy to fix. I have done that final VIRAMA -> ASAT. The U+1039 is a functional character for stacking that should not be left as a plus sign anywhere. Same as Tai Tham U+1A60 and Khmer U+17D2, they must be replaced with KARAN(or else) and VIRIAM at the end of term. --Octahedron80 (talk) 03:46, 10 April 2021 (UTC)
Burmese script error
[edit]@Benwing2 At ဂင်္ဂါ (gaṅgā), {{sa-alt}}
is not generating the correct Burmese script form (the page name), nor is it adding the label "(Burmese script)". —Mahāgaja · talk 10:11, 8 April 2021 (UTC)
- @Mahagaja It looks like the existing code is written such that it skips the entry entirely for the script corresponding to the pagename if the "transliteration into that script" is the same as the pagename; if it's not, it displays that transliteration but not the label. I don't know why it does that; it looks like User:DerekWinters wrote the code, and he's no longer active. See ઇતિહાસ for an example of a page where the script of the pagename (Gujarati) gets skipped. I also don't know exactly why the "transliteration" from Burmese script to itself gives a different result than the pagename, but it has something to do with the post_replace_fix() entries for Burmese script in Module:sa-convert (lines 710-718). What is the correct behavior? Presumably it should always skip the script of the pagename? Benwing2 (talk) 03:39, 9 April 2021 (UTC)
- Yeah, the script of the page name should be skipped; at least, that's what
{{pi-alt}}
does. Apparently{{pi-alt}}
and{{sa-alt}}
generate multiple possibilities for Burmese script, so in those cases only the actual page name is omitted, and the other Burmese-script options are shown, with the label included; see Pali ဂင်္ဂါ (gaṅgā), where the alternative forms box shows the Shan variants ၷင်္ၷႃ (gaṅgā) and ၷင်ၷႃ (gaṅgā) and labels them Burmese. But the Sanskrit module (only, not the Pali module) generates incorrect Burmese-script forms, namely ဂင်္ဂာ (gaṅgā), which (AFAIK) is simply wrong and should be ဂင်္ဂါ (gaṅgā). Maybe the editors of Module:pi-headword (@AryamanA, Octahedron80, RichardW57) can help make Module:sa-headword behave the same way as the Pali module does. —Mahāgaja · talk 06:42, 9 April 2021 (UTC)- Sanskrit is more complex than Pali, espacially at more clusters. I think to solve Burmese script with Pali logic won't be sufficient. I had tried something earlier but failed. (They stiil don't fix.) Oh, Tai Tham gonna be in hell either. --Octahedron80 (talk) 06:50, 9 April 2021 (UTC)
- I suggest you collect Tai Tham evidence before writing the transliteration into the script. I get the impression the Sanskrit corpus amounts to a single slab. I suspect that, as in Grantha, complex consonant clusters would be split into multiple stacks. I don't see your implementation of such rules for Grantha. When writing in Grantha, the splitting is a job for the editor, not the font. RichardW57 (talk) 15:51, 9 April 2021 (UTC)
- You really need a set of test cases. This will be particularly helpful with scripts where practice varies. (Where does the vowel go in -dhve in the Thai script?) I'm told that different nations using the Burmese script have different rules for choosing between round aa and tall aa, and that amongst the Burmese, Buddhists and Christians have different rules for the Burmese language. I'll have a go at porting the current Pali rules, but I'll assume that an ascending subscript prevents tall aa but that a mark above does not. (Marks above do discourage tall aa in Northern Thai.) I insist that my code be allowed to be readable in the Wikimedia web editor. RichardW57 (talk) 15:51, 9 April 2021 (UTC)
- @Mahagaja, Benwing2, AryamanA, Octahedron80: I've got tall AA being generated now. I'll now have a go at setting up some testcases for regression testing. RichardW57 (talk) 21:28, 9 April 2021 (UTC)
- @Mahagaja, Benwing2, AryamanA, Octahedron80: Medial RA does inhibit TALL AA in the Burmese Sanskrit I can find. I've fixed that just now, and also the generation of 'TALL O' and 'TALL AU'. RichardW57 (talk) 19:57, 11 April 2021 (UTC)
- What's your policy on obligatorily round subscript WA? A lot of fonts refuse to support it. According to Michael Everson, it is encoded <VIRAMA, WA>, but Martin Hosken doesn't allow it in his encoding standard. In the Burmese script, do you want -gvr- to be practically unrenderable (ဂွြ) or merged with -grv- (ဂြွ)? RichardW57 (talk) 15:51, 9 April 2021 (UTC)
- It is better rendering as the latter whatever. You can test conversion at https://aksharamukha.appspot.com/converter (This is good for every scripts/variations)
- kyra krya -> same ကျြ
- kvra krva -> same ကြွ
- kyva kvya -> same ကျွ
- For the example sentence, it results as:
- ဥုံ တျြမ္ဗကံ ယဇာမဟေ သုဂန္ဓိံ ပုၑ္ဋိဝရ်္ဓနမ် ၊ ဥရ်္ဝါရုကမိဝ ဗန္ဓနာန် မၖတျောရ် မုက္ၑီယ မာ'မၖတာတ် ။
- ကး ခဂေါ်ဃာငစိစ္ဆော်ဇာ ဈာဉ္ဇ္ဉော'ဋော်ဌီဍဍဏ္ဎဏး၊ တထောဒဓီန် ပဖရ်္ဗာဘီရ်္မယော'ရိလွာၐိၑာံ သဟး။
- (It also has Mon and Shan too.) I think this GitHub project is useful in its logic. Could someone break the code? --Octahedron80 (talk) 04:27, 10 April 2021 (UTC)
- It is better rendering as the latter whatever. You can test conversion at https://aksharamukha.appspot.com/converter (This is good for every scripts/variations)
- Sanskrit is more complex than Pali, espacially at more clusters. I think to solve Burmese script with Pali logic won't be sufficient. I had tried something earlier but failed. (They stiil don't fix.) Oh, Tai Tham gonna be in hell either. --Octahedron80 (talk) 06:50, 9 April 2021 (UTC)
- Yeah, the script of the page name should be skipped; at least, that's what
Sinhala Script Issues
[edit]@AleksiB 1945: Do you have a font for viewing your results? (What's standardly available on Windows 10 is nowhere near good enough.)
As far as I am aware, Sanskrit should use conjuncts where possible, so virama should be mapped to the sequence <virama, ZWJ> for these. However, it seems that it falls back to touching letters, for which the joiner is <ZWJ, virama>. The existing testcases need to be checked. There is a specimen of Sinhala Sanskrit at Unicode proposal L2/18-0602, but it seems to be mixed in with Sinhalese. I deal with the issue at Module:pi-Latn-translit for Pali, but Pali has only a few conjuncts. The module testcases do not pick these up; instead they fall back to testing that Devanagari and alleged Sinhala transliterate to the same Latin, a generally necessary but insufficient condition for correctness. (There may be issues with the script/region's writing of homorganic nasals - there was a time when European and Indian printers used different rules! Alternatively, we may be able to just make the same choice as the Devanagari, though it isn't necessarily the case that both are right.)
You are mishandling the vowels /e/ and /o/, and the module testcases pick these errors up. Sanskrit uses the same characters as the Sinhalese short vowels; so far as I am aware, al-lakuna should only appear in words ending in consonants. --RichardW57 (talk) 03:15, 4 October 2022 (UTC)
- (Notifying AryamanA, Bhagadatta, Svartava, JohnC5, Kutchkutch, Inqilābī, Getsnoopy, Rishabhbhat, Dragonoid76): : I've now change the Sinhala conversion for /e/ and /o/ to use the same convention as Pali, and for consonant clusters I've change to using the rules in
Module:sa-utilities/translate/post replace fix/SinhModule:sa-utilities/translit/post replace fix/Sinh, the same data (almost) file as used by Module:sa-utilities/translit/SLP1-to-Sinh. I'm still looking for strong evidence for the Sinhala handling of nasals + voiced stops. I have weak evidence that the same glyph is used as for the Sinhalese prenasalised consonants, and I use it in Module:sa-utilities/translit/SLP1-to-Sinh/testcases. --RichardW57 (talk) 23:32, 20 December 2023 (UTC)- Corrected module name. --RichardW57 (talk) 08:18, 21 December 2023 (UTC)
Fortunately, candrabindu is a minority taste in Sri Lankan Sinhala-script Sanskrit. I couldn't get a straight answer from Unicode on how the character was to be placed in encoded phrases. --RichardW57 (talk) 03:15, 4 October 2022 (UTC)
Tamil
[edit]The Tamil script cant and was never used to write Sanskrit; historically Tamils used the Grantha script and now they use Devanagari. AleksiB 1945 (talk) 14:27, 21 December 2022 (UTC)
- Wrong; the Tamil script has been and still is used to write Sanskrit. Various systems are used to distinguish the phonations of the plosives, including the use of small Western Arabic digits. Another system is to use different fonts. Finally, the writing may simply be ambiguous, and that is where documenting Tamil script usage would be most useful. (Of course, you may agree that Wiktionary should be useful.) RichardW57 (talk) 07:46, 22 December 2022 (UTC)
- When where the Hindu-Arabic numerals introduced to south Asia? historically the Grantha script was the extension that was used to write Sanskrit, now both scripts are distinguished and Devanagari is used predominantly in south Asia (even in Kerala, Karnataka, AP) where did this letter with subscript numbers came from? is it actually used outside the internet? AleksiB 1945 (talk) 11:47, 30 August 2023 (UTC)
- I don't know the history. The position numbers could well have started out being in an Indian system, being replaced by Arabic numerals when the system got computerised. The system doesn't work well with Microsoft renderers, so its probably not Unicode-based. The only transmission mechanism I've seen is PDF, though it probably goes back to early PostScript. --RichardW57m (talk) 12:56, 1 September 2023 (UTC)
- Is it used outside the Internet, are there any inscriptions or texts with it? AleksiB 1945 (talk) 10:11, 2 September 2023 (UTC)
- I suspect there are texts, but I don't know where to find them. There are unlikely to be any also available in the Internet in 'plain text'. --RichardW57 (talk) 11:30, 3 September 2023 (UTC)
- There are pictures of texts at https://www.unicode.org/L2/L2010/10379--extended-tamil.pdf. That year saw a fair bit of discussion - see the submissions list at https://www.unicode.org/L2/L2010/Register-2010.html. There is also plain text in that general writing-style under 'Categories' at https://sanskritdocuments.org/tamil/. So my last point was wrong. --RichardW57m (talk) 15:39, 12 September 2023 (UTC)
- @RichardW57m I see Tamil script recently got added to the module, but the implementation (at least on my computer) is buggy, as the vowel signs don't combine with the 'numbered' consonants, e.g. at देव. Do you know whether this implementation will improve? Exarchus (talk) 21:47, 20 January 2024 (UTC)
- Yes, it will improve by midnight ending 20 January UTC. The crucial lines were in the wrong place. but they should also be made less obtrusive. They also ought to be made readable. Publishing the change will regenerate all the pages invoking
{{sa-alt}}
, so I'd prefer to publish a change just once. --RichardW57 (talk) 22:52, 20 January 2024 (UTC) - Done. To enable test cases, we need to set up transliteration from Tamil Sanskrit - or, better, enter the desired outcome. I will do that in slower time, and make the testcases module a bit cleverer about missing (or unconnected) transliteration modules. --RichardW57 (talk) 23:58, 20 January 2024 (UTC)
- @Exarchus. See the above. --RichardW57 (talk) 23:59, 20 January 2024 (UTC)
- @RichardW57m Now देव looks OK, but the superscript number is in the wrong place at दान. Exarchus (talk) 09:54, 21 January 2024 (UTC)
- @Exarchus: Can you please supply evidence of that. I just rearranged someone else's code, hiding behind that, and I consciously only promised 'improvement'. The best form of the evidence would be an entry for the Tamil script form of दान (dāna), including a quotation (probably a picture or some ugly way of linking to it) and an addition of the test case. The test case offer a way of adding a link via the field Taml_why, and you may prefer not to struggle with the Wiktionary entry. I have a feeling the correct Tamil form is த³ாந (without the dotted circle), but I couldn't lay my hands on an example. We already have a difficult case with जिघांसा (jighāṃsā) - see Module:sa-convert/testcases#Tamil or equivalently . Today I'm planning to work on the transliteration, using the generation of that last segment as the test case, but I'll add an easily extendable initial form of the test case for the simple word - it's also probing for Burmic scripts. --RichardW57 (talk) RichardW57 (talk) 11:11, 21 January 2024 (UTC)
- Equivalently Module:sa-convert/testcases/Tamil. --RichardW57 (talk) 11:28, 21 January 2024 (UTC)
- @RichardW57 I was looking at the document you linked to above (https://www.unicode.org/L2/L2010/10379--extended-tamil.pdf) and at page 3 you have ப⁴ா (without circle). Page 5 has 'த³ா' (lines 4 and 8 of the Hindi text), the Kannada text has 'த⁴ா'. Exarchus (talk) 11:52, 21 January 2024 (UTC)
- By the way, I doubt the correct rendering can be done without changes to the font. (Or without encoding த³ etc. as a separate Unicode character.) It doesn't work for me with 'Noto Sans Tamil' either. Exarchus (talk) 12:47, 21 January 2024 (UTC)
- @Exarchus I'll put them forth as evidence. By the principle initially used to exclude Roman script Sanskrit from Wiktionary, the example on p3 is Tamil, not Sanskrit. I'll see if I can get a ruling from Unicode, but it looks like a slow process and may get nowhere. In the short term, there are two possible ways of getting the right appearance:
- Make a font that swaps the number and AA glyphs. Possibly not Unicode compliant.
- Make a font that deletes the dotted circle. Unicode compliant provided it claims not to support U+25CC DOTTED CIRCLE. There is also a valid way to do this on HarfBuzz renderers; I haven't tried with CoreText, and the trick fails with Uniscribe/DirectWrite.
- There is also a third possibility. Perhaps the glyph sequence can't be achieved in Unicode (without recourse to the Private Use Areas), and so such words fail CFI. --RichardW57 (talk) 13:01, 21 January 2024 (UTC)
- @RichardW57 This document (from the same person) says: "in most forms of Extended Tamil (including the Gita book mentioned previously running to almost 420,000 copies) the diacritics are placed between the consonant and any vowel signs placed to the right. We have also remarked (in L2/10-085 p 11) that the diacritic should rightfully semantically gravitate to the glyph that it qualifies. Thus in the interests of standardization, one would prefer that even in this ‘V-I system’ of Extended Tamil, the diacritic(s) is/are placed immediately after the consonant." Exarchus (talk) 16:22, 21 January 2024 (UTC)
- @Exarchus: I've tweaked the rearrangement code,
post_replace_fix["Taml"]
, to yield digit before right matra, leaving the working code for the other way round in but commented out. I've allowed for the Grantha-like use of SIGN O; I should allow for SIGN E as well. I'm not sure whether anusvara goes above the consonant or the right matra in these cases, so I've included it in the characters to be re-arranged, but left it the end in the output. The rearrangement code also handles subscript digits; so doing doesn't complicate the code. I think it's time to look at generating alternative code, without breaking the interface ofexport.tr
. - We might get some more positive replies (at https://corp.unicode.org/pipermail/unicode/2024-January/thread.html) to the encoding question when the USA starts work today. --RichardW57m (talk) 11:49, 22 January 2024 (UTC)
- The answer to anusvara placement may be that U+0B82 TAMIL SIGN ANUSVARA is the wrong character - there's a condemnation of it in Unicode documents L2/12-018 and L2/12-051. --RichardW57m (talk) 14:44, 22 January 2024 (UTC)
- @Exarchus: I've tweaked the rearrangement code,
- From what I'm reading between the lines, there may have been some 'Tamil purists' that didn't like this proposal of encoding these extra letters... Exarchus (talk) 16:28, 21 January 2024 (UTC)
- @Exarchus: Rereading, I see that the word on p3 is Tamil, except in so far as it is coincidentally(?) the Sanskrit accusative singular. Ah well, maybe we can hook it up one day, or find a better test case. --RichardW57m (talk) 15:55, 22 January 2024 (UTC)
- @RichardW57m The Unicode Standard 15.0 says "... ப² = pha, ப³ = ba, and ப⁴ = bha. The characters U+00B2, U+00B3, and U+2074 can be used to preserve this distinction in plain text." That would suggest these superscript digits are somehow supported by Unicode for use in Tamil script, which is not really the case in practice. Exarchus (talk) 17:05, 22 January 2024 (UTC)
- @Exarchus: Sharma has just clarified in a Unicode post today that the digits always precede the right matra. If that is so, then the question is whether a rendering system may assume that such digits after a right matra may be rearranged to give the right appearance. One could argue that if they can't be, then they should be separated from the right matra by a ZWNJ, just as has to be done with a final virama to ensure it isn't rendered by stacking.
- We come close to that situation with the use of superscript digits for footnotes in inflection tables, but I think we have enough attribute changes between the inflected word and the superscript digit that they won't be interpreted in the same run and therefore rearranged.
- Now, MS Edge has something weird difference between U+00B3 and U+2074. I'm getting a dotted circle and the vowel after U+00B3, but just the missing glyph (tofu) after U+2074. That could indicate that a font might handle one but fail with the second, but I can't see where that is Unicode's fault. I get the dotted circle and vowel in Word after either digit, so perhaps it's a Chromium bug. --RichardW57m (talk) 18:04, 22 January 2024 (UTC)
- @Exarchus, (Notifying AryamanA, Bhagadatta, Svartava, JohnC5, Kutchkutch, Inqilābī, Getsnoopy, Rishabhbhat, Dragonoid76): I've reverted the code by @sbb1413 so that we get the visual order digit, AA, which is what quotations that count for CFI attest. I've left it easy to switch to the AA, digit order by changing the value of variable
Tamlaa4opt
. --RichardW57 (talk) 12:29, 6 February 2024 (UTC)
- @Exarchus, (Notifying AryamanA, Bhagadatta, Svartava, JohnC5, Kutchkutch, Inqilābī, Getsnoopy, Rishabhbhat, Dragonoid76): I've reverted the code by @sbb1413 so that we get the visual order digit, AA, which is what quotations that count for CFI attest. I've left it easy to switch to the AA, digit order by changing the value of variable
- @RichardW57m The Unicode Standard 15.0 says "... ப² = pha, ப³ = ba, and ப⁴ = bha. The characters U+00B2, U+00B3, and U+2074 can be used to preserve this distinction in plain text." That would suggest these superscript digits are somehow supported by Unicode for use in Tamil script, which is not really the case in practice. Exarchus (talk) 17:05, 22 January 2024 (UTC)
- @RichardW57 This document (from the same person) says: "in most forms of Extended Tamil (including the Gita book mentioned previously running to almost 420,000 copies) the diacritics are placed between the consonant and any vowel signs placed to the right. We have also remarked (in L2/10-085 p 11) that the diacritic should rightfully semantically gravitate to the glyph that it qualifies. Thus in the interests of standardization, one would prefer that even in this ‘V-I system’ of Extended Tamil, the diacritic(s) is/are placed immediately after the consonant." Exarchus (talk) 16:22, 21 January 2024 (UTC)
- @Exarchus: Can you please supply evidence of that. I just rearranged someone else's code, hiding behind that, and I consciously only promised 'improvement'. The best form of the evidence would be an entry for the Tamil script form of दान (dāna), including a quotation (probably a picture or some ugly way of linking to it) and an addition of the test case. The test case offer a way of adding a link via the field Taml_why, and you may prefer not to struggle with the Wiktionary entry. I have a feeling the correct Tamil form is த³ாந (without the dotted circle), but I couldn't lay my hands on an example. We already have a difficult case with जिघांसा (jighāṃsā) - see Module:sa-convert/testcases#Tamil or equivalently . Today I'm planning to work on the transliteration, using the generation of that last segment as the test case, but I'll add an easily extendable initial form of the test case for the simple word - it's also probing for Burmic scripts. --RichardW57 (talk) RichardW57 (talk) 11:11, 21 January 2024 (UTC)
- @RichardW57m Now देव looks OK, but the superscript number is in the wrong place at दान. Exarchus (talk) 09:54, 21 January 2024 (UTC)
- @Exarchus If Unicode document L2/10-440 [2] is valid, the previous code that I said I would improve 'before midnight' was valid - the superscripts should be treated as nuktas! It's then just a problem with fonts and, I fear, browsers. L2/10-435 shows them working nicely under AAT (nowadays coretext). I think we will have to support multiple encodings if we are to support Tamil script Sanskrit from the wild. --RichardW57m (talk) 14:45, 23 January 2024 (UTC)
- @RichardW57m I really don't know enough about fonts/encodings to help you there, so up to you to see how much time/effort you want to spend on this. Exarchus (talk) 15:21, 23 January 2024 (UTC)
- Yes, it will improve by midnight ending 20 January UTC. The crucial lines were in the wrong place. but they should also be made less obtrusive. They also ought to be made readable. Publishing the change will regenerate all the pages invoking
- @RichardW57m I see Tamil script recently got added to the module, but the implementation (at least on my computer) is buggy, as the vowel signs don't combine with the 'numbered' consonants, e.g. at देव. Do you know whether this implementation will improve? Exarchus (talk) 21:47, 20 January 2024 (UTC)
- There are pictures of texts at https://www.unicode.org/L2/L2010/10379--extended-tamil.pdf. That year saw a fair bit of discussion - see the submissions list at https://www.unicode.org/L2/L2010/Register-2010.html. There is also plain text in that general writing-style under 'Categories' at https://sanskritdocuments.org/tamil/. So my last point was wrong. --RichardW57m (talk) 15:39, 12 September 2023 (UTC)
- I suspect there are texts, but I don't know where to find them. There are unlikely to be any also available in the Internet in 'plain text'. --RichardW57 (talk) 11:30, 3 September 2023 (UTC)
- Is it used outside the Internet, are there any inscriptions or texts with it? AleksiB 1945 (talk) 10:11, 2 September 2023 (UTC)
- I don't know the history. The position numbers could well have started out being in an Indian system, being replaced by Arabic numerals when the system got computerised. The system doesn't work well with Microsoft renderers, so its probably not Unicode-based. The only transmission mechanism I've seen is PDF, though it probably goes back to early PostScript. --RichardW57m (talk) 12:56, 1 September 2023 (UTC)
- When where the Hindu-Arabic numerals introduced to south Asia? historically the Grantha script was the extension that was used to write Sanskrit, now both scripts are distinguished and Devanagari is used predominantly in south Asia (even in Kerala, Karnataka, AP) where did this letter with subscript numbers came from? is it actually used outside the internet? AleksiB 1945 (talk) 11:47, 30 August 2023 (UTC)
Encoding choice
[edit]@Sbb1413 What is your source of the information for the conversion? I am particularly curious about your choice for some of the obscurer characters:
["ँ"] = "ஂ", ["ं"] = "ஂ", ["ऋ"] = "ரி", ["ॠ"] = "ரி", ["ऌ"] = "லி", ["ॡ"] = "லி",
For anunasika and anusvara, other sources indicate a preference for using variations of ம் or Grantha characters, and for the syllabic consonants, variations of ரு and லு. --RichardW57m (talk) 14:37, 22 January 2024 (UTC)
Are all those scripts actually used to write Sanskrit?
[edit]@DerekWinters, AryamanA, some dont have necessary letters like ṣ, ṛ, ḷ in Takri, some of them like Tai Tham just cant be used to write Sanskrit, they dont have most of the letters, even breathy voicing; atleast for Tibetan there are modified letters specifically for Sanskrit like ḍ from rotated d etc. The existing ones also need fixing, see Wiktionary:Requests_for_verification/Non-English#ශ්රී. AleksiB 1945 (talk) 13:29, 30 August 2023 (UTC)
- The problem with Tai Tham is not the lack of characters, though I'd like to see how ṝ is written; I would not be surprised to learn that it was ᩂᩣ, as in Thai. See Module:pi-translit for details, or, without the need for a special font, w:Tai_Tham_script#Sanskrit_consonants_in_Tai_Tham_script, which latter page doesn't address the Sanskrit-specific vowels. Deep stacks might be an issue, but that might easily be resolved in the same way as Grantha, i.e. by splitting the stacks with SAKOT. I have been directed to an article about writing Sanskrit in the Tai Tham script, which I ought to get round to reading. I have heard of one Sanskrit inscription in the Tai Tham script. --RichardW57m (talk) 12:43, 1 September 2023 (UTC)
Tibetan Script Issues
[edit]@Theknightwho, AleksiB 1945, (Notifying AryamanA, Bhagadatta, Svartava, JohnC5, Kutchkutch, Inqilābī, Getsnoopy, Rishabhbhat, Dragonoid76): : What's the evidence for intersyllabic tsheg (U+0F0B) within Sanskrit words? The evidence I've gleaned from the Internet, e.g. by searching for མཎི (mṇi, “jewel”), suggests that it is omitted within Sanskrit words and often enough from religious loans from Sanskrit, e.g. https://linguistics.stackexchange.com/questions/46548/what-are-the-rules-for-creating-multiple-syllable-words-in-tibetan-without-tsheg. The tsheg was added in this change. --RichardW57m (talk) 15:34, 1 September 2023 (UTC)
- @RichardW57m That link doesn't support your point at all - it merely says
However, loanwords, especially those from Sanskrit (henceforth abbreviated Skt.), were sometimes spelled without the tsheg in traditional orthography.
I also have serious doubts that generic Google searches are going to be reliable or representative for this, either, just as they weren't for Mongolian. Theknightwho (talk) 15:40, 1 September 2023 (UTC)- @Theknightwho: But ask yourself why Tibetan script "Om manipadme hūṃ" so often appears with no tshegs on the central two words. We also have the example of an 'entire phrase taken from Sanskrit', namely ནམོ་གུརུ་མཉྫུ་གྷོཥཱ་ཡ། (namo guru mandzu ghoṣ'a ya), which would be from namo guru mañjughoṣaya, which the tshegs chop up at the more obvious morpheme boundaries. That's worrying for predictive text conversion, and might cause 'fun' for inflection tables. Assuming you're not being deliberately obstructive, it looks as though you may not have good evidence of the conventions either, but just assumed that the rules of Tibetan applied, risking s booby trap like the writing convention of Hindi for homorganic nasals before stops seemingly no longer applying to Indian Devanagari Sanskrit. --RichardW57m (talk) 16:57, 1 September 2023 (UTC)
- On the lemming principle, we can note that the New Testament in Sanskrit in Tibetan script does use tshegs - but that's an automatic transliteration of the Devanagari, so somewhat dubious. I've also found chunks of Pali in Tibetan script, but again, that's probably not very authentic. However, I did find this quote from Chris Fynn in w:Talk:Tibetan_script#Contradiction:
- "It definitely should be ཎི U+0F4E U+0F72 and is always written that way by Tibetans. However both ཨོཾ་མ་ཎི་པདྨེ་ཧཱུྃ and ཨོཾ་མ་ཎི་པདམེ་ཧཱུྃ are correct. The former being more usual, but the latter distinguishing the six syllables more clearly. It is also not strictly necessary to use the tsek character or dot between the syllables when writing Sanskrit mantras in Tibetan script. In fact when these mantras are placed in Buddha images and so on, then the tsek character (U+0F0B) should never be used."
- The text in the page from the Weir collection in w:Tibetan script#Extended use seems to be some unidentified Indo-European language (I think if it were Sanskrit we would know that it was), but again suggests that Sanskrit text should be tshegless. --RichardW57m (talk) 13:08, 4 September 2023 (UTC)
- @RichardW57m I think that's an interesting question about the tsek. From what I understand it basically functions as a space in Tibetan, as most Tibetan words are monosyllables (with often complex consonant clusters, at least in Classical Tibetan: compare the original pronunciation of བརྒྱད with the current one in Lhasa). Using the tsek in a similar way for Sanskrit would just be... not very useful. Exarchus (talk) 23:05, 21 January 2024 (UTC)
Fixes
[edit]Some like Takri, Prachalit, Kaithi, Soyombo, Zanabazar Square dont appear in the module though they were added.
- সংস্কৃত (Assamese script)
- ᬲᬂᬲ᭄ᬓᬺᬢ (Balinese script)
- সংস্কৃত (Bengali script)
- 𑰭𑰽𑰭𑰿𑰎𑰴𑰝 (Bhaiksuki script)
- 𑀲𑀁𑀲𑁆𑀓𑀾𑀢 (Brahmi script)
- သံသ္ကၖတ (Burmese script)
- संस्कृत (Devanagari script)
- સંસ્કૃત (Gujarati script)
- ਸਂਸ੍ਕ੍ਰਤ (Gurmukhi script)
- 𑌸𑌂𑌸𑍍𑌕𑍃𑌤 (Grantha script)
- ꦱꦁꦱ꧀ꦏꦽꦠ (Javanese script)
- 𑂮𑂁𑂮𑂹𑂍𑃂𑂞 (Kaithi script)
- ಸಂಸ್ಕೃತ (Kannada script)
- សំស្ក្ឫត (Khmer script)
- ສໍສ຺ກ຺ຣິຕ (Lao script)
- സംസ്കൃത (Malayalam script)
- ᢀ᠋ᠰ᠌ᠠᠰ᠌ᡬᡵᡳᢠᠠ (Manchu script)
- 𑘭𑘽𑘭𑘿𑘎𑘵𑘝 (Modi script)
- ᢀ᠋ᠰᠠᠰᢉᠷᠢᢐᠠ᠋ (Mongolian script)
- 𑧍𑧞𑧍𑧠𑦮𑧖𑦽 (Nandinagari script)
- 𑐳𑑄𑐳𑑂𑐎𑐺𑐟 (Newa script)
- ସଂସ୍କୃତ (Odia script)
- ꢱꢀꢱ꣄ꢒꢺꢡ (Saurashtra script)
- 𑆱𑆁𑆱𑇀𑆑𑆸𑆠 (Sharada script)
- 𑖭𑖽𑖭𑖿𑖎𑖴𑖝 (Siddham script)
- සංස්කෘත (Sinhalese script)
- 𑪁𑪖𑪁 𑪙𑩜𑩙𑩫 (Soyombo script)
- 𑚨𑚫𑚨𑚶𑚊𑚙 (Takri script)
- ஸஂஸ்க்ரித (Tamil script)
- సంస్కృత (Telugu script)
- สํสฺกฺฤต (Thai script)
- སཾ་སྐྲྀ་ཏ (Tibetan script)
- 𑒮𑓀𑒮𑓂𑒏𑒵𑒞 (Tirhuta script)
- 𑨰𑨸𑨰𑩇𑨋𑨼𑨉𑨙 (Zanabazar Square script)
The scripts should be in alphabetical order except for Devanagari and similar scripts like Oxomiya/Bangla, Mongol/Manchu like Devanagari Assamese Bengali Balinese Bhaiksuki Brahmi Burmese and Malayalam Manchu Mongol Modi
Tamil one needs fixing, the text from the main page
>க²க³ௌக⁴ாஙசிச்ச²ௌஜா ஜ²ாஞ்ஜ்ஞோ𑌽டௌட²ீட³ட³ண்ட⁴ணஃ। தத²ோத³த⁴ீந் பப²ர்ப³ாப⁴ீர்மயோ𑌽ரில்வாஶிஷாஂ ஸஹஃ AleksiB 1945 (talk) 10:59, 16 December 2023 (UTC)
- @AleksiB 1945: The first question, of course, is whether the Tamil script Unicode character sequence is wrong. Do you have evidence that it is wrong? I suspect a renderer bug. On the other hand, does anyone have evidence that it is correct? --RichardW57 (talk) 11:58, 19 December 2023 (UTC)
- @AleksiB 1945: Prachalit does appear, under the name Newa.
- Takri, Kaithi, Soyombo and Zanabazar Square don't appear because they aren't listed in Module:languages/data/2 as scripts used for Sanskrit.
- Take the ordering request to the Grease Pit, Module talk:languages/data/2 or, if we're going to have more control over it, Module talk:sa-headword. At present the order presented is the order the scripts appear in Module:languages/data/2. --RichardW57m (talk) 12:06, 24 January 2024 (UTC)
- Soyombo also seems to need fixing as random spaces are appearing between words like in 𑪁𑪖𑪁_𑪙𑩜𑩙𑩫 and 𑩿_𑪙𑩼𑩑𑩛. The Tamil one seems fixable if the superscript numbers are shifted to right of the vowel sign. 13:16, 5 February 2024 (UTC)
- Discussed at #Tamil, and implemented where the rendering looks right. --RichardW57 (talk) 01:47, 6 February 2024 (UTC)
Nasalised Consonants
[edit](Notifying AryamanA, Bhagadatta, Svartava, JohnC5, Kutchkutch, Inqilābī, Getsnoopy, Rishabhbhat, Dragonoid76, RichardW57): When converting between scripts, can it be reasonable to systematically insert a word separator into a (half-)nasalised geminate consonant in Devanagari? So far as I am aware, these half-nasalised geminates only occur as an alternative to anusvara + y/l/v in external sandhi. A year or two back, only Windows renderers allowed the distinction between a nasalised consonant and a nasalised vowel to be encoded properly and distinguished visually, but MS Edge now handles it.) The visual distinction only arises in Devanagari when a half-form is used for the conjunct.
An alternative view is that this form of sandhi without a word separator is not in accord with the Unicode standard (TUS 15.0, Chapter 12 R10), and therefore shall not be recorded in Wiktionary.
A reality-based test would be for conversion to Tamil script to split प्रळयय्ँयान्ति (praḷayaym̐yānti) into ப்ரளயய𑌁 (praḷayay̐) யாந்தி (yānti). (Transliteration isn't yet set up to handle GRANTHA CANDRABINDU when transliterating from the Tamil script.) It's a shame that the letter ள (ḷa) seems to be a typo for ல (la). --RichardW57m (talk) 14:55, 7 February 2024 (UTC)
- Transliteration now handles candrabindu in such combinations. --RichardW57m (talk) 09:56, 9 February 2024 (UTC)
- As one can already write this as प्रलयय्ँ (pralayaym̐) यान्ति (yānti) in Devanagari, at least according to Whitney's grammar, I think I don't have so much need to split it, which is just as well, because I suspect the word सय्ँयुग (saym̐yuga) doesn't get split in the Tamil script. Or is it a non-existent variant of संयुग (saṃyuga)? --RichardW57 (talk) 14:08, 8 February 2024 (UTC)
Gurmukhi and Kaithi vocalic r
[edit]@AleksiB 1945 @RichardW57m Surely ृ in Gurmukhi is rather ੍ਰ instead of current ੍ਹ ?
The Kaithi version of जॄ (jṝ) currently links to the letter 𑂔 as long vocalic r doesn't have a corresponding symbol. Wouldn't it be better to either show no transliteration or else (in this case) show the transliteration of जृ (jṛ), as using long ॄ for these roots is artificial anyway. Exarchus (talk) 16:34, 30 March 2024 (UTC)
Gurmukhi Gemination
[edit]I'm wondering if Gurmukhi shouldn't use the gemination diacritic in Gurmukhi ੱ rather than simply writing the letter twice and using a ੍ to suppress the vowel. ChromeBones (talk) 06:17, 21 April 2024 (UTC)
- @ChromeBones You're probably right. To be sure it would be good have an example of it being used in a Sanskrit text, and I'm interested to know where Gurmukhi is actually used to write Sanskrit. Exarchus (talk) 16:01, 25 April 2024 (UTC)
- @Exarchus Gurmukhi was used to write the Sanskrit parts of the Sikh holy book, I'm at the subway station right now, when I get a chance I'll pull some of that up and check to see how germination is represented there. ChromeBones (talk) 21:35, 25 April 2024 (UTC)
- @ChromeBones I suppose that for a geminate nasal, the tippi sign ੰ would be used instead of ੱ. And is the Gurmukhi equivalent of ऋ really ਰ or could it be ਰਿ ? Exarchus (talk) 06:40, 26 April 2024 (UTC)
- @Exarchus Gurmukhi was used to write the Sanskrit parts of the Sikh holy book, I'm at the subway station right now, when I get a chance I'll pull some of that up and check to see how germination is represented there. ChromeBones (talk) 21:35, 25 April 2024 (UTC)