Module talk:as-translit
Add topic@Wyang The testcases page is broken, would you mind checking it? 谢谢! —Aryaman (मुझसे बात करो) 04:18, 25 July 2017 (UTC)
Problems
[edit]@Wyang, Aryamanarora, Sagir Ahmed Msa, this module still has broken test cases and transcriptions like আর্ডৱার্ক which appears at aardvark. Could this be fixed? —JohnC5 10:54, 21 October 2017 (UTC)
- @JohnC5: Hi, you wrote আ র্ ড ৱা র্ ক .But you should write আ ৰ্ ড ৱা ৰ্ ক . But it will transliterate wrongly as: আৰ্ডৱাৰ্ক (ardowark)
- --Msa
- @Sagir Ahmed Msa: To clarify, I copied it from aardvark. I take no responsibility for that spelling. —JohnC5 11:49, 21 October 2017 (UTC)
- I made the module output nil if romanisation fails. Wyang (talk) 11:39, 21 October 2017 (UTC)
- @JohnC5: Schwa-dropping is often irregular with loanwords in New Indo-Aryan languages, so I don't think this is that big of a problem. MOD:hi-translit has similar problems: आर्डवर्क (ārḍavark). —Aryaman (मुझसे बात करो) 16:12, 21 October 2017 (UTC)
Can you use Assamese letters instead of romanisation? Sagir Ahmed Msa (talk) 02:18, 23 October 2017 (UTC)
Inherent vowel deletion
[edit]@AryamanA Hi can you remove inherent vowel deletion in medial position? Msasag (talk) 04:31, 10 April 2018 (UTC)
- @Msasag, AryamanA: I wouldn't rush with this. We need to know the exact rules for shwa-dropping - i.e. which combinations should allow and which should not, e.g. yes in rAkAtāt but no in rAktAt. There are too many scenarios. Pls make plenty of test cases first and try to describe the rules. --Anatoli T. (обсудить/вклад) 04:55, 10 April 2018 (UTC)
- @Atitarev Hi in tadbhava non compound words, it's simple, i.e no schwa dropping in medial position.
- But for compound words:
- Tadbhava: ফুলবাৰী (ful.bari), মাছমৰীয়া (mas.moria), কলগছ (kol.gos), তেজপুৰ (tez.pur), যোৰহাট (zür.hat), কাঠৰোকা (kath.rüka)
- Non-Sanskrit borrowings: চৰকাৰ (sor.kar), ফুটবল (fut.bol), দৰকাৰ (dor.kar), কাৰখানা (kar.khana).. some are written with hosonto/virama: চৰ্দাৰ (sor.dar)
- Sanskrit borrowings: জনসংখ্যা (zono.xoiṅkha), গণতন্ত্ৰ (gono.tontro), মালভূমি (mal.bhumi, malo.bhumi), ৰাজনীতি (raz.niti), সমতল (xomo.tol), জলচক্ৰ (zolo.sokro)
- etc
- Many compound words are also written with a "-", especially tadbhavas.
- Some Non-Sanskrit borrowings, especially those that came through Bengali and other Indo-Aryan languages are unpredictable in Assamese, since they follow Bengali phonology: চছমা (sosma), আহমেদ (ahmed), ইছলাম (islam), অলমাৰী (almari), আলজেৰিয়া (alzeria).. some are written with a hosonto: কোপ্তা (kupta), মোকৰ্দমা (mukordoma), গ্ৰেপ্তাৰ (greptar).
- For final position, it's irregular in Sanskrit loanwords, like বিগত (bigoto), গঠিত (gothito), মৃগ (mrigo) .. and the conjuncts end with the inherent vowel.
- But for other conjuncts, mostly they don't end with an inherent vowel.. and the last consonants are not pronounced. Like
- ইংলেণ্ড (iṅ.lend /iŋlɛn/), ইংলেণ্ডত (iṅ.lendot /iŋlɛndɔt/), বান্ধ (bandh /ban/), বান্ধিলি (bandhili /bandʱili/).
- The rules for morphemes were added. Msasag (talk) 06:12, 10 April 2018 (UTC)
- @Msasag: Thanks for the reply. I won't be able to implement the code but just mentioned what needs to happen. Just removing the deletion might be easy but the module can't be based on rules such as Sanskrit borrowing/native words or solid/compounds words. The module can't tell the difference. It can only be based on consonant/vowel patterns or special symbols, such as hosonto/virama. "-" is not part of the native Assamese script, it's just a convenience provided by some dictionary publishers for English speakers in the transliteration. If the implementation is not possible based on rules alone, then editors need to be prepared to provide the transliteration manually if the module fails. The more rules are described and implemented, the better but it may never be perfect. The Hindi module's implementation seems pretty good but the rules in Hindi and Assamese are not the same. --Anatoli T. (обсудить/вклад)
- @Atitarev Hi, the "-" is also frequently used for compound words (not only transliteration), some other characters are also used in modern Assamese orthography. Like ’ is a vowel. It's hard to determine the patterns. I tried a random paragraph to transliteration by the module, it seems the current rules doesn't work fine, words like চৰকাৰ (sorokar, “government”) should be "sorkar" and কিছমিছ (kisomis, “raisins”) should be "kismis".. and then পাহৰিলে (pahrile, “forgot”) should be "pahorile", নকৰিবা (nokriba, “don't do”) should be "nokoriba". And since there are more words with no schwa deletion (in absence of specific deleters like hosonto/virama, -) I think the new rule will be fine. Msasag (talk) 10:34, 10 April 2018 (UTC)
- @Msasag: I am not sure if getting rid of all of the schwa-dropping rules medially is a good idea. The example you gave in your third comment are either (1) borrowings from Persian or Arabic (which will have different schwa-dropping rules), or (2) inflections of native words. নকৰিবা (nokriba, “don't do”) = ন-কৰিবা (no-koriba), and পাহৰিলে (pahrile, “forgot”) is from পাহৰা (pahora). The transliterations could be hardcoded for those forms in the inflection tables. Even Hindi has those same problems occasionally. I also found a paper on Assamese phonology but I don't know if it will help. —AryamanA (मुझसे बात करें • योगदान) 12:37, 10 April 2018 (UTC)
- @AryamanA:
- 1মঙলদৈ 2নাহৰকটীয়া নাহৰফুটুকী ফুলবাৰী 3কলগছ চাহবাগান 4তেজপুৰ 5যোৰহাট কাঠৰোকা 7মাছমৰীয়া 8ফুটবল জনসংখ্যা গণতন্ত্ৰ মালভূমি 9ৰাজনীতি সমতল জলচক্ৰ গোলমৰিচ কনকলতা শৌলমাৰি দলবাৰী মিকিৰভেটা 10মিকিৰভেটাত বৰপেটা 11বৰপেটাত শুৱালকুচি 12তেজপিঁয়া 13গহপুৰ 14গহপুৰত জলকীয়া টনকিয়াল
- 1moṅlodoi 2nahrokotia nahorphutuki phulbari 3kologos sahbagan 4tezopur 5zürohat kathrüka 7masomoria 8phutobol zonoxoṅkhya gonotontro malbhumi 9razoniti xomotol zolosokro gülmoris konoklota xoulmari dolbari mikirbheta 10mikirbhetato borpeta 11borpetato xualkusi 12tezopĩya 13gohopur 14gohopurt zolokia tonokial
- (please add an English translation of this usage example)
- Some other words. 14 of these have mistakes. Most of these are native. Since such native words are not very common, I used names of some cities of Assam. These are nouns and these are written with non-module transliteration in the declension templates. I think it's better to have the inherent vowels than its absence in the wrong positions. In verb conjugation templates where almost all words are native non compound, and in example templates it's hard to change keyboards and write the whole line respectively. Msasag (talk) 14:01, 10 April 2018 (UTC)
- @Msasag: Okay, I see your point. But I'm not sure how to keep only medial schwas; if I remove the schwa dropping code all of the final schwas are kept too. —AryamanA (मुझसे बात करें • योगदान) 00:48, 11 April 2018 (UTC)
@AryamanA Hmm, I think it's doing more bad then good. Is there any other module where only the final inherent vowel is missing? Msasag (talk) 14:19, 15 April 2018 (UTC)
অৱগ্ৰহ
[edit]@Msasag How could we resolve the issue with the owogroho? মাঽমৃতাত্ (maomritat) should be just ma'mritat right? I know this won't be a major issue, but just for completeness of the script (and for the current sa-convert module I'm making). DerekWinters (talk) 16:53, 14 May 2018 (UTC)
@DerekWinters I have not seen the character being used in Assamese texts but from the Bengali alphabet wiki article I learnt that it prolongs the vowel before it, so the word would be maamritat. But for Sanskrit it has different usage and I have almost no idea about it.
Msasag (talk) 15:41, 15 May 2018 (UTC)
- @Msasag: In Sanskrit it's the result of sandhi of the schwa, so सः अहम् (saḥ aham, “I am that”) becomes सो ऽहम् (so ʼham). But I've also seen it as a lengthener in Hindi, especially as ओऽम् to show a long "oooom". —AryamanA (मुझसे बात करें • योगदान) 20:36, 15 May 2018 (UTC)