Module talk:mr-translit/testcases

च: c and ċ, ज: j and j̈, झ: jh and j̈h

Latest comment: 7 years ago14 comments3 people in discussion

@Kutchkutch Are there rules for when ज is j or j̈? —Aryaman ^{(मुझसे बात करो)} 13:43, 17 September 2017 (UTC)Reply

@Aryamanarora c and ċ, j and j̈, jh and j̈h are each a pair of distinct phonemes that distinguish two words even if the other phonemes in the two words are exactly the same. This is fact is well established in many sources.

For example, जपणे represents two distinct words that only differ by j or j̈ as the first sound, namely japṇe and j̈apṇe. Definitions for both of these words (with the same spelling जपणे) can be found at Molesworth’ Marathi dictionary. japṇe is equivalent to Hindi जपना (japnā) and j̈apṇe means ‘To do or perform carefully and cautiously’. Despite japṇe and j̈apṇe being pronounced similarly and having somewhat similar meanings, their pronunciations and meanings are nonetheless distinct. (Among minimal pairs that I’ve found, the ones for c and ċ are more drastically different in meaning, which might be why you specifically asked about j and j̈.)

~~However, there are probably no general rules to predict when one phoneme or the other appears.~~ Perhaps there are rules for specific cases such as ज represents j̈ in codas of syllables except in loan words such ताज and loan words from Hindustani preserve j, but this and other such rules that I observe might be incorrect or difficult to corroborate.

Thus, the distinction between the two phonemes of a pair such as j and j̈ is significant enough to be represented in the transliteration if possible. Assuming that rules do not exist to distinguish between phonemes in each pair, I would suggest being able to add an some kind of argument to indicate one phoneme over the another such as {{m|mr|जपणे|1}} to indicate that the first phoneme should be the borrowed j instead of the default j̈. I don’t have any experience with editing modules so I wouldn’t want to mess it up beyond adding test cases. I did notice that the module merely imports {{Module:hi-translit}} without making any language-specific changes. Kutchkutch (talk) 08:27, 18 September 2017 (UTC)Reply

@Aryamanarora Page 5 of Navalkar‘s 1880 Marathi grammar book, Page 7 of Navalkar‘s 1894 Marathi grammar, and Molesworth’s dictionary [1][2] do provide a rule for distinguishing the two phonemes:

Native and assimilated words: The phonemes ċ, j̈, j̈h occur before अ, आ, उ, ऊ, ऐ, ओ, औ and c, j, jh occur before इ, ई, ए, ्य.

Sanskrit and Hindustani loanwords: Sanskrit and Hindustani loanwords have c, j, jh. Exception: Some Hindustani loanwords have assimilated, and therefore they have ċ, j̈, j̈h before अ, आ, उ, ऊ, ऐ, ओ, औ such as लचक (laċak).

Persian and Arabic loanwords: ظ ,ض ,ز ,ذ followed by /i/, /iː/, /e/ are j such as जिल्हा (jilhā) (equivalent to Hindi ज़िला (zilā)). Persian and Arabic ج and جا are sometimes j as in जवानी (javānī) and जादू (jādū).

Regional exception: The genitive case marker चा, चे, ची (equivalent to Hindi का, के, की) is always ċ in the Koṅkaṇ region, and the relative pronoun/conjunction जे (comparable to Hindi जो) is always j̈ in the Koṅkaṇ region.

I have not been able to disprove this rule~~, and I have not seen this rule in any other source, but it’s better than no rule at all~~. (Although sources from the 1800s say ċh is a phoneme, newer sources say that it is extremely rare so it is not a unique phoneme of Marathi, but does exist systematically in words such as उत्सव (utsav, “festival”), वत्स (vatsa, “young of any animal”), and मत्सर (matsar, “jealousy”) where त्स represents ċh.) Kutchkutch (talk) 10:06, 18 September 2017 (UTC)Reply

@Kutchkutch: Hmm, perhaps it would be better to just do manual transliteration... I don't know whether the module can be made to tell if a word is in CAT:Marathi terms borrowed from x. Perhaps making an {{mr-IPA}} that takes such an argument as you suggested is a better idea. A random question, do any words have both realizations? And उत्सव (utsav, “festival”) would be /ʊt͡sʰəʋ/ then? Thanks for all the information. —Aryaman ^{(मुझसे बात करो)} 22:05, 18 September 2017 (UTC)Reply

@Aryamanarora Thanks for taking the time to read what I’ve written (including the background info) and thinking about it. I do recognise that this is a bit of a complex issue, so I won’t be disappointed if it this cannot be implemented in the module in the near future, but hopefully it can be implemented someday. I did notice that this issue has been discussed at Module_talk:mr-translit. The conclusion of that discussion and your your reply seems to be that for now the best way to address this issue is by manually transliterating using the argument {{m|tr=transliteration}} and adding a note in the pronunciation section. Using the rule, I was hoping that the module would be able to detect whether the etymology section of the word’s article (if the article and etymology section exist) contains {{bor}}. Your idea of searching the borrowed category is another good way for performing the first step. If the first step can be overcome, it would be relatively easy to perform the subsequent steps. I have noticed that unfortunately {{mr-IPA}} does not exist yet as it does for Hindi.

Words that I know have one palato-alveolar affricate and one alveolar affricate include चीज (cīj̈) (equivalent to Hindi चीज़ (cīz)), which is a borrowing that uses j̈ to preserve the Persian letter ز and the noun झीज (jhīj̈, “erosion”) along with its verb form झिजणे (jhij̈ṇe), which are native words ultimately derived from Sankskrit. (The article for झिजणे (jhij̈ṇe) shows its transliteration as jhijaṇe.)

Yes, the consonant cluster त्स as seen in उत्सव (utsav, “festival”) is [t͡sʰ] in IPA. However, [t͡sʰ] is not a phoneme of Marathi, so [t͡sʰ] is a phonetic transcription that is a result of the phoneme /t̪/ followed by the phoneme /s/. Thus, the phonetic rule that would describe this is /t̪s/ → [t͡sʰ]. (उ in Marathi is usually transcribed in IPA as /u/ instead of the Hindustani /ʊ/, and /ʋ/ may behave a bit differently than it does in Hindustani.) Kutchkutch (talk) 23:28, 18 September 2017 (UTC)Reply

@Wyang Is it possible to detect if a page is CAT:Marathi terms borrowed from... and use that information to change the transliteration? —Aryaman ^{(मुझसे बात करो)} 00:56, 20 September 2017 (UTC)Reply

@Aryamanarora It is possible, but it would be a clunky way to solve the unpredictability issue, and is potentially buggy. How often do such issues arise, and how often does a singular Marathi spelling correspond to multiple pronunciations? Depending on the situation, one of two solutions can be used: (1) an automatic romanisation module and the exceptions handled manually; (2) (if it is quite irregular) a Thai-like system with respellings stored in the entry. Wyang (talk) 01:49, 20 September 2017 (UTC)Reply

@Wyang Thanks for your input as well. 3 characters (च, ज, झ) each have two contrastive pronunciations, and while they are probably not the three most frequent characters, they’re still fairly frequent (च and ज are probably equally frequent while झ is perhaps a bit less frequent than the other two). The issue occurs every time one or more of those three characters appears in a word. Kutchkutch (talk) 09:22, 20 September 2017 (UTC)Reply

@Wyang Personally I'm advocating for manual transliteration now, but the exceptions are quite common I think. Maybe {{mr-IPA}} could show the phonetic romanization manually and we use the translit module only for direct romanization? In that case {{mr-new}} could be made to automatically look for {{bor}} and add the phonetic respelling to {{mr-IPA}}, which would be easier. —Aryaman ^{(मुझसे बात करो)} 11:52, 20 September 2017 (UTC)Reply

@Aryamanarora, Kutchkutch I'm a little confused... by direct romanization do you mean romanizing from the phonetic respelling Aryaman? Wyang (talk) 12:00, 20 September 2017 (UTC)Reply

@Wyang: Yes, meaning this module should not handle c/ċ, j/j̈, and jh/j̈h allophony. Rather, that kind of information would be provided to {{mr-IPA}}. That was what DerekWinters suggested at Module talk:mr-translit as well. —Aryaman ^{(मुझसे बात करो)} 12:36, 20 September 2017 (UTC)Reply

@Aryamanarora So... is this correct: Words with irregular pronunciations will have a respelling stored as an argument in {{mr-IPA}} on their entry, otherwise {{mr-IPA}} is used without a respelling argument. When {{m|mr}} or {{l|mr}} is used with Marathi word A, it will see if the entry for A has a respelling B on the entry, in the pronunciation template; if so it will relay B to Module:mr-translit for 'regular' romanisation, if not it will relay A to Module:mr-translit for romanisation. And when creating new entries, {{mr-new}} will attempt to generate a respelling automatically if the title contains any of the ambiguous letters and the etymology says it's a borrowing, but one can also provide a respelling manually to override that. Wyang (talk) 12:46, 20 September 2017 (UTC)Reply

@Wyang: Yes, that's right. (Sorry, don't have time right now) —Aryaman ^{(मुझसे बात करो)} 13:32, 20 September 2017 (UTC)Reply

I agree that this phenomenon is allophony at its core, for native and assimilated words. The allophones become individual phonemes when the loanwords are taken into consideration.

It was a bit hard at first to follow the discussion after my last post. What I understand is that {{mr-IPA}} would have an optional argument to specify a manual phonetic transcription. If the optional argument is not present and if the word does not contain any of the ambiguous letters, {{mr-IPA}} would use its own rules to create an auto-generated a phonetic transcription for the whole word. If the optional argument is not present and if the word contains any of the ambiguous letters, {{mr-new}} would be given the word, return an auto-generated phonetic transcription for any ambiguous letters to {{mr-IPA}} by checking if the word is a borrowing and applying the rule, and {{mr-IPA}} would create an auto-generated phonetic transcription for the unambiguous letters using its own rules and combine that with what {{mr-new}} returned to create an auto-generated a phonetic transcription for the whole word. {{mr-IPA}} would give its phonetic transcription to {{Module:mr-translit}}, and {{Module:mr-translit}} would have a way to create the transliteration based on the phonetic transcription that was passed to it.

When {{m|mr}} or {{l|mr}} is used on a word that has a phonetic transcription stored in {{mr-IPA}}, {{Module:mr-translit}} would transliterate the word based on the phonetic transcription. When {{m|mr}} or {{l|mr}} is used on a word that has no phonetic transcription stored in {{mr-IPA}} (because the entry for the word was created before the algorithm was implemented), {{Module:mr-translit}} would transliterate the word based on orthography. Kutchkutch (talk) 08:58, 21 September 2017 (UTC)Reply

Other Rules

Latest comment: 7 years ago1 comment1 person in discussion

@Aryamanarora Thanks for starting to implement some of the relatively more straightforward rules. Kutchkutch (talk) 09:22, 20 September 2017 (UTC)Reply

Anusvāra

The anusvāra rule is divided into two classes according to Page 17 of Navalkar‘s 1880 Marathi grammar book. They are of the form C₁V₁V₂(C₂)C₂. The most common consonant for C₁ appears to be स. V₁ is either a vowel that is indicated in the orthography or the inherent schwa. V₂ is the transliteration of the anusvāra. C₂ determines the class.

Provincial Class: Anusvāra before र, श, ष, स

In the Provincial Class, the anusvāra is transliterated as u.

Classical Class: Anusvāra before य, ल, व

In the Classical Class, the anusvāra is transliterated as u (except before य where nasalisation is represented with i since [j] is a semivowel that is more closely associated with i than u). In addition, य, ल, व is doubled to imitate the Sanskrit pronounciation of the word.

Eyelash र

‘र्‍’ is ‘eyelash र’, which is ‘र’ when it occurs as the first constituent of consonant clusters in the onset of syllables, ‘ऱ्हास’ means ‘decrease or decay’ (Unicode encodes ‘eyelash र’ as ‘ऱ्’using ‘ऱ’)