Module talk:fa-cls-translit

From Wiktionary, the free dictionary
Latest comment: 6 months ago by Atitarev in topic خوابیدن
Jump to navigation Jump to search

Auto transliteration for Classical Persian

[edit]

@Ariamihr, Atitarev, Benwing2, Dijan, Mazsch, Rodrigo5260, ZxxZxxZ, Saranamd

I was able to reuse the ur-translit module for classical Persian, since Urdu's diacritics are based on Classical Persian. It's a little messy but I was able to get all of these test cases working, the only problem is that I can't hide the word final -he. Since I was only copy and pasting code from other modules (I can't write any code myself), this was the best I could do. I couldn't help with Iranian Persian because it's vocalization system is pretty uncommon. Maybe the ezafe stuff can be used there???

This can't be used in entries currently, since they currently don't support multiple transliterations. But in the mean time can this be used in etymologies for words borrowed from classical Persian? (obviously after this template is reviewed). سَمِیر | sameer (talk) 18:28, 11 July 2023 (UTC)Reply

Activation

[edit]

@Benwing2 hi the module now has support for diacritic detection (I made it even more strict than Urdu's), arabic al- assimilation and has no known errors. Could you turn on this module as the transliteration module for "fa-cls" so I can test it? (and also "prs", "haz", and "tg", since there is no point in making specific ones for them). It would help me a lot because i've been working on adding support for vocalized text input for module:fa-IPA (see Module:fa-IPA/romanize), I can test that regardless but it would be easier to use the simple transliterate command.


(i'll work on the Iranian one soon but do you think, for "fa" links, it could be possible to use Module:fa-IPA/romanize with this module to generate both readings from the classical vocalization?) سَمِیر | Sameer (مشارکت‌هاکتی من گپ بزن) 00:00, 8 November 2023 (UTC)Reply

@Sameerhameedy Hi. Do you have any test cases that I can use to verify that the translit module works? They should have examples of vocalized and unvocalized text, at least for fa-cls, fa alone, and fa-ira. As for using Module:fa-IPA/romanize to generate both readings, can you explain what you mean? Benwing2 (talk) 03:09, 8 November 2023 (UTC)Reply
@Benwing2 there are test cases for this module on the documentation page. It includes unvocalized text, and some test cases for words with arabic al- and some rare dialectal characters.
Module:fa-IPA/romanize is a different thing, it's supposed to generate the classical, Dari, Iranian, and Tajik readings from the classical vocalization. سَمِیر | Sameer (مشارکت‌هاکتی من گپ بزن) 03:39, 8 November 2023 (UTC)Reply
@Sameerhameedy Thanks. By "reading" you mean transliteration? If so, how does it differ from Module:fa-cls-translit and such? Does it supersede them? Also as discussed before, we don't yet have support for multiple transliterations attached to a single source term. We'd need to come up with a design for how to display them as well as how to allow them to be input into the tr= param. If I'm misunderstanding what your question was about, please clarify, thanks! Benwing2 (talk) 03:55, 8 November 2023 (UTC)Reply
@Sameerhameedy, @Benwing2: Indeed, what Sameer is asking is not very clear.
We have multiple language codes for Persian (excluding Tajik "tg"): "fa", "fa-ira", "fa-cls", "prs".
I understand Sameerhameedy wants "fa-cls" enabled for entries using "fa-cls" code. Please confirm. دَقِیقَه (daqīqa) should produce "daqīqa"
I also wonder what transliteration "fa" should be using, it doesn't have a module for it. Anatoli T. (обсудить/вклад) 04:02, 8 November 2023 (UTC)Reply
@Benwing2 yes I mean transliteration. Module:fa-IPA uses Module:fa-cls-translit. In reality, it gets the transliteration from fa-cls-translit can convert it to the Iranian transliteration.
To give an example I can get the text:
کَبوتَر عِشْقِ مَن نَشَسْتَه روحِ دَسْتِت چِقَدْر نَوَازَش دَارَد نِگَاه شُوخِ مَسْتِت

and Module:fa-cls-translit can use that to get
kabōtar išq-i man našasta rōh-i dastit čiqadr nawāzaš dārad nigāh šūx-i mastit

and Module:fa-IPA/romanize can use that to get
Module error: No such module "fa-IPA/romanize".

both of those were generated from the same vocalization, it would be cool if we could use this generally. It may need overriding occasionally but it should be correct most of the time.
- سَمِیر | Sameer (مشارکت‌هاکتی من گپ بزن) 04:15, 8 November 2023 (UTC)Reply
@Sameerhameedy: It seems you're trying to produce non-classical transliteration from classical. (fa-cls) دَقِیقَه (daqīqa) should give "daqīqa". Can you give examples to @Benwing2 for "fa", "fa-ira" and "prs"? I am not clear what you're requesting. Anatoli T. (обсудить/вклад) 04:23, 8 November 2023 (UTC)Reply
@Atitarev I think Sameer's proposal is to use the classical vocalization everywhere, and transliterate it differently according to the particular language code (maybe?). Benwing2 (talk) 04:24, 8 November 2023 (UTC)Reply
@Benwing2: that's my guess too. Does that mean that Iranian Persian will use Classical vocalisation? Anatoli T. (обсудить/вклад) 04:26, 8 November 2023 (UTC)Reply
@Benwing2: BTW, I support activation too, it's long overdue but it's confusing how it's going to work.
It seems, actually easier to work with transliteration modules for each variety, rather than trying to to get a classical transliteration first and work off that for each individual variety.
Latin to Perso-Arabic conversion used by {{fa-IPA}} works well from classical to e.g. Dari or Iranian but for Perso-Arabic to Latin conversion should be split, even though there are similarities. Just my opinion, if I understand the intentions correctly. Anatoli T. (обсудить/вклад) 04:32, 8 November 2023 (UTC)Reply
@Atitarev@Benwing2 Well I was mainly saying that in usage examples, quotes, and (possibly) links. It's possible to generate both. In atitarevs example a link would hypothetically say دَقِیقَه (daqīqa/Module error: No such module "fa-IPA/romanize".). (we can also hide diacritics or change the order) Besides the obvious benefit of generating both transliterations I have been struggling with fa-ira-translit due to Iranian Persians inconsistent usage of diacritics... In مغولستان and تو, for example, the vaav/waaw implies a long vowel but in Iran it is really pronounced with a short vowel. (according to their pronunciation sections) I suspect IP's inconsistent vowel notation is why an IP transliteration module was never created before. In fact ZxxZxxZ brought this up before and mentioned that Iranian dictionary's don't typically include vocalizations, usually just a latin transcription. The ones that I see that include a vocalization use some elements of classical vocalization, like classical diphthongs or something. - سَمِیر | Sameer (مشارکت‌هاکتی من گپ بزن) 04:48, 8 November 2023 (UTC)Reply
@Sameerhameedy:
In both cases مُغولِسْتان (moğulestân) or تو (tu) will produce the incorrect "moğulestân" and "tu", which is fine, as you said yourself, it will work for 90% of cases. We will need to manually enter |tr= مُغولِسْتان (moğolestân) and تو (to).
Cases where Arabic "ي" and "و" are used to represent short vowels in Arabic are common as well, especially in loanwords. They are manually transliterated.
Modern Persian vocalisations are used but are less common. I use resource that do. Adding vocalisations is very beneficial to learners and contrasting classical to modern Iranian will give an extra insight on how they evolved and are used. Anatoli T. (обсудить/вклад) 05:07, 8 November 2023 (UTC)Reply
@Benwing2: Hi. Re diff
I have tried both "fa" and "fa-cls": مُغُولِسْتان or مُغُولِسْتان. No transliteration is displayed yet. Anatoli T. (обсудить/вклад) 05:28, 8 November 2023 (UTC)Reply
@Atitarev مُغُولِسْتَان (muğūlistān) works. It seems the module requires fatḥa before a long alif. Benwing2 (talk) 05:37, 8 November 2023 (UTC)Reply
@Sameerhameedy: Was your intention to require a zabar (fatha)? Anatoli T. (обсудить/вклад) 05:52, 8 November 2023 (UTC)Reply
@Atitarev yeah I decided to do that to prevent false positives since semivowels can be unvoweled. I guess it's not written as often in Persian as in Arabic so I can undo it but I thought it would be better to have the diacritic requirements be as strict as possible and reduce them if needed. Rather than increasing it later. سَمِیر | Sameer (مشارکت‌هاکتی من گپ بزن) 05:57, 8 November 2023 (UTC)Reply
@Sameerhameedy: Ok, no worries. The {{fa-IPA}} is behind then, since it's not inserting a zabar (fatha) at مُغُولِسْتَان (muğūlistān). Anatoli T. (обсудить/вклад) 05:59, 8 November 2023 (UTC)Reply
Yes I know, I will be working on fa-IPA as a whole later but it will probably be a while before I can upload the drafted version I have in my userspace. سَمِیر | Sameer (مشارکت‌هاکتی من گپ بزن) 19:10, 8 November 2023 (UTC)Reply
@Atitarev True but It would probably need the same amount of overridding regardless, which is why I think it may be worth trying to implement something like that. It would also help with a lot of the issues preventing us from activating transliteration for "fa" links. If we did have "fa" generate multiple transliterations we could probably safely exclude Dari from "fa" links since classical gives more information than Dari does, so classical would be more helpful to the reader. - سَمِیر | Sameer (مشارکت‌هاکتی من گپ بزن) 19:53, 8 November 2023 (UTC)Reply

Why is zabar (fatha) before alif required?

[edit]

Hi @Sameerhameedy:

Why is zabar required before the alif? Even in Arabic, it's not always required, compare Arabic اِنْقِلَاب (inqilāb) and اِنْقِلاب (inqilāb) - they both work but Classical Persian اِنْقِلَاب (inqilāb) works and اِنْقِلاب (without the zabar doesn't). What are the cases that require the zabar? Anatoli T. (обсудить/вклад) 00:20, 17 November 2023 (UTC)Reply

It always requires zabar to prevent nil translits, since unlike arabic, semivowels in Persian don't need a diacritic. Without that requirement words like باکو would transliterate despite not being vocalized. The Iranian module doesn't require zabar. - سَمِیر | Sameer (مشارکت‌هاکتی من گپ بزن) 22:08, 17 November 2023 (UTC)Reply
Hi @Sameerhameedy:
I don't think I agree but can we please look at those cases? I'd remove that requirement and work with you on those cases.
Could you list/give an example of those false positive that you mentioned in the topic above? Are they to do with the Arabic الـ or ـة?
This is what I think should happen with various combinations. Do you agree with these cases?
  1. باب should give "bāb"
  2. اب should give (NIL)
  3. اَب should give "ab"
  4. اُب should give "ub"
  5. اِب should give "ib"
  6. آب should give "āb"
  7. بِاب should give (NIL)
  8. بُاب should give (NIL)
  9. بَ اب should give (NIL)
  10. بِ اب should give (NIL)
  11. بُ اب should give (NIL)
  12. بَ ٱب should give "ba b"
  13. بِ ٱب should give "bi b"
  14. بُ ٱب should give "bu b"
@Benwing2, we may need some of your help. I think the requirement to use zabar (fatha) before alif is redundant but let's see what cases Sameer has in mind. Anatoli T. (обсудить/вклад) 08:30, 20 November 2023 (UTC)Reply
@Atitarev I actually see why Sameer is requiring it and it seems reasonable to me; the idea is that otherwise there may be no way to determine whether someone was intentionally leaving off a diacritic before a semivowel or just not vocalizing a word. There will still be cases where the determination can't be made, but at least there will be fewer of them. Ideally we would have diacritics before semivowels in all cases but apparently that's not possible, so requiring the fatha before alif is the next best thing we can do. Benwing2 (talk) 08:56, 20 November 2023 (UTC)Reply
@Benwing2: Thanks, it makes sense to add these restrictions for semi-vowels و and ی but not for the alif. If it's meant to be two words or compounds, instead of one, then they are separated by a space, ZWNJ. If alif is a short vowel (اَ, اُ or اِ), then it's only at the beginning of a word or after a ZWNJ. I have already listed cases where it should definitely be (NIL). Anatoli T. (обсудить/вклад) 09:14, 20 November 2023 (UTC)Reply

Should Dari be using this module or should it use a different vocalisation?

[edit]

@Sameerhameedy: I am less familiar with Dari but it is correct to use Classical Persian translit module for Dari? خْوَدْکُشِی‎ (xwadkušī) gives "xwadkušī" but shouldn't it be "xudkušī" for Dari specifically? Or should the vocalisation be خُودْکُشِی (xūdkušī) resulting in "xūdkušī"? Or, it is an exception? Dari translit was added by @Fenakhay. Anatoli T. (обсудить/вклад) 03:19, 29 November 2023 (UTC)Reply

@Atitarev hi I spent a while writing the usage notes which answers this question already. writing it as خوَد (xud) works. - سَمِیر | Sameer (مشارکت‌ها · بحث) 03:24, 29 November 2023 (UTC)Reply

Dari پنسل - pensel or pinsil

[edit]

@Sameerhameedy: Hi.

Does a short "e" occur in Dari and is it common, e.g. پِنْسِل (pinsil) should actually be "pensel", not "pinsil"? Anatoli T. (обсудить/вклад) 18:09, 2 December 2023 (UTC)Reply

@Atitarev my family usually pronounces this as "pensil" it's just a weird irregularity of english transliterations. There's no distinction between a short "i" and "e" and the pronunciations "pinsel" "pinsil" would not be perceived as different pronunciations. So in my opinion this should just always be "i", since it's not even distinguished. - سَمِیر | Sameer (مشارکت‌ها · بحث) 22:00, 2 December 2023 (UTC)Reply
@Sameerhameedy: OK, thanks. I understand the transliteration can use "i" but pronunciation can vary. I've made both Dari پِنْسِل (pinsil) and Urdu پِنْسِل (pinsil) entries. Pls check. Anatoli T. (обсудить/вклад) 01:24, 3 December 2023 (UTC)Reply

hamze (hamza)

[edit]

@Sameerhameedy: Hi.

Should some cases require an alif with a hamza in Persian, as in مُتَأَهِّل (muta'ahhil) to separate the vowels? Maybe for transliteration and the header purposes only so that it actually links to متاهل (the page title being without the hamza)? Make أ link to ا?: E.g. resulting in the following display and link مُتَأَهِّل (muta'ahhil). I think the case is different from سُؤَال (su'āl) where hamza is standard in writing. Anatoli T. (обсудить/вклад) 11:01, 6 December 2023 (UTC)Reply

@Sameerhameedy: I went ahead and created the Persian entry مُتِأَهِّل (mote'ahhel) with letter أ.
Perhaps we should add tracking or usage notes, since hamzated alif أ, إ are rarely used in Persian but may be useful for transliteration policies in a stricter spelling?
@Erutuon, @Benwing2, @Fenakhay: Any thoughts? Should we add tracking for أ, إ. Is letter أ even valid in the Persian orthography?
We should have tracking categories like Category:Persian terms spelled with أ, Category:Persian terms spelled with إ, Category:Persian terms spelled with ة, etc. but I assume it may only happen when the Persian headword is modularised. Anatoli T. (обсудить/вклад) 03:38, 7 December 2023 (UTC)Reply
@Atitarev This is actually determined by a setting in Module:languages/data/2. If we set standardChars for Persian to a string containing the normal characters of the language, any headwords with any other characters will be added to such categories. Can you make a list of the normal characters in Persian? It will be similar to but not the same as the "normal" characters for Arabic. It appears BTW that the standardChars string does not need to contain diacritics in it, as these are stripped out before checking the headword for nonstandard characters. Benwing2 (talk) 04:19, 7 December 2023 (UTC)Reply
@Benwing2: Thanks! I will try that. I can see that Module:th-headword adds the tracking in the module. Module:languages/data/2#L-2127 doesn't have standardChars Anatoli T. (обсудить/вклад) 04:54, 7 December 2023 (UTC)Reply
@Atitarev Yeah this is yet another case of Wyang's inability to work with standard mechanisms; everything for the East Asian languages has been created bespoke for those particular languages. Benwing2 (talk) 05:01, 7 December 2023 (UTC)Reply
@Benwing2: I appreciate what he did for East Asian languages. No-one's perfect. Anatoli T. (обсудить/вклад) 07:17, 7 December 2023 (UTC)Reply
@Atitarev Indeed; I may be too harsh on him. My perspective is of a coder having to rewrite all the code because it wasn't written in a maintainable fashion. Benwing2 (talk) 07:38, 7 December 2023 (UTC)Reply
@Atitarev The only person I was able to ask who was actually educated in Afghanistan (my dad) said أ or إ are not letters he was taught in school (but he finished his education like a billion years ago). It's really hard to figure out what "standard" Dari is, since the Afghan government doesn't really publish this stuff on the internet. I try to ask my family who were educated in Afghanistan but they went to school years ago and don't remember everything.
BTW can you change Module:kk-translit to Auto-patrollers only? - سَمِیر | Sameer (مشارکت‌ها · بحث) 09:18, 7 December 2023 (UTC)Reply
@Sameerhameedy I changed the protection. Benwing2 (talk) 09:41, 7 December 2023 (UTC)Reply
@Sameerhameedy: OK, thanks, let's sit on it. Will wait for more information, ask other people. It may be (extremely) rare but there is no other native tool to render the "a'a" separation. The Wikipedias (both English and Persian) lists أ and there is an entry fa:متأهل. Anatoli T. (обсудить/вклад) 10:42, 7 December 2023 (UTC)Reply
@Atitarev IMO it's fine to use أ and إ for Dari; we can just normalize them in entries to not have the hamza. Vowel diacritics in general aren't "normal" for any variety of Persian but that doesn't mean we can't use them. Benwing2 (talk) 22:31, 7 December 2023 (UTC)Reply
@Benwing2: Great idea! It is important to find whether they are ever used in classical and Iranian Persian as well, not just Dari. Rather than hiding the hamza, perhaps, they should be used in the headword but not in the page title, as in my post: مُتَأَهِّل (muta'ahhil). Headword=مُتَأَهِّل (with أ), page title=متاهل (with ا)? Unless it's completely abnormal for Persian to use these characters anywhere. Anatoli T. (обсудить/вклад) 22:41, 7 December 2023 (UTC)Reply
I guess not abnormal. I just found an example in "Colloquial Persian" (published), which almost the same I added in the entry:
شُما مُتَأَهِّل هَسْتید؟šomâ mota'ahhel hastid?Are you married?
I searched for "متأهل" "شما" in Google books. There are more hits. The spelling "متأهل" is more likely to be used when teaching Persian to foreigners, judging by the Google searches. Anatoli T. (обсудить/вклад) 23:03, 7 December 2023 (UTC)Reply
@Atitarev In that case I think it will work fine to have the hamzated alifs normalized to non-hamzated ones, unless we are planning on creating page names that have hamzated alifs in them. Benwing2 (talk) 23:09, 7 December 2023 (UTC)Reply
@Benwing2: I don't quite get it. Could you give an example please? I have edited both متاهل and متأهل. You can demonstrate using those.
Are you suggesting to have متاهل as the correct title with display "مُتَأَهِّل"? Anatoli T. (обсудить/вклад) 23:13, 7 December 2023 (UTC)Reply
Forgot to ping you on my findings. My searches in Google books confirm that "أ" can be used occasionally. This must be a case. @Sameerhameedy, @Benwing2 Anatoli T. (обсудить/вклад) 23:09, 7 December 2023 (UTC)Reply
@Atitarev Right, but what I'm saying is, is it correct to enter words like متأهل as Persian words? Currently we have Persian entries for both متأهل and متاهل, but this seems non-ideal. It might be best to normalize on the non-hamzated form, and if we have an entry for the hamzated equivalent at all, it should be a soft redirect. Maybe Sameer can comment more. Benwing2 (talk) 23:13, 7 December 2023 (UTC)Reply
@Benwing2: I see now what you mean. Yes, perhaps. Ignore my question, since you have explained.
I created "أ" entry to have a case. I may convert it to a soft redirect. Anatoli T. (обсудить/вклад) 23:15, 7 December 2023 (UTC)Reply
And we should enable tracking later, so these can be handled in bulk. Anatoli T. (обсудить/вклад) 23:18, 7 December 2023 (UTC)Reply
@Atitarev @Benwing2 Looking at some dictionaries from Afghanistan I can confirm the characters do appear in some words. Though not all dictionaries have them. It just seems like, at least in Afghanistan, these characters are only used hyperformally but do exist. Tolo News (the largest news agency in Afghanistan) occasionally uses these letters but tends to more frequently use a normal alif. ( - سَمِیر | Sameer (مشارکت‌ها · بحث) 02:18, 8 December 2023 (UTC)Reply
@Atitarev @Benwing2 perhaps we should have "hamza-less" forms similar to "nuqta-less forms" in Hindi? - سَمِیر | Sameer (مشارکت‌ها · بحث) 02:23, 8 December 2023 (UTC)Reply
@Sameerhameedy See comment I just posted below, we need to decide where to lemmatize the terms. It depends IMO on what other dictionaries normally do and what will help speakers more; the nuqta-full forms are used for the lemmas in Hindi because speakers are familiar with them and dictionaries tend to lemmatize on them, but this may be different in the case of Persian. Benwing2 (talk) 02:25, 8 December 2023 (UTC)Reply
@Sameerhameedy Thanks. If these characters only occur occasionally in dictionaries and not regularly, this does suggest to me to follow the approach I mentioned above, which is to lemmatize it on the non-hamzated version, map hamzated versions to non-hamzated versions in makeEntryName() (akin to removing diacritics in Arabic and acute accents in Russian), and create soft redirects from hamzated to non-hamzated versions. This is the opposite approach used in Arabic (with hamzated alif) and in Russian (with the ё character), but in both situations we're following the normal dictionary convention. Benwing2 (talk) 02:23, 8 December 2023 (UTC)Reply
@Benwing2, @Sameerhameedy: These letters are on the Persian standard keyboard: إ ,أ ,ة, even Arabic only letters ي and ك (you need to use a shift to see them).
https://help.keyman.com/keyboard/persian11/1.1/persian11
I'm fine with both ways to lemmatise. Anatoli T. (обсудить/вклад) 03:26, 8 December 2023 (UTC)Reply
@Atitarev Not sure this says a lot; at least on Mac OS, the QWERTY-compatible Russian keyboard I use has Ukrainian-only and other non-Russian Cyrillic characters on it, available using the Option/Alt key (since the Shift key is used for uppercase). Benwing2 (talk) 04:33, 8 December 2023 (UTC)Reply
@Benwing2, @Sameerhameedy:
Hi. Just an update: I swapped the entries based on the reference provided by @ZxxZxxZ on his talk page (I pinged you there as well). Anatoli T. (обсудить/вклад) 22:11, 9 December 2023 (UTC)Reply
@Atitarev Yup, I saw that; this is fine with me. Benwing2 (talk) 22:22, 9 December 2023 (UTC)Reply

خوابیدن

[edit]

@Sameerhameedy:

Hello. I couldn't find this particular case for Dari. How do I get "xābīdan"? Manually only? Is the vocalisation خَوابِیدَن (xābīdan) correct?

  1. Classical Persian: خْوَابِیدَن (xwābīdan)
  2. Dari: خَوابِیدَن (xābīdan)?
  3. Iranian Persian: خوابیدَن (xâbidan)

Anatoli T. (обсудить/вклад) 08:44, 30 April 2024 (UTC)Reply