Module talk:he-translit/old

From Wiktionary, the free dictionary
Latest comment: 3 years ago by Erutuon in topic Tokenization approach
Jump to navigation Jump to search

Test cases

[edit]

@Erutuon, if you are going to continue editing this module, you could create some test cases for the Biblical and Modern transliterations? —JohnC5 18:40, 28 February 2017 (UTC)Reply

@Erutuon: I just added (and corrected) a whole bunch of test cases, all of which are theoretically doable, but demonstrate the difficulty of the problem. --WikiTiki89 20:28, 28 February 2017 (UTC)Reply
@Wikitiki89: Thanks for the testcases (and corrections). I couldn't think of all the exceptions to the rules that I had already added. — Eru·tuon 23:42, 28 February 2017 (UTC)Reply
@Wikitiki89: No problem. This regex replacement cycle solution can only go so far. At some point we're gonna have to do proper parsing. For example, in מְקוּוּוֹת (məquwwōṯ) (which is a rather unlikely spelling of מְקֻוּוֹת), the value of each of the three vavs depends on the value of the preceding letter, thereby necessitating some sort of state variable as you move forward in the word (because if the ק had had its own vowel, the value of all three vavs would change, and a stateless implementation would only be able to handle the first vav). --WikiTiki89 00:17, 1 March 2017 (UTC)Reply
@Wikitiki89: It's probably less elegant to handle waw with regex, but it might be possible. I guess I'll find out. I'm attempting to deal with מִצְוֹת (miṣwōṯ) at the moment. — Eru·tuon 01:36, 1 March 2017 (UTC)Reply
Theoretically, it's impossible, but in practice you might be able to make it work for all spellings that are expected to occur in the real world, but it would also be much more error-prone. I thought the only thing left for מִצְוֹת (miṣwōṯ) was to assume a silent shva after short vowels. --WikiTiki89 03:30, 1 March 2017 (UTC)Reply
I'm sure you can handle the מִצְוֹת (miṣwōṯ) just fine if you try special-casing 613 different rules for them... —Μετάknowledgediscuss/deeds 03:39, 1 March 2017 (UTC)Reply
@Metaknowledge: You can probably save some time by splitting up into the two groups that share functionality, מצוות עשה and מצוות לא תעשה. --WikiTiki89 03:50, 1 March 2017 (UTC)Reply
@Erutuon: I noticed a lot of the changes you made since my last edit have added a lot of overly specific and not generally correct rules. You also removed some of the features I added. Perhaps we should communicate better about what the best generalizations are. I've already spent several years thinking about these things. --WikiTiki89 03:50, 1 March 2017 (UTC)Reply
@Wikitiki89: Yeah, I'm growing unsatisfied with much of the regex I'm adding. It's very inelegant and confusing handling dagesh and mappiq, shewa, waw, and silent letters with regex. It would be better to do it in some other way. — Eru·tuon 04:55, 1 March 2017 (UTC)Reply

Explicit numbering

[edit]

@Erutuon:: If you add comments with the number, that kind of defeats the purpose of not having to maintain the numbers every time something is inserted or deleted. If you need so enforce a correlation with another piece of data, then it would be better to just add the explicit numbering back. However, I'm confused as to the purpose of the new rules structure that seems to not be used anywhere. --WikiTiki89 00:28, 1 March 2017 (UTC)Reply

Oh, the rules are used, just not explicitly. They are added to the biblical table by the for name, setOfRules in pairs(rules.biblical) do ... code. — Eru·tuon 00:56, 1 March 2017 (UTC)Reply

Vowel distinctions

[edit]

@Wikitiki89, what's the situation with the quadripartite vowel distinction that Wikipedia uses (ō, ô, o, ŏ) and our tripartite system (ō, o, ŏ)? —JohnC5 00:53, 1 March 2017 (UTC)Reply

Some of Wikipedia uses a convention where long vowels are transcribed with a circumflex when spelled with a mater lectionis and with a macron when spelled without one. I think this is a rather useless distinction to make. There are many other aspects of the consonantal spelling that are masked by the transcriptions, such as when the mater lectionis differs from the expected one, or when one is used for a short vowel, or when there is a ktiv/qre pair; there is no point in trying to expose some of this but not all of it. If we want to show the consonantal spelling, we should add a separate consontantal transcription, which is not a bad idea, by the way. --WikiTiki89 03:27, 1 March 2017 (UTC)Reply
To even more correctly transcribe the Tiberian vowel distinctions, there would be no distinction between u and ū, i and ī, and ā and o, and perhaps different symbols to convey the fact that the real difference between o and ō, and e and ē was height rather than length. But that would probably be too unconventional. — Eru·tuon 04:58, 1 March 2017 (UTC)Reply
The scholarly transcription is really more of a proto-Tiberian system. Indicating length is important grammatically. Plus, the Tiberians did have a length distinction, even though it was probably more allophonic. There is some interesting evidence for this in some of the Yiddish reflexes. Anyway, we shouldn't be transcribing based on how we think the Tiberians actually pronounced the words, but rather based on the underlying structure of the word. --WikiTiki89 22:21, 1 March 2017 (UTC)Reply

Final rules

[edit]

@JohnC5: What is the purpose of the "final" replacement {'(%l)%1(%s)', '%1%2'}? It causes final k in וַיֵּבְךְּ (wayyēḇk) to be geminate. — Eru·tuon 04:23, 1 March 2017 (UTC)Reply

@Erutuon: See that is a little mystifying to me. That one should be covered by the text = gsub(text, regex .. "$", replacement)} rule. The primary issue for the intial and final rules is that the way you had is set up before, the rule ['(%l)%1'] = '%1' would get rendered as ['(%l)%1(%s)'] = '%1%1' for the whitespace-adjacent case, which is wrong. The extra input I'm providing only gets executed for the whitespace situation and is only used when there are capture groups in the regex. It will not be used for the string-initial or -final cases. I'm not sure what is causing the error with וַיֵּבְךְּ (wayyēḇk) at the moment. My change did fix the error with עַד בֹּאֲךָ appearing as ʿaḏ bbōʾăḵā. —JohnC5 05:16, 1 March 2017 (UTC)Reply
@JohnC5: But the regex .. "$" will not be used, because if type(do_finally) == "table" then ... else ... end excludes it. — Eru·tuon 05:20, 1 March 2017 (UTC)Reply
@Erutuon: Line 199? —JohnC5 05:24, 1 March 2017 (UTC)Reply
@JohnC5: Yeah; I mean, that if-statement only occurs once in the module. — Eru·tuon 05:27, 1 March 2017 (UTC)Reply
Oops, I misunderstood your code. — Eru·tuon 05:29, 1 March 2017 (UTC)Reply
@Erutuon: But I still don't know why it's not working in the wayyēḇk case at the moment. But you see why this solution is necessary in the case of initial and final regex's with capture groups? —JohnC5 05:33, 1 March 2017 (UTC)Reply
@JohnC5: Yeah, it is necessary. It is frustrating how complicated it is to use regex for this. I would like to come up with a different way. — Eru·tuon 05:36, 1 March 2017 (UTC)Reply
@Erutuon: It would also appear that, based on the evidence of הוֹשִׁיעָה נָּא (hōšīʿā nnā), the dagesh degemination rule appears to function differently across whitespace boundaries after vowels and consonants. —JohnC5 05:47, 1 March 2017 (UTC)Reply

──────────────────────────────────────────────────────────────────────────────────────────────────── @Erutuon: If you're gonna get a bunch of these rules to work, you're gonna need to implement the rules under w:Shva#Shva Na and then delete the remaining shvas. —JohnC5 06:18, 1 March 2017 (UTC)Reply

  • In reality, there is no "dagesh degemination rule". A dagesh chazak always indicates gemination. A dagesh qal does not, but only appears on the letters ב ג ד כ/ך פ(/ף) ת. There's also the mappiq, which occurs only on ה and rarely א. Therefore, for all letters other than the gutturals and the six letters I just mentioned, a dagesh always indicates gemination. It's only on those six letters that there is ever ambiguity. The rule for those six letters are: (1) if it is preceded by a vowel, it is a dagesh chazak, (2) if it is preceded by a shva, it is a dagesh qal (and the shva is a shva nach, i.e. a null vowel), and (3) if it is at the beginning of a word, it is ambiguous but in the vast majority of cases it is dagesh qal. The only potential exceptions to these rules are בָּתִּים and שְׁתַּיִם (and their derivatives), for which it is still unclear what the Tiberians intended, but we can just apply the regular rules for them. --WikiTiki89 22:37, 1 March 2017 (UTC)Reply
    Ahh, I see the purpose of your rule that I removed. I've restored it. — Eru·tuon 23:04, 1 March 2017 (UTC)Reply

Shin behavior

[edit]

According to Wikipedia, יִשָּׂשכָר (yiśśāḵār) is a special case. Is this really just an exception, and there is not rule surrounding it? —JohnC5 05:23, 2 March 2017 (UTC)Reply

If you read my comment in the source code (is there a way to add comments that actually show up in the test cases?), I'm not 100% sure we should handle these sorts of cases. On the other hand, I don't see what's wrong with adding a rule that any non-final consonant without a corresponding vowel should be ignored. --WikiTiki89 16:40, 2 March 2017 (UTC)Reply
I naturally should ask whether there are any other examples of this phenomenon. —JohnC5 17:52, 2 March 2017 (UTC)Reply
I think so, but I can't remember any. Maybe not. (Well, there are many examples of this with the letters א ו י, but I'm not counting them.) --WikiTiki89 20:39, 2 March 2017 (UTC)Reply

Other characters

[edit]

When we get around to modern Hebrew, should he handle the garesh? Also, what about cantillation? Should we just strip it out if entered? I might also like to add a maintenance category of transliterations having cantillation characters. —JohnC5 18:08, 2 March 2017 (UTC)Reply

Yes, we should handle the geresh. I don't think we should handle Modern Hebrew separately from Biblical. Rather the Biblical transliteration can be post-processed into a modern one. Don't forget that "Biblical" doesn't only include the Bible itself, but also a long timespan of Rabbinic literature, before it makes sense to use the "Modern" transliteration. The problem with the geresh, is that in some cases, it is ambiguous whether it is modifying the letter or indicating an abbreviation or contraction. Cantillation marks can be interpreted as stress marks. Ideally, if font support were better, I would support adding metegs as stress marks to all non-finally stressed words, and we should be able to handle that. Anyway, as long as we are unable to determine stress, this module is never going to be used. Another consideration is adding a special symbol to disambiguate shvas. Some publications (mis-)use the rafe for this, which is not too bad of an idea. --WikiTiki89 20:48, 2 March 2017 (UTC)Reply
I was looking at metegs and thinking about using them. I had hesitated to bring up adding stress marks, but I would prefer if we could add the functionality, and I would prefer metegs as the vessel. What are you concerns for font support? —JohnC5 21:01, 2 March 2017 (UTC)Reply
That if there is a meteg next to a vowel (which is always if the vowel is not a cholam or shuruk), many fonts will display the meteg overlapping with the vowel. --WikiTiki89 21:17, 2 March 2017 (UTC)Reply
Which fonts work correctly? We also have similar problems with macrons and breves in Ancient Greek but have decided that it is more important to have the information than to worry about display. —JohnC5 21:25, 2 March 2017 (UTC)Reply
Usually only the ones with good cantillation mark support can handle metegs well. Unfortunately, none of these fonts are ubiquitous enough for us to assume people have them, and I don't know if we have any in our web fonts. In Ancient Greek, this problem is not as bad because the diacritic is not that important to most people, so they can ignore the little mess that appears above the letter. In Hebrew, it's important to be able to read the vowel, and this can cause patach, qamatz, tzere, and segol to all look nearly identical, and the chiriq to be nearly invisible. Here are some examples for testing: כֶּֽתֶר (kéṯer), וַיָּֽקָם (wayyā́qom), פַּֽחַד (páḥaḏ), זֵֽכֶר (zḗḵer), אָשִֽׁירָה (ʾāšī́rā), וַיָּקֻֽמוּ (wayyāqū́mū). The following should work fine in most fonts: וַתֹּֽאמֶר (wattṓmer), קֽוּמָה (qū́mā). --WikiTiki89 22:05, 2 March 2017 (UTC)Reply
Might as well also point out that יְרוּשָׁלִַם already looks bad, with a meteg it's even worse: יְרוּשָׁלִַֽם. --WikiTiki89 22:14, 2 March 2017 (UTC)Reply

Also, should we set up the module to not provide a transliteration if no niqqud are provided? Or should we have separate entrance points for transliterations with and without niqqud? —JohnC5 18:42, 2 March 2017 (UTC)Reply

Probably. Or at least to provide a consonantal transliteration. --WikiTiki89 20:48, 2 March 2017 (UTC)Reply
IMO, the transliteration module should return NIL if there is not enough niqqud, just like the Arabic module but it would still be helpful to have a special template (for editors, not users) to show just the consonants. --Anatoli T. (обсудить/вклад) 21:35, 2 March 2017 (UTC)Reply
In Arabic, consonantal transliterations are trickier (how would you transcribe alif?), while in Hebrew they are pretty straightforward. Also, in Hebrew there are many abbreviations that should just have consonantal transliterations, like רה״מ (RH"M). Anyway, we still have a long way to go before we're ready for this module to be used automatically, so we'll have time to decide. --WikiTiki89 21:53, 2 March 2017 (UTC)Reply

My Biblical Hebrew book uses a little symbol above the letter to mark non-final stress. I can't find where it says what the symbol is called (or find it mentioned on Wikipedia), but it looks like this, only above the letter, not below: ˱. — Eru·tuon 22:30, 2 March 2017 (UTC)Reply

Oh, I found it in w:Cantillation#Names and shapes of the t'amim: ב֫. It's called ole (however that would be spelled in Hebrew). Could that be used for stress? — Eru·tuon 22:33, 2 March 2017 (UTC)Reply

That's another option, but I prefer the meteg. The ole is more old-school and specific to scholarly literature about biblical Hebrew/Aramaic and not other periods (confusingly, these systems also misuse the meteg to distinguish qamatz qatan from qamatz gadol). The meteg is actually used by and familiar to people who speak/read/write Hebrew today, even for Biblical Hebrew (for example, ArtScroll's Interlinear Tehillim (Psalms) use the meteg to indicate non-final stress, and the rafe to indicate non-initial shva na). --WikiTiki89 22:47, 2 March 2017 (UTC)Reply
Oh, yeah, my book uses the meteg to distinguish qamaṣ qaṭan, as well as for several other things. The reason I bring up the ole is it'd solve the problem of the meteg obscuring vowel diacritics below the letter. On the other hand, the meteg looks fine in the font I've specified for Hebrew in my common.css, Frank Ruehl (aside from it pushing the ם way over in יְרוּשָׁלִַֽם). — Eru·tuon 23:10, 2 March 2017 (UTC)Reply
Did you have to install it? --WikiTiki89 23:33, 2 March 2017 (UTC)Reply
Yeah, it's a font I downloaded and installed. — Eru·tuon 23:40, 2 March 2017 (UTC)Reply
That's what I thought, and that's the problem. We can't expect users to have special fonts installed just to read the vowels on basic Hebrew words. --WikiTiki89 23:44, 2 March 2017 (UTC)Reply
What about displaying forms with and without meteg in the headword? Perhaps something like כֶּתֶר‎ ‏[כֶּֽתֶר] kéṯer ... . (I'm not sure exactly how the form with meteg should be placed.) Then users who can see metegs can look at one version, and users who can't can look at the other. — Eru·tuon 00:15, 3 March 2017 (UTC)Reply
Or for users with JavaScript enabled, there could be a function to switch which form displays, and for users without, both could be displayed. — Eru·tuon 00:38, 3 March 2017 (UTC)Reply

──────────────────────────────────────────────────────────────────────────────────────────────────── Sorry, was away all weekend. What's the plan? —JohnC5 01:53, 6 March 2017 (UTC)Reply

Short vowels spelled as long

[edit]

I added rules to "shorten" ī and ū before double consonants, as in כּוּלָּם (kullām) and קִידּוּשׁ (qiddūš). But there might be counterexamples. — Eru·tuon 19:51, 2 March 2017 (UTC)Reply

First of all, stressed final closed syllables behave as if they are open in most circumstances (with the exception of some short particles, some nouns in the construct state, some finite verb forms, and some infinitives). Second of all, we shouldn't talk about vowels being lengthened or shortened, but rather we should talk about whether a given spelling implies a short vowel or a long vowel. Now here are the rules:
  • The only long vowels that can occur in closed syllables are tzere, holam, and qamatz, and only if that syllable is stressed (and in the case of qamatz, only if it is also in pausal position), with the potential exception of בָּתִּים and its derivatives.
  • The only short vowels that can occur in open syllables are patach and segol, and only if that syllable is stressed.
  • Gutturals, which cannot be geminated, are often "virtually" geminated, closing the preceding syllable (e.g. רִחֵם (riḥēm)). Gutturals with chataf vowels also usually close the preceding syllable (e.g. אָהֳלוֹ (ʾohŏlō)), even if the chataf vowel is promoted to a full short vowel before a following shva (e.g. אָהָלְךָ (ʾoholəḵā)).
  • The only vowels whose length is ambiguous are shuruk/kubutz, which can be u or ū, chiriq (whether chaser or male), which can be i or ī, and qamatz, which can be o or ā (it's interesting to note that in each of these pairs, the long and short vowels never alternate in the morphology, but rather alternate with other vowels entirely; for example, u lengthens to ō, not to ū, etc.). Thus, for these written vowels, the length should be assumed from the openness of the syllable (again, stressed final closed syllables should be considered open, except in a small set of exceptions: מִן (min), עִם (ʿim), אִם (ʾim), and כָּל (kol)), except that qamatz in a stressed closed syllable is ā. Only in ambiguous cases should we use the defaults (shuruk = ū, kubutz = u, chiriq male = ī, chiriq chaser = i, qamatz = ā). Ambiguous cases are before gutturals and non-silent shvas.
--WikiTiki89 20:09, 2 March 2017 (UTC)Reply

My appologies

[edit]

@JohnC5, Erutuon: Hi, I just made a big change that I realize overwrote a whole series of your guys's changes. Since my change makes all the current test cases work, I hope you guys don't mind too much and can merge your changes back in (if they're still necessary). --WikiTiki89 23:55, 7 March 2017 (UTC)Reply

@Wikitiki89: No worries. I was just trying to decompose the output, but it doesn't matter. Do we work on the accent now? —JohnC5 00:06, 8 March 2017 (UTC)Reply
I'm currently working on handling letters with geresh. I guess stress is next. I think we should interpret any of the cantillation marks as stress marks (except the ones that don't necessarily appear on the stressed syllable). --WikiTiki89 00:14, 8 March 2017 (UTC)Reply
@Wikitiki89: I don't mind, because I wasn't sure what to do next anyway, and I won't bother re-adding my changes, since the module works now. — Eru·tuon 15:50, 8 March 2017 (UTC)Reply
Works is a strong word. It does more than it did before. --WikiTiki89 15:55, 8 March 2017 (UTC)Reply

Is this gonna continue happening?

[edit]

@Wikitiki89, are we scrapping this project? —JohnC5 02:30, 22 May 2017 (UTC)Reply

Why should we scrap it? There's still work to do, and I'm not currently actively working on this, but eventually I or somebody else will get back to it. I took a two-year break from Module:he-verb before finishing it, and then another one-year break before improving it even more. The biggest obstacle to this module is the stress marking, for which the best solution is only possible if we wait for font support to catch up and be able to properly display cantillation marks (or metegs at least) with vowels. There's no rush. --WikiTiki89 15:13, 22 May 2017 (UTC)Reply
Okey dokey! —JohnC5 16:30, 22 May 2017 (UTC)Reply
Is the goal to use cantillation marks or metegs as a vehicle to mark stress so that the module knows where it is and can generate an appropriate transliteration, or is the goal (also) to display the marks to readers so they know where the stress is? You may have already thought of this, but perhaps the module could accept input with cantillation marks / metegs, generate the transliteration from them, and then strip them out of the Hebrew text it actually displays, until such time as font support catches up and it can stop stripping them out? - -sche (discuss) 20:12, 22 May 2017 (UTC)Reply
I did think of that, but I'm not sure whether it's a good idea. It would still require a lot of work to put in metegs wherever they belong. And we'd need to get the coverage high enough in order to be able to make the assumption that no meteg means final stress. Also, there needs to be a way to better indicate shva na vs shva nach. The best solution I've seen in print is to use a slightly larger shva for shva na, but Unicode has not included this (yet?). Basically, there are still a lot of issues. --WikiTiki89 21:08, 22 May 2017 (UTC)Reply
@Wikitiki89: Could you use a superscript, ᵊ, or subscript, ₔ, schwa for the shva na and a normal schwa, ə, for the shva nach? —JohnC5 03:29, 23 May 2017 (UTC)Reply
That wouldn't go very well with the Hebrew text (also, you got them backwards). Some siddurim put a asterisk above letters with shva na, while others use a rafe. I'm not sure whether there are any combining asterisks in Unicode and whether it looks good with most fonts. The rafe has good font support, but it technically has an entirely different meaning which is why I'm hesitant to use it for this purpose; but then again, we don't actually use it in our Hebrew entries for its actual purpose, so maybe it wouldn't hurt to use it for this purpose. It really puzzles why Unicode chose to include a separate enlarged form for qamats qatan but not for shva na, since I've never seen anything that uses one but not the other. But since there's still some extra space after the qamats qatan in the Hebrew Unicode block, I'm still hoping they will add the shva na someday. --WikiTiki89 15:07, 23 May 2017 (UTC)Reply
The other day, it was mentioned that Ken Lunde (who was/is part of the Unicode Editorial Committee) is responsive on Twitter if you want to float the idea past him before making a formal proposal. I see they recently began voting on a "Hebrew yod triangle" and "Arabic small low waw". - -sche (discuss) 20:08, 23 May 2017 (UTC)Reply
That sounds like a good idea. Too bad I don't tweet. --WikiTiki89 21:51, 23 May 2017 (UTC)Reply
Lunde is more of a CJK man. Maybe try Michael Everson (w:User talk:Evertype). —suzukaze (tc) 21:57, 23 May 2017 (UTC)Reply

Tokenization approach

[edit]

During a recent discussion about Hebrew transliteration, I started creating several Hebrew transcription modules using a tokenization approach. I made Module:User:Erutuon/he-translit-superscript (testcases) to try out superscript-based transcription, then Module:User:Erutuon/he-translit-circumflex (testcases) to generate a traditional transcription using circumflexes, and then Module:User:Erutuon/he-translit-omit-nonconsonantal (testcases) for the transcription created by this module.

Basically the tokenization approach breaks up the Hebrew into parts, which are one of the following items: waw, holam; letter, optional shin or sin dot, optional dagesh or mappiq or dot in shuruq; any other code point, including vowel points. (First I reorder the diacritics because the Unicode normalized order makes no sense and is a pain to deal with in a transliteration module, for instance Unicode puts dagesh after vowel points, but it logically belongs closer to the consonant.) Every token passes through the same loop unless you skip it, and the loop can ask things like like "does this have a dagesh?", "is this preceded by a letter with a sheva?", "is this at the beginning of a word?" and decide how to transliterate based on the answer.

No dummy characters and no worrying about whether something has already been transliterated in a previous step. The beginning or end of a word is just where the previous or next token is nil or a punctuation or whitespace character. So far I haven't needed to look at previously transliterated stuff at all. Occasionally I need to skip a token (i = i + 1), when I'm on the consonant before a furtive patah or on a vowel point before a silent letter or mater lectionis. But it's all done in one go.

None of the modules handles geresh or gershayim or contains Modern Hebrew code. I will probably convert this module to the tokenization approach of Module:User:Erutuon/he-translit-no-silent when it has at least those features, if nobody objects. I'm still uncertain on all the details about some long and short vowels and sheva, but at least the module passes more of the primary set of testcases. (At the moment it fails one testcase that this one passes.)

Overall it was easier to write the code using the tokenization approach, and it was fun to write the modules. (Our Ancient Greek transliteration module uses a tokenization approach as well.) Perhaps having modules demonstrating the various transcription systems will help people decide which transcription works best. Whichever transcription isn't chosen as an official one could still be displayed in a template somewhere in the entry, as in {{ko-IPA}}. — Eru·tuon 02:54, 13 August 2021 (UTC)Reply