Module talk:User:Isomorphyc/languages-draft

The phonetic respelling module

Latest comment: 8 years ago30 comments5 people in discussion

@Isomorphyc What would this module actually do? Can you give example inputs and outputs? I think the name phonetic_respelling_module is a bit too long, so understanding the purpose might help me come up with a better name. —CodeCa t 16:47, 21 November 2016 (UTC)Reply

@CodeCat: The intended change is: symbols -> abstract phonetic spelling -> concrete Romanisation. Where a language's editors choose to draw the line may be flexible, for example, the change to Roman letters might take place in either step. But the first step is less deterministic, and possibly a lookup, while the second step ought to be as deterministic as possible. If we are transliterating English to Greek, we might imagine cough -> cof -> κοφ (depending on the English and Greek dialect, of course). So cough -> cof is the `phonetic respelling.' For Mandarin, I am imagining hanzi -> pinyin -> null operation. The operative contention here is that hanzi -> pinyin can never be the second step because it is a lookup, not a transliteration. Isomorphyc (talk) 17:01, 21 November 2016 (UTC)Reply

Also, the module currently has to implement the function getTranslit() as in Module:th. I would also prefer to change that name too, though I don't have an alternative at present, and I don't want to edit the interface of Module:th. One of the confusing things about the Thai implementation is that there are three steps: Thai spelling --(1)> phonetic respelling --(2)> abstract Romanisation --(3)> concrete Paiboon dictionary style transliteration. The function getTranslit() performs the first and the second step. Isomorphyc (talk) 17:18, 21 November 2016 (UTC)Reply

Edit: what is also confusing about the Thai implementation is that, to take the English -> Greek example, Module:th.getTranslit will perform two steps: cough -> cof (phonetic respelling) and cof -> κοφ. Module:th-pron (of which Module:th-translit is just a wrapper) performs a third step, which in this case performs κοφ -> κοφ, because the abstract and concrete forms are the same. But if the concrete form were κοϝφ, it is th-pron which performs this last step. I hope I am stating this correctly; I really came to understand the process in the implementation only yesterday. Isomorphyc (talk) 17:27, 21 November 2016 (UTC)Reply

What I don't understand is why this phonetic respelling needs to be a separate step. In the past, we've done it in transliteration. —CodeCa t 18:00, 21 November 2016 (UTC)Reply

@CodeCat: It would be easy to set m["th"].translit_module to the composition of the two functions (although both implementations were slightly wrong, I believe). Editors have said they want to use the Language:transliterate() function for proper transliteration in etymologies. Having a separate phonetic lookup function accessible in the Language class will also make it more feasible to implement {{th-l}} using Module:links instead of Module:th (and similar for ko, jp, zh, etc.). User:Atitarev has expressed displeasure that Module:links cannot make adequate links for several Oriental languages, resulting in duplicate, incompatible infrastructure. I would also be happy not to see two incompatible sets of infrastructure at Wiktionary, although this would likely take more work than only this change. Isomorphyc (talk) 18:51, 21 November 2016 (UTC)Reply

Hello. You pinged me but I'm not sure what you're asking or if I can contribute. Yes, Thai phonetic respelling is used to generate the correct Paiboon transliteration and IPA. Thai entries also show the sequence of symbols both the orthographically and phonetically. E.g. see "Orthographic" and "Phonemic" in โปรแกรม (bproo-grɛm).

Oriental language modules do a great job of customised transliterations, each in a somewhat different way but they are not integrated well with Module:links. E.g I would have trouble automatically transliterating a Thai SoP term with [ [ ] ] [ [ ] ], even if all terms in brackets are defined. --Anatoli T. ^{(обсудить}/^вклад) 21:28, 21 November 2016 (UTC)Reply

But Thai can't be automatically transliterated anyway, because the correct transliteration can't be deduced from the spelling. How is extra code going to change that? —CodeCa t 21:31, 21 November 2016 (UTC)Reply

So as I understand it, sometimes there'd be just step 2, other times both steps. Does that not mean that we need two different transliteration functions, rather than one to do the respelling? Also, from what I've gathered, the respelling can't be predicted by a module, because it's not deterministic. —CodeCa t 18:59, 21 November 2016 (UTC)Reply

We might need two; this was Chuck Entz's idea here (excuse my summary page). While I am not sure how the parameters should work, I will make a tangible version later today. Nevertheless, clearly the default case should be to run one after the other, treating nil values as identity functions. This way the only changes in Module:languages/data(...) will be on an opt-in basis, and we only have to expose new functionality to the template or module callers as an option.

About respelling: what surprised me most, and what engendered my initial comment to you on my talk page, was that the respelling is accomplished by screen-scraping the Wiktionary entry for the phonemic form of the word, which is the argument to Module:th-pron. It is not so compute intensive because of the Squid cache, but it is as creative as it is inelegant. Similarly, as hanzi->pinyin lookup is not one-to-one (let alone deterministic), heuristics can still add a lot of value; even when multiple values are produced from a lookup table, it becomes an option to summarise it. Isomorphyc (talk) 19:20, 21 November 2016 (UTC)Reply

Is the respelling ever used on its own for anything, or is its only purpose to provide the input to the transliteration? Because if we have parameters for both, then there doesn't seem to be any value in providing pr= when you can easily just provide tr= directly. —CodeCa t 19:40, 21 November 2016 (UTC)Reply

I have had a lot of difficulty finding good examples in Thai, I think because of the language barrier. I would offer this scenario: suppose one chose to do a hanzi -> pinyin -> IPA mapping in Mandarin. In that case, a user might like either: 1) enter hanzi and get IPA, 2) enter IPA and get IPA, 3) enter pinyin and get IPA, or 4) enter hanzi and get pinyin. Clearly: 1) no-argument, 2) tr=<IPA>; and perhaps: 3) pr=<pinyin>, 4) display=pr. Further, it is easy to generalise the output with 4) display=<pr|tr> or similar. (This is probably what I was going to implement in a few hours). Isomorphyc (talk) 20:01, 21 November 2016 (UTC)Reply

We wouldn't want IPA in any of our standard linking templates, though. The IPA for Chinese is currently generated with a special-purpose template and module, and it should probably stay that way. Since this code we're adding is for the general-purpose code, it would be used in general-purpose templates like {{l}} and {{head}} and in the modules that implement them. So I'm trying to understand where exactly this new code would come into play and how our existing general-purpose templates would change relative to the current situation. —CodeCa t 20:06, 21 November 2016 (UTC)Reply

I am having trouble with examples because I believe we have already implemented the functionality which Wyang originally put in Module:links. It is possible we do not need the pr= parameter at all. I will probably implement it, then refine and remove if necessary (after asking for use cases with a tangible template), because people were quite enthusiastic when Chuck Entz suggested it for reasons I do not quite understand. Here are JohnC5's examples, however, for overloading scenarios: The Mycenaean under *h₁éḱwos and *(s)kleh₂w-, the Mycenaean and Old Persian under *tetḱ-, and the Hittite under *ǵónu, to name a few. If these are not sufficient, tell me. —John C5 02:56, 20 November 2016 (UTC) I'm sorry I can't be more help. Isomorphyc (talk) 20:22, 21 November 2016 (UTC)Reply

Well you pinged User:JohnC5 so hopefully he'll come and explain it, because I'm really having trouble understanding what this is really for. —CodeCa t 20:25, 21 November 2016 (UTC)Reply

I didn't mean to do that-- sorry JohnC5. I didn't quite understand it either, to be honest. I'm going to implement what I described, then I will ask the Thai editors for use cases. I would really rather do that than ask for use cases without a working template. If the use cases are not plausible, I will not be supporting the parameter. Isomorphyc (talk) 20:32, 21 November 2016 (UTC)Reply

Howdy. What exactly would y'all like me to explain? In many dead ideographic languages (Old Persian, Hittite, Mycenaean, etc.) the transliteration of the characters may be fairly far from the phone^t/_mic representation. For instance in Hittite, the Sumerograms may have no relation to the pronunciation at all. It would be nice to have a display of a (often recosntructed) phonetic representation, especially when showing reflexes of reconstructed languages. Does that explain what I am seeking? —John C5 00:16, 22 November 2016 (UTC)Reply

Maybe we should make a list of things each person wants first. From what I gather, you, JohnC5, want to show a "double" transliteration after links and headwords: one giving the description of the characters themselves, and one giving a transcription of the pronunciation in some format (IPA or some other customary scheme). Am I right so far?

To support this, {{l}}, {{head}} and such would need a second parameter, one for each type of "transliteration". However, do you think that adding a second module would help to achieve this? Pure transliteration would be easy for a module to do, but could a module also determine the pronunciation transcription for the languages you mentioned? If not, then at least this negates the need for a second type of module for JohnC5's proposal. Transliteration could be optionally automated as now, transcription would always be manual.

Now we need others to state their own proposals/needs, in particular the case where either an automatic pronunciation transcription module or a respelling module would be useful/necessary. So far I see the case for having two different schemes in a link as JohnC5 proposed, but not for the second module type. —CodeCa t 00:46, 22 November 2016 (UTC)Reply

Ah yes. I should say that in all of the languages which I have mentioned, transcription would indeed require human intervention or reconstruction and could not me modularized. I was merely asking for another parameter so that I don't have to overload |tr= and sometimes |pos= when the former is unavailable. This is not to say that there might be some language whose writing is somehow constructive enough to be automatically transcribed while dissimilar enough to require a separation of transliteration and transcription, but I am not currently pleading the case of this proposed language. —John C5 01:02, 22 November 2016 (UTC)Reply

In the case of Thai, the |pr= parameter would be calling on Module:th-transcript (whatever the nomenclature for transcription modules is) to generate the transcription, and the rules for romanisation display could be: (1) transcription is always displayed, and (2) transliteration is not displayed, unless the call to Module:links is via Module:etymology (i.e. {{cog|th|สวัสดี}} would give สวัสดี (sà-wàt-dii [swạsɗī], “hello”)). Tibetan could have a similar dual-display system, relying on Module:bo-translit and Module:bo-transcript to generate the transliteration and transcription for a word, and |tr=- / |pr=- could be used to selectively disable one of them. Korean can also benefit; there was a recent suggestion to make the current transcription system even more transcriptive, so a transcription-transliteration dual display is also a good option. (Btw, the absence of transliterations in Korean romanisations was also the source of displeasure of the now-inactive User:KYPark: 1, 2, etc.) Wyang (talk) 01:35, 22 November 2016 (UTC)Reply

It looks like we're coming back to the proposal of the vote that was created before. To have separate infrastructures for transcriptions and transliterations. I'm assuming that unlike JohnC5's case, for Wyang both transcription and transliteration can be generated by a module in at least some cases? The difficulty here is going to be to decide which to show where. If both parameters are provided, it's simple: show both. But {{l}} can't tell whether it's in an etymology section or not, so that idea is not going to work well. Etymologies would need to provide pr=- manually to suppress the automatically-generated transcription. —CodeCa t 01:45, 22 November 2016 (UTC)Reply

I find myself wishing more and more often these days that phab:T122934 would actually be completed/worked on/considered. That would solve your etymology section issue. —John C5 02:00, 22 November 2016 (UTC)Reply

Both transliteration and transcription can be generated automatically for Thai, Tibetan, Korean, etc. (fairly reliably). Using separate modules to store the two functions would be my preference, since it makes the choice of romanisation display easier to implement. For Thai, normally the transcription is sufficient in links, but etymologies may warrant the display of transliteration as well. For example, จันทร์ (jan [t͡ɕạndrʻ], “moon”) would be more apparent than จันทร์ (jan, “moon”), when one talks about its relatedness to words in other languages derived from Sanskrit चन्द्र (candra). Module:etymology has a function format_derived, which may be able to be modified to pass something to Module:links, indicating that the call was from etymologies. Wyang (talk) 02:03, 22 November 2016 (UTC)Reply

It could just pass "-" as the transcription. —CodeCa t 02:48, 22 November 2016 (UTC)Reply

Hi @CodeCat, Wyang, JohnC5:, thanks for all of this discussion, and sorry I don't know how to use pings apparently. I've tried to implement all of the features discussed in this thread here: User:Isomorphyc/Sandbox8. I don't see very much issue with the pr/tr display, or with etymologies. I used a two simple rules for display of pr/tr: don't display both if they're the same, and only generate pr from the module in an etymology. It was easy to separate etymologies from other links; I just generated pr= from the module where available in Module:User:Isomorphyc/etymology-draft, but only used it in Module:User:Isomorphyc/links-draft if it was passed in as an argument. I do think the ^PHONETIC superscript is a bit verbose; I might have preferred square brackets or slashes, but didn't want to be so specific. It also creates problems for pr= arguments with multiple forms. To make this work in bo and ko, I think one only has to link the right modules in Module:User:Isomorphyc/languages-draft/data2 per language; but I haven't done this yet. This should work with all the etymology templates, but I have only tested {{cog}}.

There is a significant issue I had which might relate to my linguistic confusion, or confusion about the modules. This question is mostly for Wyang. Module:User:Isomorphyc/languages-draft exposes require("Module:th").getTranslit(lemmas, phonSpell), which performs a two-step transcription: respelling-lookup -> transliteration; secondly, it exposes require("Module:th-translit").tr(text,lang,sc), which performs the second step, transliteration. What I understood was that the mapping สวัสดี -> ‎sà-wàt dii is transcription, whereas the mapping สวัสดี -> swạsɗī is transliteration. However, I could not make require("Module:th-translit").tr(text,lang,sc) perform the second step for me. Perhaps I am calling the function incorrectly? I had to expose a separate function, require("Module:th-pron").getCharSeqTbl(text) (which I factored out of getCharSeq(text)) to the lang object, so it routed it via require("Module:th-translit").tr1(text), exposing it through the object as the require("Module:User:Isomorphyc/languages-draft").transliterate1(text, sc, module_override). Then, in require("Module:User:Isomorphyc/etymology-draft"), I simply called lang:transliterate1(...) preferentially when available over lang:transliterate(...) when working with etymologies. Obviously this is all very much undesirable. My question is: if the last few lines of output of User:Isomorphyc/Sandbox8 look reasonable, what is the way to achieve it that involves rewriting the interface of the language object the least? Is it possible to expose multiple transcription, transliteration, and phonetic spelling functions all through the module th-translit, the way I exposed two of them? I'm probably explaining this very poorly, so please let me know if this makes no sense. Ideally, I'd let to get สวัสดี -> swạsɗī out of lang:transliterate(...) if possible, as I understand this should be possible.

All of my drafts are collected at User:Isomorphyc/Sandbox/Drafts, which I have also linked off the top of User:OrphicBot for (mainly my) ease of reference. Isomorphyc (talk) 01:00, 24 November 2016 (UTC)Reply

┌────────────────────────────────────────────────────────────────────────────────────────────────────┘ Thank you for your work; it looks very promising. I'm not sure I had understood everything, but the tr1 function in Module:th-translit is exactly what a Thai transliterator would be like; the tr function was not a functional transliterator function before as it had to be called by language-data for transcription generation. Just a minor thing with the Thai output, swạsɗī is the transliteration so phonetic may be misleading. Wyang (talk) 01:29, 24 November 2016 (UTC)Reply

@Wyang: Could you give me an example of a page which uses output from require("Module:th-translit").tr(...)? I have had trouble finding an actual use case for it, but I don't think I can simply remove it, and replace it with a variant of tr1? I will also try to wrap require("Module:th-pron").getTranslit() into another function in Module:th-translit to save linking an extra module, if this is all right with you. You are right about the phonetic superscript; I was partly trying to use notation which I originally had created for JohnC5's Hittite, Mycenaean Greek, and Old Persian examples. I will look into finding a better notation. Isomorphyc (talk) 02:01, 24 November 2016 (UTC)Reply

That's fine - please do any edit you see fit. Module:th-transcript may be a better place for the code, to make its purpose clearer. The tr function in Module:th-translit can be safely removed, as long as the |tr= parameters in Thai links are renamed to |pr=. The parameter |tr= should be rarely used for Thai, since transliteration is only displayed in etymologies, and automatic transliteration should be mostly satisfactory. Wyang (talk) 02:11, 24 November 2016 (UTC)Reply

The benefit of using one module and two functions is that the function names can be descriptive of the contract with the caller, and checking only has to be done in one place. I think it is much simpler, so I will use one module for this. The existence of multiple functions with different kinds of inter-script mappings can be defined clearly in the documentation. I'll also look in to changing the argument names. If you do this, you will have to make sure the Thai editors are comfortable with this interface change. The 5000-odd edits will of course not be difficult, but it will be the last step. Isomorphyc (talk) 02:49, 24 November 2016 (UTC)Reply

Placing them inside one module is okay with me. I agree that bot-renaming is not an urgent task. From impression most Thai editors have stopped providing the |tr= since automatic romanisation was enabled early this year, so it should be primarily the old links that require conversion. Wyang (talk) 03:21, 24 November 2016 (UTC)Reply