Module talk:languages/Archive 1

From Wiktionary, the free dictionary
Jump to navigation Jump to search
This is an archive page that has been kept for historical purposes. The conversations on this page are no longer live.

Wrong language code

"als" should be Alemannic German not Tosk Albanian, Albanian is "sq"

No, Alemannic German is "gsw". "als" isn't the correct code for it, according to the ISO 639-3 standard. Our templates for linking to Wikimedia projects are smart enough to correct for the discrepancy, so you may not have noticed that we've been using "gsw" with the old template setup that this is replacing. Chuck Entz (talk) 12:47, 8 April 2013 (UTC)Reply
oh, it's just that the Alemannic German wikipedia is at als.wikipedia.org, and the Albanian wikipedia is at sq.wikipedia.org. I thought the prefix on the web address always corresponded to the language code template
Most of the Wikimedia language codes follow the ISO 639 standards, but there are exceptions like this one. I believe the "als" code was assigned to Alemannic German by Wikimedia while it was still an unassigned code as far as ISO 639-3 was concerned, and they didn't really think about whether it would stay that way permanently. Once you have an entire wikipedia built up around a particular code, it's not really practical to switch it when a conflict like this comes up.
As for Albanian, the situation is complicated. If I remember correctly, standard Albanian is based on the Tosk dialect, but the sq code is also an umbrella term for the Albanian langage (macrolanguage?) as a whole. When they came out with a code for the Gheg dialect, it only made sense to have a corresponding code for Tosk, even if Tosk largely overlaps with standard Albanian, because sometimes you want to refer to the Tosk dialect itself and be clear that you're not talking about anything else that might be covered under "sq". Chuck Entz (talk) 05:39, 9 April 2013 (UTC)Reply
When the Alemannic Wikipedia started up, they made a mistake. Alemannic is gsw. Albanian (sq) is actually two separate lects, Tosk (als) and Gheg (aln). —Stephen (Talk) 06:00, 9 April 2013 (UTC)Reply

Faster?

So I checked what would happen if language code templates in t-simple in water were replaced by this module, and the result is that it takes 10 seconds to do a quarter or so of the t-simples, and then it just gives up and says "The time allocated for running scripts has expired." for the rest of them. Total page served in 35-40 seconds, even though most of the t-simples didn't run. The current version of water takes ~32 seconds, even though it runs all of them. It takes 25 seconds or so if the t-simple is completely blank, so it seems that this module is quite a lot slower than just using language templates. --Yair rand (talk) 16:08, 19 February 2013 (UTC)Reply

I think you'll have to contact Tim Starling about that, because he gave a guarantee that this approach would not make things worse. —CodeCat 16:49, 19 February 2013 (UTC)Reply
I knew from the beginning that this approach would only lead to problems. Maybe the language module should only be concerned about the language names, since we don't need the other parameters too often. -- Liliana 18:36, 19 February 2013 (UTC)Reply
Yes, putting only the names in a module would be one thing to try. You could also try putting only the most common ~100 languages in the main module, and then load the rest of them on demand with one module or template per language, similar to the way it is done with templates. String scanning may be a bit faster than loading data into a table, for data that is only needed once or twice per #invoke. I might try it myself. -- Tim Starling (talk) 23:36, 19 February 2013 (UTC)Reply
Could such a thing be done so that it's transparent to the calling code? For example, a function requests the language information of "en" through a table lookup, and the back-end then imports only what is needed but leaves out the rest, and the caller doesn't actually know that this happened and just "sees" a table lookup. In other words, memoization. Is this possible/feasible? —CodeCat 23:59, 19 February 2013 (UTC)Reply
Yes, you can map table lookups to function calls, with the index metamethod.
I did a string scanning language name lookup module at Module:Languages string db. If you use it in t-simple then water is rendered with 1.6 seconds of Lua time. Maybe it could be faster still. I'm not sure how it would compare to a module with a table that only holds names, and not other metadata. Feel free to move that module to some more appropriate title if it turns out to be useful. -- Tim Starling (talk) 00:20, 20 February 2013 (UTC)Reply
I am very confused. Why would that string scanning method be faster than natively parsed Lua? Is Lua's parser so bad that it's faster to write your own parser in Lua? —CodeCat 00:28, 20 February 2013 (UTC)Reply
It's not the parser, it's the VM, but the effect is basically the same. The Lua parser produces bytecode, which we cache. Then the VM executes the bytecode to produce hashtables, and that operation isn't cached. So each #invoke causes a large hashtable to be constructed, which is slow. Writing your own parser allows you to avoid the hashtable construction. Maybe it's possible for the hashtables to be cached for the duration of the request, I'll look into it. -- Tim Starling (talk) 06:02, 20 February 2013 (UTC)Reply
An ideal solution would be to cache the hashtable as soon as the module is edited and saved, and make it automatically available to all modules, but make it read-only so that they can't modify it. That is probably not currently possible, but I think it would be a good way to deal with collection of immutable data like this. —CodeCat 12:25, 20 February 2013 (UTC)Reply
Good thinking. It would be possible to allow a module to define a read-only string store, and then provide access to it only via accessor functions. As long as you don't provide any access to the underlying table references, the isolation of #invoke instances would be preserved. With an __index metamethod for an accessor function, the syntax would be something like local english = mw.import_data('Module:languages').en.name. -- Tim Starling (talk) 04:49, 21 February 2013 (UTC)Reply
import_data is not currently available though, is it? Also, it would generally be more desirable to return languages.en altogether rather than just the name, because a significant amount of queries would need both the name and the script (for widely used templates like {{l}} and {{term}}). But tables are passed as references so that would involve copying it each time, which isn't really good. It might be more robust if the script would just trigger an error if any attempt were made to modify the table, but that would need special support... something along the lines of C++'s "const". Module:languages modifies the table while it is building it, so I'm not sure how that would work out. —CodeCat 14:29, 21 February 2013 (UTC)Reply

I saw this from a wikitech-l mailing list message, and have investigated a few things because I have the same problem with a large table of data in a module I am working on (see w:Module talk:Convert#Overview). I also hoped that some form of read-only table could be wholly cached. Some enhancements may occur for that, but meanwhile I thought I would try Tim's suggestion of packing all the data into a string and using a binary search to unpack it on demand. I have now implemented that for Module:languages (which is larger but simpler than my data), and have put my code at the test2 wiki. I might copy Module:languages to test2 then create two pages, each to invoke one of the modules a 100 times. The NewPP report would show how much Lua time had been used. I might get a chance to try that in a few hours, or later. I'm posting for anyone interested. Feel free to do some benchmarking (which I haven't started). Languagesx should be a fully compatible version of languages—each returns a table, and indexing that table gives a table of information. Johnuniq (talk) 03:46, 22 February 2013 (UTC)Reply

Results I have now setup a side-by-side comparison of the two systems of accessing a large table. See my talk at test2. In seems (if I haven't made a blunder) that unpacking a large string is 20 times faster than using a standard table. Johnuniq (talk) 09:27, 22 February 2013 (UTC)Reply


So, what is the problem exactly? loading or retrieving the data? or both?
In either case we could use w:en:Trie's idea. Particularly, we could split //Module:languages// into 26 submodules: languages_a, languages_b, ..., languages_z, each of which would contain only langcodes that start with corresponding letter.
Note that, if implemented correctly we would have a great performance gain in both - retrieving the data (better than BS) and loading the data (apparently we load only 26th of it).--user:Dixtosa 14:15, 2 June 2013 (UTC)Reply
There is really no reason to do that: Lua is really good at loading big data tables (especially in this case where the data would only be loaded once), so the size is really not an issue. Also, in real conditions, almost all languages will be loaded in every page, at least in the pages with translation tables. Dakdada (talk) 13:28, 3 June 2013 (UTC)Reply
That's just plain wrong. Most pages have two or less language sections and no translation tables, at least at en.wiktionary. I still think that Trie's idea is excessive, but your reasoning looks off as well. —Μετάknowledgediscuss/deeds 14:35, 3 June 2013 (UTC)Reply
My own proposal is to have a single module for 2-letter codes, 26 modules for 3-letter codes, and a module for our own custom-made codes. I don't think that would speed things up all that much, but it would be easier to actually work with the modules given their enormous size. It would also lessen the impact on the job queue somewhat whenever someone changes the module, because then only a subset of the pages needs to be updated instead of all of them. —CodeCat 14:39, 3 June 2013 (UTC)Reply
My point was that there is no performance advantage to loading several pages as it's really fast anyway. So splitting the table should only be justified by ease of use and maintenance, not performance (although it also adds some complexity in the code if the tables are loaded separately). Dakdada (talk) 16:17, 3 June 2013 (UTC)Reply

Comment block

The top comment block should start with [=[ and end with ]=] (each on its own line). There's no need for the leading '--', which is throwing off the syntax highlighter. --Ori.livneh (talk) 10:31, 23 February 2013 (UTC)Reply

But wouldn't it interpret it as just a string then? —CodeCat 14:45, 23 February 2013 (UTC)Reply
Comments definitely start with "--". I've no idea what "[=[" is supposed to do. SemperBlotto (talk) 14:49, 23 February 2013 (UTC)Reply
Eep, you are right. I thought a free-floating string literal would be interpreted as a comment, but this is not the case. The current syntax is correct. My bad. I'll look into fixing GeSHi. --Ori.livneh (talk) 07:45, 24 February 2013 (UTC)Reply

Edit request

m["chn"] = { ... scripts = {"Latn", "Dupl"}, ... } Vanisaac (talk) 04:34, 3 March 2013 (UTC)Reply

Added. JamesjiaoTC 05:04, 3 March 2013 (UTC)Reply
The normal language templates are still in use for now, so both should always be added in tandem. I've changed {{chn/script}} now. —CodeCat 14:32, 3 March 2013 (UTC)Reply

Standards compliance

Is it possible to add a field indicating the source of each language code, as described in WT:LANG#Language codes? It would be useful to track our standards compliance, since the use of non-standard codes in lang attributes breaks our HTML. Michael Z. 2013-06-02 22:11 z

I had been thinking of splitting the module into different submodules depending on the origin of the code. So we would split -1, -3 and home-made codes. We would need some kind of "glue" code to make the split invisible from the outside, though. —CodeCat 22:12, 2 June 2013 (UTC)Reply
Note that since this table will be loaded in virtually every page, it may not be a good idea to add a field that would not be used anywhere just for the sake of classification (simply having the codes in different pages may be enough for that). Dakdada (talk) 13:34, 3 June 2013 (UTC)Reply
I really don't think that would make much of a difference. —CodeCat 13:36, 3 June 2013 (UTC)Reply
I'm essentially wondering about memory limitations. A page like cat uses 14MB. I think the current limit is set to 20MB per page, so if we add other languages and data, we may go beyond this limit. Strictly speaking, those things would be better stored in a database, but that's unrealistic right now. Dakdada (talk) 16:32, 3 June 2013 (UTC)Reply
How big are water and a by comparison? —CodeCat 16:36, 3 June 2013 (UTC)Reply
Water is 14MB, but a is 19.84MB (this can be found in the source code of each page, near the end). I'd like to be sure that 20MB is really the limit (even if it is, it is possible to ask to use more). Dakdada (talk) 16:53, 3 June 2013 (UTC)Reply
I wonder where all of that memory is going to. The page itself isn't anywhere near as long... —CodeCat 16:59, 3 June 2013 (UTC)Reply
Has this any significant effect in page loading speed? Maybe we should remove fields like "names" which are not supposed to be used in other Lua modules and put them in another page, so we would have a short table which is for actual use and is loaded for all entries, and a bigger table containing stuff that aren't used in all entries and is invoked only when needed. The only use for "family" and "type" fields that I can think of is in languages categories.
We need a field for automated transliteration, maybe translit = {<script 1> = "<module name 1>", ...}, see Module:Avst-translit which is script-based. Another way is to have a standard module name, function name, and arguments -- text AND script -- for all transliteration modules, so we would need just a translit field with true/false value.
BTW, isn't it faster to define the table in one command? m = {aa = {names = ...}, aaa = {...}, ...}. --Z 07:46, 4 June 2013 (UTC)Reply
The source is not supposed to be used in modules, is it? If so, we should add it as a comment. --Z 07:46, 4 June 2013 (UTC)Reply

I just read Module talk:language utilities#Generating lists and #Faster? discussions. Here's my idea: lets split data based on their type instead. We have two kinds of data here, (1) those that are used in entries (all entries), i.e. language codes, script codes of the language, and automated transliteration modules, and (2) those which are only for the sake of maintenance of the data and are somehow metadatas here, like alternative names, language type, language family, title of the corresponding Wikipedia article, source of the language code, notes, and so forth, which are used in languages categories, in pages like WT:LANG, in creating statistics and generally to maintain the data (i.e. Module:languages) itself. We can keep this module, and create a new one that invoke this page, and add these extra datas to the "m" table.

Regarding the functions that were proposed at Module talk:language utilities#Generating lists, we should create a utility module for the new page that we create, it will be like Module:language utilities to Module:languages. Those proposed functions which are for maintenance of the data itself belong there. --Z 09:37, 4 June 2013 (UTC)Reply

Don't they say that premature optimisation is the root of all evil? Why would we optimise if there are no immediate issues? —CodeCat 11:54, 4 June 2013 (UTC)Reply
This module will be eventually become very large, won't it? --Z 12:13, 4 June 2013 (UTC)Reply
Most of the information we would want to store is already there. There are some things we will probably want to add, like whether a language uses automated transliterations. But I don't think it will become that much bigger. We could also reduce the size by removing script and family when they are the default (None and qfa-und). —CodeCat 12:16, 4 June 2013 (UTC)Reply
I'm talking about all information that we would want to store. Datas that may be eventually added are much more than what you though. --Z 12:32, 4 June 2013 (UTC)Reply
So what is the real effect of adding a data field to the table? Does that 14MB grow by 10 kB, 100 kB, or 1 MB?
I agree that we should only split the table if there is a demonstrated benefit. If so, I suppose the language code would be the table key. It would be useful to have a utility function that flags unsynced language codes and missing data fields between the two tables.
Or instead of splitting the table and trying to keep the parts in sync, it might be better to maintain a full master table, and routinely generate a minimal subset from it for use in pages? Can Lua build a module from data, perhaps using a subst or via a javascript or bot? Michael Z. 2013-06-04 17:48 z
Yes it's definitely possible for Lua to output code. I used it in Module:nl-verb which has two output functions: one shows a table and the other shows a simple machine-readable format for bots. —CodeCat 17:51, 4 June 2013 (UTC)Reply
Would there be really anything to sync? Let me clarify, Module:languages would be something like m = {ru = {names = {"Russian"}, script = {"Cyrl"}}, sh = ..., ...} and Module:languages/extended will be something like m.ru.wikipedia = "Russian language"; m.sh.notes = "bla bla";, so there's no duplication in data that would need to be synched after every change. Regarding effect of adding a data field, I can think of 4 more useful fields that we can add here for every language, as well as fields for scripts (needed in scripts categories, etc.). Moreover, we are talking about a code that is alomost always loaded for all entries, so even 0.1 second means quite a lot here. I as a user open >50 Wiktionary entries per day. The project has >1000 active users, multiplied by 365, wow we can save >500 hours from being wasted per year. --Z 19:32, 4 June 2013 (UTC)Reply
Re. syncing, if languages/extended were short by three rows out of 500, it would be nice for the machine to tell us which language codes are missing.
Are you saying that adding a field does add 0.1s to page load, so that there is a demonstrated benefit to splitting the table? Michael Z. 2013-06-04 23:42 z
Yes, but in practice we rarely decide to remove a language from Module:languages or add extra informations to languages/extended for a language which is not already at languages, so... I thought this sync issue is not a significant disadvantage.
I was partly joking, I just meant to say we may have unnoticeable effects -- in this module particularly -- which may significantly affect the system.
I think the debate here is because CodeCat tends to remove / not to add less useful informations in order to keep the module short (then what I suggested would be premature optimization). For example, you CodeCat suggested to remove script and family when they are None and qfa-und, while I think we should choose the safe way, that is considering the default as unspecified.
The full master table idea looks good BTW (if it's feasible). --Z 08:48, 5 June 2013 (UTC)Reply
It was just an idea if size is really a problem. I don't necessarily think that we should do it currently. —CodeCat 12:40, 5 June 2013 (UTC)Reply
It was not a good idea. If size is a problem, we should split the table or anything but to remove information. Anyways - this module is created to contain a wide range of data hence there are potential problems regarding the size. If the problem happens, which I think it probably eventually will, I suggest to split the table or Michael's idea. I'm out. --Z 13:42, 5 June 2013 (UTC)Reply

Sanskrit

I went to add some scripts for Sanskrit, but I couldn't find it in the edit box. Anyway, I understand the number of scripts is no longer limited to 10, so please change the scripts for m["sa"] from

scripts = {"Deva", "Beng", "Brah", "Khar", "Knda", "Latn", "Mlym", "Shrd", "Sinh", "Telu"},

to

scripts = {"Deva", "Beng", "Brah", "Gran", "Gujr", "Guru", "Khar", "Knda", "Latn", "Mlym", "Mymr", "Orya", "Shrd", "Sinh", "Taml", "Telu", "Thai"},

Also, there are two scripts, Siddhaṃ and Gupta, that don't have ISO 15924 codes yet, so we'll have to wait on them. Is it still the case that the first script listed is the default when no script is explicitly specified? —Angr 09:38, 6 June 2013 (UTC)Reply

Yes, the first script is the default. It's the same with names as well. —CodeCat 11:49, 6 June 2013 (UTC)Reply
OK. Can you make the edit? As I said, when I tried I couldn't find Sanskrit in the edit box. I don't know what's going on with that. —Angr 14:21, 6 June 2013 (UTC)Reply
The search is a bit strange with modules. The custom code editor that is loaded doesn't work right with the browser's own search, so it has its own (worse) search function. You need to click inside the edit box and then press ctrl+F. —CodeCat 14:32, 6 June 2013 (UTC)Reply
Done Done - -sche (discuss) 15:15, 6 June 2013 (UTC)Reply
Thanks, -sche. CodeCat, it wasn't just that my browser's search function didn't find it. I first tried pasting the entire content into my text editor with Ctrl-A followed by Ctrl-C, but my browser freaked out because the website was going to be able to access my clipboard so I hit Cancel. Then I scrolled down to where Sanskrit should have been, and the list went straight from one of the codes starting with ry- to one of the three-letter codes starting with sa-, but the two-letter sa wasn't there. —Angr 15:25, 6 June 2013 (UTC)Reply

Adding field for automated transliteration

We need to add informations about automated transliterations. The only information that we need to add is that whether the manual transliteration should be overridden or not. (we can omit module names by standardizing the titles of these modules, which have already been done afaik) What way is the best way to hold them? I think we should just add a key for overriding manual transliteration with true/false value for these languages. Not to have the field (i.e. a nil value) means the language doesn't use automated transliteration. --Z 20:03, 24 June 2013 (UTC)Reply

I was thinking something similar. Something like has_auto_transliteration = true. We could also use a similar approach with other language specific properties, like the characters that should be removed from entry names, or sort keys (which have been discussed before but ran into some technical problems), the possible genders, and so on. —CodeCat 21:20, 24 June 2013 (UTC)Reply
That field is not enough though, as I said above, there are actually three kinds of languages: (i) those that don't use (automated) transliterations, (ii) languages that use automated transliterations which work safely AND perfectly (manual transliteration must be overridden in this case), and (iii) languages that their manual transliteration should not be overridden by automated one for some reason. The third one does not (necessarily) refer to those which don't transliterate safely (IMO, if a transliteration module can't transliterate safely, the language should be put in the first class); for example, we can safely transliterate (as opposed to transcribe) terms letter-by-letter for languages that use Semitic alphabets, the transliteration would be correct and safe and is useful when no transliteration is given, it's just not perfect since it still doesn't show vowels.
Thus, the best way is adding a key which takes the values true, false, and nil (for nil of course we can simply remove the field). The key may be something like override_manual_translit, or simply auto_translit. --Z 10:05, 25 June 2013 (UTC)Reply
Preferences should also be taken into account. While it is perfect to always use autotranslit for Armenian, Georgian and our main editors want it that way, it's okey to always autotransliterate these two. For Slavic Cyrillic languages, stress marks are important, Russian also has exceptions in readings, which are transliterated differently. Stress marks are occasionally inserted in the Cyrillic script in the "head" or "alt" tags. --Anatoli (обсудить/вклад) 11:03, 25 June 2013 (UTC)Reply
What exactly would be the difference between "safe" and "unsafe" automatic transliteration? Like, what consequence (in templates and modules) can we attach to that difference? —CodeCat 11:49, 25 June 2013 (UTC)Reply
That depends on the case and each language should be discussed, as far as I understood the Russian module for example doesn't transliterate safely, I don't know if its mistakes are tolerable and we should consider that in the third class, or not to use it at all. --Z 12:01, 25 June 2013 (UTC)Reply
I don't really think this makes much sense. Either an automatic transliteration is usable, or it isn't. There shouldn't be an in-between because we don't want to show incorrect information. —CodeCat 12:19, 25 June 2013 (UTC)Reply
(after edit conflict). @CodeCat. It's better to transliterate (unsafely) привет as "privet" than no transliteration at all. If stress is indicated ("head" in entry templates or "alt" in translations), then "приве́т" would produce "privét", which would make Russian automatic transliteration 99% safe. So use it, if manual is missing.
(before edit conflict) I would put Russian and all other Slavic Cyrillic languages (except for Serbo-Croatian) in the class where the automatic transliteration is added only where manual is missing. They're "almost" safe: if stress is indicated, they are 100% safe (Belarusian, Bulgarian, Macedonian, Ukrainian) and Russian is 99% safe, if it makes sense. AFAIK, Greek is similar to Russian in that sense. For Kazakh, Kyrgyz, Bashkir, other small Cyrillic-based languages we don't have "owners" with preferences but they are safe. Tajik is safe and I'll leave it for ZxxZxxZ and Dijan to decide. Mongolian and Tatar are not quite safe in terms of letter/sound correspondence (Mongolian) and common Roman spelling (Tatar) but there are no reliable sources and active native speakers here, so can be considered safe for this exercise. The Sinhalese module is working and is quite good (thanks, Z), can consider safe. Hindi, Hebrew, Arabic, Persian are VERY unsafe, IMHO. Hindi can be improved significantly and made safe, need skills. Vocalised Arabic can be improved with good skills as well. Persian is not vocalised and Hebrew seems unsafe even with full vowelisation (more input is needed on the topic). No comments about the rest. --Anatoli (обсудить/вклад) 12:29, 25 June 2013 (UTC)Reply
I thought it was a given that manual should always override automatic. The automatic is only there to provide a default, but defaults should always be able to be overridden. —CodeCat 12:32, 25 June 2013 (UTC)Reply
I think auto-transliteration should override manual even for Slavic languages. That would prevent people from adding wrong transliterations, like here, here or here. Stress can be indicated on the word itself by using "head=" and "alt=". As for the 1% of Russian exceptions, they would not exist if we ditched our current WT:RU TR policy, which is a Wiktionary-specific hybrid of transcription and transliteration, in favor of scholarly transliteration. --Vahag (talk) 12:51, 25 June 2013 (UTC)Reply
Sorry, Vahag but I'm against such ditching for 1% of Russian exceptions. It may be useful to know how place and personal names are officially transliterated but not for dictionaries and textbooks. In any case, I'm also for "manual should always override automatic" rule for Slavic languages and word stresses are not consistently provided anyway in most cases. Macedonian word stress is 90% predictable but not Belarusian, Bulgarian, Russian and Ukrainian. --Anatoli (обсудить/вклад) 13:08, 25 June 2013 (UTC)Reply
Moreover, by overriding manual ones, if we decide to make a change in transliteration system for a language, we can easily do that for all entries by one click. --Z 12:56, 25 June 2013 (UTC)Reply
(edit conflict) I explained that above. In many cases transliteration may be quite correct, or safe, and useful but not perfect, i.e. it may not include all of the desired information, they are useful because they are better than nothing. --Z 12:34, 25 June 2013 (UTC)Reply
Example to get the idea: we have an entry like this. We are not aware of the pronunciation (and therefore the correct transcription), so we just transliterate the word letter-by-letter (this transliteration is correct, safe) so that people who can't read the script or prefer Latin would be able to read that (so it's also useful), that's a common practice used in sources as well. --Z 12:52, 25 June 2013 (UTC)Reply
I'm really not sure what can be done with "unsafe" transliterations like Arabic, Hebrew and Persian (sorry, don't know how Syriac works but I assume it's similar). Transliterating unvocalised Arabic صحة as "ṣḥ(t)" is bad, IMHO, , vocalised صِحٌَةٌ "ṣiḥḥatun" is better but still incorrect (if we transliterate tāʾ marbūṭa that way). It should be "ṣiḥḥa(t)". The Arabic module still needs some works before it can be "released" even for fully vocalised texts. --Anatoli (обсудить/вклад) 13:23, 25 June 2013 (UTC)Reply
Transliterating Arabic is the easiest one among languages which use Semitic alphabets, I think, but even for Arabic, the word must be fully vocalized, which is rarely done, most users don't use diacritics, many even don't know how diacritics should be used, they are rarely used in texts and people are not much familiar with them, lol even Arabs themselves don't know, see entries in ar:تصنيف:فَعِيْل, they shouldn't have saakin on yaa'. So.. I don't know if that module is really useful. I think a JS version would be more useful, something that can help individuals to transliterate texts.
Regarding that bug, I've really no idea about that shadda stuff and that why it duplicates the next letter rather than the previous one. --Z 14:00, 25 June 2013 (UTC)Reply

Transliteration, diacritic removal and other things

I have thought about this a bit more and I think it makes more sense to combine transliteration and diacritic-stripping functions into a single module. That way, the module can do anything it needs to to strip diacritics, while keeping Module:languages clean of any really intricate things like lists of diacritics to be stripped and lists of characters to strip them from. Diacritic stripping and transliteration are similar operations: in both cases characters are mapped to other characters in a language-specific way. We should probably also add the generation of sort keys to this module, as this is also similar to stripping diacritics.

In addition, I don't think that it is practical to have one module per language in all cases. Case in point is Module:Cyrs-translit, which supports both Old Church Slavonic and Old East Slavic, and there is only one difference between them. If we stick to a fixed naming scheme that several modules/templates currently require, then we need to split this module into Module:cu-translit and Module:orv-translit even though they'd be almost identical. So I propose that we add only one extra value to Module:languages, like this:

m["cu"] = {
    names = {"Old Church Slavonic"},
    type = "regular",
    scripts = {"Cyrs", "Glag"},
    family = "zls",
    support_module = "Cyrs-support"}

This way, we can choose whatever we want to name the module (put Cyrs in front, or cu, as we wish). This module would then contain transliteration, diacritic removal, and sort keys all in one. This value would be optional, so modules that use it can check whether it is available by checking it for "nil". Also, the module itself would export fixed function names (tr, sort, pagename?) and those can be checked for nil individually. —CodeCat 20:54, 14 July 2013 (UTC)Reply

But by using this approach, a single function will be duplicated in many modules. Transliteration is different from diacritic stripping (and sort functions), algorithm of the former varies for each language, hence we need a separate translit module for each language, but for diacritic stripping we have (and need, for now) one function, only the characters vary, and the table that contains the lists of characters is not that big and there's not even any urgent need to move it to Module:languages. --Z 11:12, 15 July 2013 (UTC)Reply
If the diacritics are all that is different, then that is really the same as how different transliteration modules currently are. Compare Module:uk-translit and Module:bg-translit, which are the same except for the list of characters. Of course sometimes there are more differences, but that is my point: we may want to have more differences in the way we strip diacritics too, that we currently haven't anticipated. It's better to allow for that flexibility now while we are still at the beginning of things, rather than to run into problems later because we made our system too restrictive. But if it pleases you we can use the original idea of listing just the characters here. What about putting the name of the transliteration module in this module? Notice how Module:Cyrs-translit takes a language code as the second parameter. If we make that mandatory, we can use one module for several languages if the difference in transliteration rules is only small. And what about sort keys? —CodeCat 12:06, 15 July 2013 (UTC)Reply
What do you think about script-based transliterating for all languages? (having modules only for scripts) --Z 12:32, 15 July 2013 (UTC)Reply
I suppose that can work, but I don't know how well. Having transliterations per script would only make sense if they all have some kind of common base that is the same for most/all of them. I think Cyrillic is probably the best place to look for examples because it's probably used for the most languages outside Latin. So there will also be the most variety. —CodeCat 14:06, 15 July 2013 (UTC)Reply
The main problem (and the most important) is Cyrillic, which differs a bit for each language, but most other scripts are transliterated very similar or even identical (Armn, Avst) for different languages. Moreover, transliteration firstly depends on the script. This approach makes much more sense especially for those languages that use several different scripts. --Z 14:25, 15 July 2013 (UTC)Reply
That's definitely true. The division in scripts is primary, and splitting among languages only comes as an afterthought. If you can read Russian Cyrillic, then you can also understand the majority of the letters in other Cyrillic-using languages. Of course there are some subtle differences in pronunciation, some of which affect transliteration as well (like Ukrainian G = H), but that is really a matter of pronouncing the letters, which occurs in any script, even Latin. Someone who doesn't know Dutch phonology will still see that the letter G is the letter G, even if it's pronounced /ɣ/ in Dutch. —CodeCat 15:53, 15 July 2013 (UTC)Reply
Do we really need to fix what is not broken? Cyrillic transliteration modules work quite well now. I don't know if they need to be grouped into one module. There are many more languages, which are still missing but if we add, say Udmurt, Buryat and later a native person wanted to change something, they won't be able to do it if modules become too complicated and only one or two people can maintain it or they change and it will effect wrong languages. Some modules are heavily used, some may not even be used. Just a thought. As I said, I don't want to stop you but I'm not too enthusiastic about combining all Cyrillic modules into one.
We have a number of modules that need to be created or fixed. Abugida scripts, such as Thai, Lao, Khmer, Burmese, Devanagari can also be transliterated automatically but it would take a bit more coding skills than Cyrillic or Greek. The Korean and Hindi modules are there too but they are not 100% right, so they can't be used for automatic transliteration. The transliteration rules are not too complicated for readers/learners but it has to be converted into some code. Are there any volunteers to fix/create any modules other than Cyrillic? I can help with Korean. Hindi, Thai, possibly Lao and Khmer, Stephen Brown knows Khmer and Angr knows Burmese. --Anatoli (обсудить/вклад) 23:36, 15 July 2013 (UTC)Reply
I have no problem implementing transliteration modules as long as all the rules are clearly spelled out beforehand. DTLHS (talk) 23:40, 15 July 2013 (UTC)Reply
Shouldn't we generate sort key for terms in all languages using a single function that, regardless of the language, removes all diacritics and initial hyphens? --Z 20:23, 16 July 2013 (UTC)Reply
No, different languages sort differently and treat diacritics differently. I think one of the only things that all languages do is strip hyphens. —Μετάknowledgediscuss/deeds 20:48, 16 July 2013 (UTC)Reply
Yes, some languages consider diacritics as part of the sorting order while others don't. In German, ö is sorted identical to o, but in Swedish, it comes at the end of the alphabet after z, while in Hungarian, it comes after o and ó and is sorted identical to ő. —CodeCat 21:37, 16 July 2013 (UTC)Reply
I created Module:orv-translit, apparently we can't directly use Module:Cyrs-translit alongside other language-based translit modules, because it's tr functions is not identical to those of xx-translit modules; it takes lang as second parameter. --Z 21:51, 16 July 2013 (UTC)Reply
We can make that work if we add that parameter to everything that uses the transliterations. If we want to change the naming, then we'll have to do that anyway. —CodeCat 22:18, 16 July 2013 (UTC)Reply

Since discussions about whether making automated transliteration language-based or script-based went nowhere, and indeed none of them is perfect really, I suggest to use the both. So the modules titles will be put in this page and language code and script code would be always provided to these modules. One of them will be always useless, though... --Z 23:10, 27 July 2013 (UTC)Reply

We could provide only the language code, and assume that if a language needs transliteration for several scripts, that it will have several modules. There doesn't seem to be much to gain from unifying all the scripts of, say, Sanskrit, into one module, because they would not be able to share any code anyway. —CodeCat 23:18, 27 July 2013 (UTC)Reply
Looks good to me. --Z 01:14, 28 July 2013 (UTC)Reply

Regarding sort key, is it really necessary to do any other change beside what links.prepare_title does and removing initial hyphen etc.? --Z 01:14, 28 July 2013 (UTC)Reply

Yes. Just look at how German is sorted and you will see why. —CodeCat 02:02, 28 July 2013 (UTC)Reply

not languages

According to WT:LANGTREAT, none of these lects are recognised as languages on Wiktionary, yet they are present in this module. In most cases, that is because their templates were allowed to continue to exist long after the languages were de-recognised, so that they could be transcluded by the at-that-time badly-written/-coded LANGTREAT table. Would anyone like to speak in favour of allowing any of these lects L2s, or etyl: (etymology-only) codes? Are any already using L2s in defiance of LANGTREAT? (Some may be etym-only already, due to the overly simplistic way I checked for their existence in the module.) If not, let's remove them from this module. - -sche (discuss) 20:15, 16 July 2013 (UTC)Reply

Berber, the Gaulish lects, and probably the Inuktitut lects deserve etyl:s and there are probably entries that should use them. There could conceivably be a case for the Cree lects etc, but you would know more about that. After a quick scan, only one language seems like it deserves an L2: Judeo-Persian. It's an Iranian language with Hebrew and Aramaic borrowings written in the Hebrew script, analogous to Yiddish. —Μετάknowledgediscuss/deeds 20:54, 16 July 2013 (UTC)Reply
I've taken the liberty of updating LANGTREAT re Judeo-Persian. I have to imagine its addition to that page as a Persian dialect was a simple error, given the difference in vocabulary and script that you mention, and the apparent total lack of any discussion about it. I agree re Berber, Gaulish and Inuktitut. And the Cree dialects might as well have etyl: codes. - -sche (discuss) 21:30, 16 July 2013 (UTC)Reply
Another scan reveals Bukharic, which is pretty similar to Judeo-Persian. I'm sure it deserves etyl: but unsure whether it deserves an L2. —Μετάknowledgediscuss/deeds 21:36, 16 July 2013 (UTC)Reply
Hmm. I've asked in the BP whether the two should be combined under one name. - -sche (discuss) 19:05, 17 July 2013 (UTC)Reply
What about the 43 kinds of Quechua? I'm thinking "no, they shouldn't have etyl: codes", because I can find almost no references saying that specific words derive from specific varieties of Quechua. - -sche (discuss) 21:44, 16 July 2013 (UTC)Reply
That was my reasoning as well. I've seen people refer to varieties of Quechua by country (or more commonly just as "northern Quechua" and "southern Quechua"), but never as specific as that. I think we don't need those codes. —Μετάknowledgediscuss/deeds 21:47, 16 July 2013 (UTC)Reply
I've removed all of the Quechuas except "Classical Quechua", which I've left for now because its name implies it's somehow special (and one one of our entries does use it, -ta#Quechua). - -sche (discuss) 17:09, 17 July 2013 (UTC)Reply
You need to fix all the translation sections that used those codes. For example, script error at gold. —Μετάknowledgediscuss/deeds 04:21, 19 July 2013 (UTC)Reply
I've been fixing them as they trickle into Category:Pages with script errors, it's just been taking the server a while to find them all and put them in that category. Lenape keeps trickling in at a rate of a few entries per day, too. - -sche (discuss) 05:51, 19 July 2013 (UTC)Reply
Sorry, for some odd reason I thought you were neglecting it, but (as usual) I was wrong. Thanks! —Μετάknowledgediscuss/deeds 14:09, 19 July 2013 (UTC)Reply

Language grouping

Currently languages are grouped in translation sections and other places in three distinct ways:

  • Under a language that has a distinct code of its own:
    • Arabic (ar): arq, xaa, abv, shu, arz, afb, mey, acm, apc, ayl, ary, apd
    • Greek (el): grc, gmy
    • Albanian (sq): aln
    • Buryat (bua): bxr
    • French (fr): fro, frm
    • German (de): ksh, gsw
    • Lithuanian (lt): sgs
    • Low German (nds): nds-de, nds-nl
    • Mongolian (mn): cmg
    • Norwegian (no): nb, nn
    • Sardinian (sc): srn
    • Spanish (es): osp
  • Under a macro language with no code of its own:
    • Chinese: yue, dng, gan, hak, cmn, nan, cdo, wuu
    • Lenape: umu, unm
    • Sorbian: dsb, hsb
  • As a dialect or specific script where the dialect has no distinct code:
    • Albanian (sq): Tosk
    • Coptic (cop): Bohairic, Sahidic
    • Lithuanian (lt): Aukštaitian
    • Malay (ms): Rumi, Jawa
    • Sardinian (sc): Nugorese
    • Talysh (tly): Asalemi
    • Serbo-Croatian (sh): Roman, Cyrillic, Arebica
    • Kurdish (ku): Kurmanji, Sorani
    • Aramaic (arc): Hebrew, Syriac
    • Kashmiri (ks): Arabic, Devanagari
    • Romanian (ro): Latin, Cyrillic

This isn't an exhaustive list. Can this be codified and added to the language module? Any comments on this classification? DTLHS (talk) 18:52, 28 July 2013 (UTC)Reply

It can be, but there are currently no modules that might use this. What would you use it for? —CodeCat 19:02, 28 July 2013 (UTC)Reply
Potentially a translation module (but this probably will never be fast enough to actually use). This data isn't just used by modules though- anyone parsing Wiktionary needs to use it extensively. DTLHS (talk) 19:11, 28 July 2013 (UTC)Reply
Some of these groupings are disputed by the actual practice of editors, if not by formal complaints. For example, Alemannic German and Ancient Greek are both often listed on their own under 'A'. And Ruakh opposed nesting Middle High German and Old High German under German on the grounds that if we call a language 'M...', 'O...', people will look for it under 'M', 'O', not under 'G'; that argument applies equally to 'Middle French' and 'Old French'. - -sche (discuss) 19:55, 28 July 2013 (UTC)Reply
Hence the need for codification and standardization. DTLHS (talk) 20:01, 28 July 2013 (UTC)Reply
We could also move all of the hardcoded language specific rules out of MediaWiki:Gadget-TranslationAdder.js and into a readable format. DTLHS (talk) 23:20, 28 July 2013 (UTC)Reply
I would definitely support separating code from data, so that people don't need to know the intricacies of the code just to change the data. —CodeCat 23:23, 28 July 2013 (UTC)Reply

Alphabet field

For languages that use an alphabet, it would be useful to have a field with that alphabet in the standard character order. We could then automate Category:Terms by their individual characters by language (for any character not in the standard alphabet). Also useful for sorting. DTLHS (talk) 05:16, 30 July 2013 (UTC)Reply

Done Done The automation is operational on the French Wiktionary. JackPotte (talk) 22:03, 2 January 2015 (UTC)Reply
cf. Wiktionary:Beer_parlour/2015/January#Category:Terms_by_their_individual_characters_by_language_automation_in_Lua. JackPotte (talk) 16:20, 4 January 2015 (UTC)Reply

Sort key for Russian

How is sort key for Russian generated? I don't know what does it but words with "ё" often appear at the front. The letter should be after the Cyrillic "е" and before "ж". --Anatoli (обсудить/вклад) 23:41, 6 August 2013 (UTC)Reply

Unfortunately, the order of the letters themselves is fixed. All we can do with sort keys is make certain letters equivalent to other letters. So we could replace ё with е in the sort key, and then those two letters would be sorted together as if they were the same. —CodeCat 00:07, 7 August 2013 (UTC)Reply
I think what you suggest is better than putting words with "ё" at the front. If you can do it, please replace ё with е. --Anatoli (обсудить/вклад) 00:17, 7 August 2013 (UTC)Reply
Would it also be possible to replace "ё" with "e+" (where "+" is some character that makes all "ё"s be sorted after plain "e"s)? I'm not saying that's a good idea, necessarily, but it would seem to be a possibility. I vaguely recall such a thing being proposed for Catalan. - -sche (discuss) 00:24, 7 August 2013 (UTC)Reply
It sounds interesting. Maybe "ё" with "e" + whatever character goes straight after "я" (last letter of the Russian alphabet, with a different trick for the capital letter)? — This unsigned comment was added by Atitarev (talkcontribs).
Can someone please explain the meaning of these? Grave is not used in Russian but acute is. Should the handling of "ё" be there?
    entry_name = {
        from = {"Ѐ", "ѐ", "Ѝ", "ѝ", GRAVE, ACUTE},
        to   = {"Е", "е", "И", "и"}} }
--Anatoli (обсудить/вклад) 23:22, 7 August 2013 (UTC)Reply
I've added this sort_key for Russian but I don't know if it had any effect. E.g. in Category:Russian_adjectives you'll see two vulgar adjectives - ёбаный, ёбнутый appearing after numbers. Actually, "Ё", "ё" should be AFTER "Е" , "е". Will it work if I change the sort key to "Ея" , "ея" ("я" is the last letter of the Russian alphabet).
    sort_key = {
        from = {"Ё", "ё"},
        to   = {"Е" , "е"}}
--Anatoli (обсудить/вклад) 23:42, 15 September 2013 (UTC)Reply
That can work, but maybe you can pick something that comes after я? It doesn't have to even be a Russian letter. Also, you don't need to handle uppercase and lowercase letters, sort keys are always lowercase. —CodeCat 23:45, 15 September 2013 (UTC)Reply
Thank you. I don't know why ё (\u0451) appears BEFORE я (\u044f) or any other Cyrillic letter. It's Unicode value is higher. I was just trying to find, which letters follow "я", it's ѐ, ё, ђ, ѓ, є, ѕ, etc. --Anatoli (обсудить/вклад) 23:56, 15 September 2013 (UTC)Reply
Maybe the letter in those entries is the Latin ë? —CodeCat 00:35, 16 September 2013 (UTC)Reply
Do you mean ёбаный and ёбнутый? No, definitely not. --Anatoli (обсудить/вклад) 00:47, 16 September 2013 (UTC)Reply
@Anatoli: Re: "I don't know why ё (\u0451) appears BEFORE я (\u044f) or any other Cyrillic letter. It's Unicode value is higher": MediaWiki no longer relies solely on dumb Unicode order; rather, it now converts the strings to uppercase (in a non-language-specific best-effort way), and then relies on dumb Unicode order. (Hey, it's an improvement. Keep in mind that MediaWiki doesn't know that the titles in a given category are all in a certain language, so there's a limit to how smart it can be. Admittedly, it's still far from bumping up against that limit . . .) So the reason that 'ё' comes before the regular Cyrillic letters is that uppercase Ё is U+0401, while the regular uppercase Cyrillic letters А through Я are U+0410 through U+042F. —RuakhTALK 02:50, 8 October 2013 (UTC)Reply
Thanks. In other words, sort_key is not useful at the moment? It's supposed to make "ё" and "е" equal (which isn't 100% right but more acceptable than "ё" appearing before the rest of the alphabet). --Anatoli (обсудить/вклад) 03:25, 8 October 2013 (UTC)Reply
Re: "In other words, sort_key is not useful at the moment?": I wouldn't say that. It's just that to craft a sort-key-mapping that will do what you want, you have to understand how MediaWiki will process the result. (We'll have to adjust the sort-key-mappings every so often, as MediaWiki makes improvements. It's fortunate that we can centralize this logic, thanks to Scribunto, which means we can update it when we need; it used to be that we had to maintain sort-keys by hand (or I suppose potentially by bot), which of course can't be done piecemeal, since a sort-key system only works if all pages in a category are using it.) —RuakhTALK 06:59, 8 October 2013 (UTC)Reply

Edit request for Ainu

Could someone edit the Ainu (ain) entry from

  • scripts = {"Kana"}

to

  • scripts = {"Kana", "Latn", "Cyrl"}

I don't know that Ainu is still written in Cyrillics, but there are people in Russia who have Ainu blood. --BB12 (talk) 14:40, 1 September 2013 (UTC)Reply

Done. —CodeCat 14:43, 1 September 2013 (UTC)Reply
Script inflation ahoy... sigh. -- Liliana 17:12, 1 September 2013 (UTC)Reply
Thank you! It's not working, though (Category:Ainu language). Is there a delay? --BB12 (talk) 17:51, 1 September 2013 (UTC)Reply
Yes, usually when edits are made to templates or modules it takes some time for all pages to be updated. You can force an update by editing the page, even if you make no changes and just save; that's called a "null edit". —CodeCat 18:00, 1 September 2013 (UTC)Reply
That worked. Thank you, again! --BB12 (talk) 18:26, 1 September 2013 (UTC)Reply

Sauraseni Prakrit

Could someone please add the script as Devanagari (Deva) and the language family as Indic (inc). Also, because the name is so long, people usually simply call it Sauraseni or Shauraseni. Is there a way I can incorporate that too? DerekWinters (talk) 00:16, 8 September 2013 (UTC)Reply

Every language in the list has a primary name, and possibly one or more alternative names. The primary name is the name that gets used in entries, category names and such. It can be changed, but that would mean changing all of the categories and entries to match so it's not something we can just do on a whim. It would definitely need discussion with the community first. For now, I've added Sauraseni and Shauraseni to the list of names, and changed the script and family. —CodeCat 00:37, 8 September 2013 (UTC)Reply
Thank you. Would it be possible to use the shorter names as a redirect like in Wikipedia? As of yet, there are no entries for the language. Also, Devanagari is used for both Ardhamagadhi Prakrit (pka) and Maharashtri Prakrit (pmh). 173.66.190.134 01:43, 8 September 2013 (UTC)Reply

Ukrainian sort key

Ukrainian "ґ" (g) should follow straight after "г" (h), "є" after "е", "і" after "и", "ї" after "і". Russian sort_key (i.e. "ё" after "е") didn't achieve the desired result. Just mentioning the problem. --Anatoli (обсудить/вклад) 00:35, 8 October 2013 (UTC)Reply

Encapsulation

Why should the function not be called from outside the module? How is that going to work? —CodeCat 23:36, 3 December 2013 (UTC)Reply

See Wiktionary:Grease pit/2013/November#Proposal for a new module policy of prohibiting top level data modules. The data-blob for a given language-code has a rather arcane structure; for example:
  • it has values called sort_key and entry_name that perform string translations, except that they're not functions so the actual algorithmic logic has to reside elsewhere (namely in Module:languages proper).
  • it doesn't have a specific value for the language's canonical name, but leaves that to be inferred by a special convention that the full list of names follows.
In general, the module should contain all of the logic for interpreting its data.
RuakhTALK 16:00, 4 December 2013 (UTC)Reply
We could provide more helper functions to interpret the data returned by get_language. But the modules will still need to use get_language to get that data otherwise we would have to re-fetch it every time a different piece of the data is needed. --WikiTiki89 16:04, 4 December 2013 (UTC)Reply
I disagree. Those helper functions should be using get_language. (This is what's called "encapsulation" or "abstraction".) —RuakhTALK 19:30, 4 December 2013 (UTC)Reply
I'm sure if we do that, people will complain that we're making things too complicated again (you know who I am talking about). But I think that trying to encapsulate all possible use cases into this module really will make it too complex. Do we really want all of the category sorting and entry name logic to go in Module:languages? —CodeCat 19:38, 4 December 2013 (UTC)Reply
Currently, most of the category sorting and entry name logic is in Module:languages; but then the client of Module:languages is expected to do the work to implement that logic. It doesn't make sense. —RuakhTALK 19:59, 4 December 2013 (UTC)Reply
It's still encapsulation and abstraction. get_language returns an object. The other modules don't care what's in the object, they only use the helper functions to extract information from it. --WikiTiki89 19:40, 4 December 2013 (UTC)Reply
But do you think you can enforce that? Lua doesn't have private data members as far as I know. Could we turn the returned object into an actual OO-like thing, with access methods? —CodeCat 19:57, 4 December 2013 (UTC)Reply
It can be enforced by convention. Just like "private" fields in Python. If you've ever programmed in any form of LISP, this type of programming would be familiar to you. --WikiTiki89 20:04, 4 December 2013 (UTC)Reply
Conventions are hard to enforce on a big wiki like this. Granted, people who work with Lua usually understand a bit more, but even then there is a strong tendency towards an "if it works, what's the problem?" mindset. That same mindset actually acts as a detriment to any real improvement, because then people start to say "it worked, why make it more complicated?". So some kind of obfuscation might be beneficial to discourage improper use. We could use the Python convention of prefixing private members with _. But can we also attach access functions? —CodeCat 20:11, 4 December 2013 (UTC)Reply
Most people on this wiki learn by example, if we start out using it the right way, people will see that and copy it. Considering that the right way is easier in this case, I don't foresee any issues here. We may be able to attach access functions but the only difference that would make is foo(x) versus x:foo(). --WikiTiki89 20:26, 4 December 2013 (UTC)Reply
You mean m_languages.foo(x). So the latter is actually shorter. But it's cleaner as well, and I like object-oriented code. I just haven't found a really neat use for it in modules yet, because the way we work with data, there is a very strong split between data and functions that manipulate it, with few opportunities to treat things as "objects". This seems like a place where we can, though. —CodeCat 20:47, 4 December 2013 (UTC)Reply
I guess it's a matter of opinion. You "like object-oriented code"; I don't (well at least not in simple cases like this). One disadvantage of OOP here is that we would have to either add in the methods dynamically in the get_language function, unless we want to change all the language data modules again, both of which are ugly solutions. Whereas if we use accessor functions rather than methods, we wouldn't have to waste time on such things. OOP's strengths are when you have polymorphism and in this case we don't. --WikiTiki89 20:56, 4 December 2013 (UTC)Reply
I think OOP makes a lot of sense here. I don't think it's too bad to "add in the methods dynamically"; presumably we would just attach the appropriate metatable. (Disclaimer: I've never actually played with Lua metatables.) —RuakhTALK 21:20, 4 December 2013 (UTC)Reply
Me neither, which is why I'm so afraid of them. We could give both ways a try and then performance test them. If the difference isn't too bad we can stick with the OOP. --WikiTiki89 21:24, 4 December 2013 (UTC)Reply
@Wikitiki89: I'd be O.K. with that; it doesn't seem any better than having those functions accept a language-code and call get_language themselves, but it also doesn't seem any worse. (Oh — one area where it might be better or worse, I'm not sure which, is error handling. Is it better for the client to receive nil and decide how to handle it, or is it better for this module to try to handle it? I don't know.) But obviously that's not what CodeCat is suggesting. —RuakhTALK 19:59, 4 December 2013 (UTC)Reply
I think the client should handle nil values, but as a precaution the helper functions should also check for it and do something. --WikiTiki89 20:04, 4 December 2013 (UTC)Reply
Having thought about it further — I think you're right; as long as clients are disciplined in treating get_language's return value as an opaque object, it makes sense to have them call it directly. —RuakhTALK 21:20, 4 December 2013 (UTC)Reply
I'm not sure we would even need metatables here. We could just do something like this:
local t = {}
t._private = "I am private, please don't touch me"
t.get_private = function(self) return self._private end

Or am I missing something? —CodeCat 21:38, 4 December 2013 (UTC)Reply

A closure would work better:
function make_object()
    local t = {}
    local data = {"x", "y"}
    t.get_x = function() return data[1] end
    t.get_y = function() return data[2] end
    return t
end
Metatables may or may not be better though. I don't know much about them. --WikiTiki89 21:45, 4 December 2013 (UTC)Reply
We could also avoid using methods and just use table properties instead. The get_language function above doesn't necessarily have to return the retrieved data verbatim, it can modify it too. Would that provide enough encapsulation? —CodeCat 22:13, 4 December 2013 (UTC)Reply
I don't think I know what table properties are, but I don't think that really makes a difference. --WikiTiki89 22:18, 4 December 2013 (UTC)Reply
How do you know it doesn't make a difference if you don't know what I meant? What I mean is, we still return the data as we did before, but "rearrange" it a bit in the accessor function so that the returned data format is not strongly tied to the format it's stored in. For example, we could split .names into .name and .alt_names. —CodeCat 22:36, 4 December 2013 (UTC)Reply
I see what you mean. The thing I don't like about it is that all the work is done beforehand, before it is known what work needs to be done. For example, you'd be splitting the names data when all the client wants is the script code or something. --WikiTiki89 22:43, 4 December 2013 (UTC)Reply

Metatables

Since nothing currently transcludes this module anyway, I took the liberty of modifying it to demonstrate the metatable approach: Module:languages?oldid=24143943. This is mainly for discussion: I do think that metatables are the way to go, but this wasn't me planting a flag. :-)   —RuakhTALK 23:42, 5 December 2013 (UTC)Reply

I skimmed over it and from what I see, it looks good. I will look at it in more detail later. --WikiTiki89 23:46, 5 December 2013 (UTC)Reply

Returning scripts and families

If we are going to make accessor functions for all the data in the data tables, then we will presumably also want to do this for families and scripts. Assuming that we want to turn them into encapsulated objects too, there is something I wonder about. When someone asks a LangCode object for its family or script(s), should we give it the code, or should we retrieve and return the whole object? The latter is neater, but it would be less than ideal if there are use cases where someone wants only the codes, not the rest of the information. I don't know if there are such use cases; after all, the codes by themselves aren't terribly useful for anything on Wiktionary. Script codes can be used by themselves (because they are also CSS classes) but our current code retrieves the information regardless, to ensure that the code is valid before using it. —CodeCat 01:38, 6 December 2013 (UTC)Reply

It's not either-or. You can do both with two separate functions. --WikiTiki89 01:40, 6 December 2013 (UTC)Reply
True. But I think that a "get code" function would rarely be needed, so we could skip it for now and add it when we find a use for it. —CodeCat 01:49, 6 December 2013 (UTC)Reply
Or we can add it so that when someone finds a use for it, that person won't have to figure out how to add it. --WikiTiki89 02:22, 6 December 2013 (UTC)Reply

Module name property

I think it might be good to add the name of the module that the language code was retrieved from as part of the properties. We could then use it for things like "edit details" links. Currently, the edit link on the language category pages like Category:English language links to Module:languages, but we'll want to change that. However, for it to work properly, there needs to be a way for calling code to retrieve the module name that the code is defined in. —CodeCat 02:16, 6 December 2013 (UTC)Reply

That shouldn't be too hard, but I wouldn't rush to do it unless we're sure we'll need it. --WikiTiki89 02:23, 6 December 2013 (UTC)Reply
I did give a case where it would be needed. Do you think we should create that link in some other way? —CodeCat 02:25, 6 December 2013 (UTC)Reply
Well for one thing, we shouldn't be encouraging people who may not know what they're doing to edit the language data. People who know what they're doing will be able to find it without a link from the language category. --WikiTiki89 02:28, 6 December 2013 (UTC)Reply
The modules are still protected aren't they? —CodeCat 02:29, 6 December 2013 (UTC)Reply
Yes, which makes the link even more useless. We can continue linking to Module:languages and at Module:languages/documentation, we should add a directory or at least a link to Category:Language data modules. --WikiTiki89 02:33, 6 December 2013 (UTC)Reply
If you really want to though, it's really easy to determine the module from the language code. No need to add more clutter to the data structure. --WikiTiki89 02:38, 6 December 2013 (UTC)Reply
How would it be determined exactly? If we ask module writers to write their own code to determine it, then it defeats the whole purpose of separating the data storange from the retrieval mechanism, because then we can't change the storage without breaking that other code. —CodeCat 02:46, 6 December 2013 (UTC)Reply
No, I mean we could write a function in this module that takes a code and returns its module's location. There is not need for it to be part of the language data structure though. If we change the storage mechanism, we'd have to change this function anyway. --WikiTiki89 02:49, 6 December 2013 (UTC)Reply
I agree with CodeCat. Whether or not we really should have an edit-link, we currently do have an edit-link, so it should point to the right place. (Though I think it should link to the view page rather than the edit page.) —RuakhTALK 07:40, 6 December 2013 (UTC)Reply
I kind of assumed it was linking to the view page, otherwise I wouldn't have suggested keeping it linking to Module:languages (whose documentation page would then be displayed with a directory). --WikiTiki89 15:30, 6 December 2013 (UTC)Reply
Oh! That makes more sense. O.K., I'm neutral. —RuakhTALK 18:02, 6 December 2013 (UTC)Reply

Langrev

I've written two functions that implement the behaviour of Template:langrev. The first searches only by canonical name, and always returns a single language if it finds it. The second searches all the names, and returns a table of results. The second function has an "exact" parameter, which tells the function to find only languages that have the given name as one of their exact names, otherwise it searches as a substring by default. Given that these functions search through the whole database of codes, should we mark them as "expensive" to limit their use? —CodeCat 22:16, 6 December 2013 (UTC)Reply

I would have said that we should make it subst-only, but I don't think that works with modules. Maybe we should make it bot/JS only? --WikiTiki89 22:19, 6 December 2013 (UTC)Reply
Some of our current templates use {{langrev}}, it doesn't have zero transclusions. It's used by {{langname}} to convert a secondary name to a canonical name, which in turn is used by {{ttbc}} and {{trreq}}. —CodeCat 22:21, 6 December 2013 (UTC)Reply
So maybe we should allow them to be added, but have a bot automatically fix all {{ttbc}}s and {{trreq}}s to use language codes? That is if anyone would be willing to make a bot for that. --WikiTiki89 22:24, 6 December 2013 (UTC)Reply
We may want to just get rid of {{langname}} instead. It was originally created long ago when many templates only accepted language names, not codes. It was supposed to be a stopgap measure until everything was converted over to codes. This is now complete, so we don't really need it anymore. {{trreq}} and {{ttbc}} only use it because they display the name as well as the code, so it's convenient in translation tables. And I think Kephir's bot still adds them with names instead of codes. There was some discussion a while back, I'm not sure where. —CodeCat 22:28, 6 December 2013 (UTC)Reply
See also Category:Language code is name. —CodeCat 22:31, 6 December 2013 (UTC)Reply
Right, since we don't need it anymore, we should get rid of it. --WikiTiki89 22:36, 6 December 2013 (UTC)Reply

Ready?

I've converted Module:families and Module:scripts as well, because they don't seem to have any transclusions anymore (the ones that still appear all seem to go away when I null-edit them). Is this ready to be deployed? I think I would prefer it if it could be reviewed a bit more, especially by different people so that we don't get complaints later that we tried to push it through without consensus. Then again I think that certain people will never be pleased... —CodeCat 22:11, 7 December 2013 (UTC)Reply

I think we should mention it in the WT:GP asking people to review it. --WikiTiki89 22:19, 7 December 2013 (UTC)Reply
Yes, I agree. And we should give it some time; if we sleep on it a few days, I'm sure we'll think of other features we want to add before it goes live. :-)   —RuakhTALK 03:04, 8 December 2013 (UTC)Reply
So... what happens now? —CodeCat 22:25, 26 December 2013 (UTC)Reply
Now we need to convert all modules that use Module:languages/alldata directly to use Module:languages instead. --WikiTiki89 20:02, 27 December 2013 (UTC)Reply
First I think we need to address the points raised in the GP discussion . . . —RuakhTALK 20:10, 27 December 2013 (UTC)Reply
I don't think that's something we can really address, or can we? It seems more like a clash of understanding of programming practice, a kind of "something I understand" versus "something tried and true by programmers worldwide". —CodeCat 20:33, 27 December 2013 (UTC)Reply
O.K., "point raised". I agree that we probably can't address Kephir's dislike of software engineering, but as I said in the GP discussion, I do think he's right to want a way to iterate over all language-codes. We're already aware of several use-cases: (1) the "unit-test" page that tests the data for consistency; (2) multiple modules that JSONify fragments of the data for use in JS or by bots; (3) the page that lists all the language codes; (4) the module's own getLanguageByCanonicalName and findLanguagesByName, both of which currently rely on the very Module:languages/alldata that we wanted to orphan.
Also, this wasn't actually in the GP discussion, but I think we haven't yet settled whether templates should be able to call this module directly, and if so how. (Are you happy with what's currently there?)
RuakhTALK 08:14, 28 December 2013 (UTC)Reply
  1. The consistency check can be an exception, because it's meant to look at the data directly. As long as we document that I don't think it's a problem.
  2. For JSON it depends on what data the scripts need and what they do with it. We can't encapsulate the access functions in JSON as far as I know, so we can only send data. However, there's no reason that this data couldn't be "disconnected" from the data format in the modules, so that we purposely do not guarantee that whatever data JSON receives is the same data format stored in the module. Again as long as we document it, it shouldn't be a problem. The idea of encapsulation is limiting direct access to "official" and "approved" channels, but the JSON module could be one of those approved channels.
  3. The same applies to the page that lists all codes, too. This page is meant to show all the data in an overview, so it would make sense for it to access the data directly. Not sure though.
  4. Those two functions can be rewritten fairly easily, so that's not a big problem.
  5. I think using a separate module for template access is good, because that makes it easier to track such usage. We already can do that currently with Module:language utilities, and I think we should definitely keep it separate that way. In fact, it might be a good idea to make this a general practice, separating Lua-internal access from template access into distinct modules more generally. I don't know if the current way of accessing the data from template is good enough. It works, kind of, so we could try it, but we could also wait a little more and use the old method for now. We don't have to migrate everything in one go. We want to orphan /alldata, but we don't need to orphan it all right away.
CodeCat 15:25, 28 December 2013 (UTC)Reply
Re: 1–4: Obviously everything can be an exception, but why should they be? You seem to be saying that, rather than putting that logic in Module:languages (which already needs to know where all the data-modules are anyway), it's better to copy the logic to several different places. It just makes no sense to me.
Re: #5: O.K., I'll remove the experimental function, then. (BTW, I still think Module:languages/main is a better name for the external Module:languages entry-point than Module:language utilities.)
RuakhTALK 19:41, 28 December 2013 (UTC)Reply
I didn't really say that the current name is better. But "main" is misleading and doesn't really describe it. Module:languages/templates might be better.
I do think you have a point about exceptions. But these are exceptions for a reason: they do things that normal code should never need to do. For example, the current implementation of this module doesn't allow you to check whether the language has a sort key or entry name pattern specified, because this isn't needed for the level of abstraction that's given. However, WT:LL does need to be able to check this, so the interface is not enough. Writing a whole new accessor function just for that one page is a bit overboard, so direct access might just be better.
The same applies to JSON. The interface doesn't currently export all the data that JSON might need, and I don't actually know if it would make sense to. Lua string patterns don't make sense to JavaScript for example, so entry_name and sort_key are already useless to JSON unless they can be converted before they're exported. And to convert them means to access the patterns directly, so again this means writing an accessor without ever intending to use it beyond one or two pages.
Data consistency checks might not be possible to do using the regular interface, because that very same interface relies on the data already being consistent. And that's kind of the point of doing the checking externally: it means that you don't need to build those checks into the main module, which keeps it simpler and faster. The consistency check works like an "assert" statement; you don't want it in production code, but you still want it just in case.
CodeCat 19:59, 28 December 2013 (UTC)Reply
Re: "main": It's a traditional name for the external entry point into a programming language; cf. C/C++/Java/C#/D, Haskell, Python.
Re: Most of the rest of your comment: You seem to be forgetting about the :getRawData() method. For example, to check whether a sort key is specified, you would use langObject:getRawData().sort_key. (And your statement that "Lua string patterns don't make sense to JavaScript for example, so entry_name and sort_key are already useless to JSON unless they can be converted before they're exported" is ill-considered. Whatever "conversion" is needed can be done just as well in the client code. My translation-bot is in Perl, not JavaScript, so would not benefit from JavaScript-oriented "conversion" of the entry_name patterns. I have code in Module:User:Ruakh that JSONizes them straightforwardly — see output — and the bot has logic to interpret them.)
RuakhTALK 22:11, 28 December 2013 (UTC)Reply
If we have a getRawData method, doesn't that just open the gate wide for people to use it whenever they like? It's a bit like saying "we encapsulated it all, but if you don't like that, here's how you get around it". So if we do have such a method, we should find a way to keep track of all pages that use it, so that we can eliminate bad usage. —CodeCat 22:20, 28 December 2013 (UTC)Reply
I see that you've now added a getAllLanguages function; thank you.
Regarding your comment: Keep in mind that encapsulation and data hiding are about hiding things from code, not from people. This is a collaborative effort; anyone who really wants to circumvent the encapsulation will always be able to do so. In fact, that's what you were proposing: you were saying that anyone who wants to bypass the encapsulation can go straight to the data-modules. I was just saying that ideally, people should bypass as little encapsulation as they can. If they can get away with just bypassing the "clean" accessors, but still using Module:languages to find the data, then that's better than if they have to bypass everything.
(BTW, note that even without :getRawData(), there's still ._rawData. So I actually think the existence of :getRawData() should be fairly prominent, otherwise people will see that the module itself is using ._rawData, and they'll think that that's what they should do if they need something weird/unsupported.)
RuakhTALK 20:01, 29 December 2013 (UTC)Reply
I will bite from another angle: what are you planning to do after orphaning it? What magical possibilities will unfold that we have not had before? Keφr 15:47, 28 December 2013 (UTC)Reply
I don't think I understand your question. Why are you asking it, why is it relevant? —CodeCat 16:47, 28 December 2013 (UTC)Reply
What are you going to do next when the migration is done. You are not doing it just for the sake of rewriting things, right? I still have no idea why you want to impose this particular abstraction if some modules are going to bypass it anyway, as the above seems to imply. And how do we decide which modules should bypass it and which should not. Because I would start with the question of what are we trying to accomplish with it in the first place. Concrete benefits, not just "everyone does it". Keφr 18:06, 28 December 2013 (UTC)Reply
When the migration is done, changing individual language data modules will affect fewer pages. All other advantages are just icing on the cake. --WikiTiki89 18:45, 28 December 2013 (UTC)Reply
I am diabetic when it comes to coding. Distributing transclusions between pages might just as well be accomplished with the proxy table I wrote. Keφr 15:20, 30 December 2013 (UTC)Reply
I think it's a bit of a crystal ball situation. We don't expect the encapsulation to improve things now, but it's done as a matter of foresight. We expect that it will make changes easier in the future (which is why it's widely practiced in the first place), making our code easier to maintain. So we can't currently say what improvements it will bring us right now, but we can say with some certainty (because of the experiences of the thousands of programers that came before us) that not doing it will bite us at some unspecified time in the future. It would be short-sighted not to plan for the future. —CodeCat 18:53, 28 December 2013 (UTC)Reply
I think that creating convoluted layers of indirection hurts maintainability more than anything else. And what kinds of changes are supposed to be easier now, exactly?
  • Completely removing a data field. Before: 1) edit users to stop using the field, 2) remove field from data. After: 1) edit users to stop using accessor functions, 2) remove accessor functions, 3) remove field from data.
  • Replacing an unspecified number of data fields with other fields: Before: 1) edit users to check for new fields, and fall back to the old ones, 2) convert old data to the new fields, 3) remove fallback from users. After: 1) rewrite accessor functions, possibly add new ones, 2) possibly rewrite users to make use of new accessor functions, if any, 3) possibly remove old accessor functions.
  • Adding a completely new data field. Before: 1) add the field to the data, existing users will ignore it, 2) create consistency tests, 3) make use of the new field. After: 1) add the field to the data, 2) create accessor methods, 3) create consistency tests, 4) make use of the new field.
  • Switching to another database. Before: 1) create a proxy table for compatibility, which gathers data from the new place and converts it to the old format, 2) write a new accessor functions to use the new database with more efficiency, 3) convert users. After: 1) change accessor methods to gather data from another place, 2) possibly create new accessor methods to use the new database with more efficiency, 3) possibly convert users.
There is no difference, really. And remember that more code means more opportunity for mistakes and more cognitive burden on whoever comes to maintain the modules after us. Also, through use of modules like Module:utilities and Module:links, quite a lot of things we do with data is already abstracted away. Keφr 15:20, 30 December 2013 (UTC)Reply
How do you remove the field without breaking many pages? You have to remove it from all modules that use it, which is almost impossible if there's no way to trace it. With an accessor function, you can track it. But I think you missed the biggest difference of all: if you remove a data field, then you don't need to remove the accessor at all if you just rewrite the function to provide the data in some other way. That's the whole point of creating a break between data and access: they don't have to correspond to each other. So really your objections make no sense because they're built on the assumption that every data field has an accessor. That's certainly not the case. There is no reason that changing the data will also change the interface. That's the whole point. —CodeCat 15:33, 30 December 2013 (UTC)Reply

Camel-case vs. underscores.

The Lua standard library is neutral about camel-case vs. underscores (e.g. fooAndBar vs foo_and_bar) — no built-in function uses either one, but the Scribunto libraries consistently use camel-case. Left to my own devices, I'd prefer underscores, but given the fundamentality of Scribunto to all of our modules, I think we should follow their convention. —RuakhTALK 01:54, 6 December 2013 (UTC)Reply

I have wondered that too. Following Scribunto's lead may be more consistent, but I do think underscores look nicer. —CodeCat 01:59, 6 December 2013 (UTC)Reply
I agree with CodeCat, underscores look nicer, but consistency is also important. --WikiTiki89 02:24, 6 December 2013 (UTC)Reply

English in own data-module?

Should we put English in its own data-module? There seem to be about half a million pages that use the English details (via {{en-noun}}/{{en-verb}}/{{en-adj}} or {{head|en|...}}/{{l|en|...}}/{{context|...|lang=en}} or whatnot), so even if the languages in the 'stable' module are actually stable, there may be a benefit to putting English in a 'super-stable' module (just named .../en). But I'm not sure. (There's obviously a risk of overoptimizing, or of optimizing the wrong thing: if we get the English data absolutely stable, but are still constantly modifying all the other languages, and/or Module:languages itself, then it won't really make a difference.) —RuakhTALK 22:22, 7 December 2013 (UTC)Reply

I think it would be enough to just put it in Module:languages/stable. This is something we can always change later if it gets to be a problem. --WikiTiki89 22:57, 7 December 2013 (UTC)Reply
Yes, that's the nice thing about keeping the implementation details hidden in this module. We can always change it to fit changing needs. :) —CodeCat 22:59, 7 December 2013 (UTC)Reply
Absolutely, but I'm hoping that we can really avoid modifying this module. I mean, it has some pretty complex logic and detailed functionality, so it's obviously unreasonable to hope that it will never change, but it would be nice if changes to it were rare. —RuakhTALK 02:58, 8 December 2013 (UTC)Reply
We (I) will still be "constantly modifying all the other languages" for the foreseeable future, changing the script and family info of existing languages and adding new-to-us languages which were excluded from previous bot-imports of ISO codes because of naming conflicts, or which were simply created by the ISO sometime after our last import of codes. And when I say "the foreseeable future", I've been at this for at least a year, and other Wiktionarians had years to be at it before I started, and I doubt we're halfway done.
It seems to me the number of entries in a given language at least approximates the lower bound of the number of places where its code is used. At least, this seems true for the top nine languages here, since most of our entries in those languages contain some templates, and even if some entries don't, entries in other languages that mention those languages in templatized etymologies ("from {{etyl|la|fr}} {{term|fūbar|lang=la}}") probably more than make up the difference.
Using this logic, I'd guess that en, la and it are used on about half a million pages each, and fr and es on about a quarter million pages each. I would split each one onto its own subpage, and going forward would consider splitting any language onto its own subpage once it reached a quarter million entries. Yes, we may change Latin's sort order, but we'll still change Module:languages/data2 more often than we change Latin's sort order, so there's still a benefit to be had from pages that use la not going into the job queue just because someone added "North Ndebele" as an alt name for "Northern Ndebele". And there's a definite benefit to be had from pages that use en not going into the job queue just because someone updated Latin's sort order or Northern Ndebele's name. - -sche (discuss) 04:14, 8 December 2013 (UTC)Reply
You do realize we have Module:languages/stable (so far still empty) for widely transcluded stable languages? --WikiTiki89 04:19, 8 December 2013 (UTC)Reply
Yes, but for our most-widely-transcluded languages — even (or especially) if they're not 100% stable, as in the case of Latin — individual subpages seem better, for the reasons I outlined. - -sche (discuss) 04:29, 8 December 2013 (UTC)Reply
I'm torn. What you're saying makes sense, but one clear downside (IMHO) is that Module:languages has to know where to find each language, so if there are a languages in their own modules, then Module:languages will need a hardcoded list of them. (But then, I suppose that with the current approach, Module:languages/stable is used on every page that uses Module:languages; so anything that greatly decreases the number of occurrences of Module:languages/stable is still likely to be a win, even if it means a slight increase in the frequency of editing Module:languages.) —RuakhTALK 07:51, 8 December 2013 (UTC)Reply
I'm starting to think that if we do this, then there would be no point in also having Module:languages/stable. --WikiTiki89 17:23, 8 December 2013 (UTC)Reply
Hm, quite possibly. It was conceived with good intentions but in the end would probably represent too much decentralization / putting codes in too many places, to not enough benefit. Assuming that the plan was to add major languages to /stable when they became stable, it would probably end up being updated as often as /data2. En, la, it, fr and es subpages would (by the measure I used above) cover 2 million of our 3.5 million pages. A large number of the pages that use en will also use other codes, of course, because lemma forms of English words often contain etymologies or trans tables; they will still go into the job queue when those other codes are updated. But most of our entries for English inflected forms (plurals, forms of verbs) contain no other codes, and so still benefit from en being on its own subpage. - -sche (discuss) 17:56, 8 December 2013 (UTC)Reply
Actually, the way it's set up now, changing stable will change every page. Because even for codes that are not stable, the module still checks if they're in there, which counts as a transclusion all the same. —CodeCat 18:35, 8 December 2013 (UTC)Reply
Sorry, I think you must be misreading something, because your comment does not seem very relevant. The proposal being made here is that the module would not check /stable for the top few most-widely-transcluded languages, but rather, would retrieve them directly from their dedicated modules. —RuakhTALK 20:31, 8 December 2013 (UTC)Reply
There are almost three million items in the job queue (which is almost all of our entries). Let's prioritize getting these most common languages into their own subpages. - -sche (discuss) 19:59, 11 December 2013 (UTC)Reply
That won't help at all yet. We haven't finished this module, nor started using it yet, and that's the reason our optimizations have not taken effect yet. We should prioritize getting this module up and running. --WikiTiki89 20:05, 11 December 2013 (UTC)Reply
For whatever reason - I hope it was normal operation, though it could have been some kind of special intervention - the job queue of 3+ million virtually emptied to "only" 76K in a couple of days. Although we don't want to be constantly spiking to 3MM as it seems to lead to some things residing in the queue for weeks, we clearly can tolerate occasional spiking. Routine spiking is what should be avoided. DCDuring TALK 14:56, 13 December 2013 (UTC)Reply

Calling from templates.

How will templates use this module? Do we need simple wrapper exports? Should we have a Module:languages/main for use by templates? —RuakhTALK 20:59, 11 December 2013 (UTC)Reply

There's no reason why we can't just have the function in this module. Unless we plan on changing it a lot, which I doubt. --WikiTiki89 21:12, 11 December 2013 (UTC)Reply
I've been wondering if we can make a single, universal generic entry point for templates to call modules. That way we wouldn't have to duplicate lots of functions just to make them template-accessible. Such an entry-point function should have a list of allowable functions, so that things don't become totally chaotic and hard to track, and also to prevent people from doing things we'd rather not see, but I think it can work? —CodeCat 21:26, 11 December 2013 (UTC)Reply
Maybe, but let's not experiment with this module. --WikiTiki89 21:27, 11 December 2013 (UTC)Reply
I've done something like that in a few modules, CodeCat, but it's not very elegant. The problem is that Lua has many different kinds of data-structures, which can all be nested, all passed in and out of functions, etc., whereas the entry-point will only receive a string-map, and will have to return a single string. So it only works if the problem is constrained in some way. In the present case, I suppose we could write something like
function export.main(frame)
    local success, ret = pcall(function ()
        local language = export[frame[1]](frame[2])
        return language[frame[3]](language, unpack(frame, 4))
    end)
    return success and ret or frame.onError or ''
end
, allowing e.g. {{#invoke:Module:languages|main|getLanguageByCode|fr|getCanonicalName|onError=Unknown}} or {{#invoke:Module:languages|main|getLanguageByCanonicalName|French|makeSortKey|foo|onError=foo}}; so, the first argument is either getLanguageByCode or getLanguageByCanonicalName (I don't think findLanguagesByName is as amenable to being called directly from a template), the second argument is the code or name, the third is the method to call on the language object, and all subsequent arguments are the method arguments; plus, as a special case, onError=... is the wikitext to generate if anything goes wrong (e.g. if the language wasn't found).
RuakhTALK 07:06, 12 December 2013 (UTC)Reply
I suppose you're right. It was a nice idea but it's too limited to really be practical, because you would need some way to chain the functions to get sensible output that templates can use. —CodeCat 14:12, 12 December 2013 (UTC)Reply
To clarify, I think a function like this could be useful, and would cover most use cases. It's just that we can't necessarily expect it to cover all use cases. —RuakhTALK 18:46, 12 December 2013 (UTC)Reply
The only three functions that can be called from this module at all are getLanguageByCode, getLanguageByCanonicalName and findLanguagesByName. So we should consider them separately.
  • getLanguageByCode is the only function that would definitely need to be exported for templates.
  • getLanguageByCanonicalName can be used from a template, but I wonder if there's any real point to it anyway. {{langrev}} isn't used much as it is, and the uses it does still have can be gotten rid of.
  • The findLanguagesByName function returns a table, so it doesn't really make much sense in terms of templates, because there's no easy way for a template to access all the items in the table. Templates also lack any proper form of iteration, so they wouldn't even be able to iterate through all members of a table. I think we should just not provide any access to this from templates at all, to avoid people writing overly-complex solutions to a problem that is solved much more simply if only they'd write it in Lua. However, we should definitely allow JSON export, because this function would be very useful as backend support for the translation editor.
This is easy so far. The more difficult part is what to do with the Language object that getLanguageByCode returns. We could write the function to take the method to be invoked as the first parameter, so then we'd only have to look at how to pass and return the parameters and results of each of the methods.
  • getCode, getCanonicalName, getCategoryName, makeEntryName and makeSortKey take either zero parameters or a single string, and return a string too. They should be pretty straightforward to call from a template.
  • getAllNames returns a table, but it's unlikely that templates will need this anyway (even Lua modules probably wouldn't, usually) so we can just decide to not make it accessible.
  • getFamily returns a Family object. We can extend the scheme for this, by allowing an additional parameter which specifies which method to call on the Family object.
  • getScripts returns a table of Script objects, which combines the two above problems. I'm not sure which templates are currently using the script data, but I'm sure there are still quite a few, so we may not be able to get around this easily. Perhaps we can return the first script by default (which is what our templates used to do, before we had script detection), and then apply the same thing we do for getFamily to call methods on the Script object.
Ideally, we would eliminate all cases where templates currently use the script or family data. I don't know how feasible that is, though. The hardest culprit will probably be {{langcatboiler}}, which displays all the data for a given language.
So, my proposal would result in module invocations like this: {{#invoke:languages|getLanguageByCode_t|fr|getCanonicalName}} which would then return French. {{#invoke:languages|getLanguageByCode_t|fr|getFamily|getCanonicalName}} would give Romance. {{#invoke:language|getLanguageByCode_t|fr|makeSortKey|café}} would give cafe. —CodeCat 00:55, 13 December 2013 (UTC)Reply
I've added a first attempt at getLanguageByCode_t in the module. Please give feedback? —CodeCat 01:14, 13 December 2013 (UTC)Reply
It has more code, more hard-coded logic, and less functionality than the main I posted above. (Why add so many explicit restrictions about which methods are allowed to be called?) Plus, it's more inclined to generate script errors. And I'm not sure why you're hardcodedly passing args[3] rather than using unpack. —RuakhTALK 02:21, 13 December 2013 (UTC)Reply
I wrote something I could understand and which illustrated the point I wanted to make. I don't really understand how your code works. —CodeCat 02:35, 13 December 2013 (UTC)Reply
. . . so you decided to silently ignore it? That's not very conducive to discussion and collaboration. Even if you didn't understand how my code works — and even if you're not interested in learning — you should still be able to formulate an opinion about it, because I posted examples of how it would be called, and also because, unless you have no respect for your own intelligence, you should view your own lack of understanding as some sort of argument. Most of your proposal is pretty meaningless, because it starts from square 0 and progresses to square 1, after I had already posted square 2 — which your proposal is not a response to. —RuakhTALK 06:10, 13 December 2013 (UTC)Reply
*sigh* I guess I don't have much respect for my intelligence, no. :/ I think I kind of understand what your code does now, after having written my own. Writing my own has given me something to relate it to. I do wonder what your code might do if the function returns something other than a string. It would presumably show an error. Do you think you can make it work so that, if you call getFamily, you can then either call methods of the returned Family object, or return getFamily:getCode() automatically? —CodeCat 13:33, 13 December 2013 (UTC)Reply
Good question. It wouldn't be elegant, but it could be done by making some sort of assumption; for example, we could assume that a method whose name starts with get does not take any arguments, and that any subsequent frame-arguments must be intended as subsequent method calls. (So, for example, the get in getFamily means that getLanguageByCode|fr|getFamily|getCode would be interpreted as meaning getLanguageByCode('fr'):getFamily():getCode() rather than getLanguageByCode('fr'):getFamily('getCode').) A slightly different (and mostly weaker) assumption would allow both possibilities: we would call the getter with all subsequent arguments, which it would likely ignore, and if it returned a table, then we would re-interpret the arguments as a method call. (So, for example, getLanguageByCode('fr'):getFamily('getCode'):getCode().) Either way, we're assuming that if a method returns a table, then the table is an object with getters (and not just a normal table to be indexed). —RuakhTALK 20:42, 13 December 2013 (UTC)Reply
Maybe just calling getCode automatically would be a good mid-way solution. If the module invocation returns a code, then that code can be used for a second invocation: {{#invoke:families|getFamilyByCode_t|{{#invoke:languages|getLanguageByCode_t|fr|getFamily}}|getCanonicalName}} would return "Romance". It's not so nice that it "escapes" the module and requires two invocations, but it seems simple enough. An alternative, which would make it more complicated, could be to add some kind of push/pop syntax to the module call. Something like this: {{#invoke:languages|getLanguageByCode_t|fr|getFamily|(|)|getCanonicalName}}. Here, the ( ) parameters delimit the range of parameters that should be given to the called function, so it could be distinguished from, say, {{#invoke:languages|getLanguageByCode_t|fr|makeEntryName|(|café|)}}. But by this point it's almost like we're turning our parameters into Lua parser tokens, and I don't think we actually need that kind of flexibility or complexity. —CodeCat 20:53, 13 December 2013 (UTC)Reply
I've implemented some of your above code into the function, replacing part of what I wrote. I didn't add my ideas into it yet. What do you think of it so far? —CodeCat 03:45, 14 December 2013 (UTC)Reply
I'm not sure. I have the minor comments that you're not using args[1] at all (my version used that for the initial language-getting-method-name, but your version has hardcoded getLanguageByCode), and that forward is not needed (you can just write pcall(lang[method], lang, unpack(args, 4))); but for the overall picture, I'm really not sure. As fun as this sort of metaprogramming can be, I still wonder if it might be better to just have a Module:languages/main with the specific functionalities we expect templates to want. Because, for example, I think one basic functionality is something like "I have the language-code, and I want [[Category:Language-name nouns]] if the language-code is recognized and an appropriate error-category otherwise". You're clearly thinking along the same lines, and your response is to make it possible for a template to test for the existence of a language-code using {{#if:...}}, but is that really what we want? Something like {{#invoke:Module:languages/main|if language exists|fr|[[Category:{name} nouns]]|[[Category:Unrecognized language codes]]}} would be much smoother IMHO. —RuakhTALK 07:43, 14 December 2013 (UTC)Reply

Language:makeEntryName

I'm not familiar with the new version of the module, but I guess these lines:

	text = mw.ustring.gsub(text, "^[¿¡]", "")
	text = mw.ustring.gsub(text, "[؟?!;՛՜ ՞ ՟?!।॥။၊ः་།]$", "")

should be

	text = mw.ustring.gsub(text, "^[¿¡](.)", "%1")
	text = mw.ustring.gsub(text, "(.)[؟?!;՛՜ ՞ ՟?!।॥။၊ः་།]$", "%1")

no? --Z 17:00, 13 December 2013 (UTC)Reply

If that means what it looks like it means — that a leading [¿¡] or trailing [؟?!;՛՜ ՞ ՟?!।॥။၊ः་།] will not be removed if it's the only character in the string — then yes, sounds good. (Still not perfect — it will convert ¿? to ?, and ¿...? to ... — but it's an improvement.) —RuakhTALK 19:28, 13 December 2013 (UTC)Reply
If Lua had an x*? feature in its regexes, then we would have been able to do:
text = mw.ustring.gsub(text, "^[¿¡](..*?)[؟?!;՛՜ ՞ ՟?!।॥။၊ः་།]$", "%1")
But I cannot think of anything that could do that in one line with the features available. --WikiTiki89 19:39, 13 December 2013 (UTC)Reply
Would this not do it?
text = mw.ustring.gsub(text, "^[¿¡]*(.-)[؟?!;՛՜ ՞ ՟?!।॥။၊ः་།]*$", "%1")
CodeCat 19:47, 13 December 2013 (UTC)Reply
You're right. I didn't think to look at the - "magic character". But it would have to be:
text = mw.ustring.gsub(text, "^[¿¡]*(..-)[؟?!;՛՜ ՞ ՟?!।॥။၊ः་།]*$", "%1")
So that there would have to be a at least one character in the middle. --WikiTiki89 21:35, 13 December 2013 (UTC)Reply

The rollout

I think if we want to roll this out it has to be gradual. But I'm not sure if that's possible. Many templates currently pass the language code around, but it probably makes more sense if, in the new version, they pass the language object directly. A valid language object (or nil, when allowed by the function) would be assumed by each function as a matter of standard practice. Objects are passed by reference so this is cheap, and it avoids each function having to look up the data separately. Retrieving the object, and the accompanying validation, would be done by the "first" template in the chain, which would generally be the function being invoked by a template.

Of course this makes it trickier to convert the modules step by step, because you have to combine new and old usage, and it's not really clear where to "start" because modules depend on one another. Does anyone know? —CodeCat 03:19, 4 January 2014 (UTC)Reply

I like the idea of just passing the language code around. The object shouldn't be too hard to grab, as long as each module only does so once. --WikiTiki89 03:38, 4 January 2014 (UTC)Reply
Actually, each function would have to, unless it only intends to pass it to other functions. But it's not very practical. Look at how often the retrieve-and-check-if-valid code is duplicated in Module:links. We can avoid that if we demand a language object from the beginning. —CodeCat 03:58, 4 January 2014 (UTC)Reply
Internally, each module can pass around the language object if they want, but externally, I'd prefer if they continued to accept the language code. If you really want, we could have them accept either one. --WikiTiki89 04:04, 4 January 2014 (UTC)Reply
All the functions in Module:links are exported, so we can't make that assumption, all of them could potentially be called externally. I don't really see any advantage in accepting language code strings anyway. —CodeCat 04:07, 4 January 2014 (UTC)Reply
There are plenty of advantages. It makes it easier to write modules; it means we have to change less code; etc. A very easy solution, is to add the following to the beginning of getLanguageByCode:
if type(code) == "table" then return code end
Then anything that accepts a language code can also accept a language object, so switching to language objects because purely for optimization within modules. --WikiTiki89 04:20, 4 January 2014 (UTC)Reply

fix in grc entry name

Ancient Greek entries remove macrons (ᾱῑῡ) in page titles, which is well and good, but breves (ᾰῐῠ) are also in usage and need to be added to Module:languages/data3/g. ObsequiousNewt (ἔβαζα|ἐτλέλεσα) 02:20, 8 August 2014 (UTC)Reply

Edit requests

Could someone add »Glag« as a script for Old Novgorodian, and »Molise Slavic« as another name for Molise Croatian? Thanks. Vorziblix (talk) 18:52, 10 April 2016 (UTC)Reply

Excuse the lack of a title

name = mw.ustring.gsub(name, "^[-־ـ*]+(.)",
		"%1")

Can someone remove the random new line there? -Xbony2 (talk) 12:24, 24 April 2016 (UTC)Reply

Why? —CodeCat 14:56, 24 April 2016 (UTC)Reply
It simply doesn't look right and it bothers me. If it was to save line space, that could be acceptable, but there are much longer lines that aren't broken down. -Xbony2 (talk) 23:48, 24 April 2016 (UTC)Reply
It looks like it was done because right-to-left characters embedded in the string would make the appearance of the code bizarre otherwise. —suzukaze (tc) 23:52, 24 April 2016 (UTC)Reply
@Xbony2: Is this better? --WikiTiki89 14:37, 25 April 2016 (UTC)Reply
Yes :D -Xbony2 (talk) 18:13, 25 April 2016 (UTC)Reply

Edit Request: rjs (Module:languages/data3/r)

Can Devanagari (Deva) and Bengali (Beng) be added as scripts for Rajbanshi? —Aryamanarora (मुझसे बात करो) 19:59, 17 April 2017 (UTC)Reply

Done Done --Lo Ximiendo (talk) 20:08, 17 April 2017 (UTC)Reply

Addition request: Samaritan Arabic (proposed code: sme)

Samaritan Hebrew and Aramaic are listed yet not Arabic. Maybe we can fix? Sorry if English is bad יבריב (talk) 14:25, 26 July 2017 (UTC)Reply

My understanding is that there is no distinct Arabic lect spoken just by the Samaritans. —Μετάknowledgediscuss/deeds 17:43, 26 July 2017 (UTC)Reply
it is odd. It is truly more of a form of Arabic, however using the Samaritan alphabet. Therefore I was unsure if it was truly a "lect" of Arabic or a Samaritan language in its own right. Either way I would appreciate some form of addition to the module:languages so I can expand the online database of Samaritan languages without further difficulty (Again sorry if English is bad) יבריב (talk) 22:46, 26 July 2017 (UTC)Reply
If it is just Arabic, though, the script doesn't matter — we can add spellings in Samaritan script as long as they pass WT:ATTEST. —Μετάknowledgediscuss/deeds 03:46, 27 July 2017 (UTC)Reply
Really? Will it even add 'Samaritan Arabic (...)' as a category - and everything? Imagine it like Urdu, but with opposite attributes. Urdu isn't Arabic but it uses the Arabic alphabet - it has its own module. Samaritan Arabic is the same, except it's Arabic but it doesn't use the Arabic alphabet. Can we get the "Urdu treatment"?
My only point of concern is I want the categories to be automatically assigned, and this is the only way I know that canbe accomplished thru יבריב (talk) 14:09, 27 July 2017 (UTC)Reply
Yes, we can have a category for it while still considering it Arabic and making the entries say Samaritan script spelling of X. I am personally unconvinced that Urdu should get a separate code, but Samaritan Arabic certainly shouldn't if it has no difference besides script. —Μετάknowledgediscuss/deeds 15:06, 27 July 2017 (UTC)Reply