Module talk:Jpan-sortkey

What is this supposed to do ?

Latest comment: 1 year ago10 comments3 people in discussion

I have a problem with the current code :

local i = tonumber(mw.getCurrentFrame():extensionTag('nowiki', ''):match'([%dA-F]+)', 16)

What is it supposed to do ? I'm parsing the wiktionary pages outside of mediawiki, with a lua processor and this leads to a LuaError, whih I can understand, because :

mw.getCurrentFrame():extensionTag('nowiki', '')

should only produce <nowiki></nowiki> and there is nothing here to match the given hex pattern...

Please explain ! Dodecaplex (talk) 17:50, 21 April 2023 (UTC)Reply

@Dodecaplex I'm sure @Huhu9001 would be happy to explain. Theknightwho (talk) 06:20, 24 April 2023 (UTC)Reply

@Dodecaplex: See meta:Extension:Scribunto/Lua_reference_manual#frame:preprocess:

Certain special tags written in XML-style notation, such as <pre>, <nowiki>, <gallery> and <pre>, will be replaced with "strip markers" — special strings which begin with a delete character (ASCII 127), to be replaced with HTML after they are returned from #invoke.

This means mw.getCurrentFrame():extensionTag('nowiki', '') should produce something like '\127\'"`UNIQ--nowiki-00000002-QINU`"\'\127'. If you get '<nowiki></nowiki>', that means your lua processor does not work correctly.

The thing it is supposed to do is to use nowiki strips as global variables accessible to other modules working in the same page. So Japanese templates can use a sortkey stored by t:ja-kanjitab somewhere above in the page, freeing editors from the tedious work of having to manually input sortkeys in every templates. -- Huhu9001 (talk) 09:34, 28 April 2023 (UTC)Reply

So if I understand correctly you are generating an empty nowiki tag so that you can get its number, assuming it reflects the number of previous nowiki that has already been generated before in the page transclusion... Then you get those previous ones from the parsers state using unstripNowiki... until you find one that match your sortkey encoding pattern...

As a CS professor I must say that this qualifies as one of the ugliest hacks ever seen (and I can assure you my student are usually quite productive in this matter ;-), ... this depends on so many assumptions... I understand the mediawiki devs are still strugling to develop a better, portable, mediawiki parser for so many years now... I can also tell you that this will generate an error as soon as the devs decide to transition to Lua5.2...

So, in my setup (I parse EVERY PAGES in a java program that simulates Mediawiki and Scribunto), I know I will not try to accomodate this and will figure a workaround... If I may, I'll see if other mecanisms are available for this. Dodecaplex (talk) 13:33, 28 April 2023 (UTC)Reply

@Dodecaplex: The ugliest hacks exist because they work and only they work. I wouldn't want to do this either if Japanese sortkeys were not a long-standing problem without any solution for many years. -- Huhu9001 (talk) 14:28, 28 April 2023 (UTC)Reply

@Dodecaplex In fairness to Huhu9001, it's because Scribunto is specifically designed so that separate template/module invocations can't affect each other (unless one is specifically nested inside the other). This is a major, longstanding issue, as it causes us all kinds of issues when very complex calculations need to be repeated tens or hundreds of times for no reason other than this restriction. Because Japanese kanji sortkeys explicitly require some kind of user input (i.e. the corresponding kana), there's no way around this unless we make a huge table of Japanese sortkeys to be loaded via mw.loadData. That is effectively what happens with Chinese in Module:Hani-sortkey, but at least in that case it's possible to do it character-by-character. The other alternative is to make sure everything happens inside a single invocation, but that requires developing a wikitext parser in Lua, which is not easy, and I also imagine the devs wouldn't be totally happy with it, either...

I'd be very interested to see if you can come up with something else, because it might be helpful elsewhere, too. Theknightwho (talk) 14:46, 28 April 2023 (UTC)Reply

@Theknightwho @Huhu9001 I understand the trick and it’s reasons know. Thanks for the clarification. I also have performance problems in my settings due to this. I’ll keep this in mind and try to find another way… (and will maybe have to implement the weird no wiki stripping stuff as this trick may appear elsewhere (and in other wiktionary editions). Dodecaplex (talk) 05:53, 29 April 2023 (UTC)Reply

@Dodecaplex Just to note that you're almost certainly right that we shouldn't rely on this in the longer term, as I imagine it'll be patched as soon as the devs find out about it. There used to be a similar hack involving strip markers, which was patched out a while ago for the exact reason I mentioned. Theknightwho (talk) 17:58, 29 April 2023 (UTC)Reply

@Dodecaplex I have rewritten the sortkey module so that it no longer relies on the strip marker hack. This has another advantage, as it also means it's possible to collate Japanese terms in lists, but it is a little more memory intensive. It also has issues handling kanji with multiple readings, though this shouldn't be too difficult to solve; it just requires working out what the best way to handle it is, as it will probably depend on how we decide to handle automatic Japanese transliterations. Theknightwho (talk) 12:36, 19 May 2023 (UTC)Reply

Thx ! Dodecaplex (talk) 09:04, 25 May 2023 (UTC)Reply