Wiktionary talk:Todo/Lists/Broken links to senseid and etymid

Sorting by linked page and sense ID, list existing IDs

Latest comment: 1 year ago7 comments2 people in discussion

I think the list should be ordered the linked page and sense ID, because that will make it clearer which pages need new sense IDs and which pages have |id= parameters that should be fixed. It would also be helpful to print a list of the sense IDs or etymology IDs that were actually found on the page, if that information is available from the script, because it would let people see at a glance whether a |id= parameter might have a typo or synonym of an existing ID. Not sure how this information would best be arranged though. In the past, I've printed the data as JSON so I could figure out how to display it using Lua. — Eru·tuon 07:37, 17 January 2024 (UTC)Reply

@Erutuon good point about the sort order. I was in two minds about this when setting up the script, but you've convinced me it should be sorted by the linked page. Adjusted in [1].

As for outputting all the sense IDs on the given page, we could just add another column to each table with this info. Not the best way of presenting it, granted... but as you say, it would be helpful to have it there. Needs some more coding.

Also, would you be willing to be added as a collaborator on the wikt-todo Toolforge project? I know you have experience with Toolforge, so I feel like it would be easy for you to jump in and poke around if anything goes wrong - or even participate more deeply if you had the time and inclination. I'd love to have you on board either way. This, that and the other (talk) 08:33, 17 January 2024 (UTC)Reply

Sure. It seems in my area of interest because I've done so many lists in the past. — Eru·tuon 22:54, 17 January 2024 (UTC)Reply

@Erutuon added. Welcome! This, that and the other (talk) 02:38, 18 January 2024 (UTC)Reply

@This, that and the other: I looked at the code and noticed the ID and entry name extraction isn't complete yet. I actually have a Rust program that could help. It generates link records from link templates including the term and ID. (I use those to generate User:Erutuon/lists/bad transliteration/ar and User:Erutuon/lists/scripts in link templates for instance.) It handles a lot of link templates and their redirects, and should find every single ID in them, including ones in <id:id>.

It doesn't run makeEntryName in Module:languages on the term parameters yet, but the database generator for enwikt-translations does that for translations. I haven't published the link parser repository to the internet yet (I like how quick SourceHut is for just viewing code, but Wikimedia GitLab would allow collaboration), and I need to figure out how to compile and run Rust programs on Kubernetes (phab:T319724) before we can use it. — Eru·tuon 19:03, 18 January 2024 (UTC)Reply

@Erutuon the list was regenerated this morning my time and looks much better. Thanks for your suggestions.

If you already have code to parse link templates complete with inline IDs etc, let's pool resources for sure. The makeEntryName issue will be necessary for non-English languages. I had two thoughts on how to do this:

Use Mod:JSON data to get the list of languages that have entry_name substitutions. Do the simple substitutions in Python code and outsource the non-trivial ones to the Wiktionary API expandtemplates module in big batches.
Generate this todo list from the HTML dump instead, which has parsed versions of the pages, meaning we could check for all kinds of broken anchors in a totally template-agnostic way, like word#Etymology 123.

This, that and the other (talk) 10:53, 22 January 2024 (UTC)Reply

@This, that and the other: A third option is to make a Lua environment to run makeEntryName. That would require setting up a Scribunto environment (at least with the global stuff that is actually used by the code we're running), extracting the necessary modules from the XML dump, saving them in an accessible format, and making them accessible from require. It would probably not be very hard, because I don't think entry names do anything very fancy (like, no parsing of entries to extract information).

For the translation database, the program grabs modules that match include and exclude lists and then went a second time grabbing all the modules that were mentioned in require calls in the first batch of modules. I had to run the transliteration and entry name generating subcommand a few times and look at error messages to add more special cases to the include lists. It might not be necessary to do the last step for entry names because the entry name modules only use global functions.

My program puts the modules into a SQLite database (cleaner than saving them to files, which I've done before) and inserts a function into package.searcher to let require load modules from the database.

While writing this I realized that's overkill. The total size of our module namespace is only about 128 MB. I made a separate Rust program that makes a database of all modules besides some that Module:languages doesn't use, such as user sandbox modules (43 MB) and module subdirectories, like Module:zh/data/dial-syn (47 MB), with a loader in Lua using lsqlite3. That seems to make makeEntryName work.

In some types of lists it would be nice to have automatic transliterations, but that would be harder because some transliteration modules now rely on mw.title.new(entry):getContent(), like Module:zh-translit. ilscripto by Crowley666 provides an almost complete Scribunto environment including a Lua implementation of the PHP parts, including getting content from title objects. I still need to try using that sometime and figure out how to get the necessary pages from the dump without creating a full MediaWiki installation. mw.title.new(entry):getContent() is recorded as a transclusion, but rather than parsing templatelinks.sql to list transcluded entries, it would be simpler to just put all mainspace and module page titles and content in a SQLite database. — Eru·tuon 20:41, 25 January 2024 (UTC)Reply