Jump to content

Wiktionary:Todo/Lists/technical documentation

From Wiktionary, the free dictionary

This page contains technical documentation explaining how the Todo Lists project works.

Toolforge

[edit]

The Todo Lists project runs on Toolforge. The tool name is wikt-todo.

If you are a maintainer of the project, you can manage the tool by logging into a Toolforge shell using ssh (see the quick start guide) and typing become wikt-todo. All shell commands on this page will only work if you have "become" the tool account.

Updating from Git

[edit]

The code for the Todo Lists project lives at https://gitlab.wikimedia.org/toolforge-repos/wikt-todo.

On Toolforge, there is a copy of this repo in the src directory under the wikt-todo tool account's home directory.

If new commits have been made to the repo, you can update the copy of the repo on Toolforge by running:

cd src
git reset --hard   # erases local changes from any mucking around or testing
git pull origin

Scheduled todo list runs

[edit]

Automatic generation of the SQL and custom todo lists is currently scheduled to take place every week, at a time determined by Toolforge. The XML dump todo list script is scheduled to run every day, but the script does nothing unless a new dump file is identified.

Job scheduling is defined in the jobs.yaml file. The format is documented here.

If you make changes to the jobs.yaml file, you must reload it using:

toolforge-jobs load src/jobs.yaml

Failure emails

[edit]

If you get a failure email regarding a scheduled tool run, inspect the generate-lists-type.err log file in the tool's directory.

"Killed" failure

[edit]

If the lists are silently failing to be generated, and upon inspecting the generate-lists-type.err log file you see the word "Killed", it's likely that the job exceeded the memory limit imposed by Kubernetes. You can verify this by looking at the memory graphs at the Grafana dashboard.

The default per-job memory limit is 512 MiB. This should be more than sufficient for most purposes. If the job is running out of memory, there might be a buggy script that is trying to include almost every page on Wiktionary in its result set. You can run a job manually with increased memory by appending, say, --mem 3Gi (for 3 GiB) to the toolforge-jobs run command line below. This should allow the job to complete and you will hopefully be able to work out what is going wrong by looking at the resulting todo list.

The "Update now" system

[edit]

SQL-based todo list pages have an "Update now" button. When clicked, the user is taken to https://wikt-todo.toolforge.org/updater/update/Todo list name. This page is served by the Toolforge web service documented at wikitech:Help:Toolforge/Web/Python.

The web server code is in updater_webapp.py. This is symlinked from $HOME/www/python/src/app.py, the mandatory location for Python web applications. Private configuration parameters are in $HOME/www/python/src/config.yaml.

If the website goes down, the web service can be started or restarted using:

webservice restart

Logs for the web service itself are in uwsgi.log, while logs for the generate-lists-* script itself are in ad-hoc-updates.log.

Running a todo list ad hoc from the server

[edit]

To run a single todo list on an ad hoc basis, run the following command, where type is one of sql, xmldump or custom, and Todo list name is the exact name of the desired todo list:

toolforge-jobs run mytodo --command "~/pyvenv/bin/python src/generate_lists_type.py 'Todo list name'" --image python3.11

Keep an eye on the status of the job using:

toolforge-jobs list

Once the job finishes, it will no longer be present in the list. Any error output will be saved to mytodo.err in the tool account's home directory - run cat mytodo.err to view it. Regular print statement output will be saved to mytodo.out. Consider deleting these output files once done to keep the tool directory clean.

Types of todo lists

[edit]

Currently there are three ways to generate todo lists: SQL, XML dump and custom. It is intended to add HTML dump as a fourth type of todo list at some stage in the future.

SQL

[edit]

SQL-based todo lists are generated by simply running an SQL query against Wiktionary's database (or, more precisely, the read-only database replicas available through Toolforge) and formatting the output.

SQL is the best option for any todo list that does not require analysis of page content (wikitext). Anything relating to:

  • basic page metadata as found at Special:PageInfo
  • links between pages as found at Special:WhatLinksHere
  • category structures and categorisation
  • whether or not a page uses a certain template (but not the template parameters or where on the page the template is used)
  • page histories
  • log entries as found at Special:Log
  • or any combination of these

can be achieved using pure SQL.

SQL queries run very quickly if written well. Most of the SQL-based todo lists take just a few seconds to generate. However, it can be challenging to write an SQL query that is both correct and fast, especially if you do not have much experience with SQL. The MediaWiki relational database structure diagram and list of special views on Toolforge are essential references.

List definitions
SQL todo lists are defined in sql/queries.py. To define a new SQL todo list, simply add a new entry to the queries dictionary.
Testing
When developing a new query, use the MariaDB SQL console on Toolforge (sql enwiktionary) to test it.

XML dump

[edit]

XML dump-based todo lists are generated by iterating through every line of wikitext (page source code) on every page of the latest Wiktionary XML dump and running Python code to compile the resulting todo list.

These todo lists are a good choice for detecting common misuses of templates or wikitext formatting. The XML dump contains the wikitext of the latest revision of each page, along with the page ID, namespace and title, but little else.

An abstraction layer is provided so that the Python code for each todo list does not need to parse XML. For convenience, the abstraction layer keeps track of a hierarchy of section headings and, in some cases, supplies a parsed version of the line of wikitext alongside the wikitext itself. The Python script is free to run SQL queries or API requests as required.

List definitions
XML dump todo lists are defined as Python classes in the xmldump/ directory. To define a new todo list, make a copy of the !template.py file in the relevant directory and write your code.
Testing
You can run an XML dump todo list on your local machine. Download the pages-articles or pages-meta-current XML dump and run the script locally:
python3 generate_lists_xmldump.py 'Todo list name' --dry-run --file /path/to/xml-dump.bz2 or .xml

Custom

[edit]

Custom todo lists simply involve running a Python script that returns a table of results. In practice, these todo lists typically run an SQL query, then perform some kind of post-processing on the query results to generate the todo list.

List definitions
Similar to the XML dump type, custom todo lists are defined as Python classes in the custom/ directory.
Testing
If your custom list involves SQL queries, you will need to test on Toolforge itself. Use the ad hoc syntax shown earlier on this page, with the addition of the --dry-run parameter to the Python command. Output will be appended to mytodo.out, so make sure to delete that file before running the todo list.

Future type: HTML dump

[edit]

The Enterprise Wikimedia HTML dumps are convenient for certain kinds of analysis where the fully rendered page is required. These dumps, which contain page wikitext and categorisation information alongside the page HTML, are a superset of the XML dump for main namespace pages. However, Enterprise Wikimedia only generates HTML dumps for select namespaces; our custom namespaces, such as Reconstruction, are not included.

Todo list output

[edit]

The output of each todo list is a list of dictionaries. The SQL generator converts the SQL query result set into this format. For Python-based todo lists, the code must return a list of dictionaries, where every dictionary in the list has exactly the same keys. The column names/key names need to adhere to a special format, as explained below.

This data is converted to wikitext, using a sortable table format if more than one key is present in the dictionary besides the optional SECTIONHEADING, or a bulleted list format otherwise.

Section headings

[edit]

The todo list output can optionally be divided into sections (L4 headers). This is achieved by adding a special SECTIONHEADING column (key) to the output.

It is critical that the output is sorted first by this section heading key! Otherwise the section headings will be uselessly repeated in random parts of the list.

Column formatting codes

[edit]

Every column name (key name) except SECTIONHEADING must contain at least one formatting code. These are written in ALL CAPS and placed after the displayed column name, set off by underscores, for example, Creation date_RAW or Page_NSTITLE_EDITLINK.

For convenience in SQL queries, underscores will be replaced by spaces in the column name itself, so Creation_date_RAW would work too.

The formatting codes are defined in the format_data_value function in output_formatting.py. Here is a summary:

PAGE
Formats the data as a regular link to a Wiktionary page, including the namespace name (if any).
This formatting code is intended for XML dump scripts, where the pageTitle parameter already includes the namespace name.
NSTITLE
Formats the data as a link to a Wiktionary page. The column should contain values of the form 0|dictionary, 3|This,_that_and_the_other or 4|Requests for verification, made up of the namespace number, a pipe character, and the page title (either underscores or spaces are fine).
This formatting code is intended for SQL queries, where namespace names are not available. Generate a NSTITLE value using CONCAT(page_namespace, '|', page_title) or equivalent.
TALKLINK
Can be included after PAGE or NSTITLE to add a "talk" link. The link is only added for pages in non-talk (even-numbered) namespaces.
EDITLINK
Can be included after PAGE or NSTITLE to add an "edit" link.
HISTLINK
Can be included after PAGE or NSTITLE to add a "history" link.
RAW
Raw wikitext. Use this if you know the output won't contain special wikitext characters (a simple number for example), or if you need to substitute magic words, like {{subst:NS:...}}, or templates, like {{subst:langname|...}}, into the output.
NOWIKI
Wrap in <nowiki> tags.
CODE
Wrap in <code><nowiki> tags.
CODE50
Wrap in <code><nowiki> tags, keeping only the first 50 characters and adding ... if more characters are present.
CODE100
Wrap in <code><nowiki> tags, keeping only the first 100 characters and adding ... if more characters are present.