User:Conrad.Bot/Indexing

This page may be out of date, but it should accurately reflect the current status when it was updated.

Languages

On multiple pages: Hungarian, Irish, Italian, Spanish, Galician, Ancient Greek, English, Lithuanian
On one page:Mapudungun, Hiligaynon

Overview

create_indices.sh Downloads the latest XML dump from http://devtionary.info/w/dump/xmlu and then runs the following programs.
nicen.dump.awk Normalize the XML dump, removing entries I am uninterested in, and formatting those that I am more readably
extract_words.awk Scan through the dump and add every entry that contains at least one definition that doesn't look like a "form of" definition to a list. This step also stores any audio files it finds, as well as noting whether the link will need a #Language as it is not the first section on the page.
- Entries whose only definition line consists entirely of a template (except {{SI unit}} and {{given name}}) are excluded
- Definitions start with "compound of" are excluded
- Definitions that contain variations on X form of, where X is present/perfect/plural/singular/past historic/preterite/compound/ending in ive are excluded.
- This is of course guess work, and if you notice words that should be in the index, but aren't, or words that shouldn't be in the index but are, let me know.
get_trans.py Scan through the dump and add every translation of words in languages that are being indexed, and add them to the lists created in 2.
- This looks for any line starting with "*<Language name>:"
- It discards everything in (brackets).
- It will include anything in a {{t}} template or {{l}} template.
- It will include any remaining links.
- If the entire line looks like a valid term, then it will include the whole line.
get_missing.py (For some languages) scan through the current index for that language and add all words there to the list as "missing".
split_index.<language name>.pl Split the list for each language into files for each starting letter, corresponding to the list of entries on each page, and (for newly added languages) sort them, and divide them by second letter.
format_index.<language name>.pl Format the per-letter lists into wikitext (for the older few languages, the sorting and splitting by second letter happens here).
indexupload.py Upload each formatted output file

Sorting and splitting

For all languages, the strings are first normalized to lowercase. As I get round to it, I intend to rewrite the old-style ones as new style ones.

Ancient Greek

Remove all space and punctuation.
Treat any remaining non-alphabetic and 𐠀ϝϻϡϙ as 0.
Remove diacritics.
Split on first two characters.
Use el_EL.utf-8 to sort original strings.

Galician

(old style)
Remove all diacritics (except ñ).
Treat non-alphabetic characters as 0.
Split on first two characters.
Sort on normalised form.

Hungarian

(old style)
Replace á é í ó ú ő ű with a e i o u ö ü
Treat non-alphanumeric characters as 0
Split on fist two (cs|gy|ly|ny|sz|ty|zs|[[:alpha:]0])
Sort on normalised form.

Irish

Remove all space and punctuation.
Treat any remaining non-alphabetic as 0.
Remove diacritics.
Remove any leading an .
Split on first two characters.
Use gl_GL.utf-8 to sort original string.

Italian

(old style)
Remove all diacritics.
Remove any leading a .
Treat non-alphabetic characters as 0.
Split on first two characters.
Sort on normalised form.

Spanish

Remove all space and punctuation.
Treat any remaining non-alphabetic as 0.
Split on first two (ñ|ll|ch|[[:alpha:]0])
Use es_ES.utf-8 to sort original string.

Formatting

Currently all languages are treated about the same:

Strikethrough links that were added as "missing" from the inde<xes
Add an {{audio-list}} for an audio file, if one was found.
Abbreviate PoS and add that in italics.
Add an * linked to any entries which contained the word.
Add #<language name> to links that were not the first on the page.
Put the lists (#-lists) into a <div class="index"></div> seperated by ===-headings and a table of contents.
- This means that the lists run horizontally, this means that they can change width to fill the maximum amount of space available to them, and that users can continue scrolling downwards without having to go up to find the next column.