User:Conrad.Bot/Indexing
Jump to navigation
Jump to search
This page may be out of date, but it should accurately reflect the current status when it was updated.
Languages
[edit]- On multiple pages: Hungarian, Irish, Italian, Spanish, Galician, Ancient Greek, English, Lithuanian
- On one page:Mapudungun, Hiligaynon
Overview
[edit]create_indices.sh
Downloads the latest XML dump from http://devtionary.info/w/dump/xmlu and then runs the following programs.nicen.dump.awk
Normalize the XML dump, removing entries I am uninterested in, and formatting those that I am more readablyextract_words.awk
Scan through the dump and add every entry that contains at least one definition that doesn't look like a "form of" definition to a list. This step also stores any audio files it finds, as well as noting whether the link will need a #Language as it is not the first section on the page.- Entries whose only definition line consists entirely of a template (except
{{SI unit}}
and{{given name}}
) are excluded - Definitions start with "compound of" are excluded
- Definitions that contain variations on X form of, where X is present/perfect/plural/singular/past historic/preterite/compound/ending in ive are excluded.
- This is of course guess work, and if you notice words that should be in the index, but aren't, or words that shouldn't be in the index but are, let me know.
- Entries whose only definition line consists entirely of a template (except
get_trans.py
Scan through the dump and add every translation of words in languages that are being indexed, and add them to the lists created in 2.get_missing.py
(For some languages) scan through the current index for that language and add all words there to the list as "missing".split_index.<language name>.pl
Split the list for each language into files for each starting letter, corresponding to the list of entries on each page, and (for newly added languages) sort them, and divide them by second letter.format_index.<language name>.pl
Format the per-letter lists into wikitext (for the older few languages, the sorting and splitting by second letter happens here).indexupload.py
Upload each formatted output file
Sorting and splitting
[edit]For all languages, the strings are first normalized to lowercase. As I get round to it, I intend to rewrite the old-style ones as new style ones.
Ancient Greek
[edit]- Remove all space and punctuation.
- Treat any remaining non-alphabetic and
𐠀ϝϻϡϙ
as0
. - Remove diacritics.
- Split on first two characters.
- Use
el_EL.utf-8
to sort original strings.
Galician
[edit]- (old style)
- Remove all diacritics (except
ñ
). - Treat non-alphabetic characters as
0
. - Split on first two characters.
- Sort on normalised form.
Hungarian
[edit]- (old style)
- Replace
á é í ó ú ő ű
witha e i o u ö ü
- Treat non-alphanumeric characters as
0
- Split on fist two
(cs|gy|ly|ny|sz|ty|zs|[[:alpha:]0])
- Sort on normalised form.
Irish
[edit]- Remove all space and punctuation.
- Treat any remaining non-alphabetic as
0
. - Remove diacritics.
- Remove any leading
an
. - Split on first two characters.
- Use
gl_GL.utf-8
to sort original string.
Italian
[edit]- (old style)
- Remove all diacritics.
- Remove any leading
a
. - Treat non-alphabetic characters as
0
. - Split on first two characters.
- Sort on normalised form.
Spanish
[edit]- Remove all space and punctuation.
- Treat any remaining non-alphabetic as
0
. - Split on first two
(ñ|ll|ch|[[:alpha:]0])
- Use
es_ES.utf-8
to sort original string.
Formatting
[edit]Currently all languages are treated about the same:
- Strikethrough links that were added as "missing" from the inde<xes
- Add an
{{audio-list}}
for an audio file, if one was found. - Abbreviate PoS and add that in italics.
- Add an * linked to any entries which contained the word.
- Add #<language name> to links that were not the first on the page.
- Put the lists (#-lists) into a
<div class="index"></div>
seperated by ===-headings and a table of contents.- This means that the lists run horizontally, this means that they can change width to fill the maximum amount of space available to them, and that users can continue scrolling downwards without having to go up to find the next column.