User:WingerBot

From Wiktionary, the free dictionary
Jump to navigation Jump to search

I'm a bot created and controlled by Benwing. My source code is written in Python, using Pywikibot, mwparserfromhell, the blib library from MewBot (heavily modified), and a lot of custom code. Source code is available on github: [1]

If you see me do something bad, please leave a message for User talk:Benwing2.

My purpose is to make various sorts of automated changes to entries in order to assist in the maintenance of Wiktionary.

Specifics

[edit]

This section is very out of date, but originally described the operation of some of the things done by the bot.

Adding declensions

[edit]

One of my tasks is automatically adding the declension of Arabic nouns and adjectives. Generally this involves taking the headword line (either {{ar-noun}}, {{ar-coll-noun}}, {{ar-sing-noun}} or {{ar-adj}}) and converting it to {{ar-decl-noun}}, {{ar-decl-coll-noun}}, {{ar-decl-sing-noun}} or {{ar-decl-adj}}. A new declension section is added just below the definition. However, there are a lot of complexities, e.g.:

  1. If there is a manual transliteration, it needs to be appended using slash notation in the declension template, rather than as a separate param, e.g. {{ar-noun|بْرُوتُون|m|tr=brōtōn}} gets converted to {{ar-decl-noun|بْرُوتُون/brōtōn}}. This needs to be done as well to plurals, feminine forms, additional heads, etc.
  2. If the lemma begins with a definite article, the article needs to be stripped and |state=def added. The article likewise needs to be stripped from plurals, additional heads, etc.
  3. If the lemma is plural, we need to make it plural-only using something like {{ar-decl-noun|-|pl=نَاس}}.
  4. If the lemma is multiword, there are lots of complications. The declension code currently doesn't know how to handle multiword expressions with more than two words, so we skip them (there are only a few of these). Otherwise, we need to split up the two words, with the first one going into the base slot (parameter 1=, pl=, etc.) and the second going into the mod slot mod=, modpl=, etc.). If there is a transliteration, it also needs to be split up and each part appended using slash notation. (However, in many cases the transliteration is only present to indicate whether a feminine base noun is in the indefinite or construct state. Since we have to distinguish the two in any case—see below—such transliteration is redundant and we remove it.) In addition, a major issue is that we have to determine if the modifier is adjectival or ʾidāfa (a noun in the genitive case, with the base noun in the construct state). Sometimes we can figure this out automatically by looking for the presence of a definite article in the base noun (which cannot be present in an ʾidāfa construction) or modifier (for an adjectival, the article must be present either on both base and modifier or neither), but sometimes we need to resort to a manual list of which are which. In an ʾidāfa construction, we need to add something like |state=con|modcase=gen|modstate=def|modnumber=sg (for a definite singular ʾidāfa modifier). Note that we can't figure out the correct value of |modnumber= and guess singular (or collective for a collective noun, singulative for a singulative noun).

Creating entries for plurals, feminines, verbal nouns, participles, various finite verb parts

[edit]

One of my tasks is creating entries for non-lemma forms of a noun (plurals), adjective (plurals and feminines), and verb (participles and finite verb parts), as well as entries for verbal nouns of verbs (which are lemmas in and of themselves). Suffice it to say there are a whole lot of complexities, which vary depending on which sort of entry is being created. The code that handles creating these entries is over 1,100 lines of Python currently. We need to handle various cases (e.g. creating a new page, adding an entry to an existing page that lacks an Arabic section, adding an entry to an existing Arabic section of an existing page, adding a new Etymology section when there is already more than one, and adding a new Etymology section when there is only one, which involves changing any existing Etymology section to "Etymology 1" and increasing the indent of everything else by one). We also need to make sure we check for existing entries for the form in question, so that if we're run more than once we don't insert duplicate entries. Some of the things we do:

  1. For plurals, we overwrite the singular lemma and plural form in existing entries if the existing ones are lacking some vowel diacritics and we have better-vocalized versions available. We also remove ʾiʿrāb (esp. the nominative indefinite ending -un), since it has now been decided to leave out such endings and indicate the appropriate ʾiʿrāb in the declension table.
  2. For verbal nouns, we convert existing entries that say ===Verbal noun=== to just ===Noun=== and convert headword templates from {{ar-verbal noun}} to {{ar-noun}}. Instead we add a definitional line saying e.g. {{ar-verbal noun of|أَكَلَ|form=I}}; this goes at the beginning if there are other definitional lines as well. This is because verbal nouns typically have additional meanings beyond their meaning as a verbal noun, and thus there is no clean separation possible between verbal nouns and regular nouns.
  3. For finite verb parts, we group different parts that have the same spelling (including diacritics) and the same conjugation form (e.g. I, II, III ...) under a single headword with different definitional entries. Examples are يَحُزُنُ (yaḥuzunu), third-singular masculine non-past indicative active of both حَزَنَ (ḥazana) and حَنُنَ (ḥanuna), and يُجَارُّ (yujārru), which is both active and passive third-singular masculine non-past indicative of جَارَّ (jārra). We group differently-spelled words with the same conjugation form under a single etymology section (e.g. active indicative يَكْتُبُ (yaktubu), active subjunctive يَكْتُبَ (yaktuba), passive indicative يُكْتَبُ (yuktabu), etc.). We group words with different conjugation forms under different etymology sections even if they have the same spelling (e.g. يُجَارُ (yujāru), which is the passive third-singular masculine non-past indicative of both form I جَارَ (jāra) and form IV أَجَارَ (ʔajāra)). We also try to include verb parts of the same conjugational form with lemmas of that form, if they exist on the same page (e.g. passive كُتِبَ (kutiba) corresponding to active lemma كَتَبَ (kataba)).
  4. We try to insert participles directly before nouns and adjectives of the same spelling, in the same etymology section.

Arabic vocalization

[edit]

One of my tasks is the automatic vocalization of Arabic words using the Latin transliteration. This works using the tr_latin_matching function of Module:ar-translit (or rather, the equivalent function in the corresponding Python module). This is able to handle all sorts of different transliteration standards, since there are a lot of different ways of transliterating Arabic currently in use. (I will fix this up eventually.)

Some relevant facts about this function of me:

  • I only edit the English Wiktionary.
  • I only operate on headword-line templates. The current list of headword templates that I operate on is {{ar-adj}}, {{ar-adv}}, {{ar-coll-noun}}, {{ar-sing-noun}}, {{ar-con}}, {{ar-interj}}, {{ar-noun}}, {{ar-numeral}}, {{ar-particle}}, {{ar-prep}}, {{ar-pron}}, {{ar-proper noun}}, {{ar-verbal noun}}, {{ar-noun-pl}}, {{ar-adj-pl}}, {{ar-noun-dual}}, {{ar-adj-dual}}, and {{ar-nisba}}.
  • I operate in the following modes:
    1. If the first unnamed parameter is present (which is intended to contain the vocalized equivalent of the page name and serves as the |head= parameter of the underlying call to {{head}}, and in fact was formerly the formerly the |head= parameter of the higher-level templates, before a recent global change by MewBot), I add any missing diacritics in it based on |tr=.
    2. Otherwise, I vocalize the page name based on the value in |tr= and store it as the first unnamed parameter.
    3. I also add any missing vowel diacritics to |head2= based on |tr2=, |head3= based on |tr3=, etc.
    4. I also add any missing vowel diacritics to a whole series of other headword-template parameters. All of these parameters have the following format, based off of a base param name, e.g. |pl=:
      1. There is Arabic text in the base parameter, e.g. |pl=, and corresponding transliterated text in the base parameter name with tr appended, e.g. |pltr=. In this case I will add missing vowel diacritics to the Arabic in e.g. |pl= based on the transliteration in |pltr=.
      2. There may similarly be one or more alternative Arabic forms in the base parameter plus a number starting with 2, plus corresponding transliteration, e.g. |pl2= with transliteration |pl2tr=, |pl3= with transliteration |pl3tr=, etc. I add missing vowel diacritics to these, as above.
    5. I operate on the following base parameters: |pl=, |plobl=, |cpl=, |cplobl=, |fpl=, |fplobl=, |f=, |fobl=, |m=, |mobl=, |obl=, |el=, |sing=, |coll=, |d=, |dobl=, |pauc=, and |cons=.
  • I only add vowel diacritics. Currently I don't ever modify existing vowel diacritics. I also don't modify any consonant characters, although there is call to do so (in particular, changing initial plain ا to أ in various cases, which currently needs to be done by hand).
  • When I can't match completely, I don't do anything.

Transliteration fixup

[edit]

One of my tasks is to standardize the various transliterations in use to follow the preferred translation at WT:About Arabic (generally the first entry in the Romanization column), and delete redundant transliterations. This preferred transliteration is what is used when automatically transliterating from Arabic. This works as follows:

  1. I use the same tr_matching function as is used for vocalization. This matches up the Arabic text and Latin transliteration, inserting Arabic short vowels as necessary and match-canonicalizing the transliteration using a variant of the table used for vocalization. This makes it possible, for example, to determine whether "kh" should be converted to "ḵ" or left at "kh" and to insert missing ʾ symbols at the beginning of transliterations (corresponding to hamzas). The table knows about certain cases that shouldn't be canonicalized, e.g. both ū and ō map to Arabic و but should remain distinct. ʾiʿrāb in the Arabic but not the transliteration is inserted into the transliteration.
  2. If match-canonicalization succeeds, the regular tr function is then used to auto-transliterate the Arabic into Latin. If this works, this is compared against the match-canonicalized transliteration, and if they're the same, the transliteration is removed. Otherwise, the match-canonicalized transliteration replaces the original. Chief reasons why the match-canonicalization won't match the auto-transliteration are mid vowels in the match-canonicalization (ē, ō, e, o, which appear as ī, ū, i, u in the auto-transliteration) and phrase-internal tāʾ marbūṭa, which will appear as either nothing or t in the match-canonicalization but as (t) in the auto-transliteration.
  3. If match-canonicalization fails (because the Arabic and transliteration can't be match up), a more limited type of canonicalization termed self-canonicalization is used. This applies unambiguous canonicalizations of various sorts, e.g. converting aa to ā, removing accents and replacing x with . This type of canonicalization is e.g. unable to replace kh with because kh might stand for a k next to an h. The self-canonicalized transliteration replaces the original.