User:Robert Ullmann/SC recovery

English Wiktionary recovery procedure for "Serbo-Croatian" entries

Objective: to restore language sections and content removed by several editors intent on merging Croatian, Bosnian, and Serbian into "Serbo-Croatian".

A report of what is to be done, or has been done going forward, is at /report. It is at present a survey of the extensive damage that has been done to the wikt database.

Recover deleted sections

Using page history, find each of the standard language sections as it existed before being deleted, and restore it to the entry.

Process steps and notes

This is a rough outline of process to implement the above, it is not complete

Read XML dump, identify entries with missing language sections.
- if entry does not have a "Serbo-Croatian" language header, skip
Get current entry from wikt.
Identify each missing section.
Look for old section in entry history, restore it if found.
- read a reasonable number of past revisions from API, find most recent section for language
- add the section to the current entry text
- tag the restored section if there is any indication from the SC section that it may need to be reviewed (language or dialect tags, primarily)
If entry modified by the above, save with comment explaining which sections were restored or created.
- tag for AF to re-sort languages and clean up spacing

Restoring the language sections improperly deleted is very straightforward; the only true error case is restoring a section deleted because the word actually does not exist in the given language.

code

The code roughly follows process above; it is a work in progress as we discover the amount of damage done to the database and consider what should be done automatically to recover the work that has been deleted.

It is work in progress, and probably full of debug statements and tests; is probably be out of date at any given time. (;-)

#!/usr/bin/python
# -*- coding: utf-8  -*-

"""

English Wiktionary recovery procedure for "Serbo-Croatian" entries

Process:

Read XML dump, identify entries with missing language sections.

Re-read from live DB.

Identify each missing section.

Look for old section in entry history, restore it if found.

If entry modded, save.

No command line arguments.

"""

import wikipedia
import xmlreader
import sys
import re
import xmldate
import shelve
from time import sleep
from mwapi import *

def srep(s):
    return repr(u''+s)[2:-1]

# section cache from reading old revs:
scache = shelve.open("screcover-cache")

import transliteration

def stlit(w):

    # does this work reasonably for Serbian? (is general Cyrillic) We only need sort keys
    # appears to bring the entries together usefully in "ordinary" a-z order, that's all
    # we want; doing exact Cyrl-Latin for Serbian and sorting on the Latin order could
    # be done, but isn't needed; this is just the report

    lat = u''
    for c in w: lat += transliteration.trans(c, default = c)
    lat = lat.lower()

    # make into a sort key, defined order (by original):
    lat += ' ' + w

    return lat

def main():

    test = True
    partial = False

    resect = re.compile(r"==Serbo-Croatian==\n(.*?)\n----", re.S)

    # remove iwikis
    reiwiki = re.compile(r"^\[\[[a-z-]+:.*?\]\]\n", re.M)

    # find a def line in a section (just for report)
    redefn = re.compile(r'^#(.*)$', re.M)

    # match sections in old revs

    rerev = { }
    for c, lang in [ ('sr', 'Serbian'), ('hr', 'Croatian'), ('bs', 'Bosnian'),
                     ('zls-mon', 'Montenegrin') ]:
        rerev[c] = re.compile(r"==" + lang + "==\n(.*?)(\n----|</rev>)", re.S)

    # make sure we are logged in
    site = wikipedia.getSite()
    if not test: site.forceLogin()

    # get XML dump
    dump = xmlreader.XmlDump("en-wikt.xml")

    entries = 0
    scs = 0
    probs = 0
    fixed = 0
    replines = { }

    for entry in dump.parse():
        text = entry.text
        title = entry.title
        if ':' in title: continue

        entries += 1

        if entries % 10000 == 0: print "%d entries, %d scs, %d problems" % (entries, scs, probs)

        if "==Serbo-C" not in text: continue

        mo = resect.search(text + "\n----\n")
        if not mo: continue
        s = mo.group(1)
        scs += 1

        # Identify entries with missing language sections

        # we'll only want to add Serbian for Cyrillic entries
        iscyrl = False
        if ord(title[0]) >= 0x0400 and ord(title[0]) < 0x0530: iscyrl = True

        # select 5%
        if test and partial and scs % 20: continue

        # Retrieve current version

        print "%s:" % srep(title)

        # print "    tlit", srep(sctlit(title))

        page = wikipedia.Page(site, title)
        try:
            if not test: text = getedit(page)
        except Exception, e:
            print "    some exception getting page", repr(e)
            continue
        act = ''
        old = text

        # re-isolate SC section:
        mo = resect.search(text + "\n----\n")
        if not mo:
            print "    no SC section in current entry"
            continue
        s = mo.group(1)

        # report items:
        sdef = ''
        mo = redefn.search(s)
        if mo: sdef = mo.group(1).strip()
        # some ad hoc fixes
        sdef = sdef.replace("#English|", "")
        sdef = sdef.replace("{{", "{{i|")
        sdef = sdef.replace("|i|", "|")
        sdef = sdef.replace("|context|", "|")
        sdef = sdef.replace("|gloss|", "|")
        sdef = sdef.replace("|term|", "|")
        sdef = sdef.replace("|lang=sh}}", "}}")
        # then truncate
        if len(sdef) > 48: sdef = sdef[:45] + " ..."
        # should be safe now, unless truncate has split a {{ ... }}
        if "{{" in sdef and "}}" not in sdef:
            sdef = sdef.replace("{{", "<" "nowiki>{{<" "/nowiki>")
            sdef = sdef.replace("|", "<" "nowiki>|<" "/nowiki>")
        # [ don't need this: ? ]
        # sdef = sdef.replace("}}", "<" "nowiki>}}<" "/nowiki>")
        # breaks in prev strings are so that we can save this code on-wiki (!)
        recov = ''
        pres = ''
        rename = ''
        tag = ''

        # Identify each missing section

        need = { }

        if not iscyrl:
            if "==Bosnian==" not in text: need["bs"] = "Bosnian"
            else:
                print "    Bosnian present"
                pres += ",Bosnian"
            if "==Croatian==" not in text: need["hr"] = "Croatian"
            else:
                print "    Croatian present"
                pres += ",Croatian"
            if "==Montenegrin==" not in text: need["zls-mon"] = "Montenegrin"
            else:
                print "    Montenegrin present"
                pres += ",Montenegrin"
        if "==Serbian==" not in text: need["sr"] = "Serbian"
        else:
            print "    Serbian present"
            pres += ",Serbian"
        pres = pres.lstrip(',')

        # now some details: look for notes and context tags

        att = False
        # if there appear to be notes on the languages, we want to add attention templates
        st = s.replace("Serbo-Croatian", "") # copy to prevent false match on Croatian
        if "Serbia" in st or "Bosnia" in st or "Croatia" in st: att = True # matches "..ian" as well
        if "kavian" in s and not iscyrl: att = True # dialect reference, probably Serbian
        if att: tag = 'tag'

        revs = ''
        for code in sorted(need):

            # Look for old section in entry history, restore it if found

            skey = code + ':' + srep(title)
            if skey in scache:
                if not scache[skey]: continue   # we've tried before, there isn't any
                oldsect = scache[skey]
            else: oldsect = None

            # when not cached, get revs if needed

            if not oldsect and not revs:
                ut = urllib.quote(title.encode(site.encoding()))
                while not revs:
                    print "(getting old revisions)"
                    revs = readapi(site,
                        "action=query&prop=revisions|info&rvprop=content|ids&format=xml"
                        "&titles=" + ut + "&rvlimit=100")
                    if '</api>' not in revs:
                         print "(incomplete return from api)"
                         revs = None
                         sleep(20)

            # look for most recent section for language in old revisions

            if not oldsect:
                mo = rerev[code].search(revs)
                if mo:
                    oldsect = wikipedia.unescape(mo.group(1))
                    oldsect = reiwiki.sub('', oldsect + '\n') 
                    # [ if we change fixups, we may have to purge cache ]

            scache[skey] = oldsect  # cache section or None

            if oldsect:
                if att: oldsect += "\n{{attention|" + code + "|section restored, should be checked}}\n"
                text += "\n----\n==" + need[code] + "==\n" + oldsect
                act += 'recovered section for %s, ' % need[code]
                print "    recovered %s section" % need[code]
                rt = '' # mark sections of special interest, might be a few other things as well
                if '=Pronunciation' in oldsect: rt = '*'
                if '=Quotations' in oldsect: rt = '*'
                if '\n#:' in oldsect or '\n#*' in oldsect: rt = '*'
                recov += "," + need[code] + rt

                continue # did this code ;-)

            # end of loop
            continue

        recov = recov.lstrip(",")
        replines[title] = sdef + ' || ' + tag + ' || ' + pres + ' || '  + recov 

        # If entry modded, save.

        if not act: continue
        act = act.strip(", ")
        probs += 1

        if test: wikipedia.showDiff(old, text)

        # add AF tag to re-sort the languages and fix all the spacing we didn't bother with
        text += "\n{{rfc-auto|SC cleanup}}\n"

        act = 'test/to be reverted presently: ' + act

        if test: continue
        try:
             page.put(text, comment=act)
        except Exception, e:
            print "    some exception saving page", repr(e)
        fixed += 1

        # limit number of fixes for testing
        if fixed > 7: break

    print "%d entries, %d scs, %d problems" % (entries, scs, probs)

    # write out a report:

    rpage = wikipedia.Page(site, "User:Robert Ullmann/SC recovery/report")
    try:
        oldrep = getedit(rpage)
    except wikipedia.NoPage:
        pass

    testcom = ''
    if test: testcom = """''this is a test list of what would be done, no changes made'',
run from recent XML, not currrent DB"""
    if partial: testcom += ', only 5% of the entries checked on this test run'

    report = """
'''List of sections recovered/restored in entries with "Serbo-Croatian"'''

""" + testcom + """

Table is all entries with SC sections, the first defn line (truncated if long),
which of the standard languages are present,
and which are (will be) recovered. Entries that
will need review are marked with 'tag', they have attention tags added

The entries are sorted in a simple way to bring the Serbian forms together for convenience. A short bit 

of the first definition is shown, in case the word is not entirely familiar.

Translations sections and tables are not yet considered.

{| class="prettytable"
! entry
! short definition
! tag
! languages present
! to be recovered
"""

    # try to sort on transliteration, to bring Serbian forms together:
    for t in sorted(replines, key = stlit): 
        report += '|-\n| [[' + t + ']] || ' + replines[t] + '\n'

    report += "|}\n\n\n%d entries found\n" % scs

    try:
        rpage.put( report, comment = "writing report")
    except Exception, e:
        raise

    # done

if __name__ == "__main__":
    try:
        main()
    finally:
        scache.close()
        wikipedia.stopme()