User:Robert Ullmann/SC recovery
English Wiktionary recovery procedure for "Serbo-Croatian" entries
Objective: to restore language sections and content removed by several editors intent on merging Croatian, Bosnian, and Serbian into "Serbo-Croatian".
A report of what is to be done, or has been done going forward, is at /report. It is at present a survey of the extensive damage that has been done to the wikt database.
Recover deleted sections
[edit]Using page history, find each of the standard language sections as it existed before being deleted, and restore it to the entry.
Process steps and notes
[edit]This is a rough outline of process to implement the above, it is not complete
- Read XML dump, identify entries with missing language sections.
- if entry does not have a "Serbo-Croatian" language header, skip
- Get current entry from wikt.
- Identify each missing section.
- Look for old section in entry history, restore it if found.
- read a reasonable number of past revisions from API, find most recent section for language
- add the section to the current entry text
- tag the restored section if there is any indication from the SC section that it may need to be reviewed (language or dialect tags, primarily)
- If entry modified by the above, save with comment explaining which sections were restored or created.
- tag for AF to re-sort languages and clean up spacing
Restoring the language sections improperly deleted is very straightforward; the only true error case is restoring a section deleted because the word actually does not exist in the given language.
code
[edit]The code roughly follows process above; it is a work in progress as we discover the amount of damage done to the database and consider what should be done automatically to recover the work that has been deleted.
It is work in progress, and probably full of debug statements and tests; is probably be out of date at any given time. (;-)
#!/usr/bin/python # -*- coding: utf-8 -*- """ English Wiktionary recovery procedure for "Serbo-Croatian" entries Process: Read XML dump, identify entries with missing language sections. Re-read from live DB. Identify each missing section. Look for old section in entry history, restore it if found. If entry modded, save. No command line arguments. """ import wikipedia import xmlreader import sys import re import xmldate import shelve from time import sleep from mwapi import * def srep(s): return repr(u''+s)[2:-1] # section cache from reading old revs: scache = shelve.open("screcover-cache") import transliteration def stlit(w): # does this work reasonably for Serbian? (is general Cyrillic) We only need sort keys # appears to bring the entries together usefully in "ordinary" a-z order, that's all # we want; doing exact Cyrl-Latin for Serbian and sorting on the Latin order could # be done, but isn't needed; this is just the report lat = u'' for c in w: lat += transliteration.trans(c, default = c) lat = lat.lower() # make into a sort key, defined order (by original): lat += ' ' + w return lat def main(): test = True partial = False resect = re.compile(r"==Serbo-Croatian==\n(.*?)\n----", re.S) # remove iwikis reiwiki = re.compile(r"^\[\[[a-z-]+:.*?\]\]\n", re.M) # find a def line in a section (just for report) redefn = re.compile(r'^#(.*)$', re.M) # match sections in old revs rerev = { } for c, lang in [ ('sr', 'Serbian'), ('hr', 'Croatian'), ('bs', 'Bosnian'), ('zls-mon', 'Montenegrin') ]: rerev[c] = re.compile(r"==" + lang + "==\n(.*?)(\n----|</rev>)", re.S) # make sure we are logged in site = wikipedia.getSite() if not test: site.forceLogin() # get XML dump dump = xmlreader.XmlDump("en-wikt.xml") entries = 0 scs = 0 probs = 0 fixed = 0 replines = { } for entry in dump.parse(): text = entry.text title = entry.title if ':' in title: continue entries += 1 if entries % 10000 == 0: print "%d entries, %d scs, %d problems" % (entries, scs, probs) if "==Serbo-C" not in text: continue mo = resect.search(text + "\n----\n") if not mo: continue s = mo.group(1) scs += 1 # Identify entries with missing language sections # we'll only want to add Serbian for Cyrillic entries iscyrl = False if ord(title[0]) >= 0x0400 and ord(title[0]) < 0x0530: iscyrl = True # select 5% if test and partial and scs % 20: continue # Retrieve current version print "%s:" % srep(title) # print " tlit", srep(sctlit(title)) page = wikipedia.Page(site, title) try: if not test: text = getedit(page) except Exception, e: print " some exception getting page", repr(e) continue act = '' old = text # re-isolate SC section: mo = resect.search(text + "\n----\n") if not mo: print " no SC section in current entry" continue s = mo.group(1) # report items: sdef = '' mo = redefn.search(s) if mo: sdef = mo.group(1).strip() # some ad hoc fixes sdef = sdef.replace("#English|", "") sdef = sdef.replace("{{", "{{i|") sdef = sdef.replace("|i|", "|") sdef = sdef.replace("|context|", "|") sdef = sdef.replace("|gloss|", "|") sdef = sdef.replace("|term|", "|") sdef = sdef.replace("|lang=sh}}", "}}") # then truncate if len(sdef) > 48: sdef = sdef[:45] + " ..." # should be safe now, unless truncate has split a {{ ... }} if "{{" in sdef and "}}" not in sdef: sdef = sdef.replace("{{", "<" "nowiki>{{<" "/nowiki>") sdef = sdef.replace("|", "<" "nowiki>|<" "/nowiki>") # [ don't need this: ? ] # sdef = sdef.replace("}}", "<" "nowiki>}}<" "/nowiki>") # breaks in prev strings are so that we can save this code on-wiki (!) recov = '' pres = '' rename = '' tag = '' # Identify each missing section need = { } if not iscyrl: if "==Bosnian==" not in text: need["bs"] = "Bosnian" else: print " Bosnian present" pres += ",Bosnian" if "==Croatian==" not in text: need["hr"] = "Croatian" else: print " Croatian present" pres += ",Croatian" if "==Montenegrin==" not in text: need["zls-mon"] = "Montenegrin" else: print " Montenegrin present" pres += ",Montenegrin" if "==Serbian==" not in text: need["sr"] = "Serbian" else: print " Serbian present" pres += ",Serbian" pres = pres.lstrip(',') # now some details: look for notes and context tags att = False # if there appear to be notes on the languages, we want to add attention templates st = s.replace("Serbo-Croatian", "") # copy to prevent false match on Croatian if "Serbia" in st or "Bosnia" in st or "Croatia" in st: att = True # matches "..ian" as well if "kavian" in s and not iscyrl: att = True # dialect reference, probably Serbian if att: tag = 'tag' revs = '' for code in sorted(need): # Look for old section in entry history, restore it if found skey = code + ':' + srep(title) if skey in scache: if not scache[skey]: continue # we've tried before, there isn't any oldsect = scache[skey] else: oldsect = None # when not cached, get revs if needed if not oldsect and not revs: ut = urllib.quote(title.encode(site.encoding())) while not revs: print "(getting old revisions)" revs = readapi(site, "action=query&prop=revisions|info&rvprop=content|ids&format=xml" "&titles=" + ut + "&rvlimit=100") if '</api>' not in revs: print "(incomplete return from api)" revs = None sleep(20) # look for most recent section for language in old revisions if not oldsect: mo = rerev[code].search(revs) if mo: oldsect = wikipedia.unescape(mo.group(1)) oldsect = reiwiki.sub('', oldsect + '\n') # [ if we change fixups, we may have to purge cache ] scache[skey] = oldsect # cache section or None if oldsect: if att: oldsect += "\n{{attention|" + code + "|section restored, should be checked}}\n" text += "\n----\n==" + need[code] + "==\n" + oldsect act += 'recovered section for %s, ' % need[code] print " recovered %s section" % need[code] rt = '' # mark sections of special interest, might be a few other things as well if '=Pronunciation' in oldsect: rt = '*' if '=Quotations' in oldsect: rt = '*' if '\n#:' in oldsect or '\n#*' in oldsect: rt = '*' recov += "," + need[code] + rt continue # did this code ;-) # end of loop continue recov = recov.lstrip(",") replines[title] = sdef + ' || ' + tag + ' || ' + pres + ' || ' + recov # If entry modded, save. if not act: continue act = act.strip(", ") probs += 1 if test: wikipedia.showDiff(old, text) # add AF tag to re-sort the languages and fix all the spacing we didn't bother with text += "\n{{rfc-auto|SC cleanup}}\n" act = 'test/to be reverted presently: ' + act if test: continue try: page.put(text, comment=act) except Exception, e: print " some exception saving page", repr(e) fixed += 1 # limit number of fixes for testing if fixed > 7: break print "%d entries, %d scs, %d problems" % (entries, scs, probs) # write out a report: rpage = wikipedia.Page(site, "User:Robert Ullmann/SC recovery/report") try: oldrep = getedit(rpage) except wikipedia.NoPage: pass testcom = '' if test: testcom = """''this is a test list of what would be done, no changes made'', run from recent XML, not currrent DB""" if partial: testcom += ', only 5% of the entries checked on this test run' report = """ '''List of sections recovered/restored in entries with "Serbo-Croatian"''' """ + testcom + """ Table is all entries with SC sections, the first defn line (truncated if long), which of the standard languages are present, and which are (will be) recovered. Entries that will need review are marked with 'tag', they have attention tags added The entries are sorted in a simple way to bring the Serbian forms together for convenience. A short bit of the first definition is shown, in case the word is not entirely familiar. Translations sections and tables are not yet considered. {| class="prettytable" ! entry ! short definition ! tag ! languages present ! to be recovered """ # try to sort on transliteration, to bring Serbian forms together: for t in sorted(replines, key = stlit): report += '|-\n| [[' + t + ']] || ' + replines[t] + '\n' report += "|}\n\n\n%d entries found\n" % scs try: rpage.put( report, comment = "writing report") except Exception, e: raise # done if __name__ == "__main__": try: main() finally: scache.close() wikipedia.stopme()