Jump to content

User:Scsbot

From Wiktionary, the free dictionary

This is a bot account associated with User:Scs.

This account does not have the bot bit set. (That is, it has not received formal approval as an official Wiktionary bot.) I use this account for testing, for semiautomated (with manual oversight) editing, and as a container for this bot's source code and documentation. Approved bots using the same framework are User:ScsRhymeBot and User:ScsHdrRewrBot.

Description

[edit]

This is a wiki edit bot written as a shell script. If you're not familiar with shell scripting, none of the rest of this may mean much. If you're familiar enough with shell scripting to say, "A web edit bot as a shell script? That's preposterous!", read on.

First of all, it's not all written in shell, of course. Most of the real work is done by various C-written programs, some of them stock Unix apps, some of them custom tools of my own design. As I am fond of saying, shell programming involves "finely-honed but relentlessly general-purpose tools, combined perhaps with one or two new, special-purpose (and small) C programs, operating on plain text files, held together with appropriate dabs of shell script glue."

Of course, at a low level, editing a wiki via its web interface involves fetching an HTML page, filling out a form (typically including an HTML textarea which is the edit box), and submitting that form. And that's exactly what this bot script (or any wiki bot) does.

HTML fetching is done by a tool of mine called httpget, which is basically a command-line callable interface to the HTTP protocol. httpget is similar to cURL, wget, and lynx -dump -source. It has two features which are important for this application: (1) it can post data, and (2) it can maintain cookies across calls (just as a real web browser would maintain cookies across page fetches during a session).

Having fetched a wiki edit page, the script next prepares the edit form within it. This is done by a set of auxiliary shell scripts with names like formsetup and formselect. These auxiliary form-handling scripts are relatively "thin"; most of their work is performed by another C-written program, xmlsed, which is an as-yet rudimentary implementation of a planned general-purpose SGML/HTML/XML manipulator. xmlsed can do things like analyze the structure of an HTML/XML document, fetch the contents of certain fields (that is, the content between requested <x>...</x> tags), etc.

The meat of the editing then takes place on that form element which is the HTML textarea which is the edit box. By this stage in the processing, the form-handling scripts have extracted the contents of the edit box (which is of course the to-be-edited wikitext) into a separate temporary file. Further processing will be done on that file, in The Unix Way. ("Everything's a File").

The editing of the text box takes place in four distinct steps:

  1. A check script looks at the text to make sure it's a candidate for editing. This script might see whether the contemplated edit has already been made and therefore doesn't need doing. For example, if we're using this bot to add Rhymes: links to pronunciation sections, the check script might check to see whether there's already a Rhymes: link. Also, the check script might make sure that the page text is in the canonical form expected by the edit script (below).
  2. An edit script then performs the actual edits. This edit script might do just about anything, based on the task at hand. It might be a simple invocation of an off-the-shelf Unix tool, such as sed. Or it might involve all sorts of ad-hoc processing to look for ==Section headers== so as to add new text in the right section.
  3. Optionally, the script double-checks that the differences between the old and the edited text are as expected. If this step is performed, the bot literally presses the "Show changes" button (well, not literally, but effectively), and analyzes the result. (Currently the "analysis" is simpleminded, involving just a count of the number of deleted and inserted lines.)
  4. Optionally, a check script double-checks that the edited text was edited as expected, without any unexpected changes. If this step is performed, the bot presses the "Show preview" button, and passes the result to a task-specific check script.

Finally, another of the form-handling scripts, formsubmit, invokes httpget again to actually submit the edited text.

There are also various other checks along the way to make sure that the page being edited exists at all, that the bot user is not blocked, that the edit did not result in an edit conflict, etc.

Because of the extra double- (and triple-) check steps, this bot should be trustworthy and reliable. If anything unexpected happens, either because a particular page doesn't have the form that an edit script is written to expect, or because bugs in an edit script or the bot architecture goof up the page text somehow, one of the checks should catch the mistake, and prevent the faulty edit from being committed. (I've been testing this bot on my home test wiki, and the double-checks have already caught lots of unexpected and/or "impossible" situations. In other words, although I can't prove that the script doesn't have bugs, I am confident that if it does have any more bugs, the double-checks will catch them.)

It must be noted that this editing framework is generic and relatively low-level. There is no built-in knowledge of wiki markup specifics, let alone Wiktionary editing conventions. There is no high-level way to say, for example, "edit the Pronunciation section"; an edit script which wishes to do so must know how to find the pronunciation section itself, typically by looking for the ===Pronunciation=== header. (In time I'll probably develop boilerplate or idioms to abstract some of these low-level details away, perhaps sufficiently so that eventually there will be a high-level way to say "edit the Pronunciation section".)

Furthermore, this bot has no built-in machinery for discovering on its own which edits it should make. It always works from an explicit list, externally supplied. (I have various other scripts which help me build the work lists that the bot then chews on. Those auxiliary scripts aren't documented here yet.)

So far this bot has been used (on a limited scale) for two tasks:

  1. add an appropriate Rhymes: link to the Punctuation section of a page without one
  2. add a "breadcrumb" link trail to the rhymes pages (see e.g. Rhymes:English:-ɛmɪt and Rhymes:English:-eɪzɪŋ).

It has performed several hundred of each of these so far, under my supervision. (I manually double-checked each of its first few dozen edits, and I've spot-checked the remainder.)

Source code, utilities & tools, and additional documentation

[edit]

The bot proper is this shell script, "wikised" or you can read its more detailed documentation.

Actual examples of specific editing scripts (namely, the two rhyme-related ones used so far).

Documentation and source code link for httpget.

Documentation and source code link for xmlsed.

Documentation and source links for auxiliary form-handling scripts.

Besides httpget and xmlsed, there are several other custom tools I tend to use in scripts such as wikised. (examples are "line" and "column".) You can find documentation and source code for all of them at http://www.eskimo.com/~scs/src/.