Jump to content

Wiktionary:Beer parlour/2006/November

From Wiktionary, the free dictionary
This is an archive page that has been kept for historical purposes. The conversations on this page are no longer live.
Beer parlour archives edit
2024

2023
Earlier years

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002
December


Multilingual statistics

The latest statistics comparing various Wiktionary projects are now posted on Meta. The Vietnamese Wiktionary (208527 entries) has surpassed French (208214 entries), and is now the second largest project after English. The most amusing statistic is the Yiddish Wiktionary, with the second largest number of admins and the highest Administrator-to-entry ratio (29 admins to 165 entries). --EncycloPetey 00:11, 1 November 2006 (UTC)[reply]

I'm not too impressed with the Vietnamese wikt in comparison; almost all the entries are generated by 'bot, with machine-translated definitions (quality unknown to me ;-). It certainly would be very useful to a speaker of Vietnamese, but compared to what the en.wikt does for entries? ("Piedbot" does do some cool things: if you add a picture to an entry here, it will presently appear on the corresponding entry there ...) if you look at the stats, note there is an average of ~1.5 edits per entry; that will tell you something. Robert Ullmann 12:19, 1 November 2006 (UTC)[reply]
I thought it might have been something like that. This tends to confirm something else I've been suspecting: entrycountitis is potentially just as serious a problem as editcountitis.
I urge people to focus on the quality of our project, not the quantity of entries. Ideally, just don't even look at the entry count at all, let alone start comparing it to sister projects. In an ideal world, the list at http://meta.wikimedia.org/wiki/Wiktionary would be sorted alphabetically. —scs 12:44, 1 November 2006 (UTC)[reply]
Alphabetically?? Does English come before or after Ελληνικά? (I'm thinking more of a slowly spinning globe or something, but) as to your point, it's a good one. DAVilla 13:25, 1 November 2006 (UTC)[reply]
A logo representation of a statistics table? Oh my. --Connel MacKenzie 17:01, 1 November 2006 (UTC)[reply]
Number games are not completely irrelevant, but do tend to be over-emphasized. I think the Vietnamese Wiktionary jump may indicate something may be amiss there, but if the contributors at vi.wikt: are happy with it, more power to them. The statistics provide a means for measuring progress - in theory, all language Wiktionaries will eventually have the same number of entries (approximately.) So, "France has more than English" statements should be dismissed, as they miss the point. --Connel MacKenzie 17:01, 1 November 2006 (UTC)[reply]
And when all said and done, it'd really make more sense counting the number of language headers than the number of pages with a link... :) \Mike 08:19, 2 November 2006 (UTC)[reply]
But subtracting those where the only POS is "verb form", I think. And where the only definition is "plural of", or "comparative/superlative of". And...
(There really is no end to this; it's an impossible inquiry. And it begs reminds us of the old question: "How many words are there in the English [or any other] language?" To which the answer, of course, is either "infinitely many" or "it's a menaningless question".) —scs 13:23, 2 November 2006 (UTC)[reply]
I don't want to leave the impression that I was being negative about the Vietnamese wikt's approach; I was merely pointing out that the numbers aren't comparable.
I've noted a number of the wikts other than en do a lot of importing entries (from en in particular). This is good in making a lot of words accessible to the native speakers of that wikt's language. For example, I can transport entries in Japanese, Mandarin, Min Nan etc. from here to the rw.wikt, and have reasonably good entries for the trouble of translating one definition line into Kinyarwanda (it does take some knowledge of all the languages involved to do it well). The fr wikt is also used as a source, but it also copies from en: the Han character ("sinogram") entries in the fr wikt were copied from ours, reformatted in a manner very similar to what I am doing here. (But they didn't fix "Morobashi", they have 17.673 references there ...) ko and ru also seem to have copied some of those entries. and so on.
If you look at the zh (Chinese/Mandarin) wikt, you will note that they have entries for zh:father, zh:père, zh:padre, zh:vader, zh:tata, and so on all of which are (correctly) defined as 父亲, but zh:父亲 doesn't exist. You would think it would? It seems that some of the wikts are much more about language interface to the data from (e.g.) the en wikt then they are about defining their own language? Robert Ullmann 16:03, 2 November 2006 (UTC)[reply]
I wouldn't suggest that we model our activities after zh:, but I think what they've done is admirable. On some level, that approach is possible more useful than describing the basic terms that everyone already knows. But I think your observation is a reflection of individual contributor's interests, more than any policy thing. We have just as varied interests among our contributors - some people are good are different tasks. Some are best focusing on pronunciations, others putting all their efforts into etymologies, others that verify uses, others that use their programming skills to import other sources, others that fight vandalism, others that translate for specific languages. Thankfully, we are finding ways to have all of these talented people work together in a coherent manner. Seeing that cross over to other language Wiktionaries is a very positive development. I am heartened and encouraged, by seeing such efforts working. If the vi: Wiktionary streamlines the automation techniques for importing entries, they may very well pave the way for automated correction of those entries later, too. The statistics will diminish in importance as people realize that eventually, all wikt:'s will have (approximately) the same number of entries. --Connel MacKenzie 18:01, 6 November 2006 (UTC)[reply]

template

moved to WT:GP#Template:simple past of. --Connel MacKenzie 18:49, 1 November 2006 (UTC)[reply]

logo vote concluded

See m:Wiktionary/logo for the result. Someone should remove the "new logo for all Wiktionaries is being voted on at meta" banner. —scs 00:36, 2 November 2006 (UTC)[reply]

Wow...that is a horrible choice...I hope we don't adopt it here. - [The]DaveRoss 00:48, 2 November 2006 (UTC)[reply]
I think it looks pretty good. —Stephen 06:04, 2 November 2006 (UTC)[reply]
I think repeating my disgust at the "voting" procedure would be redundant, right? --Connel MacKenzie 08:01, 2 November 2006 (UTC)[reply]
Nice preterition. :-) —scs 13:15, 2 November 2006 (UTC)[reply]
I don't know about "have to", but why push the issue? Yes, we should adopt it here. Or rather, we should adapt it here. Any ideas on what could be changed? No reason why the W has to be tilted, and I've always said I wanted it closer to the viewer. I personally don't like the happy face character, and I don't care if it really is Japanese. Color scheme is okay by me, but it doesn't look authentic. No grain for the pattern and no depth to the carving. Maybe someone could construct them out of wood and simply photograph it? Or at least model it in 3D. Please! DAVilla 12:56, 3 November 2006 (UTC)[reply]
You wouldn’t see the Japanese happy face unless you were visiting the Japanese wiktionary. The Japanese wiktionary won’t do you much good unless you can read Japanese at least a little...and if you can read Japanese, the is not a happy face, but merely the letter shi. —Stephen 17:09, 3 November 2006 (UTC)[reply]
And shi is the reading of 四 (four) which is considered unlucky because it is also the reading of 死 (death) ... some happy face ... (do note though that 四 and 死 are read in hiragana し, not katakana シ) Robert Ullmann 20:38, 6 November 2006 (UTC)[reply]
Doesn't look "authentic"? Authentic what? Oh you mean those familiar things? Don't even mention that, or Hasbro's trademark lawyers will put this logo out of its misery in a New York minute. Robert Ullmann 20:28, 6 November 2006 (UTC)[reply]

Of course, the whole procedure was designed to carefully skip the question of whether a new logo was wanted at all. It is therefore completely meaningless. If we are to have a new logo for the wikts, in my opinion it requires a vote with at least 2000 support votes and little opposition. Robert Ullmann 15:38, 2 November 2006 (UTC)[reply]

It was possible to vote for keeping the present logo. Ncik 10:51, 3 November 2006 (UTC)[reply]
No, it really wasn't. The description of the alleged "approval voting" system was wrong: it said to vote for the one you wanted, in approval voting you vote for all of the ones you want. The description of the current logo was wrong (it showed bogus reduced sizes e.g. for the favicon: look at your URL bar right now and you'll see that is a W, not a smashed version of the full logo). Most importantly, the notices didn't say "if you don't want a new logo, go vote for the existing one". The argument that is was "possible" is completely bogus. Robert Ullmann 11:28, 3 November 2006 (UTC)[reply]
FWIW, there were I think three rounds of voting, and in the first two, you could vote for as many logos as you wanted. And in at least two of the rounds, there were people voting for keeping the existing logo, but no, that section wasn't prominent. Whether that possibility received little attention because it was hidden, or few people favored it, I couldn't say. —scs 14:02, 3 November 2006 (UTC)[reply]
The current logo was on the table when I voted, but I, like most others, liked and voted for the one that eventually won. There have, by the way, been several notices here in Wiktionary to go there and vote. It was well publicized. —Stephen 17:13, 3 November 2006 (UTC)[reply]
It is also worthy of mention that, being generous, 20 members of the en.wiktionary community voted for the winning logo. We have over 20,000 registered users, and an active community of over 1000. 20 votes is so far from anything which could be called a consensus that it would be pretty wrong to adopt the logo. I also wonder where the outrage over the present logo was voiced that caused us to try and change it? It wasn't actually a Wiktionarian who proposed the vote... - [The]DaveRoss 21:55, 3 November 2006 (UTC)[reply]

It is ironic to note that the meta: effort to generate a new Wiktionary logo was the direct result of conversation(s) here on the en.wikt: Beer Parlour. But by the second round (which had already eliminated to current logo as a choice) the remaining logos offered only one single improvement: the elimination of the incorrect "Queen's" pronunciation...but in all other regards were soundly inferior. It is no wonder that despite dozens of notices, so little participation from here was felt on meta:. The various cutoffs were uniformly ill-timed. The German Wiktionary has an icon on their front page that, to me, still seems vastly superior to any of the meta: logos. But due to "meta: irregularities" never got the opportunity to even be considered. --Connel MacKenzie 17:47, 6 November 2006 (UTC)[reply]

Characters or words?

I can't figure out how to find out if this has been discussed before. I've posted this content at the Information Desk, without getting any replies:

I had a look at a few Chinese and Japanese words today and I am rather disturbed by the conceptual muddling that seems to be taking place.

I am well aware that in traditional Oriental lexicography each Chinese character gets an entry of its own, followed by a list of 'compound words' in which the character occurs. (This kind of dictionary is called a Kanwa Jiten in Japanese). But this particular conception of Chinese characters clashes with a second, word-based approach to lexicography. For instance:

The word 贵 guì in Chinese means 'expensive, dear'. In Japanese the equivalent word is 高い takai.
The character 贵 guì (traditional form: 貴) is used in all languages that use Chinese characters, but the actual usage of the character differs somewhat according to the language. In Chinese, 貴 is (1) a word meaning expensive; (2) it also occurs in various character compounds with meanings like 'expensive, valuable' (e.g. 贵公司, your esteemed company). In Japanese, on the other hand, 貴 is not used on its own as a word meaning 'dear', although it is used in character compounds with the meaning 'costly' or 'expensive'. (By the way, this does not include 貴公司, which is purely Chinese; the equivalent Japanese word is actually 御社).

I do really feel that the rampant mixing of character and word-based approaches leads to confusion. The two functions need to be kept separate somehow. That is, the Chinese word 贵 needs to be listed at its own entry meaning 'expensive'. And the Chinese character 貴/贵 needs to have its own listing according to the tradition of Oriental lexicography.

A link from the English word 'expensive' or 'dear' should lead to the word 贵 in Chinese and the word 高い in Japanese, etc. In the current situation, 'expensive' leads directly to the character 贵, and there is no independent entry for the Chinese word 贵.

Has any thought been given to this problem in Wikipedia?

30 October 2006: Bathrobe (Wikipedia user name)

Well, this is the wiktionary, not wikipedia. The character and the word would appear on the same page; just as words in the same or different languages appear on the same page if they have the same spelling. The links in expensive are intended to link to the words and 高い. It is just that no-one has added the word entry to the page yet.
Look at for what a fairly complete page looks like. Robert Ullmann 15:31, 2 November 2006 (UTC)[reply]
Yes, I use these one-character pages all the time and I don’t think they are confusing at all. For the basic CJK character, the "common" meaning should be the classical Chinese sense. Then below the CJK character, there are (or should be) separate entries for Modern Mandarin, Japanese, Korean, Cantonese, etc. It’s the same thing that we do with letters of alphabets, such as Cyrillic в, which is also a word as well as an abbreviation in a number of languages. —Stephen 17:24, 3 November 2006 (UTC)[reply]
To my knowlege, we are attempting something at Wiktionary that has not been done before with respect to Chinese characters. We are (slowly, and in fits and starts) in the process of documenting everything important about any single Chinese character (etymology, usage in every language and dialect modern and ancient, encoding information, indexing information, example sentences etc) all on the same page! We have spent the last month or so perfecting a sound prototype of what such a page might look like (we used for the prototype). We now have a handful of contributors who have a very good sense of what needs to be done. Some of it can be done with computer technology (reformatting from existing online resources etc). It will take much longer to put in the information that a computer cannot supply. This will involve language experts typing in the information by hand over a period of many years. The length of time involved probably sounds strange in our age of instant data access. Not to worry though, I believe that building wiktionary will work a lot like a financial investment. At first, you start out with a small amount of money which does not grow very fast. However, as time progresses and the amount of money grows, compound interest makes the investment grow at an ever more rapid pace (we hope :). I still hold out hope that something like this will happen with Wiktionary. It may take another several years, but I believe that one day Wiktionary will acquire a critical mass of talented contributors. At that point, we should be able to make much more rapid progress than is currently possible. The key thing is for the current crop of steady contributors to lay down a solid foundation that can be built upon when that time comes. This type of work has been proceeding apace for the last several months.
In conclusion, you may see a lot of pages on Wiktionary that are poorly formatted or contain little or no useful information. But that does not mean that those pages will always be that way. The power of Wiktionary is that someone can eventually come along and fix all that.

A-cai 03:45, 4 November 2006 (UTC)[reply]

Verification and Google

To what extent can Google (as opposed to Google Books) be used as a source for attestation?

There are occasions when a word or phrase is well known to you but may not have made it into the dictionaries for various reasons. You may know well that the word or phrase has been around for considerably over a year.

Under these circumstances is it/should it be permissible to simply cite 'Google' as a reference if Google provides at least several hundred different sites that use the target term in a consistent and intelligent manner?

The specific example that brings this to mind is an entry I added for "sheek kebab". This was tagged as 'rfv', but although I know beyond reasonable doubt that the term exists on many thousands of restaurant menus and is known to millions of people, I was also pretty sure, even before I looked, that it would not be in the OED.

Googling the term produces around 11,000 hits and dropping into a few on the first ten pages shows that the expression is widely used in a consistent manner. There must come a point where the sheer number of Google hits indicates that a term is legitimate and has had currency for a reasonable amount of time. Moglex 17:30 (approx), 2 November 2006 (UTC)

I added rfv, as I've never seen that spelling, to refer to that concept. Perhaps it is common in different regions. At any rate, yes, you can comment on WT:RFV "clearly widespread use" if that is clear beyond a doubt. (I.e., if it was just a poor request for verification.) Considering that I see only four hits here, I'd say the term's spelling/validity is dubious. Please don't forget that most short (5-30 character) strings of random characters get more "straight" google hits than that. Citations that are found in print sources are included on such entries to show that the term really is used with that meaning. Simply pointing at google is insufficient, particularly when such results change so dramatically from day to day. --Connel MacKenzie 18:08, 2 November 2006 (UTC)[reply]
I fully appreciate that "found in Google" does not equate to "suitable for inclusion". What I am saying is that there must surely be a point at which a simple Google (not books) search that returns x thousand entries (or x hundred entries from suitably different sources) would have to indicate widespread useage. As I said, it is also necessary to dip into a few of the pages returned to ensure that the sense is consistent and that you are not just being duped by, for example, some tosh that is quoted in a lot of blogs. This was just an example, but the simple search returns ATTOW 12,100 hits from sites around the world, all indicating consistent use of the term. I wasn't complaining about the particular rfv, but it is a good example of a term that is not in OED/MW, but is in widespread use according to Google (and the contributor's foreknowledge which in this case is rather hard to prove). Moglex 18:33, 2 November 2006 (UTC)[reply]
Well, I don't think we've found a suitable number of "straight" google hits, to date. At one time, a comparative measure was suggested, but that too, was rejected (in light of numerous examples that demonstrated that technique's inadequacy.) By the way, I was suggesting that 11,000 google hits (not the four books.google hits) is comparable to searching for random strings of characters. Lastly, I wish to emphasize that I think this is a regional spelling difference; all the more reason to have solid print citations in the entry. --Connel MacKenzie 18:47, 2 November 2006 (UTC)[reply]
"I was suggesting that 11,000 google hits (not the four books.google hits) is comparable to searching for random strings of characters"  ????
Have you actually tried searching for random strings of characters. Unless your chosen string just happens to coincide with a real word you'll get vastly fewer than 11000. For example "sheex" = ~500 "sheec" = ~ 370.
Also, as I pointed out, it requires a little intelligent investigation of the resulting pages. In this case the first hit is from a Nepalese site, the second a fairlly cosmapolitan site, third British. There is even a site from the Napa Valley on the first page of results. So it's not a 'local' spelling variation.
But I don't want to get bogged down on this particular example. I had a similar problem with a requested entry "No harm, no foul" that I had been aware of for some time and which returned ~276000 hits, again, upon investigation, indicating that the term is widely used.
It just seems a pity that it may become pointless adding quite valid entries where Google shows many thousands of sites using the term in a reasonably consistent way, but where no one can provide concrete evidence because of the way the term has been used. — This unsigned comment was added by Moglex (talkcontribs).
But that is precisely why citations are preferred. (BTW, some silly examples: [1] [2] [3].) I do not see what you are getting at with your last point. Some terms are harder to filter out for, but adding related terms or more common forms often helps narrow it down. And there are several contributors who will sometimes militantly defend an entry listed on RFV, by providing the citations that are desired initially, for truly uncommon entries. In my opinion, we allow far too many "urban dictionary-isms" as it is; we should be seeking ways to strengthen our criteria, not ways to weaken it further. (I acknowledge that mine is an unpopular opinion.) --Connel MacKenzie 19:35, 2 November 2006 (UTC)[reply]
OK, maybe I'm getting a rather distorted view of the whole procedure because I seem to be specialising in Indian food at the moment, and that is a minefield with the freely transliterated names and recipe differences as the dishes are copied around the world. The main gist of my question remains, though. If there are a large number of Google hits from a wide variety of well produced and disparate sites, should this not be enough to attest the widespread usage of a term?

As anybody who's spent any time working on word lists and dictionaries has discovered, trying to decide which "words" are and aren't worthy of inclusion is a thorny, thorny problem.

There are precious few hard-and-fast rules, here or anywhere. A good deal of old-fashioned judgement is required. And you can't place too many limits on your list of "approved" or "official" citation sources, because there are plenty of real words that never show up in print, or are never spoken, or are never used on web pages that google can find, or whatever.

Me, I'd say that regular google searches can be a useful component of the decision on whether to include a word. When we say that we "can't come up with an absolute threshold", it doesn't mean that we can't use google searches at all, but rather, that there's no one, hard, objective cut-off or threshold above which a word is "definitely a word worthy of inclusion" and below which it's "obviously a typo or keyboard banging or urbandictionary cruft".

On the one hand, you can (as Connel demonstrated) get huge numbers of hits for seemingly random or garbage searches. And on the other hand, you can get remarkably few hits for obscure but bona-fide terms. My current favorite is "oilcan dent", a fine old term which I learned while doing autobody work with my father, but which must be somewhat obsolete, because it gets all of seven google hits today, which is getting down into the googlewhack range.

Certainly, too, as Moglex aptly noted, you have to look at what sorts of things the google-found pages are saying, not just how many of them there are.

(As to the specific example: if you'd asked me yesterday, I would have said that sheek kebab was "obviously" a typo for shish kebab. But now, after looking at the google hits, I have to say, it's clearly a real name of a real dish. Learn something new every day.) —scs 21:02, 2 November 2006 (UTC)[reply]

I thought that it was "seekh kebab". Try googling that. Leevi 01:57, 3 November 2006 (UTC)[reply]

The problem with these Indian foods is that the names, as they are written, are transliterated from one of the 18 official languages of India (or possibly a local dialect). This transliteration happens/happened every time an Indian restuaraunt owner or Indian Indian cookery writer wrote down the name of a dish. There are no rules for doing that and some transliterators put more effort into trying to preserve the aspirated sounds of the original word (hence the magic appearing/disappearing 'h's). There are various other sounds that do not occur in spoken English (in Hindi, there are four ways to sound 'd' and 't', for example). Have a look at 'papadom' to see where this leads us. Moglex 09:21, 3 November 2006 (UTC)[reply]

Ah, yes. Food names are interesting that way; you're exactly right about the "every restaurant owner" phenomenon. (There are examples from the Middle East, too: is it hummus, humus, houmous, hommus, hummous, or hamos?) And, transliteration issues aside, food names are one place where traditional dictionaries undercount human vocabulary, so it's good if we explore them.
I couldn't believe your mention of "18 official languages of India" at first, but I now see that, if anything, there are more. —scs 15:37, 3 November 2006 (UTC)[reply]
Some of the confusion is due to cultural and culinary differences between GB and the U.S., and much of the confusion is due to the fact that kebabs are popular with numerous peoples speaking different languages. Shish (şiş) is Turkish for a skewer, and seekh (sikh) is Persian for a spit. Same thing, different language. And of course there is also the problem of transliteration, and most words can be transliterated in several different ways (shish, sheesh, chiche, Schisch, and so on). —Stephen 17:36, 3 November 2006 (UTC)[reply]
When including a word in a dictionary, most people probably use the following criteria:
  1. is it in common use?
  2. was it in common use at one time?
  3. if it is an obscure word, does it hold some significance worth noting?
Ok, maybe the above represents my own personal criteria, but I don't think it would be too far off from the criteria that other people use. Google does a fairly good job at answering the first question (most of the time), but the results need to be verified by someone who knows the language and how to interpret the hits coming back from google. So, Google is a valuable tool, but it is only a part of the story. The rest involves documented usages (many of which can be found with the aid of google), and verfication by native speakers. Since, Wiktionary is comprised of anonymous contributors, a cited usage is probably always the most reliable (ex. it was in book X or movie Y etc).

A-cai 03:57, 4 November 2006 (UTC)[reply]

Other CJKV languages

While clicking along formatting part of the Han character entries, the remainder of the cruft in the entries was bugging me more and more.

I also have had some discussion with Connel about making sure there are definition lines in each language section. (See his talk page if you are curious.) I decided to match as much as possible of the non-standard formatting (endless bullets and boldface) and fix all the languages in the entries. And add a definition-needed line in lieu of actual definitions.

See for an example of the result.

The "standard" entries from NanshuBot are easy to handle, the problems of course come from variations people thought were cute or something. The only data dropped is the so-called "Penkyamp" romanization of Cantonese, which apparently was some "original research" someone was trying to promote on the 'pedia. (It was apparently deleted from the English 'pedia, and stashed on a user page: w:zh-yue:User:拼音/Penkyamp English, if you are not familiar with the characters, "拼音" is "pinyin", and presumably "penkyamp" in "penkyamp" :-) I haven't seen it filled in anywhere anyway.

There are categories to collect all the definitions needed. (Well, I haven't created the cat pages, but you can still look at Category:Mandarin definitions needed for example.) I don't expect the cats to be useful except as a way of tracking what is left out there, i.e. not as cleanup lists or some such. They will be quite large for a while. Robert Ullmann 23:34, 4 November 2006 (UTC)[reply]

Here's a thorny one that I just worked on. Sometimes, I wish I could have derived terms/usage notes/see also sections for each definition of the term. For example, Mandarin roughly corresponds to Min Nan , but only for one special sense (the one where you use it as a helping verb with a object of the sentence). Also, I want to show examples of derived terms that are separated out by meaning (ex. when it is used as a noun that means handle, one of the derived terms is 把柄. But when it is used as a verb that means to guard, one of the derived terms is 把守). The current format that we have for definitions seems too rigid at times to illustrate these things. Any ideas?

A-cai 02:20, 5 November 2006 (UTC)[reply]

We do this routinely with Translations and somewhat less commonly with Synonyms, Antonyms etc. Use a gloss for the sense in the same way:

See also

special helping verb
This would be especially useful for usage notes, although short ones can also be in-lined. We don't say this specifically in WT:ELE but it is getting fairly big already. I ran my code on this entry, and then had to combine the two Mandarin sections ... ;-) Robert Ullmann 12:03, 5 November 2006 (UTC)[reply]

I know everyone is real busy (and reading the opus below ;-)), but I would really like some feedback? Robert Ullmann 20:11, 5 November 2006 (UTC)[reply]

:-) I think the "Translation style gloss disambiguation" is the most appropriate solution for this. --Connel MacKenzie 17:33, 6 November 2006 (UTC)[reply]

The only potential problem I see if if for some reason a charatcer is not used in a particular language, but gets an entry created under that language header with a "definition request". Otherwise, I think the situation can be handled with See also or Usage notes. I'm just not versed enough in the problems inherent in the characters to say much more. --EncycloPetey 04:32, 7 November 2006 (UTC)[reply]

Languages are only getting a section if there is information in the entry. So where there is (e.g.) an eumhun reading, it gets a Korean section properly formatted. The only dubious part of this is Cantonese, where quite a large number of entries have a Yale romanization for Cantonese, but nothing else. Some (yet to be determined) number of these might go away on individual edits, but there is no way to fix this automatically. Robert Ullmann 11:38, 8 November 2006 (UTC)[reply]

ingenuitive and the wholesale breakdown of the RFV process

Throughout en.wiktionary's history, there have been various skirmishes about "correctness."

There has been one faction, particularly the older (in wiki-time) contributors, that has constantly sought for more stringent verification and better weeding out of nonsense. These are the contributors who wish to see en.wiktionary become a respectable (and eventually, a respected) resource.

On the opposite end, there has been another faction, particularly younger students, who have fought tooth and nail to include every last possible snippet of gamer-jargon, nonces, hypothetical number names, random key sequences, intentionally ridiculous sexual slang, urbandictionary-isms, 'fuck' in 8,000+ languages, baby-talk and dozens of other things you won't find in any serious dictionary.

Caught in the middle, are others, who understand that English language learners need some idioms, some set phrases, but by and large, need/want only the absolutely most common slang terms (most of which, you can find in other respectable on-line sources.)

Other factions include those who wish to include industry jargon. Or computing jargon. Or legal jargon. Or specific legal terminology. Or medical terminology. Or medical jargon. Or computer programming language instructions. Or Wiki* jargon. Or dictionary (especially Webster's-1913) jargon.

Other caught in the middle are those exposing deficiencies in other resources (such as the OED, m-w.com, etc.) where the conflicting or incorrect information found there can be obviously or demonstrably false. These rare cases leave such contributors no choice but to reluctantly side with the inclusive factions.

Still others wish to see Wiktionary be the learning tool of choice for all languages. Others wish for it to be a translation engine, capable of translating any valid word combination, from any language, to English. (And of course, all other language Wiktionaries are supposed to just translate the entries we have on en.wikt, right?)

Separate also, are those who wish to enter regional terms of local (or very local) interest. They too, are left no choice but to side with the most rabid inclusive philosophies.

By and large, no one person fits entirely into any one of the above generalizations. Most contributors consider each term on its own. Those less moderate in their opinions, can seem rather bizarre, being rabidly prescriptivist one minute, horribly inclusive the next. (We shan't mention any names, for Connel's sake.)


Some of the more dramatic battles in the English Wiktionary's history have created seemingly useful things: the list of made-up words or request for deletion of garbage. I personally have witnessed the "gaming of the system" with regard to each of these. The LOP was a place to sequester garbage that doesn't belong in any dictionary. It has been overwhelmed. The RFV process was from the inclusive faction, as an attempt to define what a neologism truly meant. But the qualifications that worked moderately well when RFV started, has now been flooded by an astonishing volume of garbage, due to our over-reliance on books.google.com. The books are increasingly including texts from questionable sources. And the numeric increase in the number of books available, has led to the inclusion of many more terms now, that clearly were ridiculous garbage when the RFV 'process' was defined.

It is clear to me that the current system of requiring three citations has failed. It is no longer (and probably never truly was) able to weed out the chaff.

The biggest problem I see, is that the [delete] button doesn't "work." That is, a deleted entry, if used by enough gamers, will return quite rapidly. For the worse cases, "redirecting & protecting" an entry skews the altavista/yahoo/google search results, causing the most outrageous terms to now point directly to en.wiktionary! Within a month, dozens of mirrors skew the results further, giving the previously unheard-of (and unusable) term now seem legitimate!

On the other hand, the complaints about the LOP seem quite valid: citations can't be sorted/ collected/ amassed for new words in a comprehensible manner with all of them stuck into that one tiny corner, with a maximum one-line each. Etymology of these precious items of humor are lost.

Part of the problem is the impossible nature of "verification." The volume is simply too enormous for the current volunteers to address rationally.

  1. I would like to see a wholesale overturning of the current WT:RFV process.
  2. I would prefer to see the main namespace include only terms that can be found in other dictionaries.
    1. I would like that list of dictionaries to be finite.
    2. Such a check can be automated.
  3. I would like to see a "Neologisms:" namespace added for things I (and other dictionaries) consider garbage.
    1. I would like to see terms in the "Neologisms:" namespace for ten years, either promoted to "real word" status or moved to an "Obscure:", "Archaic:" or "Obsolete:" namespace.
  4. I would like to see a "Technical:" namespace added to cover jargon, technical combinations and similar items.

Note: I do not know who originally proposed the idea of using separate namespaces. I think the "multi-level Wiktionary" was one other attempt at the concept, even though it has been suggested before, in slightly different forms.

I do not, at this time, know precisely what is to be done with multi-word entries. Clearly, they collectively have just as many issues as I have spelled out for single-word entries above.

I do not, at this time, know precisely what is to be done about the languages we include. Clearly, they collectively have many more issues, than just what I've spelled out above. It would make more sense to limit the number of languages en.wiktionary includes to: 1) English (all dialects), 2) any language that has ever had (at any individual point in time) over 10,000,000 native speakers. Including each dialect should be limited to the respective Wiktionary for a particular language. Languages that are too obscure should probably be sequestered somewhere else, or graciously declined for the next 5-10 years, until the main effort is over.

Presently, en.wiktionary is simply trying to satisfy too many conflicting desires, meeting none of them adequately.

Does anyone else agree that there is a problem here? Does anyone have a better idea, than moving the "cruft" to other namespaces? With a software request to exclude the new namespaces from search engine queries, some of the problems (the main ones, I think) would be addressed. Then we could allow every last "urbandictionar-ism" to be entered, leaving redirects behind in the main namespace. This would go a long way towards relieving the current (overwhelming) burden from sysops, bringing the expected activities back down to a reasonable level. Giving the nonsense entries a place to sit and rot, would greatly reduce the cause of much of the vandalism we get today (almost half, by my estimate.)

Apologies for being long-winded. Comments greatly appreciated. --Connel MacKenzie 04:39, 5 November 2006 (UTC)[reply]

Well, just one point for the moment: the respective Wiktionaries for different languages. Those pages are completely useless to most English-speakers who want to look up a foreign word. You have to have an excellent command of the language in question in order to make sense of the pages. Besides that, they simply do not address many of the things an English speaker might wish to know about, because it’s second nature for the natives for whom the pages were written. —Stephen 06:44, 5 November 2006 (UTC)[reply]
Agreed. I do not know what to do about the languages we include. It still seems reasonable to limit the languages to a finite list, for the short (5-10 year) time frame. But I agree that such a restriction is unlikely to fly. That's why the topic at hand is a couple new namespaces for garbage terms. --Connel MacKenzie 06:50, 5 November 2006 (UTC)[reply]
Firstly, thanks for this piece. It contains a lot of background and and enumerates some of the problems of which a newcomer such as myself may not be aware.
I was thinking about one aspect of the language 'problem' this morning, viz: the 'clutter' of, sometimes, over a dozen translations and how these, when interpolated between parts of speech, can make the system very awkward to use. There is, of course, a very simple way to avoid that problem, namely allowing each user to set in their preferences which languages they wish to see and then omitting any 'translation' or foreign headwords.
The idea of limiting words to those already found in a specific subset of dictionaries seems a good one. Other words could then be classified according to their speciality (or 'newness'), and a classification could either be automatically included according to a preference entry, or offered when a search is done on the word. (e.g. Search 'sheek' and be offered, maybe, an entry from cuisine and one from gaming). The system could then actually keep tracks of how many times each word was then selected by a unique IP address and, if a certain criteria was met after a number of years, allow the word into the main body. Moglex 09:11, 5 November 2006 (UTC)[reply]
If nobody ever looks up a word, then what does it matter how we mark it? If enough people look up a word then you'd assume it's legit. Why go through all the effort only to mark a few words that hardly anybody looks at? DAVilla 09:45, 5 November 2006 (UTC)[reply]
I do not understand what you are saying here. Perhaps I should clarify my 'proposal' with an example.
Suppose I add an entry 'wiggin' and suggest that it means 'telling off'. Someone suggests it is suspect and a check of the nominated dictionaries draws a blank. Nonetheless, some contributors defend the word and Google indicates that there are a non trivial number of independent usages. So the word is included in a special area marked, for example, 'neologism' and/or 'speciality:cuisine'. Then, when anyone searches for the word they will be told it is not in the main dictionary but they can access it as one of the noted classifications (which may well, if there are more than one, point to quite different entries). If they accept one of these options the fact is noted, and after a suitable period the access profile for the word is examined, and if it is acceptable the word is entered into the main dictionary. Thus if there were a million accesses for a month after entry but only 10 per month three years later, the word would be considered obsolete whereas if there were a consistent 1000 accesses per month it would be considered to be in common usage. As an online resource we should use the technology available to the fullest extent. The sheer number of people accessing things like Wiki's or Google provides research data that it would have been impossible for a dictionary compiler to acquire 50, or even 20 years ago. Moglex 10:11, 5 November 2006 (UTC)[reply]
Thanks to Connel for marking ingenuitive as non-standard, by the way. It seems, whether the majority of us like it or not, these words are going to force the preservation of thier own existence in some form in this dictionary. In fact whoever named this topic couldn't have picked a better example since the three quotations are all from printed works, and the old arguments that being online makes something any more crufty than not, fall flat on their face. So recognizing that we have to put up with trash to some degree is the first reconciliatory step, at least until we ask how to do it. Pushing such attested words off to the protologism page, for instance, isn't what most of us have in mind for their accommodation, although there's no reason we couldn't be that strict if the community desired it.
Because wikis are edited by the common denominator, perhaps the most we could expect of this free dictionary is the generic, copycat, and non-ingenuitive before treading on original research. That would be simple enough to patrol and no less objective. We could do a survey of dictionaries to find what criteria are used for inclusion, select our own minimal criteria that a dictionary must meet, and then require that every word entered be listed in one of those approved dictionaries, throwing out the possibility of our own fact-finding without making any but the most inclusive irrate. The jargon crowd and the local crowd would have to show that technical and colloquial resources met the criteria. If we relied so heavily on dictionaries, deficiencies in these resources could still be exposed with an RFV process that focuses on double checking rather than screening, having a weaker (perhaps as weak as the current) criteria that would be hard to fail and more trusting of the authority.
The hardest for me to resolve are the foreign-language learner and translation crowd, but partly because it hasn't been resolved under the current process either (e.g. last summer). So I don't see any logical breakdown or conflicting interest in pushing that issue off for later.
But... and here is the rebuttal. Actually, there isn't any. If that's what we want to do then so be it. In fact I think it's important that we distinguish non-standard terms, and if an inline contextual tag isn't a bright enough flag enough to do it then please, by all means, find another. I don't see any reason to raise ingenuitive to the ranks of (don't laugh) irregardless. I will submit, without knowing who the original poster is, that this is, psychologically, a reaction to the growth of Wiktionary and the inability for any one individual to patrol it, even a cyborg creature like we all imagine certain unnamed top contributors would have to be, but that isn't a counter-argument.
The only thing I can say for certain is that a separate namespace, for neologisms as I've argued before for protologisms, would be extremely difficult to maintain. On the one hand you have to make sure that a definition isn't duplicated in the main namespace when it exists as a neologism. On the other hand it would be possible to have definitions that were legitimate and neologims for the same word that weren't. In such cases the same problems we presently have will emerge, with common yet unattested definitions that don't belong in the main space pushing their way in. So while we might want to make a greater distinction for neologisms, a separate namespace may not be the right way to do it. DAVilla 09:22, 5 November 2006 (UTC)[reply]
Ragarding ingenuitive's three print citations, as I stated above, they are all citations from dubious sources, due to the deluge of bad publishing that google now reports.
Regarding "extremely difficult to maintain": how would that be any more difficult than it is now? Shunting the bad entries to a separate namespace (or back) is a simple move. In the case of vandalism, a sysop would simply protect the main namespace redirect. Someone wisiting the site, looking up a term would be directed to the namespaced entry automatically (via that same redirect, protected or not.) But the nonsense of deleting entries (and the residual orphan talk pages) would go away. I would think that some of our most avid "inclusionists" would be strongly in favor of such an approach. At the same time, it would provide a method of addressing entries that we all know, do not belong in Wiktionary. And yes, I did start this thread. I am not a cyborg, but I cannot speak for SB. RobotGMwikt and TheDaveBot, of course, are. --Connel MacKenzie 10:59, 5 November 2006 (UTC)[reply]
Not a cyborg? Then we could only deduce that you are pure silicon. No, all in jest of course! In reality I know you've just got a great system in place where others like myself fuddle around with the clumsy online interface. I mean, my IExplorer takes half a minute just to open a new window!
If you don't think a separate namespace is a problem then despite my personal opinion I guess I couldn't really argue against it. I'd like to know though where a query is directed for a word that is in both namespaces, presumably to the main NS:0? Then how would users know about the existence of the page by the same title in the neologism namespace?
You claim that this can be handled automatically, but what would a bot do if somebody added a term that already existed in the other space? Automatic deletion? Pre-empted by automatic protection and redirection? I'd like to know more specifically how you'd set it all up, in the general case moreso than for special cases of vandalism. DAVilla 15:44, 6 November 2006 (UTC)[reply]
Fair enough. I'd add the namespace(s) as the result of a vote here on en.wiktionary.org. Then when something is obviously non-standard, or fails RFV (etc.) it would be moved to the appropriate namespace (whatever namespaces are voted on.) The redirect would be left behind, perhaps protected in the case of repeat vandalism.
For specific meanings, the entry would be split. Using the current {{see}} method would provide the initial "disambiguation" but if that proves to be inadequate, can certainly be supplemented, for the more extreme cases.
For someone re-adding a "split out" meaning in the main namespace, a rollback would suffice. If added back several times (number TBD...perhaps 1, perhaps 3) additional "disambiguation" could be added.
Again, this is more of a request for better ideas than an immediate proposal. What better way can you think of, that would actually 1) reduce readditions, 2) reduce resultant vandalism, 3) leave the main namespace pristine - looking like a real dictionary?
--Connel MacKenzie 17:31, 6 November 2006 (UTC)[reply]
Another quick point while waiting for my brain to get around to a more reasoned reply. You brand all three of the sources of the ingenuitive citations as "dubious". This includes one university press as well as Architectural Press, a branch of Elsevier, arguably the most prestigious scientific publisher in the world. I'm not going to argue for the perfection of every word either of them has ever printed, but this makes me wonder exactly what "dubious source" means to you (besides "disagrees with the last thing I said" -- I'm taking on faith that you mean more than that). It definitely points to the inherent problems with "tightening" the requirements for citations. Deciding what's dubious after-the-fact, based on whether they give us the answer we want to hear, will quickly reduce our set of acceptable sources to zero, which will doom us to making our decisions on gut instinct alone, which (since we all have different guts) will doom us to never-ending and never-endable arguments over thousands and thousands of words. Still trying to see where the improvement is here. Keffy 07:05, 7 November 2006 (UTC)[reply]

Is it really a breakdown?

How big a problem is ingenuitive, anyway?

I mean, yes, clearly the "word" is illiterate garbage. Clearly Sue Roaf, Cristian Macavei, and Charlotte Gilman all have tin ears and poor vocabularies. (For me, that word connotes disingenuity, not ingenuity.) But that's not the point.

No one reading our ingenuitive page is going to get the impression that we're awash in urbandictionary cruft. The word is clearly shown to be nonstandard, but also (for whatever reason) to be in some level of actual print circulation. If I happened to be a reader of Ecohouse 2, A Series of Robbing Thieves, or The Yellow Wall-paper, and was scratching my head over this word, and I happened to look it up on en.wiktionary, I'd be grateful for the enlightenment.

Even if ingenuitive were truly unworthy of inclusion here, I'd be inclined to view it as an aberration, not a harbinger of disaster. None of our urbandictionary warriors or penis synonym contributors are going to take the time to find three print citations, or to enumerate which other dictionaries (if any) contain a term. So I'm not sensing any floodgates about to open.

Defining good criteria for inclusion is tricky, no question. Often, it's sorely tempting to say, "I can't define what is and isn't worthy, but I know it when I see it." But of course that's subjective and unworkable, so having objective criteria -- such as our three-print-citations threshold -- is very valuable. The question is, what do you do when you get taken by surprise, when the objective criteria end up making a decision that's at odds with your subjective opinion? Sometimes it means that the objective criteria are broken and need fixing, but sometimes it means that the objective criteria are working perfectly correctly, and are demonstrating their superiority to our subjective individual opinions. We're allowed to be taken by surprise by these things, from time to time.

scs 15:01, 5 November 2006 (UTC)[reply]

The breakdowns I see in the system are different than Connel's, but basically result in the same situation. I also am not a fan of the current system, it requires a tremendous ammount of effort, it is over-used, it favors trends and techie-neologisms and excludes words which have been in use, but specific or low-internet-currency use for many years. There are some major holes and hurdles in our CFI and our RFV process, not the least of which is the "objectivity" vs "subjectivity" mentioned above. Often times I think we are too hard-line descriptivist, and words which don't really matter get included...it is trivial to get a word mentioned in 20 blogs, and all of a sudden it has 200,000 google results due to mirrors and lists and discussion about how "word x" now links to "georgebush.com" or whatever, the internet is too easily manipulated for google results to mean much in and of themselves. This means that in order to find meaning someone has to read those results, a tedious and often unfruitful activity, taking hours away from the few people who actually do produce quality content en masse.
I am of the opinion that we should make the CFI more stringent, three cites over 1 year is pretty trivial, but I don't think we should move to the "do the other guys have it?" litmus, as this will miss some words that we do want... I don't have the solutions, but I am glad this discussion is happening and I agree that there are problems with the current system. - [The]DaveRoss 17:53, 5 November 2006 (UTC)[reply]
Apart from relying on dictionaries, I don't know how to make the system any more stringent without also making it either more burdensome on those verifying terms or more focused on a certain type of discourse, like say printed books exclusively, ignoring street terms, no matter how common, that usually show up at best only as mention in liguistic papers. As far as the most extremely idealistic NPOV is concerned, durably archived is the only criterion for judgement. And anyways restricting the type of source wouldn't even make a difference for this word. So let's think about it, but for now I really don't see where any other good ideas could come from. Except... hmm... maybe weighting the sources based on how old they are, which has foundation in the idea that older sources are more difficult to preserve, and probably represent a greater number of the same use? But that could potentially involve some funky math if it didn't have arbitrary cutoffs. DAVilla 15:44, 6 November 2006 (UTC)[reply]
A quick response on the "it's not a problem" side -- maybe more later when I have the time.
While I don't see ESL learners as the main, and certainly not the only, target audience for Wiktionary, even they have the right to expect:
  • to look up words that they have a chance of coming across in their life.
  • to find all the information we have on that word, without needing wizard-level skills in Wiktionary arcana such as namespaces.
  • to be warned when a word is significantly rarer than a more common alternative, which they should probably use instead.
Our current system, with appropriate Usage Notes, can accomplish all three. As far as I can tell from my quick reading, Connel's proposal accomplishes none of them. He might be able to convince me otherwise, but it'll be uphill all the way. Keffy 16:54, 6 November 2006 (UTC)[reply]
That's a good list. There's a fourth which might be added:
  • to not be overwhelmed by volumes of specialized arcana when doing approximate searches, or looking at lists of synonyms, rhymes, category members and other lists, etc.
(I'm not saying that we shouldn't have "volumes of specialized arcana", merely that we might need to worry about overwhelming users with all of it at once.) —scs 17:15, 6 November 2006 (UTC)[reply]
My proposal was to leave redirects behind in the main namespace...and was a request for better ideas.
Doing so would simply display "Neologism:____" insead of "____" word as the heading.
Doing so would simply eliminate the overhead related to deleting thing, that we do now. It also would eliminate the overhead of re-deleting cruft. It would do so without weakening any of our existing criteria!
--Connel MacKenzie 17:04, 6 November 2006 (UTC)[reply]
Oh! I missed the redirect part. That's a very interesting idea.
I do believe (as always) that we could use better tagging, and it's not clear to me that namespaces are the best way to achieve such tagging, though they do have the undeniable virtue of existing today. —scs 17:15, 6 November 2006 (UTC)[reply]
Scs, to address your closing paragraph: I have witnessed the erosion of the RFV process. As I said a dozen times, I think this is mainly due to the dramatic increase in texts available from books.google.com - a sucessful gaming of the system, as this example does demonstrate.
MOST IMPORTANT: You looked at the version of ingenuitive after I corrected it! Take a look at what it appeared as, when it passed RFV! --Connel MacKenzie 17:09, 6 November 2006 (UTC)[reply]
Sidestepping the whole ingenuitive issue, since we're discussing automation here, perhaps we could just provide a means to rank words by citation history; a longer citation history would mean that the word is more prominent, and more likely to be correct. This system could be further applied to individual senses within the definitions. I believe I've seem some pages with a timeline template that accomplishes something similar to this (but this wouldn't necessarily be ranked by time.) Each page could have a relative frequency ranking displayed. This still wouldn't address the issue of words used primarily in a non-written context...
On the idea of indexing our entries against a selection of other print/online dictionaries, I can't see how the copyright holders would allow that to happen (unless we relied solely on public domain texts.) Also, we would become (more) dependent on the functioning of external resources; what would we do if wiktionary is unable to contact them? --Jeffqyzt 03:03, 7 November 2006 (UTC)[reply]
Well, any better ideas then? The concept of linking other dictionaries has been floated as a copyright prevention technique in the past. But indeed, this would be quite different and I fear your assessment is correct - using just their lists of words would present some difficulties. --Connel MacKenzie 07:31, 7 November 2006 (UTC)[reply]
Hmmm I haven't fully had time to think through the implications, but I kind of like the Neologism: namespace idea (Aside: would singular or plural be the right choice?) with redirects left behind. This would also allow us to use a 'bot to tag the whole mess and create a category for the terminally confused. On the other hand, I don't like the idea of a Jargon: namespace.
Deciding what's technical and what's not gets trickier than you might think -- particularly when you consider that many technical jargon words from mathematics, chemistry, botany, and physics appear in the OED and Webster's. I would have a really tough time trying to define the boundary line, particularly when some of the entries will exist as both a common word and as jargon (such as commute, head, and string). I prefer to handle these cases as we've done before -- by simply tagging the definition with an in-line category to note the specialty field.
I very much dislike having to check against other dictionaries. One of the greatest strengths we've had over other dictionaries is our includsion of printed or otherwise published terms that don't yet appear in any major dictionary. Taking away that strength would be foolish. I would rather see a more clever replacement for the "3 citations" rule put into effect. I can't think of a single simple criterion, but can think of a three-pronged approach. How about requiring a word to (1) have three independent citations with at least one year between any pair of ciations, or (2) print publication in a newspaper, magazine, or book printed before 2000, or (3) publication in a "major" newspaper or journal (list to be defined). I'm not entirely happy with my formulated criteria, but they're easy to implement and more stringent than the current criterion. --EncycloPetey 04:23, 7 November 2006 (UTC)[reply]
I must be missing something. You have "or" a couple times above, making it less stringent, right? Perhaps if the print citations were required to be before the web-publishing era, i.e. before ~2000? --Connel MacKenzie 07:31, 7 November 2006 (UTC)[reply]
I will go away and think, but at present I tend to the "Is it really a breakdown?" view. But, two quick points:
If we revise CFI, we should consider that some types of word are more difficult to find than others -- for example, there are perhaps 1,000,000 people who work in/are associated with the British building industry, and at least 100,000 who occasionally hear (though may not always understand) most of the technical terms used in it. But it is very hard to find some of the words in a dictionary -- they are defined mainly by their use in text books, and are then used glibly in specifications, etc. Those people would benefit from a freely accessible dictionary, but it is hard to find our preferred "publicly accessible" documents to use as cites. Again, much well-known literature uses archaic words, while again the number of readily-findable independant cites may be small. For these types of word, finding three cites is perhaps sufficient. But for internet slang, 1,000,000 Google hits might seem appropriate!
Similarly, we need to be aware of the non-English languages in this Wikt. For obsolete languages, and those present day languages which are rarely written, even three "good" cites may be impossible, while the word may still be worth inclusion. --Enginear 20:33, 7 November 2006 (UTC)[reply]
    • Despite everything I said above about ingenuitive, I won't lift a finger to support "frictive". And though I'm not a fine-tooth-comb policy-reader (i.e. someone who is could probably find a way to prove me wrong), it seems to me there's an adequately strong case against "frictive" as a misspelling for fricative: (1) We don't list misspellings as "words", except perhaps in the case of extremely widespread misspellings; (2) three print citations is proof that a word may be included, not a mandate that a random string of characters must be included; (3) there's no hard-and-fast rule for how common a misspelling must be before we'll include it, but it's probably more than "three print citations". —scs 02:39, 11 November 2006 (UTC)[reply]
Thank you. So you agree then, that something isn't working quite right with CFI/RFV at this point in time? --Connel MacKenzie 03:05, 11 November 2006 (UTC)[reply]
I agree, if it's the case that well-meaning, clueful people are being defeated at this point in time in their attempt to delete the bogus, frictive-as-misspelling-of-fricative sense, defeated by ninnies waving copies of the defective CFI policy. —scs 03:21, 11 November 2006 (UTC)[reply]
How does your insult help the current situation?
Just to be clear: you think we should include the obscure meaning, but not identify it as a likely misspelling? --Connel MacKenzie 03:35, 11 November 2006 (UTC)[reply]
Um, what insult? Wait, you didn't think I was calling you a ninny, did you? (I certainly wasn't.) —scs 03:50, 11 November 2006 (UTC)[reply]
Whomever. (Presumably, the OED editors, who force searches for 'frictive' to their entry for 'fricative'.) Was it not you who recently took time off due to a lack of civility?
So, again, which is more important, in your opinion? Identifying the obscure "friction" possible meaning, or the likely misspelling of the linguistics meaning? With the current CFI/RFV system both will probably remain, in some form or another. --Connel MacKenzie 04:05, 11 November 2006 (UTC)[reply]
Clearly, "frictive (adj.) of, relating to, or caused by friction" should remain.
Whether a note mentioning the misspelling should be included is harder to say.
(Whether this fact -- that it's hard to say whether a note about the misspelling should be included -- represents a breakdown in CFI/RFV is debatable; I tend to think not.)
If a note about the misspelling is included, I'd say it should not be a sense listing, but rather a "See also" link (e.g., "See also:fricative (possible misspelling)"), as I've seen in some of our other entries, though I forget where.
So what are you arguing for? I thought you were trying to get something deleted, but now I'm pretty confused. —scs 04:19, 11 November 2006 (UTC)[reply]
Such a "see also" link would be very inconsistent. If you look back (quite a ways now) you'll see that my proposal would be to put entries like this in another namespace (presumably "Neologisms:" or perhaps "Obscure:" or "Misspellings:") with a redirect from the main namespace. In fact, if you read what I actually wrote above, you'll understand...
I'm sure I'm being very stupid, but there's only one thing that's clear to me in this subthread, and it's that the more I read of what you wrote, the less I actually understand. :-\ —scs 05:08, 11 November 2006 (UTC)[reply]
...that I was proposing that much less should be deleted. Exactly the opposite, in fact. On the other hand, I feel it is a disservice to our readers to assert that this word is likely what they intended. That is why I was asking for ideas and comments on how to improve our criteria (for items that do belong in the main namespace.) Would the entry have disappeared if I hadn't drawn attention to it? Hard to guess.
As a sidenote, I really like your suggestion that 'misspellings' should require more than three citations. I wonder if that would work for alternate spellings, as well? Can you think of a more suitable number, or some other metric to use as a guideline, for those cases? --Connel MacKenzie 04:59, 11 November 2006 (UTC)[reply]
Okay, thing 1: Sorry for misunderstanding what you were on about, but at first I really did get the impression that you were bemoaning the existence of our entry for frictive in the misspelling-of-fricative sense, and wishing it could be unapologetically deleted, and fearing that a too-strict reading of our CFI might necessitate its retention. I thought that was the problem you were worried about, and when you asked me to agree that there was one, I said, "if it's the case that well-meaning, clueful people are being defeated by ninnies waving copies of the defective CFI policy". That is, if it was the case that (as I assumed) you, Connel, were being clueful and attempting to delete the bogus sense entry, but that other, unspecified "ninnies" were thwarting you by stubbornly pointing at three (bogus) print citations and saying "Nyah, nyah, we have to / get to keep it", then yes, I would have agreed there was a problem. But in fact, my reading of the WT:RFV thread was that the objectionable sense had every indication of being deleted, and that there were no "ninnies" lobbying for its retention, and that the system was working pretty well. (Incidentally, that's also why I had no qualms about making what might have seemed to be an uncivil remark, since there wasn't in fact anyone acting like a "ninny" for me to insult.) But then at some point you seemed to be agitating for the sense's retention in, er, some sense, and I got badly confused.
Thing 2: A "See also" link may not be the right way to handle the "Did you mean?" hint, after all; I see that the precedent for that sort of information tends to be a Usage Note. See, for example, our pages on lose/loose, or principle/principal, or capitol/capital, or its/it's. So my current suggestion is to delete the sense at frictive that says "class of consonant sounds having a sibilant, hissing, or buzzing quality" and replace it with a Usage Note saying "Not to be confused with fricative" or the equivalent.
Thing 3: It seems to me that the right (objective) metric for including a misspelling or a mistaken usage (and perhaps also an alternative spelling) would involve a percentage rather than an absolute count. So if a google search comes up with 1000 hits for a misspelling or mistaken usage, that certainly counts as "common" if it compares to 3000 hits for the correct usage, but not if there are 3,000,000 hits for the correct usage. But I'm not sure what the appropriate percentage cutoff would be. (Also, this hypothetical percentage-based metric doesn't tell the whole story; there are valid subjective criteria as well.) —scs 15:39, 12 November 2006 (UTC)[reply]
re: Thing 1: If en.wiktionary.org were mine I would unapologetically delete it. But it is not, so meh.
re: Thing 2: That sounds like a very reasonable approach.
re: Thing 3: Percentages offer good guidelines, but are susceptible to strange statistical anomalies (such as auto-corrections skewing web search results.) I think comparative percentages might fare better. That is, for words spelled correctly that have more web hits than a word such as "offer", 1% would indicate a valid misspelling. For terms about as common as "skiboard" a 10% threshold. But I confess I haven't delved deeply into that sort of analysis in a while. The problem with setting numeric limits is that the numbers change (both too rapidly, as well as by orders of magnitude from year to year.) --Connel MacKenzie 18:54, 12 November 2006 (UTC)[reply]
I think this is an interesting issue, and I don't know enough to address most of the elements. However, I completely disagree with the notion of requiring a term to be listed in another, presumably more respectable, dictionary before it can be included in Wiktionary. In the last couple of days, for example, I have found several alternative spellings thoroughly attested by quotes in the OED, but not actually listed in that dictionary (try full-text searches in the OED online for pryvely, pryuely, prively, and priuely). I would be astonished if the same were not true for unusual words and in other dictionaries. Furthermore, the editorial processes at most dictionaries are extremely slow. It could easily be over a decade between the time a word comes into common use and the time it is listed in a printed dictionary. --Dfeuer 23:11, 13 November 2006 (UTC)[reply]
Well, the notion of using other dictionaries was already discarded, obliterated and destroyed by someone pointing out that systemic use like that would have potential copyright-violation issues. I am open to better ideas.
Could someone work similar magic to waked as was done to frictive? Not exactly the same issues, but very similar "fix" is needed, right? --Connel MacKenzie 22:17, 16 November 2006 (UTC)[reply]

Is it really a breakdown? redux

Now that frictive has essentially passed RFV, I'd like to assert that there is something fundamentally wrong in our approach. Apparently it is a word, albeit a rare one. But out of context, it certainly seems like a typical misspelling. Since fricative is not a common term outside of linguistics, it would probably be wrong to call it a "common" misspelling. Someone looking up the term to see if they misspelled it, now will (eventually) be directed to the correct entry/spelling. But I fear the one individual case worked out only because of an unusual amount of attention was given to it.

None of this explains why all other dictionaries choose not to list it at all. Just because it is obsolete? Too narrow a context? Too bizarre? What? --Connel MacKenzie 18:29, 22 November 2006 (UTC)[reply]

Is it not even in the OED? That's very surprising, if so.
Me, I don't think there's any overarching lesson to be drawn here, other than the well-known one that dictionary inclusion can be a difficult process. No matter how comprehensive our guidelines are, the process will always be somewhat subjective, with tricky words needing attention on a case-by-case basis, subject to care and good judgement.
There are, of course, lots of "words" that don't appear in all dictionaries, or that may not appear in any dictionary. Some are slang, some are abstrusely technical, some are obsolete, some are always spoken (never written), some are just too new. If a candidate word doesn't appear in other big dictionaries, that may be surprising, and it should certainly cause us to check our citations more carefully, but it's in no way a damning piece of evidence against including the word at all.
So, is there something fundamentally wrong with our approach? Me, I don't think so. Our process may not be perfect, and we certainly don't have any kind of magic objective crank we can turn that will automatically pop out a "yes" or "no" answer for any putative word we're considering, but all in all I think it's about the best we can do.
(If the process is looking broken to you, do you have any avenues in mind to pursue in trying to fix it, or any way to narrow down what the difficulty is? I'm sorry to sound like a doubting Thomas, but I'm still not really seeing the essential problem. In particular, how could or should frictive have been handled differently?)
scs 19:40, 22 November 2006 (UTC)[reply]
No, it's not in the OED, not even searching the full text. Yet it has at least 90 BGC hits, more than some other words that "we all know" are used. In this case, the answer to why we have it when no other dictionaries do is: in parts, particularly where a lot of attention has been given to a word, we're better than any other dictionary.
The thing which is of greater concern re CFI is that, if anyone chose to champion the misspelling of fricative, I believe they could find enough good cites to pass CFI. In general, I think the principle is valid -- that if we accept a rare word with x good cites, then if x books contain the same misprint of another word, it is worth adding as a (non-standard spelling) definition. The likelihood of someone wanting to check the meaning of each word is similar.
The main question is, what should x be for words which are currently in use in English (or Spanish or Chinese) and should it be higher than y which we would use for obsolete words, or z for words in Adanque or Twe; and what do we do for proto-Indo-whatsit and indeed those present languages which are "never" written except in dictionaries.
The subsidiary question is, say we set x at 10, or 20, do we actually require all those cites to be added (even with an online source, they take me several minutes each, by the time the context has been checked to confirm the correct meaning is being used, and then all the info has been copied across) or do we agree a short cut (perhaps only as a short term measure). --Enginear 20:25, 22 November 2006 (UTC)[reply]
The notion of using straight numeric counts was rejected out of hand, in preference to having cites on the page that demonstrate the meaning conveyed. --Connel MacKenzie 19:51, 24 November 2006 (UTC)[reply]
Yes, that's logical, but the corollary is that if we set a requirement of more than about 5 cites for current English words, the RFV process may break down because no one can be bothered to cite them, so they either get nodded through without any at all (as often happens even now) or get binned incorrectly. --Enginear 00:33, 25 November 2006 (UTC)[reply]

One of the cleanup activities on my todo lists, is to reformat the old-style "*Common misspelling of..." forms to the more recently accepted format. However, due to the tremendous increase in books available on b.g.c., pretty much every term that was formerly unambiguously a misspelling can now, today (about one year later) easily pass WT:CFI as alternate spellings. Special:Whatlinkshere/Template:misspelling of.

BTW, what is the present accepted format (I've been meaning to ask for a while) --Enginear 00:33, 25 November 2006 (UTC)[reply]
{{misspelling of|foobar}} looks pretty good to me. —scs 04:11, 25 November 2006 (UTC)[reply]

Now, misspellings are thorny; no doubt. Sometimes they legitimately are regional spellings. Other times they reflect linguistic changes that arguably, we should be tracking. But other times, they are just common misspellings. I don't see any part of WT:CFI that sufficiently addresses the problem for genuine, common misspellings. Obviously, the books from the large publishing houses have each of these errors numerous times. Does this mean that b.g.c. and amazon full-text search are no longer reliable? Or does it mean that we should require >10,000 books.google.com hits for something to be considered a word? If percentages were to be used, what would be decent cutoffs? A "common misspelling" would be something that occurs more than 0.5% as often as the main spelling? Anything over 10% would be an alternate spelling? (Regional spellings, of course, make it much more complicated than just those cut-offs.) --Connel MacKenzie 08:11, 24 November 2006 (UTC)[reply]

I believe there are two main reasons for wanting to look up a misspelled word:
  • Because you don't know what it means (perhaps knowing the correctly spelled word but not recognizing the target word as a misspelling of it, or perhaps not knowing the correctly spelled word either)
  • Because you know what the intended word means, but can't remember how to spell it (I've just taken 4 tries to find corollary).
For either case, I don't see the logic of %s, certainly not as the only criterion. If a common word is misspelled in a certain way 0.01% of the time, it may still cause as many people to puzzle and look it up than a rare word (used 0.001% as often) which is misspelled 10% of the time. There may be some connection with how similar the words are, and a knowledge of common misspellings -- few people who know necessary would doubt that neccessary or even neccesary had the same meaning, but I was once confused, at least for a few seconds, by the typo het; also, I only had to look up corollary with the different permutations of r, rr, l, and ll, since I knew the other letters almost for certain. But I don't think % of misspellings is relevant in itself, more the absolute number of times a spelling is written. --Enginear 00:33, 25 November 2006 (UTC)[reply]
I would distinguish between "common misspellings" and mere "misspellings".
  • Common misspellings might find their way into print, and readers who don't realize they're misspellings might want to look them up, so including them is arguably okay. And I do believe that a relative (percentage) measurement is a decent way of deciding whether a particular misspelling is "common".
  • Mere misspellings, however, are infinite in number, and while it's true that users might need help finding the correct spelling, it's not clear that including a dictionary entry for every possible misspelling is the best way of giving them this help.
I may not have explained clearly enough. I am trying to say that we should measure common by the number of times a misspelling gets into print, not by the percentage of times the intended word is misspelled. To take an example, the noun sense under etymology 3 of nope barely meets CFI (in fact it may not meet it -- it is very hard to find cites for it because of the preponderence of the Interjection use, and I gave up after finding two). It is obsolete, but had a common (in % terms) misspelling an ope. The misspelling was perhaps 50% as common as the correct word. However, since that meaning of nope is on the borderline of CFI, ope clearly fails, and should not be included in spite of being a 50% misspelling. However, neccessary has at least 5,000 b.g.c. hits (compared with at least 30,000,000 for necessary). A <0.01% misspelling but logically having a greater claim to be in the dictionary because it is in print perhaps 1,000 times as often as ope with the meaning given. (I won't be adding it myself, but might support it if someone else added it and it was then RFDed.)
Incidentally, frictive is borderline CFI as a misspelling of fricative (I noticed about five b.g.c. cites on pages referring speech sounds, but my knowledge of the subject is inadequate for me to say whether fricative was meant or whether the word was being used correctly.) --Enginear 16:08, 25 November 2006 (UTC)[reply]
(Everything else being equal I'd rather we didn't list either kind of misspelling, because it would be nice if Special:Allpages generated a word list, not a mixture of words and misspelled words. But of course everything else isn't equal, and for now I agree that our inclusion of common misspellings is, on balance, appropriate.) —scs 04:11, 25 November 2006 (UTC)[reply]
(I agree. The best solution, IMO, would be to have a did you mean in the search engine. Obviously, one way of acheiving that would be for the search engine to refer to a list of words which had been cited as misspelling of nnnn. Perhaps, if others agree, the misspellings could be moved to a List of Misspellings in anticipation.) --Enginear 16:08, 25 November 2006 (UTC)[reply]

Breakdown of rfv?

As Connel pointed out, the main difficulty with the rfv process at present is that it is entirely subjective. That said, I am not certain that I agree that rfv has broken down. Of the CFI pages, it appears to be the only one working at present - even if it is working imperfectly. Hence the comparative increase in volume on the rfv page. What I have found over the months that I have been working on the page, is that the basic cruft has been eliminated and the words now being nominated need a fair amount of research for verification. Very often I dont have the time or the expertise to make such decisions and rely on contributors whom I trust (as in ingenuitive where I relied on the work put in by other contributors).

Hppitrail brought this subject up earlier this year and proposed a two tier dictionary, with a user preference button which would allow users to see less conventional material, and a second button which would eliminate the less recognised words from view. I think this proposal was not fully discussed and I personally feel that it has merit. (How it would be implemented is another debate)
  • My own personal view is that the 3 cites is a good rule of thumb and I have changed my mind that print citations are "better" than google citations. - Certainly google cites can easily be checked, and it is usually easy to spot self reflective google hits. I think we should rather encourage the 3 cite policy.
  • I disagree that we should rely primarily rely on dictionaries. This will eliminate new words, eg. chav, lolicon, vuvuzela and the like. I recently failed a word from Oxford shown as used once in the 1700's and was possibly a spelling mistake then.
  • I detest the protologism pages. In my view either a word is or it isnt. However, the recent suggestion of having protologisms time-lined to a year and then reviewing the entries there is a good one.
  • I think that one of the ways we can focus rfv is to call the original contributor to account. If a word is rfv'd, then at the time it is referred, a note should be sent to the original contributor's talk page, requesting him to cite the article. If he fails to do so, then it can be presumed that the word is questionable. [I appreciate the flaws in this argument, and am not going to enumerate them] - the presumption can be overridden.
  • With technical words, it is more important to cite than with ordinary words as the majority will not have the expertise to verify. The current tag of {law} or {engineering} or {geology} should raise the flag that this is a technical word.
  • With offensive, scatalogical, sexual and racist words, the words should be presumed to be failed unless the citations are at the page.

These are a few thoughts, but I hope will contribute to a policy decison in due course. Andrew massyn 06:22, 11 November 2006 (UTC)[reply]

Thank you - great comments. A quick note: print citations are strongly preferred as they can be checked. "Durably archived" Internet resources cannot be guaranteed to remain in public view indefinitely, even if they could somehow be guaranteed to truly be durably archived. Books on the other hand, can be verified, usually quite easily. Internet resources that are not durably archived should not be considered at all - most don't remain available for a single year.
One other note: the idea of implementing some way of keeping garbage entries is a compromise to reduce vandalism and re-entry of said garbage, first and foremost. (If you recall, the entire RFV process exists only because of vandalism. It was an attempt at trying to find a fair way to stem the tide.) --Connel MacKenzie 17:03, 11 November 2006 (UTC)[reply]

Greek Verbs

Sorry to be of bother again, but as was aforementioned, it has been difficult to get a clear grasp on Wiktionary policy regarding Ancient Greek for the fact that there exists little to none of it. My question today is dealing with verb citation. I know for Latin, verbs are listed by their infinitive form. In Ancient Greek, however, the infinitive is not among the principal parts. I found pre-existing terms to be listed by their first person, singular, present, active, indicative forms (first principal part). To get to the point, I was hoping to confirm that listing by the first principal part is correct. For reference: φεύγω (a page I created) and βάλλω (a page to which I made a minor edit, but hope to further improve!)

In regards to basic policy towards Ancient Greek (and other foreign languages in general; I would likewise hope to fix the presentation of Hawaiian), how does one go about in its origination? Thank you all for your time. Medellia 02:55, 7 November 2006 (UTC) (reposted Medellia 04:49, 7 November 2006 (UTC))[reply]

Well, I don't know anything about Ancient Greek, but it would seem to me that these sorts of decisions do vary depending on the language in question. Some people here like to pretend that the infinitive in English is the base form, and that's just really awkward. DAVilla 08:55, 9 November 2006 (UTC)[reply]


An inclusivist's counter-manifesto

As promised (threatened), here's a rant in response to Connel's masterful rant on the RFV process, the criteria for inclusion, and much, much else. Connel apologized for going on for so long. I would too, but I wouldn't be sincere. :-)

Since I'm apparently not a moderate, I hope I'll at least count as a moderately respectable proponent of rabid inclusivism, being neither young, nor a student, nor a gamer, nor even any good at video games (though I once lasted a whole two minutes on Ms. Pacman), nor obsessed with discussing sex with ninety thousand different synonyms, nor any of the other things Connel mentioned. At least I assume I'd count as a rabid inclusionist, seeing how I found myself agreeing simultaneously with every single group of moderate he mentioned. (Idioms? Yup, need those. Legal terms? Check. Medical terms? Check. Regional terms? Check. Words from puny languages like Swedish, Finnish, Norwegian, Danish, Catalan? Check.) Last week, I'd have guessed that agreeing with every moderate would make me too a moderate, but I'm no good at this newfangled new math stuff.

The first part is a specific response to Connel. Then I'll get abstract and downright florid.

  • On being a "respectable" dictionary:
    Respect isn't a gas floating free through the atmosphere. Let's be specific. Whose respect are we trying to get?
    Would it be the end of the world if certain outsiders started berating us with descriptions like "a non-word deluge", "debased verbal currency", "a dismaying assortment of the questionable, the perverse, the unworthy, and the downright outrageous", perhaps even "disgraced by just that sort of ignorance and unfairness which cheap journalism in a mass culture would substitute for scholarship"?
    Those are actually all quotes about the Webster's Third in the early 1960s, made by people outraged at its horrendous crimes, such as including slang and accurately describing the use of ain't. If that's the kind of disrespect you're worried about, then bring it on. I'd be honoured to disrespected in such company. Pandering to the kind of linguistic ultra-conservatives who hated W3 is a no-win game. They can't even agree with each other. There's no possible way to make all of them happy at the same time, so why try?
    Which leaves us with trying to get respect from the same kinds of people who would respect the W3 and the OED...
  • On being a "serious" dictionary:
    I couldn't agree more. Wikitionary should be, and should be seen to be, a serious dictionary. (I'm not talking about specialized abridged dictionaries here: the mini-dictionary for ESL learners, the mid-size dictionary for highschool students, the pocket-sized dictionary for soldiers and aid workers. I'm talking about comprehensive dictionaries. If that's not our goal, then what's the point?)
    This needs elaboration, since most of your argument is based on the assumption that no serious dictionary would dream of including certain words. There's really no delicate way of saying this, but here goes... You're deeply confused about the way serious modern dictionaries work, especially in the English-speaking world. Their inclusion policies are almost identical to our CFI. I'm not kidding. Put them all in a pile and ask a random passer-by to pick which one comes from that flaky Internet bunch, they couldn't do it. Maybe one dictionary wants five citations instead of three; maybe one wants a spread of three years instead of one. But they also have considerably more resources to use in finding those five citations than our feeble books.google.
    Serious dictionary editors would gladly include words from gamer slang if they didn't have to trade a piece of Romantic poetry or a South American bird for each one (or else add $50,000 to their production costs). Often, they'll make the trade anyway. I can't emphasize this enough: If you can't find one of our RFV-passed words in a recent dictionary, it's not because no serious dictionary would ever include it. It's due solely to the glacial pace of print publishing and/or the budgetary implications of producing multi-thousand-page books. You're misinterpreting brute economic constraints as principled selectivity.
    I know a few dictionary editors and professional lexicographers. I've listened to them moan about the practical difficulties of their jobs and how they wish things could be instead. Most of them would kill to be as free from physical and economic constraints as we are (if they could figure out how to do that without going bankrupt, of course). And you want to throw that freedom away in order to be more like how they wish they weren't?
  • On the varieties of "cruft":
    You often tar many kinds of words with the same brush. I'll tease them out here, though my ultimate response to each one is "tag it appropriately and let it be."
    • non-standard / minority dialect forms:
      There's no justification at all for not including them, with appropriate tagging.
    • technical terms:
      What's the problem? Who exactly is being misled by our current format into thinking that reverse transcriptase is a great word to drop into a casual conversation about the price of gas or what we all should do this weekend? And if such a person is so dense as to ignore the obvious technical nature of the definition and often an explicit discipline tag, how is redirecting them into a Technical: namespace going to cure their denseness? (I'm all in favour of more consistent tagging, but if we haven't even been able to do tagging properly, trying to do a namespace segregation will be nothing but an unholy mess.)
    • gamer jargon, etc.:
      Yes, our readers should clearly understand that these words are relevant for only a small percentage of the population. But as long the words are properly tagged, exactly what are we losing by including them? Not the "respect" of the Webster's-haters -- we never had that anyway and we never will. A few K of Wikimedia's disk space (less than pretty much any recent Commons photo)? The joy of waging a never-ending 24/7 battle against the hordes who want them in? Mmm, fair trade.
  • On the need to concentrate on ordinary words:
    Okay, so the inclusion of "cruft" doesn't detract from our reputation, but it doesn't do much to add to it either. To enhance reputation, we need great entries for words like die, hubris, and crestfallen, and too many of our entries for words like that are, well, un-great.
    Yeah, we need more and better articles like die, hubris, and crestfallen. So here's a radical thought: Why don't we here spend our time actually creating and improving those articles, instead of trying to delete the articles we don't care about and getting into arguments about it?
  • On the onerousness of the current RFV process:
    Yeah, it's gotten out of hand. Through my rose-coloured inclusionist glasses, it seems obvious that things would improve a lot if people who disliked a word communicated the reasons for their dislike through tagging and usage notes rather than through attempts to delete. Here's my proposed rule-of-thumb, though I'm not naive enough to think it'll catch on:
    • RFV an article only if you have sincere doubts about whether the word actually is used as a word by anybody at all. If your problem isn't with whether, but with who uses the word, or with how, or with where, or with when, or with how many, or with how silly it sounds -- in short, if your doubt isn't about whether it is used, but about whether it ought to be used -- then don't waste people's time RFV'ing it. Tag it appropriately. If you don't feel competent to tag a particular word yourself, take it to the Tea Room and let everyone who's interested hash out the most appropriate tags and usage notes.
  • On prioritizing editors' labour to more important things
    Unless you win a lottery and start paying us, it's not gonna happen. The knowledge, interests, and enthusiams of volunteers cannot be switched around at will like trains in a railyard. (Don't worry, I'm not even remotely tempted to use exic*rnt here.)
    Deleting the words that "younger students" are interesting in contributing will not show them the error of their ways and make them want to work on hubris instead. Telling a Swedish speaker that he's obviously stretched too thin to contribute Swedish words for the next decade will not make him gladly contribute French instead. Telling a retired lawyer that we really don't need the legal terms she knows will not make her want to wikify Shakespeare quotes instead.
    The only editor's time you have any control over is your own. I won't ignore my own advice and try to convince you to switch your enthusiasm from getting rid of cruft to wikifying Shakespeare quotes. Mainly because it's obviously not your enthusiasm -- fighting this endless battle is clearly driving you crazy. Even if you don't believe me that serious dictionaries don't obsess about cruft, for the sake of your blood pressure I beg you to pretend you do. Show Aiken-like wisdom in response to battles you can't win: rather than nuking the battlefield, declare victory and go home. (Home means words like hubris, not non-Wiktionary stuff, right? :-) )
    • Of all the misplaced comments, the preceding paragraph is the only one that got my gourd. You think I'm looking for this crap? I methodically work through the cleanup lists that I generate after each XML dump...some words do not merit my (or anyone's) time to clean them up - those are the ones I nominate in the current system. Again, the "namespaces" proposal is an idea for reducing that, by giving the non-vandalism stuff a place to live. Stuffing it all in the main namespace routinely meets astronomically harsh resistance.
    • Now, since you are so eager to tell me how I should spend my time, perhaps I should recommend you take a look at the cleanup efforts for the Wikisaurus namespace...and spend your time there, properly sub-page classifying the 5,000+ synonyms for 'penis' in a coherent manner.  :-)   A couple hours of cleaning/sorting a single Wikisaurus page has cured several rabid inclusionsists, so far. --Connel MacKenzie 07:35, 8 November 2006 (UTC)[reply]
      Nice try, but it won't work to cure me. :-) I have no urge to impose logical order for the sake of logical order -- only for the benefit of end-users. As far as I can see, the current Wikisaurus entries already perfectly meet every need that I can imagine of those users who want to look at a list of synonyms for penis. Keffy 04:47, 12 November 2006 (UTC)[reply]
I find that to be a pretty incredulous statement. Wikisaurus is supposed to be a thesaurus listing synonyms. Someone referring to a thesaurus does expect synonyms, not nonsense. --Connel MacKenzie 07:22, 13 November 2006 (UTC)[reply]
The one I really like is Notary Public. Andrew massyn 18:02, 17 November 2006 (UTC)[reply]

And now a more positive statement of my starry-eyed inclusivism, framed as...

The End-User's Bill of Rights

Eventually, people who use Wiktionary should have:

  • the right to find an entry for any word they might encounter in their lives and need to look up.
    I see this as so self-evidently the job of a comprehensive dictionary that I frankly can't anticipate what objections someone might have to it. It's an ideal we'll never achieve, and we can have honest arguments about our priorities in the less-than-ideal meantime. But if somebody truly thinks that the even the perfect Wiktionary should still give users the Nogomatch blue screen of death for words they've actually encountered, I have no idea how to respond to that.
    I see the "three citations" criterion in this light, rather than as an arbitrary standard of worthiness. Each occurrence of a word that we can find in a durable medium increases the probability that some end-user, sometime in the future, will find that or some other occurrence of the word and want to look it up. That's why we need to include it, not because the third citation magically made it real.
    Note that this interpretation will not open the floodgates to protologisms. The proselytizing protologist can indeed use their creation in a durable medium, but if they want other people to start using it too they're pretty much forced to define it themselves -- which automatically removes the need for anyone to look it up in Wiktionary.
  • the right not to need psychic powers to look up words successfully.
    A graceful way to handle wrong or variant spellings entered in the search box is definitely not a short-term goal. It will take lots of careful planning to do right. In the meantime, we can do a better job at making compounds and idioms easier to find. If a user reads "the dust bunnies under my bed" in a newspaper article, they shouldn't have to guess they're dealing with a compound rather than with an unfamiliar sense of bunny -- the compound should be linked to from both bunny (which it is) and dust (which it isn't yet). If they do guess it's a compound, they shouldn't have to guess that whether we've spelled it dust bunny or dust-bunny or dustbunny.
  • the right to have definitions written at an appropriate level for the concept being defined.
    No need to get Longman-esque and try to define reverse transcriptase using only the vocabulary of an 8-year-old. But, honestly, if an ESL learner has forgotten or never learned what a second is (as a unit of time), chances are pretty darn good they won't do any better understanding a definition that begins "The SI unit of time, defined as the duration of 9,192,631,770 periods of radiation corresponding to the transition between two hyperfine levels of caesium-133 in a ground state at a temperature of 0 K when moving through space at zero (0) meters per second (or not moving)". (Which, I just noticed, actually contains the word it's supposed to be defining, in the same sense. Hmm.)
  • the right to be warned of any relevant social aspects of how the word is used.
    This includes any limitations on its use according to geographical region, historical period, social group, subject field, level of formality, and so on. But it also includes information about attitudes that various social groups might have about the word, including but not limited to the usual prescriptivist judgments of "correctness".
    Many people (not always the same people) will think less of you if you use kids to refer to children, mankind to refer to our species, gun to refer to an AK-47, boat to refer to the Titanic, rightsizing to refer to massive layoffs, and so on, ad nauseam. This is important information to know about those words -- way more important than their etymologies -- and it deserves to be in the article. If we can narrow down in the usage notes who the "many people" are, without killing each other off, so much the better. If we can't, we can leave it at "many people". (I promise to refrain from calling some of them "pompous self-righteous snobs" for at least as long as PaulG can refrain from calling them "authorities" or "careful speakers".)
    This stand is more conservative than Webster's Third, less conservative than the original OED, about the same as learners' dictionaries like Longman's and Cobuild. It's the best achievable compromise between descriptivism and prescriptivism. It's as acceptable as possible to both honest descriptivists (who see describing as a scientific methodology, not as a facade for a political agenda) and honest prescriptivists (who take helping learners speak appropriately for the situation as a genuine goal, not as a facade for a political agenda). It's also pretty much what we already do, so I won't belabour the point. But directly relevant to the current discussion...
    If you passionately believe end-users should never use a certain word (for whatever reason, even if you don't think of yourself as prescriptivist), the only way of influencing them is to add social information to an article. Censorship is simply not a sane response. You can't send a message to non-psychic end-users through the absence of an entry -- there are far too many other reasons why your pet peeve might be absent, they have no way of guessing that the actual reason is your concern for their well-being.

And while I'm at it, how about a shorter one...

The Word's Bill of Rights

Every word (or potential word considered for inclusion) should have:

  • the right to be judged on its own merits (i.e., whether it's actually used), not on the merits of the people who use it.
    I'll spare you all my rant on this 'til the next time I'm provoked.
  • the right to be judged on its own merits, not on the merits of the editor who created the article.
    You'd think this would go without saying. But too many exchanges on the RFV page go like...
    Respected Wiktionarian #1: "I don't think that's a word."
    Respected Wiktionarian #2: "Actually, I use it all the time and I've heard it three times this week alone."
    Respected Wiktionarian #3: "No. The guy who created the article is a known moron. Deleted."
    That shouldn't happen.

Whew! End of rant. No more for month.

— This unsigned comment was added by Keffy (talkcontribs).

Yes, that was me. (Hangs head in embarrassment at managing to type all those keystrokes and then forgetting four piddly tildes.) Keffy 04:47, 12 November 2006 (UTC)[reply]
Considering the "personal attack" nature of your rant, I shall kindly ask you to go back and (for the first time?) read what I wrote. Why do you think I'm proposing a proper way of categorizing all these things, that currently are axed left and right? --Connel MacKenzie 06:48, 8 November 2006 (UTC)[reply]
Well, I really don't want to get any deeper into this debate because it looks like it's starting to get personal. I don't know if Keffy intended to direct his comments as he did, but it's probably better to just step back from them and let what's been said rest. The majority of the above I thought was pretty insightful, whether it was a direct response to the question that Connel raised or not, and despite some parts that may have gone over the line, or maybe just needed to be unloaded off someone's chest. The humor of "curing" inclusionsists I thought was a witty way to handle that, at least if it gets a chuckle out of Keffy, and if not please don't take offense. I won't be so bold as to say an apology is in order, but I do think it needs to be clear that you both value each other's contributions since I know you both do. Wherever you stand on ingenuitive, you still have to agree that there really is a lot of cruft that gets dumped here, and Connel shouldn't be blamed for trying to divert the floodgates, and certainly not for reinforcing the dam. At the same time we do need people eager to spend hours on a single word to make it as complete as they can; in fact we need many, and this is where I blame the inclusivists, not Keffy but myself in particular, for having not properly marked this very word, and therefore giving up my right to any claim of quality over quantity. DAVilla 11:56, 8 November 2006 (UTC)[reply]
Ooops. Sorry, I forgot how far out of context this is now. Yes, Keffy and I talked briefly (bantered) on IRC regarding this - and yes I very much value his opinions on this matter. In most regards, we seem to be aiming at similar goals in this conversation. If it looks like I was actually taking offense, then I've worded it too ambiguously. Likewise, I very much appreciate DAVilla's contributions, particularly when his "inclusiveness" exceeds (or momentarily offends) my own delicate sensibilities. This year has seen a dramatic increase in the number of contributors; these growing pains should be expected, as each aspect of maintenance is overwhelmed. --Connel MacKenzie 16:55, 8 November 2006 (UTC)[reply]

Wow. Keffy (if that was indeed you), that was a very nice rant. I don't know what anybody's talking about if they're seriously calling it a "personal attack" -- it wasn't; it was just a passionate but reasoned defense of a school of though which I, at least, agree with 1000%. If Connel seems to get his back up, don't worry: as anyone who's been around here longer than about a month knows, Connel seems to get his back up all the time, but he's an okay guy, really, and just as passionate and dedicated.

There is, I think, one counter-inclusionist point that's worth addressing. (I tossed it in up above, a few days ago, but I'll repeat it here.) Some will say -- I have said -- that there's no downside to including any possible term; that people who are looking for it will be glad to find it, and that people who aren't looking for it won't find it and so won't see it and so won't even know it's there. But once we're fully inclusive of every specialized technical term, and every specialized medical term, and every specialized legal term, and all 47 declined forms of every Spanish verb, and every properly-tagged-as-deprecated schoolboy penis synonym, there are a few aspects of our user interface that can get overwhelming. When a user doesn't quite know how to spell a word, and uses http://en.wiktionary.org/wiki/Special:Allpages/pref to do a prefix-restricted match on all our entries, or uses the "fuzzy match" search feature we don't have yet, or performs some other step which performs some mechanical generation of some subset of our content, the results can be overwhelming (if not offensive). So in the fullness of time we'll want mechanisms which permit initial search results to be filtered in various ways (with, of course, user-selectable preferences to customize, and convenient "broaden search" buttons to temporarily override). —scs 18:22, 8 November 2006 (UTC)[reply]

Sheesh, real life can raise its head at the most inopportune times. Anyways...
I'm gratified to read the entirely appropriate comments by everyone above. I'd like to thank Scs for his 1000% support, DAVilla for his deft peacemaking attempts (you could even have been bolder), and especially Connel for obviously making a greater attempt than my clumsy prose deserved to distinguish between the parts that were specifically about his comments and those that were about something related to something related to something that I was reminded of by one of his comments.
To make it clear to all bystanders, I meant nothing personal against Connel. My entire argument (and I wish I'd been able to make it this succinctly earlier) was:
  • It's simply not the case that no serious dictionary would ever include words like ingenuitive et al. So there's no need to jealously guard the purity of main namespace against words like these in order to be more like serious dictionaries. So there's no need for policy changes designed to preserve namespace purity.
That said, one aspect of Connel's proposal is slowly growing on me. Many dictionary publishers have huge databases of citations, including those for words that haven't yet met their CFI. We don't. One of our editors could find one citation for a word this year, someone else could find a different citation next year, and someone else a third citation two years from now. Despite that, the word could never get created, since no single editor had a critical mass of citations -- or it might even fail RFV three times. Something like Connel's namespace idea could give us an analogue of the major dictionaries' system for remembering these citations. It could also remove the artificial one-month deadline for RFVing (since a word deserves to be judged on its own merits, not on whether the people who might have found evidence for it had enough free time that month and happened to read the right three magazines articles.)
So I might be convinced of part of Connel's namespace proposal, if it's clearly understood by everyone that this would be a holding-place for words in the process of being verified, not a ghetto for permanent second-class citizens (which does seem to be how Connel often presents it), that it's not a dumping ground for words that someone morally disapproves of (and I explicitly don't mean Connel here), and that once a word accumulates enough evidence to meet the CFI it automatically gets moved to the main namespace, without waiting ten years, without further argument about how dubious the sources are.
Scs made great points about swamping the end user and the potential for false hits. (I knew I was forgetting something when typing the bill of rights.) It might be worth starting one of those policy think-tank page thingies to hash out how an ideal user interface might eventually look and work, so that there'll be something concrete to ask the developers for someday. More short-term, the customization options that already exist via Javascript/stylesheets could be better publicized, including on the Help pages. Keffy 04:47, 12 November 2006 (UTC)[reply]
Building some kind of comprehensive Citations database (even if it is a copycat move :-) ) is a wonderful idea. I wonder if it makes most sense to accumulate citations in /Citations subpages (as many of our entries already do today), or to use some new Citations: namespace, or to invent some other new mechanism. (Or perhaps the right approach, especially for dubious words, might involve a combination of these techniques, e.g. a separate namespace, as Connel has already suggested, for protologisms, "suspect" words, and other terms we aren't sure about yet, but with /Citations subpages for each page, exactly paralleling the citations subpages for real, main-namespace words.) —scs 15:56, 12 November 2006 (UTC)[reply]
Keffy, I frame it as namespaces for "cruft" because I spend too much time patrolling Special:Recentchanges. Therefore, I can say with absolute certainty that 90+% of the such a new namespace would be filled with things I (and others) personally think do not merit dictionary entries. Because it is more than 90%, I frame it more as anti-vandalism, than as a type of research repository. What one finds therein, depends on the search that one does. So for the "minority valid" new terms, yes, the new namespace would accomplish what you say, nicely.
Disclaimer: The whole "/Citations" tab idea is Hippietrail's. I had very little to do with that. (Others had proposed it; he implemented it.) I do not think of it as a "copycat" move at all - rather, it is a requirement of any serious dictionary. The concept has certainly been implemented in different ways - probably as many different ways as there are dictionaries.
I think now that general understanding is being reached, a separate "think tank" might detract from the progress already made. Perhaps a policy proposal page would be better? (Erm, wouldn't that require the vote for new namespaces that I was originally proposed?) --Connel MacKenzie 18:35, 12 November 2006 (UTC)[reply]
I don't think this solves the alleged problem of ingenuitive, but we could have a Citations: namespace that would operate quite differently from the main. First of all there would be little distinction between hyphenated/spaced or inflected forms. The easiest way to handle this would be with redirects, which means the space could only be used for English entries (or at least that the scope of such a space as conceived would only encompass the context of a single language). Corner cases will evolve additional mechanics. The talk pages would be used for failed RFV entries. There there would be little to no formatting for the definitions, if provided, if we decide that they should be allowed, and I'm leaning towards not. Besides a citations section like the current /Citations subpages, there would be a section for references, including word lists and slang dictionaries, that for many terms would be the only claim they have for their existence. However I don't think we'd be allowed to quote the definitions provided without essentially and potentially incorporating the entirety of another work. Avoiding such issues, and avoiding the cruft as well, is a reason for not allowing definitions (aside from summary headings for disambiguation), which would entirely distinguish this idea from the solution that Connel had in mind. However, if definitions were provided, there would be no reason to format them or even require a part of speech, as the POS of e.g. necroposting isn't clear even with citations. They would still have to be patrolled for advertising, personal attacks, etc. which is why a dumping grounds isn't something I would really want to see anywhere on Wiktionary. DAVilla 06:31, 13 November 2006 (UTC)[reply]
I dislike the idea of creating a namespace that would become confused with subpages already in place on Wiktionary. The citations in a dictionary should be closely tied to the entry (or at least the lemma for each word). The citations are even more closely tied to a particular definition, in order to support the existence of the definition in usage. It further provides demonstrable evidence of standard usage and grammar. Creating a new Citations: namespaces cross-linked with every conceivable form and spelling of each entry will dissociate the citations from their support and use function. Whatever we're talking about creating, it should not be called Citations. --EncycloPetey 00:06, 14 November 2006 (UTC)[reply]
I don't think anyone can blame me for expressing my frustration at this point. The notion of creating a "Citations:" namespace is silly. The subpage solution is in place and does work for that scenario already, with or without the new namespaces. I think that may have been mentioned only to confuse the issue at hand: where to file away BS entries. Apparently, the deletion log is the only acceptable place. --Connel MacKenzie 20:11, 16 November 2006 (UTC)[reply]
Please show me how the [current Citations:] subpages handle neologisms. The namespace would first of all provide a location for words that don't exist. It doesn't make sense to have a subpage of a non-existent article. Secondly I proposed further that including e.g. references to slang dictionaries would allow for many of the more useful neologisms for which even a single proper citation could not be found. It's the mention not use problem, and I have seen a few surprising terms described in very upright and proper linguistic papers. Although all you have to add is a definition section and it becomes exactly what you've proposed, I do have to agree that in the end the deletion log is usually preferable. DAVilla 17:39, 17 November 2006 (UTC) [edited][reply]
I'm sorry, what subpages? Someone creates a junk entry, it gets moved to Neologisms:. Citations are added as a subpage in that namespace. Did I misunderstand your question, or are you misunderstanding what I'm getting at? --Connel MacKenzie 08:44, 18 November 2006 (UTC)[reply]
  • In summary: Andrew massyn, TheDaveRoss and I support the possibility of using Hippietrail's "multi-level Wiktionary" concept. While DAVilla, Keffy, Scs, Enginear, EncycloPetey, Moglex, Jeffqyzt and Dfeuer wish to maintain the current system of sending failures to Special:Log/delete while at the same time filling the main namespace with entries like ingenuitive. I suppose if it came to a vote, I'd be even more disgusted. So, I'm glad we discussed it first. Thank you all for your time. --Connel MacKenzie 06:14, 23 December 2006 (UTC)[reply]
  • I must have written even less clearly than usual -- I strongly support the "multi-level Wikt", and also a small main name space (if that is considered different), and also increased use of tags/glosses and also the ability to hide etymologies/quotations/translations. I consider the present platform OK at present, but beginning to be strained, and all of the above would be improvements which will become more important as we grow. --Enginear 18:49, 24 December 2006 (UTC)[reply]

Lemmata for verbs in Classical Languages

This discussion is opened here, but should be continued on the Wiktionary page About Latin.

I've been dreading opening this can of worms. On the English Wiktionary, we cuurently use the infinitive form of Latin verbs as the lemma (main entry). Presumably this decision was made because that is how most dictionaries for most European languages select their headwords for verbs, and in particular the dictionaries for Romance langauges do this (esp. Spanish, French, and Italian). However, Latin (and Greek) dictionaries do not use the infinitive as the headword; they instead use the first person singular present active indicative (first principal part) as the headword. Part of the reason for this is that it is the first form of any verb that is memorized by students learning conjugation tables. It is also due in part to the fact that the infinitive in Latin often does not behave like a verb at all, but as a verbal noun.

So, should we switch to using the first principal part as the lemma form of Latin verbs (and possibly Ancient Greek)? There's an added reason to do this, and it is the reason I've finally breached the topic. I've just checked Victiōnārium and discovered that they're using the first principal part as the lemma form for their verbs. That means that since we currently use the infinitive as the lemma, we can't cross-link lemma forms between the two Wiktionary projects. Most of their entries don't even have the infinitive form entered yet. Even when they are entered, it will mean that lemmata between the two projects will each be connected to non-lemmata on the other edition.

Since the Latin Wiktionary and all major dictionaries and textbooks use the first principal parts of Latin verbs as the lemma form, I propose that we switch to doing the same. --EncycloPetey 01:30, 10 November 2006 (UTC)[reply]

This discussion is opened here, but should be continued on the Wiktionary page About Latin.

Chinese POS Template's rs option

I understand the need for this option in han characters (zi), but I do not think this is proper with han words (ci).

the ci becomes a word needing attention, I do not think a ci entry needs such.
-- Hiòng-êng 03:00, 11 November 2006 (UTC)[reply]

The rs value is needed for both. This is to ensure that the word (regardless of whether it is comprised of one or more characters) is properly sorted in all of the relevant categories (by radical sort order when required). There are many things that may not make sense if you think of Wiktionary as a small project with only a few thousand words. Sooner or later, we will end up with categories that contain 1000's of words (both zi and ci). You'll be glad that we insisted on this kind of discipline early on. As far as the word needing attention category is concerned, this merely earmarks the entry for someone (me ;) to come along later and fill in any missing or improperly formatted information. It is a good thing, and nothing to be overly concerned with.

A-cai 03:34, 11 November 2006 (UTC)[reply]

Okay. What would be used to process this rs? A bot looking for zi05 or something and create an Index page or something? or just use search engine to find the index?
-- Hiòng-êng 03:22, 17 November 2006 (UTC)[reply]

Also, to clarify things, in the ci, would the radical be used in both zi or on the first zi only?
-- Hiòng-êng 03:22, 17 November 2006 (UTC)[reply]

I'm not sure I understand your first question. The answer to your second question is that the rs is for the head character (the first zi). ex. The rs for 字典 would be rs=子03.

A-cai 12:17, 17 November 2006 (UTC)[reply]

Some 'bots that might be useful.

Having been splitting my time here between the interesting stuff: writing definitions and the boring grunt work of adding verb forms and older plurals that no 'bot has caught, I thought of a few tasks that could usefully be performed by a 'bot.

A couple of these are:

  • Adding entries for parts of verbs that are not already present when one or more of the parts is.
Parts being: infinitive, TPS, present participle and simple past/past participle.
  • Adding entries for adverbs where the corresponding adjective is already in place.
  • Adding entries for adjectives where the corresponding adverb is already in place.
  • Adding various other entries where one part is already present e.g. if condemn is present, condemnation should also be.

The robot, as I have designed it, works from a command file prepared from a database dump, and I would review all the additions (which would include a check with the OED to make sure the words actually exist - failures requiring further research). There is no reason why the command file could not be posted on a special page for review by anyone who's interested.

I would also like to confirm (as the Wiktionary guidelines seem slightly different to the Wikipedia ones), that it is permissible to run a 'bot under the 'bot account name for 10-100 edits before formally submitting the 'bot to a vote? Moglex 16:36, 11 November 2006 (UTC)[reply]

The Bot policy is still in limbo. It is customary to create separate bots for each of those tasks (please don't get me started about how silly that is.) For now, there seems to be some agreement that multiple bot tasks can sometimes be combined under a single account. For yours, since the tasks are inherently similar (sharing the majority of their code) I'd hope this community agrees that you could use one single account.
It's of no consequence. The way the code is written it would configure itself on startup and run under whichever account is appropriate for the task selected by the operator. It would appear here as several 'bots but in reality they would be multiple instances of a single program. Moglex 18:30, 11 November 2006 (UTC)[reply]
Limited tests (2-10 edits) can be run from your account, yes. Generally, it is better when discussed/submitted for vote first. Otherwise, you risk alienating much of the community here. --Connel MacKenzie 16:50, 11 November 2006 (UTC)[reply]
That is what I thought and what I'd read on the relevant 'pedia page, but it is flatly contradicted by the Wiktionary 'bot page WT:BOT which states (process) that you should do a test of 10-100 edits under the bot account before asking for a vote. I don't mind either way. I'd just like to "do the right thing". Moglex 18:30, 11 November 2006 (UTC)[reply]
Good catch - I'll try and correct that wording (if no one beats me to it) soon. --Connel MacKenzie 17:47, 12 November 2006 (UTC)[reply]
The problem with bot-generated entries is that there is a danger they will be oversimplistic or even wrong. For example, although many adverbs can be defined as "in an adjective manner", many cannot - see vanishingly as an example. Each generated definition would still need to be patrolled to see if it made sense. SemperBlotto 16:56, 11 November 2006 (UTC)[reply]
Yes, I'd noticed that :-). That's why there is human moderation. And indeed, the moderation could, as I stated above, be extended to all interested parties.

By the way, how did you get OED's offical permission to use their resources to build your bot? I thought they had copyright on all their texts. --Connel MacKenzie 17:09, 11 November 2006 (UTC)[reply]

I naively assumed that when I bought the CD I had also purchased a licence that allowed me to look things up on it. Note that I said "I would review all the additions (which would include a check with the OED to make sure the words actually exist)". Can't see that they would have anything to complain about. Moglex 18:30, 11 November 2006 (UTC)[reply]
That doesn't mean you can use their list for verification of a GFDL resource...that would be releasing their list under the GFDL, in essence, right? Perhaps if WMF lawyers say it is ok, I'd remove my objection. But it certainly seems shaky, at best. You're not talking about generating a list and using OED as the final verification, but rather using their list as the primary automated filter. For that, I'm pretty sure they would object. --Connel MacKenzie 17:47, 12 November 2006 (UTC)[reply]
It sounded to me like he was talking about "generating a list and using the OED for verification"! And I tend to agree woth Moglex, I can't see what OED would complain about, if that's all he does. —scs 19:57, 12 November 2006 (UTC)[reply]
You don't see the copyright exposure of copying/matching their list? --Connel MacKenzie 20:26, 12 November 2006 (UTC)[reply]
Copying is different from matching, and it's very different from the kind of matching he's talking about. (But the fact that matching is innocuous is one of the reasons that copyrighting a mere list at all is dubious.) —scs 20:32, 12 November 2006 (UTC)[reply]
By "dubious" you mean to suggest that OED's copyright is not valid? Why should we needlessly expose Wiktionary to costly copyright lawsuits? I'm sure they'd love to sue WMF to oblivion; they are already stimyed by short, individual dictionary definitions not being covered. And when Wiktionary is usable, it will represent a very real threat to their existence. The only compelling reason I've heard yet, for linking (or even mentioning) other dictionaries is to show that we don't violate copyright. Opening the door to an army of IP-lawyers for this one bot is monstrously short-sighted. --Connel MacKenzie 20:42, 12 November 2006 (UTC)[reply]
Why, oh why, must you take everything (even an offhand remark in a parenthetical) and carry it off to some hyperbolic extreme?
I did not say, nor even imply, that the copyright on the OED is not valid. That would be a preposterous statement, and it is preposterous of you to suggest that I might believe it. (If I weren't such an equanimous soul, I'd be offended.)
What I did say is that the copyrightability of a mere list of words (which of course the OED is not) is dubious. And if you think about it for a moment, that's an unassailable statement.
But even if you disagree, even if you can prove (in the face of all the people who doubt it) that there's no doubt that a mere list of words is copyrightable, you still have no complaint with me, because I did not suggest using the OED (or any potentially copyrighted list) as a positive source for new entries here! —scs 22:19, 12 November 2006 (UTC)[reply]
If you were to run a stub-bot on Wikipedia, that created a stub for every Britannica headword, would Britannica consider that a copyright violation? Please be reasonable. --Connel MacKenzie 20:29, 12 November 2006 (UTC)[reply]
Is that what Moglex is talking about doing? That's not the way I read his posts. (But no, of course that kind of stub-bot would not be reasonable.) —scs 20:34, 12 November 2006 (UTC)[reply]

Just to verify the process that I'm using. From a dump of the WD database I create a matrix with columns for infinitive, TPS, PPl and Past. The rows are obviously the different verbs. If an entry has only the infinitive, or is missing the infinitive I look it up in the OED. If both the participles or the infinitive and one participle exist there, that row is deemed probably correct and goes into the command file unmarked. Otherwise it goes into the command file with a mark to indicate it needs to be looked at more carefully. I then review the command file and if the 'bot is approved, will place any putative command file on the 'bots page for (say) a week for anyone else who cares to review it to do so (with an announcement of that fact in, I suppose, the tea room).

There is no way that any 'bot I propose will use any other non Wiki work as anything other than a first pass check. I'm not sure about the copyright implications - it doesn't seem logical that the actual word list could be copyright since, presumably, most literate people learned at least a few words from noticing interesting diversions whilst looking something up in a dictionary. If you asked a group of people each to learn 20 new words that weren't in the Wiki, how could you complain if they went to another dictionary to find them. If you then asked them to add any interesting words they knew to WD, you'd be bound to end up with a lot form other dictionaries.

There does, however, seem to be little need to worry about such shenanigans for the forseeable future as there will be plenty of people adding to WD from their own store of knowledge.

And, if I did want a whole bunch of words that weren't included, I'd download project Gutenberg, sort that lot and check that against the OED. New words would be missing but I'd bet you'd get most of the Obs, Rare. and Arch. Moglex 20:55, 12 November 2006 (UTC)[reply]

Thank you for explaining your approach. I would not like to see you run that bot here; it clearly would be violating the OED's copyright.
Which 'bot? The POS bot or the 'Gutenberg' bot? Or both? Are you a judge? They are usually the only people who get to state that x is 'clearly' infringing y's copyright. Moglex 22:23, 12 November 2006 (UTC)[reply]
I do refresh my analysis of Project Gutenberg regularly, updated after each XML dump of en.wiktionary.org. (Yes, I filter out non-English texts.)
??? confused ??? What for? If it's to add new words, why do you need to do it more that once? Moglex 22:23, 12 November 2006 (UTC)[reply]
I'm not uploading useless stubs, I'm uploading a lists of wanted entries. --Connel MacKenzie 22:35, 12 November 2006 (UTC)[reply]
No, you've still lost me here. We already know what words are (wanted)/(linked but not present). Where exactly does an analysis of Gutenberg come in? Or are you saying that you use a list of the words found in Gutenberg (including, presumably, mis-scans and spoellink errors) to check the validity of wanted/linked words in the Wictionary project?
I think what he's saying is that each iteration of User:Connel_MacKenzie/Gutenberg starts out as a list of redlinks. When he regenerates it, it's not to pick up any new words, but simply to get rid of the redlinks that have turned blue since the last iteration. —scs 02:07, 13 November 2006 (UTC)[reply]
If you were to run my list against the OED, that very much would constitute a direct copyright violation. (What you or I think about the copyright laws is irrelevant.) --Connel MacKenzie 21:08, 12 November 2006 (UTC)[reply]
Those two sentences are contradictory. You have given your opinion as to what would constitute a copyright infringement and then stated that your opinion is irrelevant. More on why I completely, 100% disagree with that opinion --Moglex 22:07, 12 November 2006 (UTC)later.[reply]
You are right that those two sentences are contradictory - I should have worded it much more harshly (but I was trying to be nice.) There is no reason that I can see for Wiktionary to be exposed, just because you personally do not respect copyright law. --Connel MacKenzie 22:35, 12 November 2006 (UTC)[reply]
Please tone it down with the accusations. There is no evidence that Moglex does not respect copyright law. He (and I) are merely questioning your interpretation of one small facet of copyright law. —scs 22:41, 12 November 2006 (UTC)[reply]
I do not fail to repect copyright law (or, more accurately, I do not fail to respect copyright), it's just that your view of the restrictions imposed by copyright law is about 250% more draconian than mine.
Connel, is that Original Research, or can you cite a source to support this "clearly would be violating" claim? If I have a list of words A which I and my friends came up with somehow, and a copyrighted list of words B which I bought from OUP, it is no copyright violation for me to discover that subset of list A which is not in list B (perhaps using the equivalent of the Unix command "comm -23 A B"). Mechanically discovering and then doing something with the entirety of the other subset, namely those words in list B but not in A, yes, that operation would be suspect. —scs 21:45, 12 November 2006 (UTC)[reply]
How would any court distinguish one from the other? BOTH use the reference set as the reference set. Do you have any case law you can refer me to, showing that? Let's keep things in perspective here: it is an uphill climb from here, to show that Moglex's proposal might not be a copyright violation, not the other way around! --Connel MacKenzie 22:35, 12 November 2006 (UTC)[reply]
How would any court become involved in the matter? What would the allegation of infringement be?
  1. "Defendant decided, after consulting plaintiff's electronic dictionary, to compose his own definition of a word he knew of, even though the word was not in plaintiff's dictionary."
  2. "Defendant decided, after consulting plaintiff's electronic dictionary, to compose his own definition of a word he knew of, after confirming that the word was in plaintiff's dictionary."
  3. "Defendant's dictionary does not contain a word which is also not in plaintiff's dictionary."
I submit that all of these allegations are prima facie absurd.
The only way copyright law might enter in here is if Moglex first had to type the OED's list of words into his computer so that he could perform the double-checking lookups. That would run afoul of the standard "no part of this publication may be stored in an information retrieval system" clause. But since they sold it to him on a CD, it's pretty clear it's intended to be stored and used via an information retrieval system, so that point is moot.
I thought I had posted this earlier, but perhaps it didn't stick (or I posted it in the wrong place). I doubt that Moglex was sold anything except a piece of plastic and a licence to make use of the data on it. One fundamental point is whether the licence allows what he proposes. I suspect it prohibits wholesale copying of lists of words complete with definitions. I doubt that it prohibits batch checking of whether entries exist, which is arguably what Moglex intends, but it might. --Enginear 19:05, 13 November 2006 (UTC)[reply]
My reading of the licence is that there is no such prohibition, but I would be most grateful if others could check to make sure I haven't missed anything. If there is a problem I would go wth the suggestion made by scs and use a list from an OOC work. The difference in efficacy would be negligable. Moglex 20:40, 13 November 2006 (UTC)[reply]
Now, I will grant that we're close to the edge here. Doing something mechanical with an electronic copy of a copyrighted work should cause alarm bells to go off. The fact that (to refer to an earlier example of mine) the invocation "comm -23 A B" is fine while the invocation "comm -13 A B" is suspicious, is suspicious. However, I maintain that what Moglex is talking about doing is comfortably on the right, fair side of the line.
You said, "How would any court distinguish one from the other? BOTH use the reference set as the reference set." But the court doesn't have to distinguish that, because the court doesn't care how I use anything. Copyright law doesn't say how I may or may not use a copyrighted work. What it does say is that I may not make unauthorized copies, or republish or retransmit, the copyrighted work. But nobody is talking about making any copies of, let alone republishing, the OED's list of words.
You can say that it's perilous to be contemplating any of this if we're this close to the line. But I put it to you, Connel: how many dictionaries do you own? Do you have any commercial, electronic dictionaries on your computer? Burn them all now! A court might hold that your use of them is not fair use, and there's no way the project can afford that kind of legal exposure. No, wait: even if you burn them, there will still be information you got from them rattling around in your head. I guess you're forever tainted, then, and won't be able to work on the project ever again.
(Is this an absurd argument? Yes and no. If you believe that what Moglex is talking about doing is so perilous, if you believe he's putting the Wikimedia Foundation at grave risk of a ruinously expensive copyright lawsuit, then I have to say that you and I and everyone else here who's ever consulted a commercial dictionary also is.)
scs 23:08, 12 November 2006 (UTC)[reply]
Or, to put it much more simply: Suppose that Moglex does what he's proposing to do. State for us please, as clearly and concisely as possible, the hypothetical complaint which the OED might bring against Wiktionary for copyright infringement. What would they claim that Wiktionary had done, and which excerpts from Wiktionary might they present as evidence of infringement? —scs 23:15, 12 November 2006 (UTC)[reply]

---

Well, in the first place, in civil law, it is not, as in the case of criminal law, "innocent until proven guilty", it is nonetheless: "on the balance of probabilities".
On the balance of probabilities, provideed you can show that there is positive evidence that the primary list was extant before any access to the list that is considered to be violated by breach of copyright, then I suspect (but suspect very strongly) that any court would deny breach of copyright.
There is a massive amount of evidence that Wictionary is the independant work of a large number of people (and that evidence would be given more weight by Connel's determined and entirely laudible stance in protecting other dictionary producer's copyright).
I continue to be quite unconvinced that using an extant dictionary to verify the existance of words that it can be clearly and unambiguously demonstrated were originated elsewhere could, under the wildest dreams of the most avericious lawyer, be deemed a breach of copyright.
The process of RFV's (and, indeed, the avoidance of RFV's by quoting one or more dictionary refence), relies very heavily on the use of two dictionaries: OED2 and MW. Why the fact that a robot may be involved makes a scrap of difference, I cannot imagine. There are one hell of a lot of words in the Wiktionary that are there because they can be verified as extant by a reference to the OED, and a fair number whose definition, I would warrant, are little more than the OED definitions with the word order rearranged. Moglex 23:20, 12 November 2006 (UTC)[reply]

You buy a dictionary for three reasons (perhaps four if you want to impress your guests by having all thirteen volumes of the OED on your coffee table).

  1. To find the meanings of words.
  2. To find the correct spellings of words.
  3. To ensure that a word you've encountered but are unsure of actually exists.

How on earth could the OED mount a case against anyone for using a dictionary for one of the very purposes for which it was bought?Moglex 23:40, 12 November 2006 (UTC)[reply]

Summary for the war-weary, and some oil for troubled waters

The discussion above between Moglex, Connel, and me has gotten rather heated and rather tangled. Here is a summary (from my perspective) of what the issues are and are not.

No one is proposing to copy any entries from the OED. That would be wrong.

No one is proposing to mechanically add entries to Wiktionary's corpus based on words which are in the OED but are not yet in Wiktionary. That, too, would be wrong.

No one is even proposing to take a list of candidate words not yet in Wiktionary, and to compare it to OED's list of words in order to weed out (or scrutinize more carefully) those candidates which might be garbage. That's the case we've been debating so heatedly, which Connel has been criticizing and Moglex and I have been defending. The arguments and counterarguments we've been slinging may or may not be accurate, but it hardly matters, because even this (relatively benign) case is not what Moglex is proposing to do.

Here's what (as I understand it) Moglex is proposing to do. If we've got an entry flummox but (say) no participle flummoxing, Moglex has a bot that is prepared to mechanically generate and add pages for the missing verb forms. (This is obviously a mindless and mechanical process.) What he's proposing to do is to double-check his bot's intermediate output against the OED. If the bot comes up with (or if Wiktionary already has) the same spelling of the same verb form, all is fine. But if there's a discrepancy, it gets kicked out for someone to look at by hand. (Not to copy anything from the OED -- rather, for someone to look at by hand.)

In other words, Moglex is not even using the OED to double-check the addition of new verb stems which Wiktionary doesn't yet contain. All he's doing is double-checking the mechanical generation of declined verb forms for stems which Wiktionary already contains. So if there's a copyright violation here (which I really don't think there is), it would have to involve the OED claiming to hold a copyright on the past tenses and participles of English verbs, or on a collection of past tenses and participles.

But now for the concession/capitulation/compromise. Even though I hold that there's nothing wrong with what Moglex is proposing, if he (and I) get tired of spending all this time and energy arguing with Connel about it, there are two or three copyright-unencumbered (or at least far less copyright-encumbered) word lists I can think of which Moglex might choose to use instead. These might be a little older and a little less comprehensive than the OED, meaning that using them could lead to more entries being kicked out for manual attention. But on the other hand, these lists are already in a more generic, low-level format, meaning that using them might even be easier. The lists I have in mind are (1) the 1934 Webster's Second word list, (2) Grady Ward's "moby" lexicon, and (3) the Princeton Wordnet. All of these are readily available on the net. (Moglex, I can provide pointers if you're interested but can't find them.)

scs 03:07, 13 November 2006 (UTC)[reply]

Yes, I was thinking that, in the worst case, that would be a potential solution. Then the words that were set for greater manual scrutiny would be examined by various editors, some of whom, no doubt, would look it up in the OED and MW.
I really think that there needs to be a greater involvement from others, and I'm rather disappointed that no one else has thrown their hat into the ring so far.
It seems to me that an extremely important principle is at stake here: Is it permissible to look up a word in a copyright work, of any type, to verify its existence?
If the answer to that question is 'no', then I suspect that a very large number of words in the wiktionary constitute a breach of copyright. If the answer is 'yes' then there is no problem.
It's fairly obvious what the views of the three protagonists here are. We really do need to hear from other people.
Just as a reminder, there has been a lot posted in a short span of time, most of it over a weekend (Moglex's first post about the 'bot proposal was only 11 November; at this writing, it is 13 November.) If you want more community involvement, you might want to slow down a tad. Many people can only contribute a small amount of time at a given stretch, and reading through such a large discussion (which this has become) can take a significant portion of that allocation. Remember that this is still a much smaller community than Wikipedia. Now, as to the question being asked: "is what I plan to do going to expose Wiki to legal action," you must get a qualified legal opinion if there is a question. That said, here is my take. This proposal sounds similar to what has been done for the Wiktionary:Requested_articles:English/DictList. That page lists words, and links to external dictionaries. --Jeffqyzt 17:59, 13 November 2006 (UTC)[reply]
Jeffqyzt, I'd like to note that the request for deletion discussions of those copyvio lists have gone conspicuously missing. I'm not sure if I'm simply looking in the wrong place (as a result of those pages moving around a lot) or what.
The point that has been repeatedly munged, above, is that what I've seen proposed, is republishing of the OED's list (in a highly modified form.) I do not see how anyone can assert that doing such a thing is not a violation.
To those that would suggest that "I love OED" or am acting in "OED's defense" I must carefully assert the opposite. My comments here are to protect the significant investment of time, money and effort that I have made to en.wiktionary.org. I do not wish to see it undermined by a careless attempt to be helpful. The guidline has always been, that you cannot use other dictionaries' lists. (Note too, the suggestion of using Webster's 1934 is also a copyright violation. That's why we've always stuck to Webster's 1913! Likewise, Wordnet with its conflicting license, cannot be imported.) --Connel MacKenzie 21:28, 13 November 2006 (UTC)[reply]

Well, I came in right on the tail end of this one! Frankly I'm not sure I understand the purpose of checking anything against the OED in the first place - it does seem that there are plenty of non-controversial public domain resources against which something could be checked, if all we're checking is whether such a word exists. Anyway, as to the principal problem, I have put on my IP lawyer hat and decided that even I don't know the answer to that one. If I wrote a dictionary and put in everything that I thought was a word, and then I had some questions about whether a few of them actually were words, I'm quite confident that I could look those few up in any dictionary without fear of running afoul of any copyright law, as there's no actual copying involved. Infringement requires actual copying. Now, could I program a bot to look up the words for me? If the bot does not need to copy the original list to do so, then I don't see why not. Still no actual copying involved. bd2412 T 23:43, 13 November 2006 (UTC)[reply]

I am taking the advice of scs and withdrawing my proposal to use the OED to check for the existance of a word. In the first place there aren't that many words that require that check, and in the second, as has been pointed out, there are non-copyright lists available.

Thinking about a possible copyright case has brought some amusement:

  • Barrister: "Your honour, in this case I appear for the Oxford University Press. We are bringing a breach of copyright action against a Mr Moglex and the Wikipedia Foundation on the basis that there are non-existant words that do not appear in our dictionary and do not appear in the defendant's dictionary either. We maintain this is a clear breach of our copyright."
  • Pause
  • Judge: "I ... erm ... see, Mr Kafka. And what would these disputed words be?"
  • Barrister: "We don't know, they never published them either"

A revised proposal will follow.

Pragmatism rulz! Moglex 08:24, 14 November 2006 (UTC)[reply]

I believe that list is getting far too long and therefore suggest that each letter-sublist gets divided into their own subpages. You agree, don't you? Can somebody reorganize them into subpages? (Or can I do it myself if no one else has the time?) --Takanatsu the Frippant 21:54, 12 November 2006 (UTC)[reply]

Can it be that cardinal numeral is a normal term in GB? EncycloPetey is in the process of changing our many cardinal number pages and categories over to cardinal numeral. In the U.S., at least, cardinal numeral is very unusual, and cardinal number is what we use. I see that Wikipedia also has pages for w:cardinal number, but not w:cardinal numeral. To me, cardinal numeral sounds illiterate. Likewise ordinal numeral. Just now googling, I find about 12,000 hits for cardinal numerals, but almost 300,000 for cardinal numbers. —Stephen 17:29, 15 November 2006 (UTC)[reply]

I assume that is because there are more sites discussing the numbers behind the numeral designations than sites discussing the grammar of numerals in language. --EncycloPetey 20:27, 16 November 2006 (UTC)[reply]
I have never seen cardinal numeral except in discussion on this site. Nor have I ever heard of the numeral of the beast. However, I see that in OED, while numeral (noun) definition 1a is A word denoting or expressing a number, and definition 1a in OED for number (noun) is The precise sum or aggregate of a collection of individual things or persons; the quantity or amount, definition 2a includes the words Something which graphically or symbolically represents a numerical quantity, as a word, figure, or group of these; a numeral.
The full text of OED 2+ online edition includes 56 instances of cardinal number and 21 instances of cardinal numeral. Also, it defines cardinal number (at cardinal (adjective) 3., unchanged from OED 2), but not cardinal numeral. In the language of 1989, the full text of OED 2 online edition includes 46 instances of cardinal number and 13 instances of cardinal numeral. Again, it defines cardinal number, but not cardinal numeral. So in 1989, 78% of mentions were for number while by 2006, this has dropped to 73%.
So it seems that both descriptions are acceptable in the UK, and I see that the relevant aspects of our own definitions of number and numeral are very similar to the OED's, so either usage is consistent with our current definitions. Cardinal number has the advantage of familiarity, but might possibly be ambiguous, while cardinal numeral can only mean one thing. It may be that the OED is slowly deprecating cardinal number, but at their present rate it would take them about 250 years to complete the task. It seems EncycloPetey is moving more quickly. I have no strong feelings on the matter. --Enginear 23:16, 15 November 2006 (UTC)[reply]
What is the other meaning of cardinal number? I have never encountered a case of ambiguity with it. On the other hand, cardinal numeral means something completely different to me. Cardinal means basic, principal, fundamental ... and cardinal numeral means the ten fundamental numerals: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. The Greek cardinal numerals are: Α, Β, Γ, Δ, Ε, Σ, Ζ, Η, Θ, Ι; and so on. The cardinal number 1012 is composed of the cardinal numerals 0, 1, and 2. In any case, I don’t believe American usage is changing in that direction. —Stephen 23:53, 15 November 2006 (UTC)[reply]
No, 0, 1, ... 9 are digits. So 1012 is no composed of cardinal numerals, it is a cardinal numeral used to represent a cardinal number.
Yes, they are digits. They are also numerals. They are also numbers. They are cardinal numbers. 1012 is composed of the basic numerals, which is exactly what "cardinal numeral" means. Cardinal means basic, fundamental. 1012 is a number, composed of four numerals. It is a cardinal number composed of four cardinal numerals. —Stephen 19:30, 17 November 2006 (UTC)[reply]
There is a cardinal numeral designated by the numeral 1012, written with four digits. You cannot see cardinal numbers because they are abstract quantities. The numerals are the encoding that represents them. --EncycloPetey 20:38, 17 November 2006 (UTC)[reply]
There is no other meaning of cardinal number I'm aware of. The issue is that number in linguistics refers to a feature of the grammar of nouns, verbs, and adjectives (and sometimes to other POS as well). The ambiguity is in the linguistics of headers we use, not in the definitions section. The problem is that English is one of the few European languages that has not traditionally recognized "numeral" as a part of speech, and so its usage as such is not as widespread outside linguistics circles. You will find it in grammars written in English for other languages. --EncycloPetey 22:34, 16 November 2006 (UTC)[reply]
I have studied and worked as a linguist my whole life. One of the senses of number in linguistics refers to a feature of nouns, verbs, adjectives, etc. Other senses of number mean other things. Cardinal number is the term for the fundamental numbers (not grammatical number), and cardinal numeral is illiterate. I have advanced degrees in several languages including Spanish, German, and Russian. There is NO ambiguity with cardinal number; cardnial numeral is illiterate. It is precisely the same situation in Spanish: Spanish for cardinal number is número cardinal ... and Spanish for grammatical number is número. German for cardinal number is Grundzahl. German for even number is gerade Zahl (even numeral is as illiterate as cardinal numeral). German for numeral is Zahlzeichen or Ziffer. Russian for cardinal number is количественное числительное; Russian for numeral is имя числительное. There is no problem whatsoever with cardinal number, which is perfect, idiomatic English and the correct term in linguistics. I have read hundreds of grammars and I have never encountered cardinal numeral in my life. Apparently it is a Britishism. —Stephen 19:30, 17 November 2006 (UTC)[reply]
Since we're waving our trainings around...I am a professional mathematician. The grammar I cite below tend to disagree with you. Perhaps you are not as widely read as you believe. Feel free to look up the texts I have quoted below, many of which use the term. Many of the authors are American (as am I), so this is not a Britishism. Consider the quotes below from the various American authors and publishers. --EncycloPetey 20:38, 17 November 2006 (UTC)[reply]
To clarify, I'm not aware that British usage is changing either, and nor could I think of any ambiguity. But maybe someone knows otherwise. What does M-W say? --Enginear 01:23, 16 November 2006 (UTC)[reply]
M-W lists cardinal number, has nothing on cardinal numeral. American Heritage also lists "cardinal number: NOUN: A number, such as 3 or 11 or 412, used in counting to indicate quantity but not order." The American Heritage has nothing on cardinal numeral. —Stephen 02:08, 16 November 2006 (UTC)[reply]
The issue is one of definition of the grammatical category of the label versus definition of the members of that category. There is a number called "5" or "five". The number is question is a cardinal number, but the symbols and words used to designate that number are numerals. So the number of stars here: ***** is a cardinal number, and that number is designated by the numeral "five" or "5". In like fashion, a cat (as a creature) is an animal, but "cat" (as a word) is a noun. You would not begin the definition of "cat" as "a noun with whiskers..." and you would not give "animal" as the part of speech. "Numeral" is the term grammarians use for the part of speech for words such as "five", "fifth", "half", "twice" and some grammarians also include indefinite numerals like "some" and "many" in the category too. Did you read any of the discussion on the WT:POS talk page, which is clearly linked from the vote that happened? --EncycloPetey 20:23, 16 November 2006 (UTC)[reply]
It seems clear that you have not made your case convincingly enough (especially in comparison to other dictionaries as described above.) On the other hand, people have expressed reasonable objections to the change. --Connel MacKenzie 20:33, 16 November 2006 (UTC)[reply]
Most words have multiple meanings, and this is never a problem with either number or cardinal number. It is not a problem in English, Spanish, German, French, Russian, or any other language that I can read or write. The Random House lists 42 different senses for number. In spite of this, it is never misunderstood by linguists, and it is never confused with cardinal number which has one and only one meaning. "Cardinal numerals", OTOH, just means the "fundamental numerals", and that is different. —Stephen 19:30, 17 November 2006 (UTC)[reply]
No, the case is that one person has expressed an objection that it doesn't "sound right" to them. This does not form the basis of a "reasonable" objection, since it is not based on solid reasoning. Many points of fine, correct English grammar don't "sound right" to people because usage and prescibed grammar differ on a number of points. That said, please note that in the discussion on the WT:POS talk page, I state that I'm limiting any changes I make for now (to a few English pages, Italian, and Latin) in order to try out my ideas. I have already come across some sticking points that would need working out before I go further than that. I've also noted that many people in voting said that Numeral alone would appeal to them more than using a qualifier such as "Cardinal" or "Ordinal". While I can't envision how that would work, someone might propose a workable format for using that. If someone does, then I'm all for using it, since it would solve some of the difficulties I've mentioned. --EncycloPetey 22:34, 16 November 2006 (UTC)[reply]
I'm in favour of having "Numeral" as a POS header and indicate cardinality and ordinality at the beginning of each definition. This is necessary because Roman numerals such as II, VI and LXII can be both cardinal numbers ("chapter VI") and ordinal numbers ("Pope John Paul II"). I also suspect that in dates ("20 September 1990") Arabic numbers ("20" in this case) can be ordinal numbers without requiring a suffix (-st, -nd or -th). Ncik 01:30, 17 November 2006 (UTC)[reply]
EP, my complaint is the several people complaining above! Evidently, not all were even aware of the vote, but that notwithstanding, you claim to have garnered support that you don't actually have. The comparison to other dictionaries is not a valid complaint? Oh my. --Connel MacKenzie 01:32, 17 November 2006 (UTC)[reply]
Er, I see only one person making a complaint above, namely Stephen. You have commented above, and so has Enginear. You are apparently using several in a sense that we haven't defined here yet. No, the comparison to other dictionaries is not valid. We're not talking about the English definition of anything, we're discussing grammar and parts of speech. The definition of cardinal number is to be expected, since it's used in mathematics to refer to the abstract concepts represented in writing by the numerals. I will provide numerous examples of the use of numeral and cardinal numeral from the grammars in my library. I don't have time to do so at this moment, but I'll present the meticulous evidence I have later today, since that seems to be necessary to cut through all the hand-waving. --EncycloPetey 16:52, 17 November 2006 (UTC)[reply]
That is flatly false. Read the above comments, please! --Connel MacKenzie 09:12, 18 November 2006 (UTC)[reply]
Could you be specific? I maintain that the argument made from comparison to other dictionaries is not validly structured to reach the conclusions that were claimed. You will also see below that I did present meticuluous findings cited from published works. I also se nore more than two people complaining above. The word several is defined on our own entry to be understood to refer to more than two. Enginear has not complained, he has presented some findings and concluding that "cardinal number" is not as precise as "cardinal numeral" though the former may be more familiar to the ear. --EncycloPetey 22:50, 20 November 2006 (UTC)[reply]
The data: I have pulled various grammars off my library shelves and looked in each one for the section on numerals as a part of speech. I have about thirty grammars in total, covering Dutch, Latin, Romanian, Hungarian, and a variety of Baltic and Slavic languages. Most of these volumes title their section on this part of speech "Numerals"; only two of my grammars label the section "Numbers". Of the latter, one is a slender paperback volume primarily concerned with Hungarian verbs with the other parts of speech seeming to have been added as an afterthought. The other is a volume entitled "Beginning Croatian and Serbian", published in the 1950s out of Pennsylvania State University, and therefore one of my oldest grammars in English. The remainder of the books title their section "Numerals" and use that term in the index as well.
Specific citations: Here are cited quotations and information from various specific grammars. Anything included within double quotation marks (" ")is quoted. Anything indented is merely a longer quotation. Items in single quotes (' ') is merely being identified as a term.
  • S. Moore & T. A. Knott, The Elements of Old English (Ann Arbor:George Wahr, 1942)
    This is one of the oldest grammars written in English that I have for any language. In the chapter on morphology, between "Declension-Adjectives" and "Adverbs", there is a section "Declension-Numerals". Although this volume uses the phrase "cardinal and ordinal numbers" at the start of the section, on page 163 it says:
    371. ān, one, which is sometimes a numeral and sometimes an adjective...
    So this early volume is already using the term "numeral" for this part of speech.
    • So it is archaic, regarding a dead language. Indeed. --Connel MacKenzie 09:12, 18 November 2006 (UTC)[reply]
      A usage in 1942 is not archaic if it's still one of the most highly used books on the subject of Old-English. The book is describing Old English, not written in Old English! ;) --EncycloPetey 00:54, 28 November 2006 (UTC)[reply]
      The Wiktionary-specific meaning of "archaic" is 50-99 years old..."obsolete" being 100 or more years old. So yes, 1942 is evidence that the term is "archaic" (in this Wiktionary context.) But then, that generally is in reference to citations, not secondary sources. And I believe you were citing that that text uses the word "Numeral", right? So {{archaic}} is certainly appropriate. Completely beside the point, Old English is still a dead language. Or maybe not beside the point: a text on that topic is certainly more apt to use archic styled language for its descriptions. Bah, that is beside the point. 1942 is archaic, for Wiktionary. Likewise 1924, etc. --Connel MacKenzie 08:30, 30 November 2006 (UTC)[reply]
  • E. Kruising, A Grammar of Modern Dutch (London: George Allen & Unwin Ltd, 1924)
    The part of speech chapters in this volume are titled (in sequence): "Verbs", "Nouns", "Adjectives", "Pronouns", "Numerals". The chapter on numerals descibes "cardinals and ordinals" as the two subgroups. Neither the term 'cardinal number' nor 'cardinal numeral' appears; only "cardinals" is used.
  • E. Cristo-Loveanu, The Romanian Language (self published, 1962); the author taught at Columbia University
    Chapter XV of this book is titled "The Numerals, Numeralele". The major sections are "Cardinal numerals" (p84), "Ordinal numerals" (p85), and "Indefinite numerals" (p91). This provides attested evidence that these terms have been in use in English for more than 40 years.
    • "Publish or perish" comes to mind. What exactly does "self published" mean here? --Connel MacKenzie 09:12, 18 November 2006 (UTC)[reply]
      Exactly what it says: "published by the author." You can see a description of the book's contents at UCLA's Language Materials Project. The book was published as a thick hard-bound volume less than a year before her death. She may have been in a bit of a rush; I don't know the full circumstances of her health in the year before her death. I can say it's more thorough and more user-friendly than most grammars I've worked with. --EncycloPetey 00:54, 28 November 2006 (UTC)[reply]
  • Bruce Donaldson. Dutch: A Comprehensive Grammar (London & New York:Routledge, 1997)
    The chapters on parts of speech are titled (in sequence): "Nouns", "Pronouns", "Adjectives", "Adverbs", "Verbs", "Conjunctions", "Prepositions", "Numerals". The term "numerals" is translated at the outset of the chapter into Dutch as "telwoorden", making it clear through the suffix -woorden that this term refers to words, and not to numeric symbols. The subsections are titled "cardinal numbers" and "ordinal numbers", but the word 'number' is not used without the qualifier 'cardinal' or 'ordinal'.
  • Dana Bielec, Polish: An Essential Grammar (London & New York: Routledge, 1998)
    The chapters on parts of speech are titled (in sequence): "Verbs", "Nouns", "Pronouns", "Adjectives", "Adverbs", "Prepositions", "Conjunctions", "Numerals", "Interjections", "Particles". The chapter on numerals lists as subcategories "cardinal numbers" and "ordinal numbers", as well as "indefinite", "collective", and "fractions". While this book and the one before both use 'numbers' for the specifics and 'numerals' for the category, note that both originate from the same publisher, and so will be subject to the same editiorial process. If they (like Stephen) have their editors checking the dictionary for correct usage, the terms may have been modified over what the authors had originally. In any event, the Routledge books use "Numerals" for the part of speech and "cardinal number" for the specific examples.
  • G. Stilman, L. Stilman, & W. E. Harkins, Introductory Russian Grammar, 2nd ed. (New York: John Wiley & Sons, 1972)
    Page 239:
    It has been seen that the numerals two, three, and four take the genitive singular...
    Compound numerals are formed as in English (but without a hyphen)...the numerals 5 and above govern the genitive plural...
    The last digit (i.e. the untis digit) of the compound numeral determines the number and case of the noun.
    That final sentence would be confusing if there weren't a clear distinction between the use of 'numeral' and 'number' in the grammar being discussed, since both terms are being used in the same sentence in very, very different senses from what Stephen has suggested.
  • William W. Derbyshire, A Basic Reference Grammar of Slovene (Columbus, Ohio: Slavica Publishers, Inc, 1993)
    The sections on parts of speech are (in sequence): "Nouns", "Adjectives", "Adverbs", "Pronouns", "Numerals", "Prepositions", "Verbs".
    (page 54): The two main categories of numerals in Slovene are: cardinal and ordinal.
    (page 55): In comparing cardinal and ordinal numerals, the student will note that ordinals from 'fifth' onward are formed by the addition of the suffix -i to the cardinal.
    (page 56): The cardinal numeral 'one' occurs in the singular and is declined like bogàt (p. 33).
    It is clear from these many uses, published in America, that the usage of 'cardinal numeral' is neither a Britishism nor illiterate as Stephen has claimed.
  • F. M. Wheelock, Wheelock's Latin, 6th ed. revised (New York: Harper Resources, 2005)
    From pages 97 and 98, in the section on "Numerals":
    (page 97): The commonest numerals in Latin, as in English, are the "cardinals"...and the "ordinals"...In Latin most cardinal numerals through 100 are indeclinable adjectives.
    (page 98): The cardinals indicating the hundreds from 200 through 900 are declined like plural adjectives of the first and second declensions.
    (page 98): Mīlle, 1,000, is an indeclinable adjective in the singular, but in the plural it functions as a neuter i-stem noun of the third declension.
It is clear that Wheelock is fine with using the term 'cardinal numeral', and I've never heard this premier Latin textbook author described as "illiterate" by anyone before. It is also clear that the cardinals include numbers written with more than a single digit, as opposed to the definition favored by Stephen. --EncycloPetey 19:19, 17 November 2006 (UTC)[reply]
Numeral is completely different from "cardinal numeral". It is correct to say that the part of speech for 1, 2, etc., is numeral, but not cardinal numeral. They are cardinal numbers. —Stephen 19:36, 17 November 2006 (UTC)[reply]
Then by all means feel free to write the many authors I have cited and tell them you know better than they do. Again, the concepts represents by those symbols are cardinal numbers certainly, but the symbols themselves are cardinal numerals. --EncycloPetey 20:38, 17 November 2006 (UTC)[reply]
I will gladly write the OED and the writers of the 6000 articles containing "cardinal numeral" if you will first convince the Random House Dictionary, the American Heritiage Dictionary, and the Merriam-Webster, as well as the writers of the Wikipedia articles and all of the 300,000 pages that use cardinal number that they are wrong and should be using your idea instead. Keep me informed of all their comments. —Stephen 20:51, 17 November 2006 (UTC)[reply]
Please look at the contents of the Wikipedia article on cardinal numbers, and not just the title. The article is about mathematics, so it is using the term 'cardinal number' correctly. The article is not about the grammar of the cardinal numerals; it is about the mathematics of the cardinal numbers. I don't need to tell anyone on wikipedia that they're wrong because in fact they are using the term correctly in applying it to the mathematical concept. In fact, wikipedia has an article on English-language numerals, which deals with the word and symbolic representation of large numeric values. There is also an article entitled Slovenian numerals, with sections on cardinal numerals, ordinal numerals and so on.
Likewise, there is nothing amiss in any of the dictionaries you've cited because they also use the term correctly. When the numeral 'five' is defined as "the cardinal number between four and six", they are referring to the concept, not the word itself. The grammars I have quoted above are discussing the word "five", rather than the concept the word labels. I simply don't understand why you're having such a hard time grasping this fundamental distinction between the label and the thing to which the label applies. A cat is an animal, but "cat" is a noun. Thus, five is a cardinal number, but "five" is a cardinal numeral. --EncycloPetey 22:42, 17 November 2006 (UTC)[reply]
Addendum: How exactly did you get a figure of 300,000 pages on Wikipedia that use the term "cardinal number"? I did my own google search for that exact phrase limited to wikipedia and came up with only 364. There's more than a little discrepancy between what you claim and reality. --EncycloPetey 22:54, 17 November 2006 (UTC)[reply]
Gee, [4] yeilds: Results 1-99 of 338201 so any accusations of number games :-) seem misplaced. The exact phrase doesn't get 6,000, but Results 1-99 of 5359 so there may have been some amount of exageration, but not significant.
Perhaps a step back, towards civility would help here. This does not need to carry such an adversarial tone, especially between several of the best contributors here. Ncik had a possibly helpful suggestion above that seems to have been lost in the shuffle. If the heading is reduced to "Number" and cardinality/ordinality is identified elsewhere, the controversy may decrease here, right? --Connel MacKenzie 09:12, 18 November 2006 (UTC)[reply]
Do you see the difference between what Stephen claimed to have searched for and what you searched for? Stephen spoke of "300,000 pages that use cardinal number". With the phrase connected by wiki-link to cardinal number, he is explicitly saying that the noun phrase was searched for and is used on those pages. What he should have said is that 300,000 pages on Wikipedia have both the word number and the word cardinal, which is a completely off-topic issue. A page that discussed the "number of cardinals electing the pope" would be counted in that search. It is a deliberate misrepresentation of the facts intended to support an unsupportable position. And, no, the heading could not be reduced to "Number", because that term is even more frought with ambiguity than "Cardinal number". See the opening sentences of the Wikipedia article on Number 0 to see the care that was put into the wording distinguishing the discussion of the symbol (numeral) versus the number itself.
Not at all! I did not search for < cardinal number >, I searched for < "cardinal number" > (with quote marks). In fact, if I search for either singular or plural, < "cardinal number" | "cardinal numbers" >, I get 500,000 hits([5]). —Stephen 17:38, 21 November 2006 (UTC)[reply]
Ah! You were searching the entire Internet, then. The context of your statement led me (and apparently Connel) to believe you had searched Wikipedia only. --EncycloPetey 23:04, 21 November 2006 (UTC)[reply]
In the very first paragraph of this discussion, I said I had found the hits by googling. Whenever I use the verb "google", I mean the Internet. —Stephen 23:55, 21 November 2006 (UTC)[reply]


Do you understand the difficulty with using "number" in this situation? Consider Stephen's statement (above): "The cardinal number 1012 is composed of the cardinal numerals 0, 1, and 2." If we break that into its component statements, what he's said is: "1012 is a cardinal number. This cardinal number is composed of the cardinal numerals 0, 1, and 2." But this pair of sentences is using "cardinal number" in two different senses. It's difficult to see this for precisely the reason that "number" has several very different meanings. In this case, the two meanings are both the quantity (which is abstract) and the written form. An analogous statement would be "Great Britain is a country. This country is composed of the words Great and Britain." This pair of statements exactly parallels the "number" statements before, but does not make sense because "country" does not have the same dual definition that "number" has. If we're having this much difficulty with the definition of "number" than it is a poor choice for a header, particularly when non-English speakers will also rely on headers to assist them. --EncycloPetey 20:34, 20 November 2006 (UTC)[reply]
Or indeed a heading of "Numeral"? --Enginear 14:31, 18 November 2006 (UTC)[reply]
Using "Numeral" might work. There are some format issues I'm not sure how we would handle, but it would solve some of the difficulties I've discovered in my format experimentation. We could use an inline (cardinal) or (cardinal num) that would sidestep the issue. --EncycloPetey 20:34, 20 November 2006 (UTC)[reply]
As with the precedent of using inline (transitive) and (intransitive) for verbs. --Enginear 13:39, 21 November 2006 (UTC)[reply]
Exactly. Or as we use (countable) and (uncountable) for nouns. Which reminds we, we should probably consider using (predicable) and (nonpredicable) for adjectives. --EncycloPetey 23:04, 21 November 2006 (UTC)[reply]
Please start that as a separate topic.
<blink> Did we just reach consensus on a workable solution? --Connel MacKenzie 19:48, 24 November 2006 (UTC)[reply]
using cardinal and ordinal on the definition line is just right. But I really think the header should be Number, it is simpler and more familar, and "numeral" is often used to distinguish digits from words: when you write a cheque, you are instructed to write the amounts in words and in numerals e.g. [6]. As to non-native speakers of English, "simple:number" is going to be a word they almost certainly know, while numeral isn't. Neither is going to be precise as a header, but if we were going to have that kind of precision in headers we would be using Transitive verb form etc. as headers. I'll also note that the non-native speakers adding things to the en.wikt have used "Number" frequently, that should say something about its understandability. (And, as was pointed out, having "Numeral"/(cardinal) is nonsense, there is no such thing. So we'd still need Cardinal number ...) Robert Ullmann 06:28, 28 November 2006 (UTC)[reply]
...Except that that's a British distinction. In the US we often use "number" to distinguish digits from words. In school you learn your "letters and numbers". The term "number" refers to the numeric form quite often rather than a word written with letters. The familiarity argument is neither here nor there. We expect them to use Adjective, Interjection, and Initialism, which are less likely to be familiar than any term they're familiar with anyway. The advantage of "Numeral" is that it is cognate with the name used for the part of speech term used in most European languages. See the Italian (uno) or Portuguese (um) Wiktionaries for examples. While each defines the term as a number (numero, número), the part of speech header uses numeral (numerale, numeral). The probelms with "number" are (1) it has far more possible definitions in English, making it difficult to decide which meaning is intended, (2) it functions in grammar as a feature of inflection (singular, plural, and even dual), (3) it is used in the US to distinguish numeric forms and digits from words, and (4) it is not cognate with the part of speech in other languages. The non-English Users adding "Number" are probably just mimicking what they see already in place. If we were using "Numeral", that's what they'd mimick. The only argument anyone has provided against cardinal numeral is that it "doesn't sound right" or is "illiterate" without any backing up of these claims. I have provided published sources (see cites above) that use the term, so it clearly does exist in the literature, so there is such a thing. As such, the use of number presents many more difficulties than the use of numeral. --EncycloPetey 00:23, 30 November 2006 (UTC)[reply]
It is hardly a British distinction, I am a native speaker of American English, I've only been speaking British for the last 3 years in Kenya. We have been using Cardinal number and Ordinal number, it seems perfectly reasonable to change this to Number with (cardinal) and (ordinal) tags. I don't understand why you are so intent on introducing "Numeral" de novo, going so far as to dig out obscure citations from grammars of foreign languages to try to justify a term that isn't in (e.g.) M-W. (And, note, wasn't in the wikt until you introduced it two weeks ago.) Robert Ullmann 07:50, 30 November 2006 (UTC)[reply]
Yes, and as a person trained in mathematics, I have used the term "cardinal number" for years. In the proper context, it is a very important and widespread term. The problem is that the term means something rather different in mathematics, as the content of the Wikipedia article clearly shows. That article includes no information whatsoever on parts of speech, only on the cardinality of sets. Consider that mathematicians include aleph-null as a cardinal number. However, neither the symbol nor the word for this cardinal number functions as the part of speech that three and hundred do. The category "Number" will include many items that similarly do not function as such a part of speech, such as e, pi, i, and φ. These are all numbers and should be so defined, but their function in English speech and writing is purely as nouns, not as numerals. Therefore, using "Number" as the header blurs a very important distinction between two different logical categories, those of mathematics and grammar.
Why are you so set against using "Numeral" with (cardinal) and (ordinal) tags? I can go to my bookshelf and pull off grammars that have a chapter titled "Numerals" with subections on "Cardinal numbers" or with subsections entitled simply "Cardinals". If this is good enough for major publishers of language grammars like Routledge, then why does it irk you so much?
I am most certainly not introducing "Numeral" de novo. v That would mean it was not already in use, but in fact it is in use and has been for quite some time. Do your own search to see that the header "Numeral" already appears in many articles where it was added by non-English speakers as a POS header for foreign word entries. The header is already in use and I am merely advocating a standardization to that header, since it is used in English and is cognate with the header term used on many other Wiktionaries, such as Italian, Portuguese, Spanish, and Romanian
What is obscure about Wheelock's Latin? What is obscure about grammars I can find on the shelf at the local bookstore? What is obscure about the Italian and Portuguese Wiktionaries? While these may be books about foreign languages (why did you italicize "foreign language"?), the grammars themselves are written in English. The fact that a book has content about another language and uses the term is support for using that term here, since Wiktionary includes words from many languages. "Numeral" is the correct English term for this part of speech as found in many languages. We simply don't recognize numeral as frequently in English as a part of speech in its own right, which is probably why it seldom appears in dictionaries.
The fact that a term is missing from a major dictionary is not as strong a case as you seem to believe. I am all the time finding that common terms in my vocabulary are missing from major dictionaries, for whatever reason. For example, I discovered last week that brown noser is not in the AHD (4th ed). Neither is initialism in the AHD, but that doesn't stop us from using it as a header. In particular, many technical terms are missing from such dictionaries such as the specialized grammatical and lexical terms needed in Wiktionary. Perhaps the greatest advantage cardinal numeral has over cardinal number is that the former term has only a single possible definition whereas the latter has three (we currently do not have "cardinal number" defined to include the grammatical sense, only its two numeric senses as used on Wiktionary). The same is true of numeral over number. More foreign users will be comfortable with "Numeral" because it's the usage and distinction they're already familiar with.
Please explain how the age of a Wiktionary entry has any relevance whatsoever to this discussion. I can go to the list of New Pages and find many valid words that were only entered in the past few hours. The age of an entry on Wiktionary has no relevance at all to this line of discussion. When I find an entry is missing, I add it. We only recently had CJKV added as an entry, but that term is thrown around in discussions here all the time. The age of the entry is meaningly. --EncycloPetey 23:14, 30 November 2006 (UTC)[reply]

Why are plurals highlighted in the headword repeater?

It's rather peculiar. You go to 'cloud', see a link (plural clouds) which takes you to a page where the relevant part just reitterates that 'clouds' is the plural of 'cloud', and links straight back to 'cloud'.

Sometimes there may be an alternative sense for the plural form, but the relevant definition is always (perhaps I should say 'almost' - there may be the odd exception) just an exact restatement of the information encompassing the link that got you there. Moglex < 12:00, 16 November 2006 (UTC)[reply]

The plurals usually are a barebones entry so that when people search for a word that happens to be a plural such as clouds they are informed that it is such. They are then directed by link to the main cloud article. This reduces the amount of duplicate information on pages and makes it easier when, for instance, there are alternative definitions for clouds, to pick them out from its plural meaning.--Williamsayers79 12:04, 16 November 2006 (UTC)[reply]
Yes, I understand that part.
What I'm asking is why the headword repeater in 'clouds' has the plural entry set as a link so that you can go to the barebones definition of the plural that does nothing but link you back to 'cloud'.
Maybe 'clouds' is not such a good example as there is another sense on that page. The plural link for 'orange', on the other hand does precisely nothing except link you straight back to where you came from.
A plural entry can have information about the pronunciation, and it will be another raison d'être for the link to it in addition to the alternative sense. --Tohru 13:09, 16 November 2006 (UTC)[reply]
  • Sign your posts here, on this discussion page, please.
That's another thing: Why can't the software add a signiture automatically? It knows who edited. It can do a diff. One tries to remember but in 16 years of using the 'net this is the first time I've ever encountered anything that won't automatically add a sig so it takes a bit of getting used to. Moglex 20:05, 17 November 2006 (UTC)[reply]
The entry for oranges exists for several reasons:
  1. As a navigation aide (as said above)
  2. As a disambiguation entry (for words that exist with the same spelling in other languages)
  3. As a synonym/translation/pronunciation stub placeholder (particularly helpful for plurals ending in sibilants) for someone to approach in the future.
  4. To inform English-learners what the proper plural is.
  5. For 'inflection' line consistency.
If en.wiktionary.org were written using dictionary software, instead of encyclopedia software, we'd probably have fewer silly work-arounds. As it is, the entries are largely entered by 'bot. Because we do not, much effort has been wasted on defining the 'perfect' inflection line to everyone's mutual dissatisfaction.
The various rules for "stemming" are different for each language. So a software approach to look up a term on the English Wiktionary cannot assume English (since we have all words in all languages.) Because words in other languages can share spelling, the placeholder entries are used.
All that said, the WM software has evolved (and continues to evolve) and is now much more capable of addressing these problems, than when this scheme was devised. I think the approach could use some review. --Connel MacKenzie 19:45, 16 November 2006 (UTC)[reply]
In fact some folks have started murmurring about the issue in other langauges. See some bits of discussion at Wiktionary_talk:About_Latin#Lemmata for verbs. There are lots off complicated but highly specialized offshoots tied to this problem when you consider inflected langauges. --EncycloPetey 23:11, 17 November 2006 (UTC)[reply]

Rfvfailed

Following a discussion with Connel, it seems that saving the rfvfailed discussions kept at their talk page is causing problems in the present format. I have accordingly made an Appendix:rfvfailed, where the discussions can be kept. Rfvpassed and RfvResult remain as is i.e. on the talk page of the article. Andrew massyn 18:36, 17 November 2006 (UTC)[reply]

WT:RFVA Is where I thought they were supposed to be put. The format used in 2005 worked reasonably. --Connel MacKenzie 09:25, 18 November 2006 (UTC)[reply]
They are still put there. The distinction is that the deleted words at WT:RFVA are in date order whereas the deleted words at Appendix:rfvfailed is in alphebetic order. Andrew massyn 16:28, 18 November 2006 (UTC)[reply]
Well, sort of. The page WT:RFVA (not its sub-pages) is where I'd like to see the terms wikilinked for quick review. When it was working well, they items that would reappear on that list could be checked and thwacked quickly and easily, raising no objections from any quarter. Having the discussions archived on the sub-pages is still needed, of course. But the parent-page index is incredibly useful. --Connel MacKenzie 19:45, 24 November 2006 (UTC)[reply]
I actually agree. The problem is stripping out the failed words. Any suggestions for a quick way of doing it? Andrew massyn 19:55, 24 November 2006 (UTC)[reply]
Nope. 'Bot request, perhaps...but I don't have time for yet-another-task myself, nor time to spec out the requirements for Werdna. --Connel MacKenzie 19:57, 24 November 2006 (UTC)[reply]

Why not Usenet?

I know I'm going to regret asking this, but why on earth use this unbelievably clunky form of discussion software rather than the vastly more elegant solution of Usenet?

It's extraordinarily difficult to see who's added what to which discussion and absurdly time consuming.

If Usenet were to be adopted it would yield the following benefits:

  • Less strain on the servers as they would not have to serve the whole current discussion page every single time someone just wanted to look at a one line addition.
  • Less strain on the servers and less wasted user time as no need to scan the page (or histories/changes) to see if the discussions you're interested in have material added.
  • Newsreader software forms its own archive.
  • With decent newsreader software you can search the whole archive.
  • Properly threaded discussions.
  • Ability to mark discussions that are of no interest so you don't keep getting bothered by them.
  • Hierarchical group structure.
  • Google keeps an archive of everything on Usenet so new users would be able to scan the entire group or set of groups of a particular branch.

Moglex 20:23, 17 November 2006 (UTC)[reply]

Probably because we wouldn't have any power over it ourselves. But that kind of structure I agree would work a lot better than what we're using. It's one of the big differences between the way Wiktionary and Wikipedia are run. Entries here are simply too sparse to collect discussion on individual talk pages. DAVilla 15:12, 21 November 2006 (UTC)[reply]

Romani versus Romany

We have both a Category:Romani language and a Category:Romany language. Although both sepllings are correct, it would be best to consolidate these to a single category and single spelling. The article on Wikipedia is titled Romani language, so we might wish to prefer that spelling as well. Sounds like a nice little weekend-killer for someone to work on, unless our 'bot policy permits working this kind of fix. --EncycloPetey 01:38, 18 November 2006 (UTC)[reply]

Revised 'bot proposal

Now that it has become apparant that the OED is not even needed as a filter, I'll revise the proposal for what I'll now call the part of speech robot

The purpose of the robot and its associate command software is (at this stage) to find as many word forms as possible that share a root with the verbs in the Wictionary. These words are generated automatically and initially verified using public domain word lists. It uses this to add missing words together with a concordance for the related terms entry.

This initial command file is fed into another piece of software that allows its operator to:

  1. Check that the words suggested:
    • do actually share a root with the given infinitive.
    • are correctly formed.
    • have sensible definitions where these are mechanically generated.
  2. Add examples for those parts of speech that require them
  3. Add any words that are missing.
  4. Note if the third person singular is also a plural.

Once all this has been done a further command file is generated that feeds the robot (for the moment it is being used to feed a manual assistance program)

The robot then:

  1. Adds words that are not present
  2. Adds related terms to all extant members of each family where none exists
    Shouldn't these be derived terms in some cases? DAVilla 03:04, 20 November 2006 (UTC)[reply]
  3. Merges the generated concordance with any current related terms entries for extant words.

+ Add any anagrams it finds, whilst it's in the area.

The proposed robot's account is MogBot-POS

I would not envisage this robot needing to operate particularly quickly given the extensive human moderation involved in the production of its command file. Moglex 17:46, 18 November 2006 (UTC)[reply]

As the original proposal has now been up for over a week and nobody has made any objections other than the copyright issue which has now been resolved (by no longer using the OED to automatically check the existance of words), I propose to run the software in its non-robotic 'assist' mode for a couple of dozen infinitives. This will give anyone who's interested a chance to comment on the format of the added words and sections.
It would seem sensible to do this work under the account set up for the robot so that it is obvious which work is being done according to the devised protocol (as opposed to any other edits I might happen to do).
If there are no objections I will start this on Tuesday. Moglex 20:24, 19 November 2006 (UTC)[reply]
It will be very helpful to see the examples. One thing: lose the anagrams. They are just noise that anyone can generate; I have no idea why we even reference them. "MogBot" is a good (very good!) name, but why the -POS? You'll just need yet another account if you have another project. Oh, and the though occurred to me: there is no reason why you can't check against the 1913 Websters if that is at all useful. Robert Ullmann 17:46, 21 November 2006 (UTC)[reply]
The name MogBot-POS is more descriptive, follows the earlier convention, and allows for separation of tasks (which may be needed for a large undertaking.)
TheDaveRoss suggested a change to the bot policy which has never been fully accepted (nor voted on) nor have all the details ever been worked out. His proposal was for established trusted users, e.g. over 500 main namespace edits. In that situation, perhaps the generic name might apply. --Connel MacKenzie 18:27, 21 November 2006 (UTC)[reply]
I have no objections to a preliminary test being done from the User:MogBot-POS account. Please show us what you mean. --Connel MacKenzie 18:27, 21 November 2006 (UTC)[reply]


The POS was for 'Parts Of Speech'. I used that because I had read that each robot is (or at least was) only allowed to do one thing. I'm quite happy to use the same name for this and any other tasks that may get approved. As to the anagrams I saw that some words had them so I assumed it was a policy that eventually all words for which anagrams could be listed would have them listed. I agree that it's a very odd thing to have in a dictionary and will happily omit them.
I do have a problem with the concordance that I should have forseen but didn't. I had planned to place it either where there was a current 'related terms' entry or right at the end of the 'English' section. However, I now notice that some words have different related entries catagories for different parts of speech. So now I'm not sure where to put the concordance. Nowhere is perfect. Next to 'Etymology' seems to make most sense (and these words are rather more closely integrated than the majority of the 'related terms'). In fact, a separate heading such as (but almost certainly not) 'common root' would seem to be an idea.
I shall do a single infinitive and its associate words in that style so that anyone interested can see and comment. Moglex 18:59, 21 November 2006 (UTC)[reply]


The first verb 'access' has been processed so that people can see and comment on the format for the concordance entry. (The reason for the new 'common root' heading tested in these samples is explained above (although I dislike that actual expression - I'm sure there's a better one)). The problem of the concordance relating to more that one entry on the page and thus needing to go above the definitions in the hierarchy is more intractable. It would, however, seem to be sensible to place this section together with 'etymology' to which it is inextricably related. — This unsigned comment was added by Moglex (talkcontribs).

STOP! You've never even read WT:ELE? One test is quite different from the bombardment you are doing now. --Connel MacKenzie 20:32, 23 November 2006 (UTC)[reply]
???? What bombardment?
I did a single test, as advertised, using the infinitive 'access'. The test is now complete and awaits discussion from interested parties.
Moglex 20:38, 23 November 2006 (UTC)[reply]
  1. Sign your entries here! Still.
  2. Wikilink your example edits, if you want them to be approved.
  3. re: Heading ===Common root=== HELL NO.
  4. Accension is not and should not be stuffed under the same entry as accessing.
  5. Wikilink to the diff of edits you have automated, so the entry being cleaned up will not appear as your proposed/automated edits. E.g. http://en.wiktionary.org/w/index.php?title=accessing&diff=1814348&oldid=776924
  6. You are exceeding the remit of the stated task. Bot tasks are meant to be ganular so that only the portions that need to be rolled back, can be. Right now, all your edits seem 1/2 good, 1/2 bad (therefore all these tests have to be rolled back now. Thanks.)

Why did you avoid proper categorization that results from using the correct templates? Why did you avoid using the correct templates? What is the point of entering them wrong? --Connel MacKenzie 20:44, 23 November 2006 (UTC)[reply]

Words/forms(++) bot added:

  1. concuss ++
  2. commentate ++
  3. came across ++
  4. comes
  5. coarsen ++
  6. co-educate ++
  7. bodge (misspelling of botch!) ++
  8. anaesthetise (not labelled regional) ++
  9. allegorise (Weird spelling not labelled regional) ++
  10. air-freight (multi-word forms don't get inflection entries, according to Ec.) ++
  11. act out (multi-word...) ++
  12. incentivise (not labelled regional) ++
  13. accessability (weidness with 'common root')
  14. accessibleness (cr)
  15. accessive (not labelled either regional or obsolete, whichever it is)
  16. accessibly (cr)
  17. accessability (not labelled as non-standard/notaword)
  18. accessor (accesss???)
  19. accession (cr, probable copyvio)
  20. accessibility (cr)
  21. accessibleness (cr)
  22. accessable (not labelled non-standard/notaword/misspelling, related terms listed are related only by the fact that they all start with "A".)
  23. accessible (cr)
  24. accessed/accessing (cr)

I don't see a single entry that conforms to en.wiktionary.org formatting guidelines. Perhaps you should focus on making it past the initial learning curve before automating things that each need considerable cleanup from others here. --Connel MacKenzie 00:53, 24 November 2006 (UTC)[reply]

How to tag them...

Certain terms such as massa, youseselves and other terms which are highly dialectical need a word to describe them. youseselves is labeled as "non-standard", which I don't think captures the essence of it's usage at all. - [The]DaveRoss 17:40, 19 November 2006 (UTC)[reply]

What do you suggest? A ===Usage notes=== section should be able to address each, right? --Connel MacKenzie 18:54, 24 November 2006 (UTC)[reply]

Origins of words

Most words have origins- should we add that to the Wiktionary criteria, or is it in criteria already? WikiNerd 04:37, 20 November 2006 (UTC)[reply]

Could you please rephrase your question? I fail to see what the ===Etymology=== heading has to do with the criteria for inclusion. --Connel MacKenzie 05:18, 20 November 2006 (UTC)[reply]
Is etymology part of Wiktionary criteria? WikiNerd 08:17, 22 November 2006 (UTC)[reply]
I must be having a bad day - I fail to see how it could be added. Some words have unknown etymologies. But worse, you would have a missing etymology be a valid excuse to delete an entry? --Connel MacKenzie 08:53, 22 November 2006 (UTC)[reply]
This is one of the sections we'd like for the ideal entry, but it shouldn't be a requirement for creating a new entry. --EncycloPetey 23:57, 27 November 2006 (UTC)[reply]

Is this an anonymous robot?

Special:Contributions/146.145.99.210 ?

It just removed something from the RFV list.

Are anonymous robots allowed on Wiktionary?

This is the second (suspected) one in two days.

No, anonymous robots are not allowed, they are blocked. This was not a robot edit, just someone who doesn't understand how to count to three. --Connel MacKenzie 16:44, 20 November 2006 (UTC)[reply]

Bite the newbie?

As a rogue admin, I'm clueless how a copyvio vandal should be dealt with best. Short dictionary definitions cannot be protected by copyright individually, but systemic use of copyright sources of course can. Likewise, all etymology information is generally held to be much more complicated than simple definitions, therefore does not offer the same copyvio exemption.

What, then, am I supposed to say to Special:Contributions/Quadell as the copyvios from www.wordorigins.com come in? Just block as any vandal? Or is this a genuine (misguided) attempt at helping? (If so, how is this situation handled, best?) --Connel MacKenzie 19:50, 20 November 2006 (UTC)[reply]

This may be a stupid question, but have you spoken to him/her? The penultimate question in your piece above suggests otherwise.
Someone asked me a while back if I was writing my own definitions. I didn't take offence and just reassured him that I was. Obviously in this case the contributions are not original but what the motivation is will remain unclear without inmput from the miscreant.Moglex 20:08, 20 November 2006 (UTC)[reply]
I agree. The first step is to mention the problem to the person and politely inquire whether they were aware of copyright restrictions on what we put on the web. Some wiki users are quite young, and may not be aware that it's an issue. An explanation may be all that is required, and the copyvio offender might even make the effort to rectify the copyright violations, which would save others the work of doing so. --EncycloPetey 22:37, 20 November 2006 (UTC)[reply]
Thank you; fine suggestions indeed. I could use some help wording that 'politely', as I seem to be having difficulty with that today. --Connel MacKenzie 14:30, 21 November 2006 (UTC)[reply]
Do you think this problem is one we'll encounter often? If so, then it might be worth the trouble to draft a standard bit of boilerplate, and have a small group work together to make sure it's politely worded. --EncycloPetey 22:59, 21 November 2006 (UTC)[reply]
I recall seeing some variety of etymology problem (such as this) about once every other month or so. It may be that my recollection is retroactively optimistic, or that many I simply don't see. So, I guess it depends on how you're defining 'often' in this context. It certainly is a problem with a complicated explanation. So yes, I'd like to see a template for it. --Connel MacKenzie 09:24, 23 November 2006 (UTC)[reply]

QC for Han characters

I've put together a quality control check for the Han character formatting. It is Python code, completely independent of the format 'bot (AWB ruleset). It generates exception reports for each row of the Unified Han blocks. (Connel has code that looks for exceptions and generates todo lists, that is overwhelmed by the NanshuBot entries (and excludes them); I put this together to cover them, looking for similar exceptions, plus a number of things specific to these entries.) See User:Robert Ullmann/Han.

The present set of reports is run on the 3 November XML, covering 21,177 21,309 character entries; mostly before the 'bot had gotten to them (it is up around 7,000 now). So most of the entries don't report the details the QC code looks for; once it finds a bad level 2 (language) header it suppresses most of the checks; else the report for each entry would be a page. (The NanshuBot entries not only have non-standard headers, they have no standard headers at all.)

The remaining exceptions will be of much more interest when we get another XML dump and I can repeat the run. Robert Ullmann 20:50, 20 November 2006 (UTC)[reply]

It is very refreshing to see a new take on this. I am very glad you have done this analysis separately. I am absolutely delighted to see someone who understands CJKV taking this part over! --Connel MacKenzie 14:33, 21 November 2006 (UTC)[reply]
I note that there is no entry for CJKV. Should there be? --EncycloPetey 00:35, 22 November 2006 (UTC)[reply]
We don't? Hmmm. w:CJKV helps. Perhaps we should have CKJV as well. --Connel MacKenzie 08:59, 22 November 2006 (UTC)[reply]

In one of the bot rules, it was generating a "Usage note" header, (from "Other Info"), that gets caught by the QC report (and has been fixed of course.) The QC report also identifies 457 entries that have bad Yale romanizations for Korean. (It isn't clear how Nanshu got these wrong.) They have "y" on the wrong side of the vowel in some cases. And there are an unknown number that were wrong, but have since been fixed. So the bad news is that there are 457 of them left, the good news is that I have Python code that appears to identify them (and the correct form) without errors. Robert Ullmann 18:46, 24 November 2006 (UTC)[reply]

ELE advocates "phrasal noun"

I've started looking at making some of the necessary additions and revisions to the WT:ELE. One problem in particular caught my eye as something that could (and should!) be fixed quickly. The ELE says (in the section on the article core):

This is basically a level 3 header but may be a level 4 or higher when multiple etymologies or pronunciations are a factor. This header most often shows the part of speech, but is not restricted to "parts of speech" in the traditional sense. Many other descriptors like "Proper noun", "Idiom", "Abbreviation", "Phrasal noun", "Prefix", etc. (emphasis added)

The header "Phrasal noun" is no longer considered a standard header, and should be replaced. We might use "Symbol" as a good example of a non-POS descriptor. Also, that final sentence needs a predicate. Note that I did not clip out the end! Since the ELE is technically an official style guide now, I suppose we need to have a proposed replacement paragraph and then put it to a quick vote. I don't see this particular issue as controversial, though. --EncycloPetey 00:33, 22 November 2006 (UTC)[reply]

Sionce no one has objected, I have replaced "Phrasal noun" with "Symbol" as an example. --EncycloPetey 23:59, 27 November 2006 (UTC)[reply]

Language templates and Etymologies

I would like to clarify what uses the templates on Wiktionary:Index to templates/languages actually have. Some of us have been using them in Etymology sections and consequently these have then been modified to include the relevant Derivation category also e.g. {{ang}} has been used in quite of few Etymologies now and has been modified to include category:Old English derivations. Recently I've went over to using the so called "Webster 1913 abbreviations" (see Wiktionary:Etymology#Etymology language templates) to avoid issues like entries turning up in category:Old English derivations for example.

I know I'm guilty of causing some of this havoc and have started to rectify the issue by looking the What links here pages of the templates in question and replacing the general template with corresponding one on Wiktionary:Etymology#Etymology language templates.

So, after the ramble, am I OK continue with this fixing up and should we place a notice on Wiktionary:Index to templates/languages prohibiting use in Etymologies to automattically categorise words? --Williamsayers79 13:33, 22 November 2006 (UTC)[reply]

Better than prohibiting them, would be to add a reminder notice there indicating the correct templates to use for Etymologies. Unfortunately, most people inclined to enter that information will get knee-deep into it, before reading those instructions. So periodic rechecking for them is advisable. At the moment, I have other priorities. Overall, your approach sounds helpful and good to me. --Connel MacKenzie 17:21, 22 November 2006 (UTC)[reply]
Sounds good, but what are those templates used for anyway? Never mind, I figured it out! --EncycloPetey 00:02, 28 November 2006 (UTC)[reply]

Multiple sockmasters in Primetime's Comcast subnet

On Wikipedia, it has been shown that there is more than one sockmaster in the IP range Primetime lurks around on. See w:Wikipedia:Requests for checkuser/Case/Primetime for evidence. The other sockmaster discovered on the same Comcast subnet is Balthazarduju. Please be careful when using CheckUser to hunt for Primetime socks. Jesse Viviano 23:00, 22 November 2006 (UTC)[reply]

Pronunciation guidelines

There are extensive charts about IPA and SAMPA rerpresentation of English and other sounds, but what is the guideline about stress symbols (ˈ) and syllable breaks: (.)? Most French entries seem to have syllable breaks, but no stress symbols. English entries seem to have stress symbols but no syllable breaks. Is there a consensus about this, which should be included? henne 12:00, 23 November 2006 (UTC)[reply]

Stress is less important in French than in English: in French, syllables are generally evenly stressed, except the last syllable is stressed at the end of a group of words. I don't believe that the syllable break symbol is used much, at least not in my French-English dictionary. Poccil 22:17, 24 November 2006 (UTC)[reply]
Adding syllable breaks to English words isn't easy. Consonants at the end of a word tend to be tied into the next word, e.g. red ink is pronounced "re-dink" which is pretty much impossible to convince to an English-only speaker. Hence the confusion between "read-i-ly" or "rea-di-ly" or "rea-dil-y". I'm no authority but I believe the middle one would be British English, well-enunciated, and the last American, employing a dark L. But I wouldn't be surprised if most dictionaries listed the first. DAVilla 23:48, 24 November 2006 (UTC)[reply]
So if I understand correctly, you are saying: you can add both, but it is difficult. I do not really agree French has no stress. Less explicit perhaps, but it is there. For example, in avancer, obviously the last syllable is stressed, whereas in avance, it is the second. This is the last in both cases, but I am unsure whether this is a rule. henne 17:55, 25 November 2006 (UTC)[reply]
Stress in French is not phonemic, and in practice it is almost even anyway. avancer is certainly not stressed on the final syllable. French often sounds like it has final-syllable stress to English-speakers (I don't know whether you are one or not) because its equal stress pattern sounds strange when you're used to the penultimate stress of most English words. That's why when French words are adopted into English they often take final stress, especially in America (e.g. cliché, passé etc.). Widsith 14:48, 27 November 2006 (UTC)[reply]
This is new for me. You could have looked on my user page to see that I am Dutch speaking, French being, along with English, secondary languages. Ah, I wished everybody used the {{Babel}} template on their user page. Connel, maybe this is something to include in your standard welcome talk?
Anyway, Ok, so French has no stress you claim. Then indeed, marking syllables makes sense, but is it useful? Should syllables always be marked? I often add pronunciation for Dutch entries, which has primary stress, but no secondary. Should I add it there? Should I add syllable markers for non-stressed syllables? henne 12:33, 30 November 2006 (UTC)[reply]
Although I use {{welcome}}, {{pediawelcome}} and {{welcomeip}}, I try to let other people play around with the wording. Whenever I mess with them I get too many complaints. Please be bold! --Connel MacKenzie 23:40, 30 November 2006 (UTC)[reply]
I had thought French does have stress (in comparison to Japanese for instance) but that it wasn't part of the pronunciation of the word since the stress pattern can change without changing the meaning of the word. As you say, it's not phonemic. I remember a French friend saying some simple sentence that none of us got because of this, and we laughed at ourselves when we figured out why.
That doesn't mean avancer and avance, when said alone, aren't stressed at the end. Of course, not speaking French, I really wouldn't know. I've never even been able to figure out if the final R is pronounced or not. DAVilla 14:11, 30 November 2006 (UTC)[reply]
Some linguists treat the . as its own phoneme, since it can affect the meaning of a word in some cases. This makes it important. However, for most non-linguists it's difficult to properly locate that phoneme correctly in spoken English, and that's why I generally don't mark them when I add IPA to English entries. For Latin, however, I routinely include them. For English, stress is much easier for a native speaker to hear, so I add those in English (as well as Latin), though there are cases where it's difficult to decide whetehr there is secondary stress or not, particularly in compound words were there is even stress on two different syllables. --EncycloPetey 00:09, 28 November 2006 (UTC)[reply]
This would lead me to not include them, unless necessary, e.g. in Czech syllable-forming r or l (I can’t think of a word now)? henne 12:33, 30 November 2006 (UTC)[reply]

Use of assistant

As per the guidelines I'm writing this to let anyone who is interested know that I will be using an assistant program to add new verbs. Sadly, it can't produce the definitions so I'm still having to write those myself.

It will not add the POS concordance entry for these verbs until that has been discussed.

Moglex 18:50, 23 November 2006 (UTC)[reply]

More "numbers" nonsense

The WT:LOP page has been usurped by the sock-puppeteer responsible for the "numbers" vandalism; it is unclear what (if anything) should remain of the "reorganizing" done to LOP and LOP by Topic. Anyone feel like clearing it out? Fark (talkcontribsglobal account infodeleted contribsnukeabuse filter logpage movesblockblock logactive blocks) = Shoof (talkcontribsglobal account infodeleted contribsnukeabuse filter logpage movesblockblock logactive blocks) = Googe (talkcontribsglobal account infodeleted contribsnukeabuse filter logpage movesblockblock logactive blocks) = Looj (talkcontribsglobal account infodeleted contribsnukeabuse filter logpage movesblockblock logactive blocks) etc. --Connel MacKenzie 21:27, 23 November 2006 (UTC)[reply]

How is adding terms for numbers to the WT:LOP page vandalism? It is no more vandalism than adding any other kind of neologism to the page. Shoof 21:50, 23 November 2006 (UTC)[reply]
It is nonsense from one person (you) but it is vandalism when you use sockpuppets to give some sort of validity to them. It is vandalism when you are actively disruptive here "promoting" the terms you've made up. It is also disruptive when your nonsense causes the little itty bitty page of made up terms that no one else wants here, to span what is now an entire section of Wiktionary. --Connel MacKenzie 22:08, 23 November 2006 (UTC)[reply]
I have not made up most of those terms, rather I got them from Jonathan Bowers homepage http://www.polytope.net/hedrondude/home.htm Jonathan's home page. Anyway, why have you blocked my main account? If anything, block the sockpuppets, but don't block my main account User:Shoof. I have made most of the additions to Wiktionary with that account, and so it shouldn't be blocked. If I can't use that account, what account can I use? Do I need to create a new account and stick to it? 4.235.114.246 22:26, 23 November 2006 (UTC)[reply]
Good point. Would another sysop please review the blocks I've made. If you agree with Shoof that he can conduct himself as a valued, trusted contributor, then please unblock that account. A list of other accounts that have been used in the past would be helpful towards allaying my concerns. --Connel MacKenzie 22:34, 23 November 2006 (UTC)[reply]
I will edit from my IP until my account is unblocked. I'm restoring the neologisms that I've added to the wt:lop page as it's not vandalism to add them. 4.235.120.220 00:21, 24 November 2006 (UTC)[reply]
Evading a block (wasn't that a request for review on my previous post here?) is a reason to expand the current block(s). Be patient - I imagine many of the other sysops are currently busy with turkey dinners. --Connel MacKenzie 01:04, 24 November 2006 (UTC)[reply]
You may wish to notice that nowhere on his web site does it indicate that the list is released under the GFDL. So you've been violating his copyright (implicit!) by releasing his inventions under a new license. Brilliant. --Connel MacKenzie 01:04, 24 November 2006 (UTC)[reply]
That's the problem with many of the protologisms that people add. People often tend to copy things from Urban Dictionary, which is certainly not released under the GFDL.
That's yet another reason they are deleted as they are found. --Connel MacKenzie 01:23, 24 November 2006 (UTC)[reply]
I don't see why this user should have all sockpuppets blocked. If allowed to edit, he might as well be allowed to use one of them as a legitimate account, and by choice why not? I have shortened the timeframe for the block on User:Shoof to one week. The user should be informed that he will not be allowed to edit Wiktionary during that time, even under an anonymous account or any other name. The block is in place because of probable violation of copyright and minor disruption through sockpuppetry. After that time the user is again invited to edit, but asked to use a single account and respect policies, especially those applicable here. Any questions may be resolved through mail. DAVilla 20:04, 24 November 2006 (UTC)[reply]
Thank you for that sysop action. I think that is very reasonable. --Connel MacKenzie 20:07, 24 November 2006 (UTC)[reply]

Robot proposal, Connel's objections: (1) Use of robot and first word list

There follows a list of words that Connel MacKenzie has accused me of adding by the use of a robot.

In the first place, they were not added by a robot. They were added by me using an assitant program. As it was not making what I would consider 'a large series of edits', in as much as each root required a definition and and example to be hand written, and each derivative needed its example checked, I did not run in a separate account. Looking at the rate of word addition, added ~32 words in 1 hour 34 minutes from 20:14 to 21:48, a rate of around 1 word every three minutes - well under what can be added entirely unassisted.

At that rate I don't think I contravened either the letter or spirit of the rules. With the single exception of 'concusss' the only difference between using the assistant and doing it by hand is that it was actually slower using the assistant because of the manual checking that was done after each addition. The only reason 'concusss' was not picked up was the arrival of Connel's rather excited demand that I stop.

Just to be quite clear: There has not been one single word I have added that has not been added as a result of a direct, manual keypress by myself after reviewing the entry concerned.

This is the list of words. I'll deal with them below.

Words/forms(++) bot added:

  1. concuss ++
  2. commentate ++
  3. came across ++
  4. comes
  5. coarsen ++
  6. co-educate ++
  7. bodge (misspelling of botch!) ++
  8. anaesthetise (not labelled regional) ++
  9. allegorise (Weird spelling not labelled regional) ++
  10. air-freight (multi-word forms don't get inflection entries, according to Ec.) ++
  11. act out (multi-word...) ++
  12. incentivise (not labelled regional) ++
  13. accessability (weidness with 'common root')
  14. accessibleness (cr)
  15. accessive (not labelled either regional or obsolete, whichever it is)
  16. accessibly (cr)
  17. accessability (not labelled as non-standard/notaword)
  18. accessor (accesss???)
  19. accession (cr, probable copyvio)
  20. accessibility (cr)
  21. accessibleness (cr)
  22. accessable (not labelled non-standard/notaword/misspelling, related terms listed are related only by the fact that they all start with "A".)
  23. accessible (cr)
  24. accessed/accessing (cr)

I'll ignore those where there is no specific complaint.

  • 7: 'bodge'. No, this is not a misspelling of 'botch'. It appears in the Moby word list and is a word I have been very familiar all my life. It may well be 'regional', but a problem that any editor will have is to be aware that a word with which he or she is personally familiar is not used across the entire English speaking world. Rather than research every word in the dictionaries of every English speaking country it would seem a more pragmatic solution to allow others to comment on the word if it bothers them. I note vary many American usages that are not marked as regional and do not incur Connel's wrath, but more of that later.
  • 8,9 and 12: 'anaesthetise', 'allegorise' and 'incentivise'. All I can say to that is that the objection seems to be that they are not labled with a region. Well, check out, just picking two at random, 'socialize' and 'extemporize'. Neither of these are marked as regional spellings. In fact I don't remember a single instance of seeing a verb with an 'ize' marked as 'regional' although they clearly are. A more valid complaint, perhaps, would be that the alternate spelling was not included, and that is an enhancement I will be pleased to make to any further verbs I add - by whatever means.
To continue on the theme, I would like to point out that AFAIA the Wictionary is supposed to be an international dictionary with no region taking precedence. Connel in particular, although I'm sure he'd deny it (and I'm not even sure he's aware of it) would seem to be working on the underlying assumption that American English is in some way the basic language, and anything else is a variant. Of course, he may simply be affected by the problem I mentioned above, i.e. that he is unaware of that certain variants belong to the region: USA.
  • 10 and 11: 'air-freight' and 'act out' (multi-word forms don't get inflection entries, according to Ec.) I don't know what 'Ec' is. One of the problems of working here is that there is information scattered all over the place with no single point of access for people wanting to get to grips with the many style and content conventions. This has been mentioned recently. At the very least there should be a page that points to all the relevant pages that a newcomer will need if they are to become useful and productive as quickly as possible. Things such as 'how to edit, 'how to format, 'entry templates', 'other templates', 'inclusion criteria', 'WT:ELE' and the mysterious 'Ec' of which I seem to have fallen foul (I've tried 'WT:EC' and WT:Ec' but to no avail).
Not having seen 'Ec' I have to rely on logic to determine what to do and I can't see any logical reason to deny language learners the relevant inflections. This (denying) would seem to be an ideal way to ensure that people using Wiktionary as a prime source for learning English will try to use constrctions such as 'act outing' or 'airing-freight'. Of course, if that's the concensus opinion, I'll abide by it.

The complaints about words with the root 'access' are dealt with separately as these relate to the proposed robot.

There is also a separate comment about the use of templates.

As to the comment:
"I don't see a single entry that conforms to en.wiktionary.org formatting guidelines. "Perhaps you should focus on making it past the initial learning curve before automating things that each need considerable cleanup from others here. "--Connel MacKenzie 00:53, 24 November 2006 (UTC)

I have already added several hundred words here. Some have elicited comments and these have been taken on board. The majority have been accepted without comment. I don't actually accept that I've contravened the guidelines, at least not those I've encountered, but if I have I will certainly correct the errors.

Robot proposal, Connel's objections: (2) Words with the root: 'access'

Here is the list of words with (or not) the common root 'access' with Connel's objections:

  1. accessability (weidness with 'common root')
  2. accessibleness (cr)
  3. accessive (not labelled either regional or obsolete, whichever it is)
  4. accessibly (cr)
  5. accessability (not labelled as non-standard/notaword)
  6. accessor (accesss???)
  7. accession (cr, probable copyvio)
  8. accessibility (cr)
  9. accessibleness (cr)
  10. accessable (not labelled non-standard/notaword/misspelling, related terms listed are related only by the fact that they all start with "A".)
  11. accessible (cr)
  12. accessed/accessing (cr)

Starting with 2, 4, 8, 9, 11 and 12.

Here his objection is the 'common root' entry. That was one of the main purposes of the test, it was stated in advance (together with the reason for its placement), and the test did exactly what it said it would do.
I was aware that it would cause discussion - that was the intention, and that the naming would certainly change, the placement might well have to change, and that the whole idea might need to be scrapped. That would involve the waste of a great deal of effort but I think it was worth the risk potentially to provide a useful feature for the dictionary.
The name is neither here nor there. I was certain that whatever I used would be changed. As to why 'Related terms' is unsuitable: 'Related terms' is placed underneath definitions. As such it cannot help but be underneath a POS. However, the concordance for words with a common root is not (there may be the very rare exception) delendant on the POS. It has more in common with 'Etymology', and that was why it was placed either directly under etymology or where etymology would be if it existed.
As I said, this will require discussion and if it is felt that placing the concordance above the definitions is not desireable, then I fear the project will fail since it would otherwise involved adding it separately to each POS as part of, or the whole of, 'Related terms', which would look daft.


Now, 1, 5 and 10 (1 and 5 being the same word ???).

accessability: 1,020,000 Google hits, 1790 Google book hits.
accessable: 1,220,000 Google hits, 1760 Google books hits.
Thus both these words satisfy the inclusion criteria by a country mile.
On the matter of 10: 'related terms listed are related only by the fact that they all start with "A"', if Connel had actually done the most minute piece of research he would have discovered that the word is simply an alternative spelling of 'accessible" and is thus clearly related to the other words in the list by having the same root.

Coming to 6:

'accessor' with , 1,190,000 Google hits, 12,700 Google book hits
Again, this word satisfies the inclusion criteria by a country mile. In fairness to Connel, I must point out that it does seem to be almost entirely used in a computer science technical environment, although with exactly the meaning you would expect it to have using an intelligent deduction. In fairness to myself, I would also point out that as I have been working with computers for many years the word seemed perfectly natural. Interestingly, however, there is no such word as accesser, so what you call someone who accesses other than an accessor, I don't know.

and 7:

'accession'. An interesting one. It appears to have its original etymology in 'accede' but OED says that it 'has partly occupied the ground of the earlier ACCESS', and MW online mentions 'access' in meaning

6. In fact, the reason I included it was possibly based on a misunderstanding of its origin in the usage I know it best: a library or museum 'accession' number. Since you can usually access a book or artifact from its accession number, the root seemed to fit. I would warrant that the overwhelming majority of 'related terms' and suchlike in wiktionary have not been included after copious research by the editor to ensure an apparantly obvious meaning was correct.

Exactly where the accusation of copyright violations comes from is a complete mystery to me.

Finally, we have 3:

'accessive': 238 Google book hits with several on the first page being from the last two decades so it's hardly obsolete. And if we dare to take a peek in the OED without the witchfinder general screaming 'copyright violation' at us, we see that the etymology clearly supports the inclusion of the word in the list of those with access at their root.

Robot proposal, Connel's objections: (3) Use of templates and summary

On the subject of use of templates, Connel asserts that I haven't read WT:ELE.

I have.

It has this to say on the matter of templates:

"Further forms can be given in parentheses. For a noun this will simply be word (words). For an adjective this will appear as hard (harder, hardest). For a verb you may use end (ends, ending, ended). Templates are available at inflection templates for those who prefer this technique." and "the inflection word itself (using the correct Part of Speech template or the word in bold letters),"

Note well there is absolutely no suggestion whatsoever that template use is compulsory, or even prefered.

So calling an entry 'wrong' when it does not fail to follow the guidelines as written is both wrong in itself and extremely unfair.

Asserting 'I don't see a single entry that conforms to en.wiktionary.org formatting guidelines' actually needs a lot more backup. Looking at the list of unfounded complaints I am highly doubtful if the majority of entries actually fail to follow guidlines that exist in written form (at least anywhere that I've seen).

Obviously I'm prepared to be enlightened on these matters.


In summary, Connel added a very long list of objections which make it look as if I had been systematically wrecking the dictionary.

On closer examination I find that there are a couple of points relating to wikifying links which are well taken but would have been far more useful had they been expounded before the test. There was plenty of notice.

There was one major error (concusss) that I did not catch because I was interuppted by Connel's stop request and dealing with that.

There may be a couple of minor errors, and an expansion to the formating (===Alternative spelling===) would seem appropriate. It is also a given that 'Common root' needs to be discussed and a concensus reached before it or its replacement are ever used again.

Apart from that, the accusations and implications seem either spurious or plain wrong, relating either to words that easily pass the CFI or mysterious 'guidelines' that I cannot find.

I haven't spent the last couple of weeks writing thousands of lines of code with the intention of upsetting anybody. That would be a certain way to ensure I was just wasting my time. On the other hand, I cannot accept that just because one person seems to believe that a suggestion in the guidelines that one technique is possible is actually a mandatory requirement that renders anything that does not use said technique 'in violation of the guidelines' as anything other than ridiculous attempt to impose his or her own particular viewpoint when it is not the general concensus.

If a thing must be done a certain way then at the very least the guidelines should use the word 'should'. It would be preferable if rules whose breach was likely to cause people to be accused of being in violation of same were called what they are: rules.

Welcome to Wiktionary, Moglex, where many of the written guidelines are significantly out of date, where we have no agreed-upon process for updating them, and where the de facto guidelines are those which the regulars assert were consensus some months or years ago.
There's been a lot of discussion about those inflection templates, with (typically) no definitive record of what the outcome was. I'm not surprised to hear that WT:ELE is at odds with today's version of consensus on their use.
In that case, Connel should have been apologising to me, for trying to impose a standard that he has had every chance of conveying to myself and other newcomers but has singularly failed to do, and thus wasting my, and others time. Instead he goes off on an almost hysterical and ill researched rant and makes snide, snotty suggestions that people should try 'making it past the initial learning curve'. Moglex 17:03, 24 November 2006 (UTC)[reply]
Connel always comes down like a ton of bricks on anyone doing something he doesn't immediately and completely agree with; don't take that too personally. With that said, though, what you did in your first test was certainly well beyond the bounds of what I, also, thought you were intending. I thought you were fleshing out only the three or four declined verb forms; I had no idea you were planning on trying to tie together every derived form sharing the same stem (or sharing the first few letters). That's a much more problematic task, from several respects. —scs 15:40, 24 November 2006 (UTC)[reply]
Well, it certainly isn't trying to tie together all the words with the same few letters. It is trying to find words that derive from the same stem as the root of the infinitive. However, it does not do this with a completely unintelligent approach. It looks for all the words that tend to be appendable to the root of a verb. Obviously 's', 'ed' and 'ing' are extremely likely candidates. 'able', 'ible', ably', 'ibly', 'ability' and 'ibility' are also common. The chances of finding spurious words whilst not insignificant are not actually that high since people tend not to form words by adding those endings to verb roots unless they intend the words to mean what you'd expect them to mean. No one is going to invent a better widget and call it a 'goibility' (probably).
If you look at my analysis of the list that Connel, in his haste, tried to ridicule, there really isn't anything on it that shouldn't be on it. That is the reason that the command list is heavily moderated and that 'if' the robot ever gets approval, it will be processing tens of roots per day (unless I have a lot of spare time), rather than going through the whole list in a few days.
Every word would be scrutinised with the same diligence as would a word I'd remembered or seen in print and decided to enter. In fact, the robot part of the task could well be considered a bit of a red herring. All the robot part of the system is doing is ensuring that I can spend my time writing definitions and examples and researching words that I would otherwise spend waiting for the browser to finish an update and that entries are in a 100% consistant style. That means more words with better quality entries and research which, I would have thought, could only be a good thing.
It is true that the original proposal was just for the three verb endings, but after the brouhaha about the copyright situation had died down, and as a result of much poring over word lists I realised that it could, with human supervision, go much further and provide an even better resource for people trying to learn English or improve their language skills. The revised proposal clearly stated in the second line: "The purpose of the robot and its associate command software is (at this stage) to find as many word forms as possible that share a root with the verbs in the Wictionary." Which is why I maintain that I did nothing that was not in the proposal. Moglex 17:03, 24 November 2006 (UTC)[reply]
  • Interesting three-part treatise in response to my initial critique. It remains clear that you have no interest in listening to suggestions so I probably should not waste any more of my time on you. Your robotic (what did you call them? Assisted?) edits all arrive in batches, so yes, fall under the purview of WT:BOT; meaning that yes, you are running unauthorized 'bot code here. Note to Scs: please review all the above and note that only one of my initial specific complaints have been rectified (concusss.) Moglex, your impatience does not promote your POV, nor do your (and Scs') personal attacks. I suggest you spend more time learning your way around here, before "wasting" time on coding unwanted "solutions." The shortcut WT:EC was deleted, it used to redirect to User talk:Eclecticology. In closing, I think it is very unlikely your 'bot efforts will gain approval this year. --Connel MacKenzie 17:27, 24 November 2006 (UTC)[reply]
[edit conflict]
Connel, I'm not trying to take sides here. I sympathized with Moglex on some aspects of the current situation, and criticized him on others.
You accuse him of being impatient and not listening, but really, he'd be justified in thinking the same things about you.
(And I'm sorry if this, or my earlier comment, seems like a personal attack. It's not intended as such. But when the interpersonal issues get in the way of the smooth functioning of the project, they can't just be ignored.) —scs 18:22, 24 November 2006 (UTC)[reply]
I understand your wish to mediate the situation by being supportive of his efforts, but I do think you went way too far in that support.
He proceeded with (semi) automated edits without approval, far outside the remit of what he said he'd do. So, I called him on it. It is not like you were about to.
He obviously is incapable of reading what I wrote neutrally. So you all can try to work with him - good luck. Look at some of his outrageous comment...en-us? Yeah. Right. Most of the time I'm the only one actively pushing for proper regional delineation. But arrogant Brits think that en-uk is the only English. Great way to push a stereotype! (BTW, isn't it the venerated British tome OED that recommends the -ize forms?) Or his comments that inflection templates are my personal imaginative preference? How many of those templates have I even written? ANY? And the comments about 'formats not of his personal preference'? Wow. Haven't I been the single most rabid supporter of approved 'bot activities here? Haven't I been the one to work extensively with getting each of our other 'bot operators approved? Wasn't I the one the spent years building trust in the small group of 'bot operators? Well, good luck with him, again. If he calms down, he may be able to contribute constructively without getting blocked. --Connel MacKenzie 18:40, 24 November 2006 (UTC) AKA Ton Of Bricks.[reply]
  • Interesting response from someone whose 'initial critique' was rushed, ill researched and arrogant.
  • I have spent a great deal of time providing extensive evidence that you were, in the main, talking complete twaddle.
  • You make no attempt to back up the bizzare and ill founded accusations you have made, as indeed you cannot because the evidence that you are wrong is plain for all to seen on Google, Google books and extant dictionaries.
  • Instead youy become pompous and overbearing and try to bludgeon your viewpoint across taking it upon yourself to predetermine what I understood should be decisions undertaken as a concensus of users here.
  • You whine that only one of your complaints has been addressed. That is because it's the only one that actually made sense. Most of the others are either demonstrably spurious, have insufficient detail to enable action to be taken, or are, as far as I can see just your own personal likes and dislikes.
  • It is now becoming clear to me that you consider this project to me your own personal fifedom, and that you seem to be taking on the role of de-facto chief administrator.
  • Your almost surreal contempt for other users here is amply demonstrated by your nonchalant admission that some text that you accused me of ignoring does not even currently exist at the point by which you referenced it. Even had I found it by some weird chance its layout and content would not lead one to believe that it was the repository for some odd rule about inflections and multi word verbs.
  • Your utter cheek in bemoaning 'attacks' made by scs and myself after your own diatribe is breathtaking.
  • In closing, I would ask you to either put up or shut up. Either back up your complaints and refute some of the evidence that I've provided that they were ill considered and wrong, or withdraw them.
  • I trust that other, more reasonable, people here will recognize the old, old technique used by someone who has made rash and ill considered statements and been called on them that they are "not going to waste any more time" on the matter to be exactly what it is: a device to avoid admitting the inadequacy of thought and consideration that went into the initial statements.
But, Moglex, it does no good to attack him like this, either. He puts huge amounts of time into the project, and is a respected contributor. He's not completely wrong here, just partly wrong. If you go at each others' throats, you both lose, we all lose. —scs 18:28, 24 November 2006 (UTC)[reply]
Moglex, I would have commented earlier, but ran out of lunch hour. It is certainly true that not all our "rules" are properly documented yet (and a few are actually stated incorrectly in ELE and its related pages; when errors are pointed out, they usually get corrected quickly). This is partly because we are a younger wiki than 'pedia, using software that was not designed for a dictionary. We are also attempting to become a dictionary of a type which does not yet exist, so we have to experiment. As we do so, we mutate, and generally the fittest mutations survive longterm. When this is noticed, the ELE is changed to match the latest thinking.
While I would not defend Connel's rudeness, I will certainly support him against any equally rude incomer. He is respected by most of us as one of the most hard-working and passionate supporters of our cause, and also one of the most knowledgable about why decisions were made in past years. We have seen some good work from you in the month you have been editing, but you shouldn't expect to implement major changes in our formatting and methods without discussion. You accused Connel of treating this as his fiefdom (perhaps your perception, but not shared by most of us), and yet you appeared to be treating it as yours.
I suggest that the intemperate nature of Connel's response was partly for the following reasons (with which I agree):
  • Like Connel and Scs, I had no idea that you intended to make so many changes all at once in the trial we had agreed to. Perhaps 3 or 4 words would have been appropriate for discussion purposes, possibly the 13 you unleashed in 11 minutes from 11:27 on 23 Nov might not have aroused too much adverse comment. But at that point you should have stopped until we had had a good chance -- several days, since some of us cannot be online every day -- to comment. Certainly, no approval was given to continue with a further 49, often at a rate of 3 per minute, later in the day. It could be politely called "unauthorised bot use". Personally, I would call it "bullying". It certainly was not what was agreed.
There seems to be a certain amount of confusion here, (which I will admit is probably my fault) concerning what software is being used/doing.
There is one very large suit of software that pulls all the words sharing the root of a verb together, generates a concordance, then allows human review and the addition of missing words. That is the piece of software that produced the 13 entries with the root 'access'. If you look at the section : 'Revised 'bot proposal', where the test was proposed, you will see that it did not do anything other than what it said it would do. Maybe, on reflection, I should have picked a root that would not generate so many words, but I really didn't think that 13 words out of over 300,000 would be a problem. Also, in the original proposal I suggested a dozen infinitives but when I realised that there were far more words sharing verb roots than just the inflected verbs I reduced the test to a single infinitive specifically so that it didn't change too many words before adequate discussion.
Thanks for the clarification, which is better late than never. --Enginear 17:43, 25 November 2006 (UTC)[reply]
I don't really understand why you say 'better late than never'. The bot proposal stated what the bot would do and that was what the bot did. The note about the assistant stated what the assistant would do and that was what the assistant did. Both of these were present before the associated program ran. Moglex 18:45, 25 November 2006 (UTC)[reply]
There is a second, much simpler piece of software that I wrote for adding verbs. The only way what it does differs from using a browser in the normal manner is that it adds all four 'standard' parts (infinitve, TPS, Present participle and simple past) in a group after the definitions and examples have been written and checked. It does automate some of the generation process, but each word is separately dispatched to be added, and its main purposes are improved consistency, and saving the operator from having to twiddle his or her thumbs whilst the browser shovels heaps of html back and forth. Had I been furtive I could have told it to output the words at around one per minute and the result would have been no different to doing exactly the same thing manually. It would just mean I could go and make a cup of tea whilst it caught up. BTW, despite Connel's disingenuous comment "(what did you call them? Assisted?)", these types of programme are well known and have their own rules (which I followed) on the wikipedia bot page. Moglex 09:33, 25 November 2006 (UTC)[reply]
That may be the major part of the problem. This is NOT Wikipedia. The bot rules are very different here. --Enginear 17:43, 25 November 2006 (UTC)[reply]
There is a wiktionary bot page which covers a subset of the rules needed. There is a wikipedia page about bots to which you can actually navigate via the help system. It says things about the difference between bots and assistants. The wiktionary page is silent about assistants so it seemed reasonable to assume that there was no intent to override the information on the wiktionary page. This view is supported by the fact that the wiktionary page requests that you read the wikipedia page. The wiktionary page also suggests that you do a test of 10-100 edits before you even mention anything in the BP, so I was vastly more circumspect than the stated policy requires.
Note also that the page says the bot operator should undo: "and any edits which consensus decides were unwanted" not "any edits that Connel McKenzie decides to takes a dislike to". Moglex 18:45, 25 November 2006 (UTC)[reply]
  • Connel has been at the forefront of a group pushing for the community to trust experienced bot runners, rather than making them seek approval for every single task, eg one bot for adding pres parts, one for adding 3pps, etc, in the face of much distrust. At present, that is not formally approved even for those (like him and unlike you) with a history of immediately reverting any bot actions which are complained about. And yet you used one. Actions such as yours quickly destroy hard-built trust.
Bearing in mind what I wrote above about the two separate programs, I cannot 'revert' the main verb additions because they are new entries and I have no way of deleting them. To be honest, I wasn't even aware that it was down to me to do the revertions (ignoring the fact that I can't actually do revertions - I don't have a button for that), as all the relevant page has to say is this: " Any complaints made about the bot during the trial period require that the bot be stopped immediately, and the issue should be resolved at Wikipedia talk:Bots." The robot trial was a fixed length and had stopped hours before any complaint, and I stopped using the assistant immediately I saw the message.
A test search of the Wiktionary site failed to find this. --Enginear 17:43, 25 November 2006 (UTC)[reply]
Well it wouldn't, would it. The clue is in the 'Wikipedia talk:Bots'. However, as I have quoted, the wiktionary page only requires that edits are reverted when a concensus has been reached that they are unwelcome. I would consider a concensus to require several people to agree exactly what is wrong and how they would like the problem resolved, bearing in mind that I can neither revert nor delete. (Clearly if there was anything that contravened written guidelines I would immediately hand edit it to comply.) Moglex 18:45, 25 November 2006 (UTC)[reply]
  • Major mutations are dangerous and need to have a consensus before action, otherwise major damage can occur, or the wiki could even "die" -- if you piss us all off enough. Connel, like many of us, thinks hard about the pros and cons of any proposed change, comparing it with previous proposals and considering possible outcomes. I have seen that in discussion. But, when ideas are hammered at him too fast to consider, he is easily riled (as indeed am I). You have put forward a lot of ideas. Some, or even all may be appropriate. But those who have been here long enough to know "where the bodies are buried" need time to reflect and comment. For example, DAVilla's suggestion below, following on from both your thoughts and an earlier (unresolved) discussion re formatting/placement of Derived terms leads to a thought of moving Languages to level 1 headers. We don't use them. Why? I don't know, but those knowing our history will enlighten us soon.
I read that level one headers are used for headwords. As to the 'ideas coming to fast', I fear that this again comes down to confusion between the proposed robotic task for which there was one test after which I was, and still am, fully expecting discussions about the topics raised to carry on, possibly for weeks, before concensus was reached, changes were made and the actual use of the robot can be put to a vote. The other, smaller assitant program really shouldn't be a problem, especially now that I have changed it to to use the inflection template and incorperate the 'alternative heading' entry. It never did add the 'common root' concordance entries since I knew that would require lengthy discussion which is why I was completely confused by Connel's lumpin both the 'common root' edits and the new verb additions together. Obviously, in the light of the fuss caused, I will now explain it fully and put it to a vote before using it again. Moglex 09:33, 25 November 2006 (UTC)[reply]
Finally, I would comment that Connel had the grace (and perspicacity) to say that his comments were quick and preliminary.
Looking at what appeared to me to be little more than a rant, his comments about it being quick and preliminary looked more like a threat that this was just the beginning and he was going to see what else he could find to nail me! If that was a mistake on my part then it is a great pity because that is what engendered a lot of the heat in the response. Moglex 09:33, 25 November 2006 (UTC)[reply]
Some mistakes are therefore understandable. It is helpful to the wiki to be able to share such comments without fear of a flaming response (though I agree that his tone encouraged an angry reply). However, your considered response also contains errors. The two which were immediately apparent to me are:
  • Bodge is indeed a dialect word. It is one I use frequently (I prefer the sound to botch) and I am pleased to see it added, but I only discovered it when I lived in the South Midlands. It is not in general use throughout the UK (and I see that OED calls it "obsolete or dialect").
Well, I was born and live and work in part of the time in London and part of the time in the South East, and the word has been familair to and used by me all my life, so if you know it from the Midlands, OED is wrong about it (currently) being 'dialect' and it's certainly wrong about it being obsolete. I'll agree, however that it is regional. Moglex 09:33, 25 November 2006 (UTC)[reply]
  • While I would love you to have been right re -ize (I hate the usage) you are really showing ignorance of "proper" British English if you suggest that it is an Americanism.
Let me interject here. I didn't say it was an Americanism, I said it (more particularly its usage) was regional. The point I was making was that Connel was saying that the 'ise' entry should have been marked regional whereas as far as I can see from other entries the standard is for each version to have an 'alternative spelling entry. I'm ell aware that 'ize' is the original English - indeed, in an early episode of Morse, the eponymous inspector actually called someone illiterate for using the 'ise' ending :-) Moglex 09:33, 25 November 2006 (UTC)[reply]
In the last few months, there has been a growing acceptance (but not yet a consensus, let alone a formal policy) that we should use more glosses. In particular, I think that most of us want to show, wherever (and to the extent that) it is known, which regions a word is used in. There either is, or soon will be, a vote in progress on what to call words that are used in the UK and most Commonwealth countries, but not in the US. --Enginear 17:43, 25 November 2006 (UTC)[reply]
Since OED2 was published in 1989, the OED has disagreed with you (it caused an outcry at the time, but they had the data to back up their assertion, with cites throughout the modern English period). As they state "But the suffix itself, whatever the element to which it is added, is in its origin the Gr. -, L. -izre; and, as the pronunciation is also with z, there is no reason why in English the special French spelling should be followed, in opposition to that which is at once etymological and phonetic. In this Dictionary the termination is uniformly written -ize." So we can only be grateful that here we allow entries for both (and if you produce a bot to add all missing -ise entries for extant -ize words I want to be the first to support it).
A proposal is on the way, but I first want to make absolutely sure that there is a concensus of how it should be handled. Some entries use a redirect and some have separate entries for each version. Obviously a robot couldn't care less what it does but it makes sense to ensure that a consensus actually exists before trying to get one approved. Moglex 09:33, 25 November 2006 (UTC)[reply]
This issue is discussed every few months, and it seems to me that, at least for the last year or so, the following ideas have won out:
  • Redirects are definitely deprecated; the arguments which lead to this are complex and, to me, non-intuitive, but they are logical; the only reason some words still have redirects is that no one has got around to correcting them -- would you like to?
  • A simple entry giving "==English==/===PoS===/[inflection line]/# Alternative spelling of nnnn" is OK, but at least some of us feel more information should be added, eg definitions, cites for that spelling, etc, and
  • A full entry, to the same standard as (or better than) the original, is also acceptable; the data storage objections are balanced by other advantages, eg usability.
  • Now let's see who disagrees with that assessment! --Enginear 17:43, 25 November 2006 (UTC)[reply]
So please, accept your own fallibility, just as Connel frequently accepts his own, and if you feel you are superior to him, demonstrate it by being more patient and more polite than he is...at least until you have gained his experience and good standing in the community. --Enginear 23:47, 24 November 2006 (UTC)[reply]
I'm well aware of my own fallibility (concusss, for example), and also that there are severe limitations to what I know. Particularly about this project. I hope that up to this point I have taken all advice and criticism offered in good part and incorperated it into my work here.
But, I'm afraid that if there's one thing that get's me really, really, really, angry it is being castigated for following the procedures, as presented, when the person or group doing the castigating are well aware that the procedures in question are being presented incorrectly. In those circumstances it is the person/group who are in the wrong and the fault is entirely theirs. If that opinion makes me arrogant, then I'm afraid that I'm arrogant, but I consider that to be no more than natural justice. Too many people have suffered in far more serious realms than a dictionary group because of such shenannigans for me to change my view on that. Moglex 09:33, 25 November 2006 (UTC)[reply]
The good news is that no one is killed or is imprisoned because of our wiki shenannigans! If it upsets you so much, perhaps you have the motivation the rest of us seem to lack (except in our pet areas), to sound out community opinion and negotiate written policies. But there's no point in trying unless you feel you can consult patiently until you reach consensus. --Enginear 17:43, 25 November 2006 (UTC)[reply]
One comment about the distinction between "bots" vs. "assistants" (which was mentioned several times in the preceding long thread). Connel and I have discussed this issue. Connel believes that a highly-automated assistant acts enough like a bot to be regulated as one. I believe that neither the Wikipedia nor Wiktionary bot policies (as currently written) apply to automated assistants. I'm not sure anyone else has weighed in with an opinion. I don't think there's any consensus as yet; the use of automated assistants is definitely a gray area. (See also w:Wikipedia:AutoWikiBrowser.) —scs 19:31, 25 November 2006 (UTC)[reply]
Enginear, above suggests that I might try and sound out opinion and negotiate written policy. As I have fallen foul of Connel's personal opinion (not supported by any written documentation that I've noticed) about what constitutes a robot, I think I'll have a try.

On inflections

To clarify, the templates are in fact preferred in the inflection line. You will not be chastised for replacing the template with hard code if you do not know how to amend the inflection line for corrections or additions. However, there is certainly a problem in making such changes in batch.

Thank you for that information. Once I know what the rules are, I can follow them! Moglex 19:54, 24 November 2006 (UTC)[reply]

As we list them, inflections include only the nearest forms which are of the same part of speech. Other forms go under ====Derived terms==== (which are currently lumped with compound terms such as compound nouns or phrasal verbs that I would like to push down to a lower ====Compound terms==== section). Derived terms are defined as those terms of the same language that include the word in the etymology... although there is no need in relisting the inflections. With a few adjustments your bot code could significantly improve the presence of these on Wiktionary. DAVilla 19:32, 24 November 2006 (UTC)[reply]

What about the fact that anything that goes under a POS header is automatically subordinate to that POS whereas the derived terms are effectively subordinate to the etymology? I'd originally planned on putting them right at the bottom until I realised that this would make them appear to refer only to the last POS on pages with more than one.
BTW, thank you for the constructive comments. Moglex 19:54, 24 November 2006 (UTC)[reply]
The user who instructed me on this point, if you see my talk, was Paul G. I don't understand the reasoning either. The ELE states that such terms, if the relevant POS is not easily identified, can be listed at level 3. You have my support if you want to consider them subordinate to the etymology, probably better at level 2 without a divider, and language at level 1. However it will likely take some time for such a change to come around. We'd need to see some examples—probably any common word with several parts of speech will do—and have a vote. I tend to just add them to whatever's already there, and nobody complains, but that doesn't address your question. DAVilla 20:28, 24 November 2006 (UTC)[reply]
I share some of the same feelings you do about positioning Derived terms. Theoretically, any derived term should list the term in question as its etymological root, but in many cases this relationship is difficult to demonstrate and is assumed (often incorrectly). My opinion is that if a term is derived, and so listed, then the etymology of the parent term must say so, with the appropriate part of speech indicated. If this can't be reliably done (which it often can't), then the derived term should be given at the same level as the POS header, towards the end of the entry since it's a list of other words, not the one presented in the entry. --EncycloPetey 00:19, 28 November 2006 (UTC)[reply]

archiving this page

Does anyone have any special way they like to archive this page? It was getting waayy too big (705k), so I summarily archived the rest of August, and all of September. I may do the first half of October, also. —scs 19:56, 24 November 2006 (UTC)[reply]

Perhaps we could start a campaign for brevity!  :-)
I've historically built the topic indexes right after doing an archive (just as you did) but with Werdnabot here now, I've been trying to determine the best way to convert to using that method. The main stumbling block I've seen is that Werdnabot archives based on the last comment in a section, not when it was started. The other bug with Werdnabot archives is that the intermediate links aren't recognized (but that admittedly is a much smaller problem.)
I'm not certain we can smoothly switch over to Werdnabot archiving just yet. But the sooner the glitches are worked out, the better. --Connel MacKenzie 20:03, 24 November 2006 (UTC)[reply]
What's wrong with waiting for some time after the last comment? An active discussion should be kept active, and a long inactive one is as resolved as it's going to be. Or do you mean that it actually labels the archive with the wrong date?
And what's an intermediate link? DAVilla 20:46, 24 November 2006 (UTC)[reply]
When Scs cut-n-pasted September to the archive, the entries went into the archive in the order they were created. The subtle difference with Werdnabot is that as a conversation ages, it gets dropped into the current month's archive, in the order that people stopped replying to it.
I think I could live with that. Preserving the order might be possible as well with code modification, sorting by first message date. DAVilla 21:16, 24 November 2006 (UTC)[reply]
The intermediate link stuff is actually not applicable to WT:BPA, so please forget I said anything about it.
What I had been seriously considering was tagging this page to archive to a particular page with a 90 day duration, letting 6 hours pass (for the next archive run) then setting it to the next month, with a 60 day cutoff, then six hours later setting it to /{{CURRENTMONTH}} with 30 days. Seems kindof silly now. And yet, Werdnabot seems to be inactive right now... --Connel MacKenzie 20:54, 24 November 2006 (UTC)[reply]
We should also take a look at whatever they're currently doing on the Wikipedia reference desks. I think everything somehow starts out in a transcluded daily subpage right away, or something, so that links keep working even after things have been archived. —scs 21:25, 24 November 2006 (UTC)[reply]
We could, you know, just create the "archive" page for each new month in advance and transclude it into WT:BP ... then "archiving" doesn't involve moving anything, just removing the transclude and leaving the link in "Beer Parlour Archives" box. You think? ;-) (yes, if someone uses the + tab it will land on the main page, but that can easily be moved whenever anyone pleases.) Robert Ullmann 21:46, 24 November 2006 (UTC)[reply]
Well, once Werdnabot is turned on, that won't be necessary, right? Disclaimer: as evidenced by how I set up WT:VOTE, yes, I am a big fan of the subpage concept. But the last thing someone needs when they are looking for help, is a maze of where to post things...which is one reason we end up getting occasional Wikipedia complaints...the WP maze is too hard for some newcomers.
If we have a generic javascript thing to add new sections to a subpage, then yes, I could also see that as a workable work-around. Which avenue should we be pursuing here? I am partial to keeping some amount of backwards compatability with the old archives (obviously.) --Connel MacKenzie 22:43, 24 November 2006 (UTC)[reply]
In the WP, they just have the + tab work normally, and then scoop that into the archive file when desired. (Presumably with some automation, but we don't even need that.) Consider, if this page right now included Sept and Oct, and most of Nov except maybe the last section or three added, it would look the same, but work much better. And completely compatible with what we have been doing. Robert Ullmann 01:57, 25 November 2006 (UTC)[reply]
Tried this, section editing works as it should when the section is in the month page. Not so much if it is still here ... (which is an actual bug) will think about it ;-) Robert Ullmann 13:47, 25 November 2006 (UTC)[reply]
OK works now you can see at the top of this page in edit mode. We can now shovel large blocks into the archive pages as desired while keeping 2-3 months "current". We do have to have a L1 header at the break, which is why the (continued) header. (can't have a L1 section start in a subpage and end on the main page or vice-versa). And is completely compatible with the current method. (I do like werdnabot, but it needs some work; needs to learn how to archive under the first date, and put it on that page, not on the archive page (CURRENTMONTH or whatever) being used at the time of archiving!) Robert Ullmann 15:34, 25 November 2006 (UTC)[reply]
I should point out that the + tab works perfectly normally. Newbies won't need to know anything about what is going on; adding a section with + or section editing the last existing section are unchanged. Robert Ullmann 15:48, 25 November 2006 (UTC)[reply]

Um, that was pretty radical. But the page isn't too big now ... Robert Ullmann 16:56, 25 November 2006 (UTC)[reply]

That was a little too bold. How do you undo that? --Connel MacKenzie 23:11, 25 November 2006 (UTC)[reply]
You'll note that some objected to this format (history no longer displays what was edited.) --Connel MacKenzie 23:31, 25 November 2006 (UTC)[reply]

Undoing that was painful (browser lockups while restoring over 10,000 edits.) Please don't be bold in this process. --Connel MacKenzie 00:54, 26 November 2006 (UTC)[reply]

I'm sorry.
I don't understand what you did to try to unwind it, it certainly didn't require moving page histories or edits. I would have happily put it back; the only history that would be off is a half-dozen or so edits in the last few hours that would be attached to Nov instead of this page. (And now the first part of Oct is duplicated?) Coulda just asked me to put it back? I am very sorry. Robert Ullmann 01:05, 26 November 2006 (UTC)[reply]
The main point is not to be rash about archiving here. Wanna be bold? Please be bold on a "lesser" page (like the experiments still ongoing for WT:GP!) Or on your own talk page and then discuss it/showcase it. But WT:BP is a little to busy for that. (BTW, I did catch and correct my error with October, a few moments before you posted.) --Connel MacKenzie 01:09, 26 November 2006 (UTC)[reply]
Well, but I did - on the subpage, where you didn't/couldn't notice the reply. Anyway, it probably was hard to tell what was "being bold" vs. Scs did (which was following the previous archiving conventions.) I did learn from your experiment that we definitely do not want to go that route. Said another way, my objections to the subpage method are now crystal clear...and I know I am not the only one that tracks this page by history.
Knowing that the subpage technique is unworkable, makes the Werdnabot option look much better now. Perhaps if it gets turned on now (with 30 days of inactivity causing it to archive to the current month's subpage) only October 2006 would be "funny." Actually, I'll just nag Werdna some more (currently on IRC) for the 'first date' enhancement. If that isn't feasible, we can return to "Plan B" right? --Connel MacKenzie 02:29, 26 November 2006 (UTC)[reply]
Robert, sorry for goading you into what we now see was an ill-advised experiment. I described how I thought the Wikipedia Reference Desk archiving worked, and it seemed plausible, but it looks like I was wrong. The content doesn't go onto a subpage immediately; rather, it gets copied to a subpage after a day or so. The previous two days' subpages are transcluded back onto the main reference desk page. This works well for the fast-paced reference desk conversations (which usually only last a day or two), but clearly represents too fast a turnover for our typically more extended Beer Parlour conversations. I don't know if the same scheme, with a different time interval (maybe based on weeks rather than days?) would work for us, but it might.
The Reference Desk archiving is performed by w:User:RefDeskBot, owned by w:User:Martinp23.
It's also worth thinking a bit more about the page history issue. I know exactly what Connel's complaining about, because he's right, he's not the only one who tracks pages like this by history, and I noticed the problem immediately, too. But, the biggest problem there was not the subpage scheme at all, but rather the sudden move of a bunch of live threads to the brand-new subpage. Once we were in steady state with a subpage archiving scheme, it might be decently workable. On the one hand, if you were interested in every thread you might have to check 2-3 pages' history instead of just one, but on the other hand, if you were only interested in one or two threads, you could see their history a bit more clearly, undiluted by all the other days' threads you didn't care about. (And in fact, for this to work best, brand-new threads would start on their starting day's subthread, so that they'd never get moved, so maybe the idea wasn't so ill-advised after all.)
scs 22:10, 26 November 2006 (UTC)[reply]
It's possible. :-) Am I the only one who thinks that talk pages for discussion pages are useless? (For me, at least, it's a good thing this discussion isn't taking place on the talk page, because if it were, I'd never notice it.) —scs 22:13, 26 November 2006 (UTC)[reply]
Well, drat and darn. It is the one of only two topics that actually should be on the discussion's talk page.  :-)
Shall I proceed with turning on Werdnabot archiving in the "broken" mode? That is, archive anything inactive for 14 days to the current month's archive sub-page. This will look a little weird for November/December 2006, but will be consistent thereafter. As long as it runs within the next six days, that is. --Connel MacKenzie 20:55, 23 December 2006 (UTC)[reply]

OK, fugedabouwdit. I'll do November and December in the old-style archive, then after tomorrow (starting the new year) I'll turn on WerdnaBot at 14 days. That way, January 2007 will have about a three week gap, but archives from that point forward will be consistent. (Plus, it is easier to explain the distinction at the year boundary.) --Connel MacKenzie 09:48, 31 December 2006 (UTC)[reply]

Request for comments for verb entry assistant

This is a very simple program that assists in the entry of verbs.

The only way that anything it produces differs from normal, manual, additions is that the infinitive and inflections appear together within about a minute rather than being spaced out.

Purpose: The purpose of the assistant is to enable the operator to add the infinitive, TPS, past and present participle without quadruplicating the work involved. It can also prepare the alternative set of words for a verb with an ise/ize ending (although this feature is currently disabled until I can be absolutely sure what the consensus on that is).

Benefits: There are two benefits, one is that the reduction in duplicated effort can be put to good use in better research and preparing examples, the other is that the additions will be 100% consistent since the assistant does not forget things such as the emboldening of the headword in the example, and other niceties. For the operator it reduces the time wasted staring at the screen waiting for things to happen.

Operation: When given an infinitive, the program generates the inflections and presents them for approval. Once they are approved or corrected, the operator can then type in the definition and example lines. One click wikification is possible here. Once this has been done, the program generates the correct definitions for the inflections and attempts to generate correct example lines. These are then presented to the operator for approval/correction and once all is well the generated text is formatted and dispatched for addition to the dictionary. Apart from the spacing of the additions there is absolutely no difference between using this program and doing the same job manually, except, as mentioned, it does not forget things like emboldening, capitalisation and full stops.

The program sits in a strip at the bottom of the screen and can remotely control a browser above it to look up any word in the reference of choice.

History: This program has got off to an unfortunate start in that its output was lumped together with a test for the much more complicated POS/concordance generating robot. The two legitimate complaints, namely that it ought to use the inflection template and that verbs with ise/ize endings should have an alternative spelling section have been addressed.

Test: There is little point in running a test at the moment since the result would be indistinguishable from manually added words. I have, however, added the word 'cinematise' and its inflections in the exact format they would be added by the assistant. Thus if there are any comments to make about the formatting of that word they will also apply to the assistant.

Running: If there is a consensus that this program is safe to run, I think that at least the infinitive should run as a standard user account so that the definitions and examples are reviewed. It might be a good idea to add the inflections via a robot account so that the recent changes do not get cluttered up with near identical entries. I can easily arrange that the software adds different parts of speech from different accounts.

Any and all comments, corrections, and suggestions for improvement are welcome.

Should anyone else wish to run this program for their own additions (assuming the concensus is that it may be used), they are welcome to have a copy ('doze only, I'm afraid - wine?). It could easily be converted for nouns/adjectives/adverbs, and it really does remove a great deal of the unproductive grunt work from adding entries. Moglex 10:50, 25 November 2006 (UTC)[reply]

This seems good to me, but I have four comments:
  • Connel has a bot which, I think, can do much of this. It would be worth discussing with him to ensure the most appropriate methods are used for every case. In particular:
This isn't really a robot, though. Its primary purpose is to enable verbs (at the moment) to be added more efficiently and consistently by undertaking the grunt work in creating an entry. Adding the other inflections is really more of a by-product to avoid generating large quantities of unsatisfied links (although it also has the benefit that it is easy to add examples for all inflections which seems a nice feature). Of course, it would be less work to take it out! Moglex 19:52, 25 November 2006 (UTC)[reply]
I thought I said before that I liked the idea of this manual supplement to TheCheatBot et al. My concern was entering things as only one part of speech, when multiple apply. (My bots add entire entries, only if there is no stub already entered.) There are many combinations that I specifically exclude, opting instead to just skip the term(s). So having a semi-automated supplement is a Good Thing (tm), as long as it works right. --Connel MacKenzie 07:14, 26 November 2006 (UTC)[reply]
The fundamental rule that I built into my system was: Do not, under any circumstances, replace or modify a section written by someone else. Apart from that, the two common cases, a TPS that is also a plural relies on the operator to tell the software that is the case, after which it adds either, both or neither POS, depending on what is already present. For the PPL/adjective case, at present I must admit I have assumed that the TPS can always function as an adjective. Of course, it would only take one counter example to show that approach is wrong, and it would need to be made 'turn-offable'. It follows the same scheme as the TPS/plural case in that it will then add only what's missing. Moglex 08:55, 26 November 2006 (UTC)[reply]
  • What are you intending to do when you find a word like loves which is both plural of a noun and 3pps of a verb? I seem to recall that Connel's bot can do both, provided there is not already an entry for one or the other. If you do just one, you may actually be making more human work rather than saving it (but check with Connel -- I may well be wrong.
I click on a button and it generates the plural as well, it then merges the created entry with the extant entry giving precedence to what is already present. i.e. it will not overwrite anyone else's work. Moglex 19:52, 25 November 2006 (UTC)[reply]
As I said before, my bots only create entries that don't exist (only.) Partly be design, mainly for pragmatic reasons. Since my bots have to wait for any given XML dump to refresh the feeder lists, I don't see any fundamental problem with this. I do wish to caution Moglex that for a very long time, if I detected any combination, the uploading was skipped (therefore, you should expect an abnormally high number of terms that do have more than one part of speech - particularly if wikilinked from inflection lines elsewhere.) --Connel MacKenzie 07:14, 26 November 2006 (UTC)[reply]
Looking again at that aspect, the only problem that I can see at the moment (more will undoubtable pop up if it's ever run again) is that if the ppl is present then the adjective is not added. This does require a change to actually read the page again just before the root is processed to check for the presence of the adjective (since I don't want to spend time providing examples for adjectives that are going to be skipped a few seconds later). Moglex 08:55, 26 November 2006 (UTC)[reply]
  • I suggest you either steer away from rare words, or build in a feature that checks books.google.com or something similar, to see that the word is actually common enough to meet our CFI. For example, cinematise, cinematised and cinematising may just meet our CFI, but I doubt that cinematises does (there are no b.g.c hits for it). (BTW, the same words in -ize are fine.) This will get much worse if you start comparing adjectives, since the comparative and superlative forms usually get many fewer cites.
Interesting problem. I hadn't really considered that. I had assumed, perhaps somewhat naively, that if the infinitive passed muster then the three main inflections would automatically follow since they could crop up at any time.
I also have made that same assumption. I'll dig around for the relevant conversations... --Connel MacKenzie 07:14, 26 November 2006 (UTC)[reply]
Having thought some more about this, I'm rather concerned about not making that assumption. It leads to two problems: 1) Permanent red links where an infinitive has been inflected but one POS does not meet the CFI, and 2) in the dreaded ize/ise cases, anyone who looks up a word they have heard will most likely use the ending appropriate for their location (or birth location), and it would be a pity is they couldn't find it just because that version has never made it into print there. Moglex 08:55, 26 November 2006 (UTC)[reply]
This issue suggests something I've suspected already: perhaps the automatic links to the inflected forms (which are so often red) aren't such a good idea, after all. More on this in a new thread below. —scs 22:39, 26 November 2006 (UTC)[reply]
Maybe it would be better to omit the inflections altogether if people are worried about inflections that don't meet CFI creeping in. Although I could automate the Google books search, it isn't reliable. Try looking for 'bodge', and you think you're OK, but look at the entries and, well, I really never knew there were so many people called 'Bodge'. Moglex 19:52, 25 November 2006 (UTC)[reply]
Well, I see where you're coming from (although throw enough mud at the wall and some of it will stick" isn't actually a verb :) ), but isn't it rather throwing the baby out with the bath water to ban all two word verb construct just because you might find something you can't inflect? This seems a very valid point to pursue because it's just the kind of thing that is likely to confuse people learning English who end up put outing the cat after airing freight something to their mother back home.
Enginear's recollection matches mine. Multi-word terms, hyphenated terms, idioms, phrases and proverbs all wikify all the component words, so that someone can see individual word inflections only. For idioms, there are additional rules (that I don't always understand at first glance) for putting the main entry idiom at the most linguistically basic form, with redirects from the common forms. (Idioms are one of the few things that get an exemption for redirects - most everything else cannot use a redirect.) My bots skip all multi-word and hyphenated terms because of this previous rule. --Connel MacKenzie 07:14, 26 November 2006 (UTC)[reply]
From the POV of this tool, the easiest thing to do is just not to inflect anything with more than one word. It makes less work for the operator! The only problem with that is that it denies English learners the information about how they should inflect cases such as those mentioned above. I suppose in most cases it's obvious, provided they know which of the words is the verb, but 'air freight', for example is more difficult: Which is the verb? Perfectly obvious to a native English speaker but to learners, are you airing something that happens to be freight, or freighting something by air. Moglex 08:55, 26 November 2006 (UTC)[reply]
You have a very valid complaint, that en.wiktionary is documentation-poor, for things like this. However, if you wish to push for inflecting phrases, you have a long uphill battle, as that has already been a non-issue for a very long time now. Such a thing would be a separate beer parlour discussion, followed by a month long WT:VOTE before that change could reasonably be considered. My suggestion that you edit entries manually for a couple weeks before automating what you think is correct, would elimiate misconceptions like this. --Connel MacKenzie 16:18, 27 November 2006 (UTC)[reply]
But I did spend a couple of weeks adding several hundred words manually. At the begining I made quite a few mistakes, towards the end a lot less. You must appreciate that the only way I was going to find out about the inflected compund verb rule was by making the mistake. Had I settled down to add a couple of dozen of that type of verb manually, I could quite easily have done that in an hour or so and if no one had happened to check any of those entries during that period, I would not have been appraised of the error until all 48 words had been added. The assistant element in all this is a complete red herring. In fact, because I was using the assistant and thus, as it was new, carrying out a further level of checking, there were actually fewer words added during the period that you are objecting to. --Moglex 08:32, 28 November 2006 (UTC)[reply]
Thanks for the comments Moglex 19:52, 25 November 2006 (UTC)[reply]
    • I ran a spelling checker on this before posting this time. Apologies if that has changed anyone else's text but it doesn't show the context when pointing out errors. Moglex 08:55, 26 November 2006 (UTC)[reply]

Frysk - that's not West Frysian

Dear all,

I would like to comment on the entry to the Frysk wiktionary on the main page - it has been called "West Frysian". That seems a rather unfortunate and - moreover - incorrect translation. 'Westfries' happens to be the adjective and the dialect spoken in a region in Holland called West-Friesland. Just like there is a region in Germany called Ost-Friesland. (Just Google with 'Westfries' and you'll find some info). I propose to change the name to simply 'Frysian'.

Cheers,

Sander

P.S. I hope I put this comment in the right section - I'm completely new to this...

Both the Meta site [7] and Wikipedia give West Frisian as the English translation of Frysk. It is unfortunate that the term also applies to the dialect in West-Friesland, but this sort of confusion is not uncommon in English. The Pennsylvania Dutch are actually German settlers in the US. Dutch is the language called Nederlands, but Deutsch is called German. The term used on the Main Page is correct for English, though it may be confusing. --EncycloPetey 00:36, 30 November 2006 (UTC)[reply]

Achieving consensus for a written policy on robots: 1 - robots and assistants

Having fallen foul of the lack of any up to date written robot policy on Wictionary, and at Enginear's suggestion, I'm going to try and sound out opinion about some of the aspects of proposing and approving robots and assistants.

Firstly, since there is no mention of assistants on WT:BOT some stuff about those.

The current status

As assistants are not even mentioned on the Wiktionary robot page, but are mentioned on the Wikipedia robot page, and as the Wiktionary robot page explicitly ask you to read the Wikipedia robot page it seemed entirely logical to assume that the silence of WT:BOT on the subject meant that the policy on assistants devolved to the entirely sensible policy mentioned on WP:B.

This states that approval is not required for assistants.

Why make a distinction?

  1. There already is a distinction: Your browser is a robot. It takes what you want to communicate to Wikipedia and packs shed loads of html and other protocol information around it before sending it to the servers. It similarly unpacks it on the way back. Making a distinction based on whether you are using a browser or some other software is entirely artificial.
  2. The rules for robots are designed to protect the dictionary from damage, the reason special rules are needed are that robots can do a lot of damage very quickly. Hundreds of times more quickly than an unaided human, and they can go on for a lot longer. Although an assistant is likely to enable faster editing, if the task genuinely qualifies for an assistant (see later), then it probably won't won't speed up edits by that much - two to three times maybe. Furthermore, robots should operate with robot flags to ensure that they don't clutter up the recent changes list. If you accept the prime criterion for considering a task to be an assistant rather than a robot specified below to be valid, by their very nature (requiring human decision making) assistant tasks should appear in the change list so that the human decision is checked.
  3. If you have a rule, it should be enforceable. If someone wants to run an assistant surreptitiously, they merely need to set it to only make a change every minute of so (with some random element added), and (apart from the complete consistency- right or wrong) it will be impossible to tell that one is being used.
  4. Assistants should be encouraged because:
    1. They make the operator more productive
    2. They make the work less boring, meaning that people are likely to do more
    3. They promote consistency and, providing they are well written, save others from having to tidy up entries later
  5. Whilst a robot is unlikely to be needed quickly (and if it is, its approval could no doubt be expedited), if someone feels they want to use an assistant, and goes to the trouble of writing one, making them wait for several weeks for discussions and votes may well lead to their becoming bored and moving on to other things.
  6. With a robot, the potential problem is the problem with the payload multiplied by the rate at which editing is performed. With an assistant, the problem is far more with what the operator is doing, and this really isn't down to the automation. An unassisted operator working quickly when the servers are responding well can muck up a hell of a lot of pages in an hour.

In summary, I would suggest that an assistant is very different from a robot and the rules governing one, whilst important, should be far less stringent than those governing a full robot. Moglex 20:47, 25 November 2006 (UTC) — This unsigned comment was added by Moglex (talkcontribs).[reply]

On that, I agree. Perhaps the most important aspect of 'assistants' is that they be announced well in advance to being used, so that code can be reviewed, etc. Unannounced assistants should have very specific penalties (i.e. the spambot attacks of the last two evenings.) (Note: I am not saying Moglex had anything to do with those - I am suggesting the opposite! However, misuse must be directly addressed.) --Connel MacKenzie 16:09, 27 November 2006 (UTC)[reply]
I'm not concerned here with penalties, but as you can see, my proposal very much requires announcement and keeping the community informed. If you check WP:BOT, you will find that it excludes assistants from the approval process and does not, in fact, impose any requirement for prior announcement.

How to classify an assistant

Firstly, I think it is important to make some restriction on what you can classify as an assistant, since if your definition is too wide the system could be abused by, for example, an 'assistant' that placed words on the screen one after another and only required that you press the enter key to set off the edit. This would allow you to initiate 1800 edits in a minute simply by using the keyboard auto-repeat, and would clearly be an abuse.

One possible way of excluding abuse of the designation would be to say that you cannot use an assistant for a task that could be performed by a robot. In other words, unless you can show definite and positive requirement for human intervention, either for composing text or verifying that an edit is appropriate, your task cannot be considered for assistant status.

This would automatically allow tasks such as new word entry, and re-wikification of articles since these clearly cannot reasonably be performed by a robot.

It would similarly exclude tasks such as routine reformatting, changing section ordering, and replacing deprecated templates with their up to date equivalents.

Further, in cases where some edits require intervention and others do not, there should be a minimum proportion of edits requiring intervention. i.e. You cannot write a robot that detects that one in a hundred edits require human intervention and let it process the others unaided and call it an assistant. If any pre-processing filtering can be done without intervention then that must be done and the automatically processable edits passed to a properly authorised robot. Sounds complicated, but I think in practice it would be easy to spot the difference.

A further restriction (I think) should be that an assistant cannot churn through the database in real time. It should either have a command file prepared by examining a downloaded database (and recheck selected words for intervening edits before processing them) or it should require each word or root to be entered manually.

What should the formalities be?

As stated earlier, I think the main problem with an assistant lies with what the operator is actually trying to do and the way he or she is trying to do it rather than the program running amok and devastating the database.

An initial stab at the process for setting up an assistant would be:

  1. Prior to any use in the BP
    1. Specify what you are trying to achieve
    2. Specify the method you are using to pick pages to edit
    3. Delimit the areas that are the responsibility of the operator and those that are the responsibility of the software
    4. Specify how you will handle any unusual cases you encounter
    5. Specify the formats you will be using (by manually editing a few pages to show variations you intend to use).
  2. After a couple of days
    1. Take note of all comments
    2. Modify any techniques that have been shown inadequate
    3. Produce a revised BP entry
  3. After two more days
    1. Announce when you will run your assistant for the first time
    2. Use the assistant to make some specified number of edits
    • I think Connel has some recommendations for linking the pages modified so that people can see the results of the test.
  4. After the test
    1. Publish a list of pages on which your assistant has made changes
    2. Wait for 24 hours for comments
    3. If anyone points out anything you have done that contravenes a written guideline or is indisputably in error, correct it immediately.
  5. At this point, if no objections are made, it should be pretty safe to use the assistant.

Of course, this assumes that the potential user acts in good faith and does not, for their sample runs, pick examples that they know will avoid contentious or complex edits that will occur in more prolonged use.

To be continued.

All comments welcomed. Moglex 20:47, 25 November 2006 (UTC)[reply]

If you replace the word "day" with "week" (or month) in all of the above, it might be more reasonable. Most of the long-standing contributors have very little time they can donate to this project and get only a couple hours each week. If they spend the first hour reading the beer parlour, they have less time to make meaningful contributions. If they've accumulated a list of words they wish to work on, then any beer parlour distractions can be viewed as counter-productive (to say the least!) --Connel MacKenzie 16:01, 27 November 2006 (UTC)[reply]
Thank you for your comment. Five weeks to five months for something that just saves a bit of typing does seem rather a long time. I fear that it would deter many from bothering. I wasn't aware that people found reading the BP such a chore - there certainly seems to be plenty of traffic here. Also, I wasn't envisiging that proposals would be appearing on a daily, or even weekly, basis. I know that Wiktionary is not Wikipedia, but it does seem a shame that whilst Wikipedia suggest that it might take up to a week to get approval for a full robot, you think it should take between five weeks and close to half a year to get approval for something that is manually surpervised. I wonder what timescale you think would be suitable for a full blown robot. Moglex 17:10, 27 November 2006 (UTC)[reply]

Achieving consensus for a(n up to date) written policy on robots: 2 - robots

Having re-read WT:BOT for about the tenth time, it seems to me that most of the elements of the policy are already agreed, or nearly agreed, but in places they are a bit vague and there is a major problem in that no one actually seems to believe them.

For example, the recent furore when I ran a few tests contravenes only one element of the process listed below:

Process

If you think you have a task likely to be done much more easily by a bot, try using the python wikipediabot under your own account   
and see if it helps. Only if you've made sure that it will significantly help getting the task done, create an additional account 
for your bot tasks. The bot account's name should contain your own username, and an indication of the kind of tasks the bot will be 
performing. Create links from the bot account's user page to your own user page, and preferably vice versa as well.

Then, do a test run (under the bot account) on some 10-100 entries until you're certain everything goes well. If the edits are minor (as most bot-performed tasks should be),
please mark the edits as minor so that they do not swamp the Recent Changes list.

If all goes well, post a request for bot status at the Beer parlour. Clearly state:

The contravention being that I forgot to change accounts before running the test.

And yet, even people who were being helpful accused me of behaving in a way that would destroy carefully built up trust.

Well, trust is a two way street. If you want to be able to trust people to obey the rules, you need to ensure that they can easily and unambiguously determine what those rules are. Whilst I can easily and unambiguously determine the rules for an initial test from the above, and despite going far beyond those rules in terms of keeping people informed of what I was planning, I was still accused of acting in a reckless and irresponsible manner.

The only actual problems were one ill formed POS and some minor formatting problems relating to 'invisible' guidelines that are contradicted by the actual written guidelines.

So, it seems to me there is a need to:

  1. Check the current consensus on the rules for a robot.
  2. State those rules clearly and unambiguously so that if you believe someone has acted incorrectly you can say exactly what they did or did not do in error.
  3. Make sure that these rules are the ones Wiktionary users actually find, rather than have them follow the help pages to the Wikipedia bot page and have to find out by chance that they're looking at the wrong one.
  4. It should also be made very clear which parts of the Wikipedia page actually still apply to Wiktionary (or, preferably (in my opinion), copy the relevant sections to WT:BOT).
  5. Accept that if someone follows written guidelines and those guidelines are out of date, that is the fault of those who changed the guidelines and did not document the fact rather than the user who followed the published rules.

Looking at what is already present, and the associated discussion page (and bearing in mind my experiences wrt these rules), it looks as if the following need to be discussed, agreed, and restated:

  1. How much information does the potential robot operator need to give the community before testing starts.
    Personally, I think; quite a bit. If the robot is truly simple, it isn't a chore, and if it isn't, I know from many years of experience, no one ever thinks of everything at the beginning. (There may be an exception if someone is automating a task that they have been doing for an extended period by hand). By and large, exposing your idea to scrutiny is likely to save you embarrassment later on and possibly a lot of reversions. And apart from that, why turn down free consultancy?
  2. How many edits should be made for an initial test.
  3. Should there be subsequent longer tests.
    For anything other than the most trivial of tasks (and how you decide what qualifies for that epithet, I don't know) should probably undergo at least two tests, possibly more, increasing the number of edits on each test. That way you need to do one change to that 'one in a hundred' exception in a medium sized test rather than 1,000 in a 100,000 edit run.
  4. What actually constitutes a concensus to go ahead.
  5. What should be the timescales for each part of the process.
    Again, there is the problem of determining what constitutes a trivial task, but having said that I would have thought that, start to finish, a week for minor tasks and two weeks for major ones, That's assuming everything goes smoothly. Objections, improvements and errors in tests would obviously extend the timescale.
  6. Under what circumstances should the robot operator undo edits made by the robot.
    1. If written guidelines are ignored.
    2. If 'obvious' errors are made.
    3. If any change to the operation of the robot is made between test runs such that an unwritten guideline that was not ignored in an earlier test (when others would have been able to comment), is now ignored.
    4. If the edits deviate from what was stated.

To be continued.

All comments gratefully received. Moglex 11:49, 26 November 2006 (UTC)[reply]



Please add comments above this line. Below it is a personal message that Connel wrote to me and my reply to it. It is nothing to do with the article above.


Since you are so easily offended, I wish to remind you that everything I have written below is in a neutral tone. Nothing is intended to be harsh (but, if you so choose, you could interpret it that way - I certainly have no control of your emotions.) Genuine complaints are just that - genuine complaints, not personal attacks.
I am not that easily upset :-). What does upset me is a) people writing a long list of criticisms, some of them entirely spurious, without doing the smallest amount of research to see if the criticisms are valid. b) People changing the 'rules' from those published and the blaming me when I act in accordance with the rules as published. I'm sorry to be blunt, but that is the action of a bully.
You have put the cart before the horse. The purpose of rewritting the policy that way, was to encourage established users with a clear way of getting bot approval. You, on the other hand, automated first, then visited to see what our formats are like.
That is cleary absolute nonsense. I have added several hundred words, all bar about 50 by entirely manually editing. The vast majority have not been corrected in any way and I meticulously check my watchlist to see how words I added are edited, and incorporate the rules I infer from that into my editing procedure.
The adage of "it is easier to ask forgiveness, than permission" does almost work well, when you are learning your way around. But not when you are automating uploads.
The primary method of detecting 'bot' activities is by the rate of editing. Obviously, you were not typing those entries yourself. That simple rate of editing is what tipped your hand.
Of course, the fact that I followed the only existing guideline concerning assistants that can be found through following instructions on WT:BOT and actually posted a message to say what I was going to be doing meant that any hand tipping was irrelevant. I'm not trying to beat the system here. If I was trying to be covert, I'd simply have caused the assistant to space the edits out. I've said it before, but it does not seem to have sunk in: If you look at the time taken to add the 30-40 words added by the assistant, I could have done the job considerably faster by hand. If you look at two sets of words I entered manually yesterday: "cohere + infl and cinematise + infl", you will see that in each case all four parts were entered in 2-3 minutes, well under a minute a word, whereas in the time you are complaining about the rate of addition was more like 3 mins per word.
For minor assistants, I certainly use my monobook.js to a great extent. Despite minor problems with it, I invite other to use it (WT:PREFS.) However, it is code that is inherently open to review, and every effort possible is made to ensure that others are able to comment about specific features long before they are used for any volume.
Your assistant, on the other hand, is written in some unknown language, with closed source code. So, it was easily mistaken for similar python tools (such as were used for Spanish inflections) that achieve the exact same result. My apologies for calling your mystery assistant a "'bot." Do understand, that in this context there can't be a distinction.
Apart, perhaps, from the rather important distinction that it was only actually doing edits at the rate of one every 2-3 minutes whereas a robot would have been doing them at the very least 2-3 time faster than that, and the fact that all edits were individualy checked and dispatched by an individual human keypress. Oh, and the other minor indication: that each root had a custom definition and example. If you believe my programming skills enabled me to write a robot to do that I'm extremely flattered.
One of the precepts of the policy you are reviewing, for this version anyhow, is that bot tasks are wanted activities. You seem to be automating for the sake of automating.
I'm automating for the sake of being more productive (and I'm very happy for anyone else who wishes to, to use any tool I create). If I find that to add a word takes two minutes of definition writing, and three minutes of emboldening, italicising and waiting for the browser to refresh, then there is clearly a lot of time being wasted and if I have the ability to save that time, I will try to do so. That seems to be an entirely reasonable attitude. YMMV.
Without an understanding of the existing en.wiktionary formats, that obviously had a tremendous negative effect.
Let's just get this into perspective, shall we? For the assistant, 32 words were added. The identified problems were: There was no indication that some of them were regional (not a mandatory requirement according to the documentation), The inflections did not use the templates (specifically allowed according to extant documentation), and a couple of multi-word roots were inflected (I have still not seen any documentation to say this is disallowed - although obviously I accept that that is the concensus view, and it's hardly damaged the dictionary).
In fact, it's hard to see how anything that I did whilst I happened to be using the assistant has caused any damage. Rather, there are a few words that are not formatted according to the current standard. They join the tens of thousands of others that can be similarly catagorised. Obviously I would have prefered not to make those mistakes, but I can only follow rules that are actually written down.
My suggestion that you take a week or two to actually do manual edits, and learn general formatting rules here first still stands.
I've been here several weeks, and done several hundred edits. I can't learn rules if people don't document them.
Well, no, not quite: the whole reason for the usual advice to lurk quietly for a few weeks before doing anything flagrant (advice that was equally valid in Usenet newsgroups, as you may recall) is to learn all the unwritten rules. If all the rules were faithfully written down, and if you were a careful reader, the advice to wait a few weeks would be unnecessary; you could dive in on day one doing full-blown editing. But alas, it's a rare community where all the rules are faithfully written down. —s
If WT:ELE had said to use templates, I would have used them. if WT:ELE had said not to inflect multi word infinitives, I would not have done so. If you had spent a fraction of the time keeping WT:ELE up to date that you have arguing with me, none of this would have happened. Since you have been kind enough to offer me advice, let me reciprocte: I think perhaps you should consider a little priority adjustment.
That does not mean entering 24 chapter volumes of rants on WT:BP.
Aside to Connel: that was not "neutral tone". —s
I'm sorry you got upset at my defending myself against the attack (and it was an attack) you made. Had you done a little research and not filled the list with spurious complaints about words that very easily pass CFI, odd, and still entirely unexplained allegations of probably copy violation, and incorrect assertions about the use of a robot, and limited yourself to politely explaining the deviations from current formatting policy, the errors would have been quickly corrected and we could have moved on.
Your verbosity needs to be checked. I don't understand why you think people can even read a 10,000 word post, nor why you seem to get upset when you don't have a response within 5 minutes (when no one can reasonably have even read your entire post yet!) :--Connel MacKenzie 15:56, 27 November 2006 (UTC)[reply]
I don't understand that complaint. I don't think I've made any 10,000 word posts and I do not recollect complaining that anyone hadn't read anything?
I was wondering about that too. —scs 21:12, 27 November 2006 (UTC)[reply]

To summarise, I think that the real problem here is that you are saw a handful of words that, whilst formatted according to written guidelines, are not in line with the undocumented concensus view. You also noticed that the edits came in groups and correctly infered that something other than a standard browser was in use. Where you seem to be making your mistake is in concentrating on the irrelevance of how the words were added (irrelevant because, overall, they were added far more slowly than if they'd been edited by hand), and ignoring the extraordinarily simple solution which was to politely and consisely explain where the formatting is in error and then use your undoubted talents to take a few minutes ensuring that the documentation is up to date so that this situation does not arise in the future. Moglex 18:23, 27 November 2006 (UTC)[reply]

Proposal for a robot to partially synchronize verbs with ise/ize endings

Name

MogBotize

Robot or assistant

Robot - no human intervention.
Edit: 15:42 27 Nov 06 Initial human scan to identify exceptions such as prize/prise, surmise/-, surprise/- etc.

Purpose

  1. To find instances of verbs where an 'ise' ending exists and an 'ize' ening does not and vice versa.
  2. By extension to do the same for the (verb) endings 'ises/izes', 'ised/ized' and 'izing/ising'.
  3. In these cases, to copy the extant verb to the missing verb, (replacing a redirect page if necessary).
  4. To ensure that each resultant page has an appropriate 'Alternative spellings' heading
  5. To ensure that each resultant page has consistent 'ise/ize' endings for the infinitive and inflections used in the meanings and examples. This only applies to words of the root being processed and would not change any other, similar, endings that happened to be present.
  6. Where an adjective definition exists for the PPl, it would be treated in the same manner.

Method of selecting candidates

Examination of latest xml dump checked against current database at time of edit

Adds new words

Yes, as purpose

Adds new sections

Yes, in the case where the adjective is already present but the verb is not and vice versa.

Modifies existing sections

Yes - in cases where this or a similar process has already been undertaken and the spellings in the page body are inconsistent.

Handling of exceptions

None that I have thought of - yet.

Formats used

As originals. No new formats defined.

Proposed initial test

2 infinitives, 8 words in total


Comments and suggestions welcome Moglex 17:06, 26 November 2006 (UTC)[reply]

The bot needs some way of knowing whether the entry is using -ise/-ize as a suffix, since the "rule" doesn't work for other cases, eg prise and prize. I would guess that all valid words would be at least six letters long. Perhaps the number of "exceptions" for longer words is so small that a human scan of a dump would be the quickest option.
We know that all valid "-ise" words also have a "-ize" option. However, I'm not sure that all valid "-ize" words also have a "-ise" option. (I suspect someone here knows though.)
My opinion, although I doubt it will be shared by everyone, is that if one form exists, the other one automatically exists as well since English is a spoken language as well as a written one and the accepted written realisation of a verb that sound like 'realise' is 'realise' in some regions and 'realize' in others. Also any user ought to be able to look up a word they've heard using the written variant that's most natural to them (I realise that this is not possible for a great many words given the state of English spelling). — This unsigned comment was added by Moglex (talkcontribs).
With such a crystal clear example given above,...ah nevermind. --Connel MacKenzie 16:04, 27 November 2006 (UTC)[reply]
The example shows very clearly, that the task needs human supervision to check the target list and I incorperated that in the proposal as soon as it was pointed out. Just because a problem is identified does not mean that the entire project should be abandoned. If people carried on like that we would still be living in caves (if we ever did). Moglex 16:36, 27 November 2006 (UTC)[reply]
How about an 'assistant' to sign your talk page entries? --Connel MacKenzie 16:04, 27 November 2006 (UTC)[reply]
I'd just remembered that I'd forgotten and come back to add the sig but you got in ahead of me. Such is life. BTW, please see the Wikipedia 'bot page if you are still having problems with the idea of assistants. They are well explained there. Moglex 16:11, 27 November 2006 (UTC)[reply]
Re what to do about words so rare that they do not meet CFI, see suggestion in thread below. --Enginear 14:39, 27 November 2006 (UTC)[reply]
Strong oppose This cannot be reasonably automated. For manual updating, confer User:Connel MacKenzie/US vs. UK. (That is, the part that can be automated, already has been.) --Connel MacKenzie 15:34, 27 November 2006 (UTC)[reply]
I'm not sure of the difference, from the POV of reliable editing, between using a robot after a human assesment of the word list and doing the job manually. Perhaps you could flesh out your objections a little? Moglex 16:29, 27 November 2006 (UTC)[reply]
Oppose - However, I think automation could proceed on a first step, namely to identify all such existing or potential pairs and create a page where they are listed as such. Human eyes could then be used to winnow the list down by removing inappropriate pairs. Once this is done, I could see the refined list possibly being tested on a bot. Otherwise I see severe difficulties. --EncycloPetey 00:24, 28 November 2006 (UTC)[reply]
I think preparing a list for general review is an excellent idea and I'll do it this morning. Where should such a page go? -Moglex 09:00, 28 November 2006 (UTC)[reply]
Make it a "subpage" of your user page and link it here in the discussion. --EncycloPetey 16:33, 28 November 2006 (UTC)[reply]
What about US/UK tags? They couldn't be copied over. And I wouldn't be surprised if it isn't always a clear distinction. But where you're knowledgeable of the differences the objections are dropped. Except I don't see why alternative spelling isn't sufficient, but I wouldn't block this on that basis. DAVilla 07:47, 28 November 2006 (UTC)[reply]
The US/UK tags are a problem because they are very inconsistent. If they're under an 'alternative spelling' heading, it's not a problem because that can be remade, but if they are embedded in the definition (which I have seen) it gets a lot more difficult. In fact, on this aspect alone it may be that the task turns from being robotic to one where an assistant is more appropriate.
The only reason alternative spelling isn't sufficient (just my opinion) is that it means people can't get to a word that is in the dictionary but spelled the 'wrong' way for their educational locale. I don't want to drag up another much mulled over argument, but surely, if alternative spelling is sufficient, then that implies that the words are the same which implies that a redirect would be appropriate (or a common page included under two entry pages with the regional information). --Moglex 09:00, 28 November 2006 (UTC)[reply]
Except that we don't use redirects. Each legitimate spelling can (and should) get its own entry. Otherwise, if say we had redirected crystalise to crystalize, but someone comes along and creates an entry for crystalise as a verb in Foo-bosh (not a real language), then we lose that redirect information on creation of the new article. It just isn't a feasible approach to use redirects in a multi-lingual dictionary. --EncycloPetey 16:33, 28 November 2006 (UTC)[reply]
Yes, on misspellings dependent on locale, others have said the same. DAVilla 04:08, 30 November 2006 (UTC)[reply]

Rethinking the automatic linking of inflected forms?

Currently, our inflection line templates like {{en-noun}} and {{en-verb}} automatically generate links to the inflected forms. In many cases, of course, these links are red. And for that and a couple of other reasons, I'm wondering if the links are such a good idea, after all.

The links aren't too useful to our users. If they're not red, in most cases all they point to is a stubby page which says e.g. "twiddled: Past tense of twiddle", which leads right back to the page we started from.

And the links end up having a significant cost. They goad us into creating all those stubby declined-form pages. I guess we have decided we want to have separate pages for every plural, past tense, and participle, so perhaps that's okay. (I wonder, though, if half the reason we decided on having the separate pages was the impetus from the automatically-generated redlinks.) But it's not that simple, because up above someone suggested that full CFI criteria apply to every declined form, meaning that if citations for the declined forms can't be found (or if nobody can be bothered to look for them), the automatically-generated links will stay fiery red.

It could be argued that the automatically-generated links could be useful to our readers, since the declined-form pages could contain distinct information like pronunciations and translations. I'd say, though, that on the day any significant portion of the declined-form pages have that information, the argument holds some water, but not until. In the meantime, I think we'd lose nothing, and maybe even gain some little things, if we modified the templates to turn the automatic link generation off.

(Remember, there are two classes of reasons to link things: (1) because the links will be useful, and (2) because you can. Needless to say, reason (1) is a much better reason. Does it apply to these links? That's the question.)

scs 22:41, 26 November 2006 (UTC)[reply]

I absolutely agree that links that generate a two stage circular path are pointless. I'd also agree that having a separate page for each inflection is, per se, unnecessary. However, it is necessary to make up for a limit in the software, namely that you cannot (AFAIA) have multiple (i.e. different) headwords on a page. I would think the ideal situation would be that if someone entered or clicked on 'looking', they would end up on a page that said something like:

Look      verb  noun
looks     verb  plural
looked    verb
looking   verb  adjective

1 To see; to search ...

(where the small words are on page links)

Being able to get to 'look' by clicking 'looking' either directly, as above, or the in the way it is done now is very useful for language learners.

Of course using the scheme I outlined above you'd also need all the POS's on the same page but that could well be a good thing because it keeps the relevant information together. Moglex 08:50, 27 November 2006 (UTC)[reply]

This issue arose because of people adding plurals based on a past language (eg Greek) from which the headword had been taken, where actually the plural we normally use is formed by adding s. I've forgotten the absurd examples which arose on RFV, but I made the suggestion that we could use the normal CFI process to determine whether these "academic" (sic) plurals were valid or not. I have used similar arguments about variant spellings and misspellings. Logically, if as we claim, we are non-prescriptive and recording actual usage, what other line can we take? I should emphasise that this is my PoV, and although it has attracted a little positive comment, I haven't heard enough discussion to guess what the community view is.
To backtrack a little from what I said before, I think that the burden of proof for "standard" inflections/variants is perhaps different. For "non-standard" inflections/variants, they will fail any RfD unless someone finds adequate cites. Perhaps, "standard" inflections/variants should be considered valid unless someone tries hard to find cites and fails. They would therefore be generated automatically in inflection lines, and entries might be made unchecked by bots, but could be deleted later if a check came up blank.
However, I do think that (if it is easy to implement) a check should be made in any automated process to see whether a books.google.com search gives any hits. While it can sometimes be difficult to find three valid cites even out of twenty b.g.c. hits, in my experience, the absence of any hits generally means that we can't find three online cites anywhere.
It's easy enough to implement, but I'm a little concerned at the ethics of using Google books as a database for checking around 20,000 words when they are not getting their quid pro quo of presenting their advertising payload to a human being for each search. Moglex 16:07, 27 November 2006 (UTC)[reply]
Incidentally, my proposal was that, where we find no evidence of (say) a present participle, we would use the normal template but with present participle not cited in place of the inflection. If we found some minor evidence of the word, but insufficient for CfI, we could perhaps say present participle rare, possibly ...ing. --Enginear 15:19, 27 November 2006 (UTC)[reply]
As a reminder, the reason we do not want multiple headwords, but need separate pages for each spelling, is because we work in "all" languages, not just English. --Enginear 15:21, 27 November 2006 (UTC)[reply]

I don't know who'd be up to it, but it might be better presentation to show the links in black unless the page actually exists. That would be cosmetic. As far as changing how things are done, I don't see any compelling reasons. We might have to sort out how to handle obviously correct inflections for extremely rare though well accepted words. The less accepted have been fought bitterly, e.g. necroposting preceded necropost in attestation. DAVilla 18:12, 27 November 2006 (UTC)[reply]

In English, it is quite often the case that the inflected forms exist as another word. While runs is the plural of run (noun) it is also the third-person singular simple present indicative (verb). It is also a pluralia tantum noun in its own right that doesn't have a corresponding singular definition. So, if we turn off the red-links, we lose this important aspect of cross-linking. Also, just because the entry exists and the link is blue, it doesn't mean that the linked entry has relevant information. I've seen this especially among Spanush, Italian, French, and other Latin-derived languages. There is an entry in existence for a plural or for a feminine, but it's not in the relevant language for the link. So the fact that a link exists doesn't mean squat. The current system may not be the desired ideal, but I think it's better than the suggested alternatives. Far better is for people with some skill at creating entries, but who don't feel up to heavy lifting, to begin adding citations and pronunciations to inflected forms. --EncycloPetey 00:36, 28 November 2006 (UTC)[reply]

Yale errors in single hanja entries

There are a lot of errors in the Yale romanization of single Korean hanja, that have been there since they were created. They consis of putting the "y" on the worng side of the vowels a and e, fairly consitantly, but not entirely. And some of them have since been fixed. The bad news is that there are 1007 of them to be fixed. The good news is that the 1007 can be identified by code, and fixed.

See User:Robert Ullmann/Korean Yale. I'd appreciate it if anyone with knowledge or interest could look at this list; if/when it is correct, I can fix the problems in a very short while with a little pybot magic ;-) Robert Ullmann 07:07, 27 November 2006 (UTC)[reply]

wikrastination

Hi all-

Some friends and I have begun to use the word "wikrastinate" do describe the act of browsing Wikipedia's seemingly endless cross-references instead of doing quick research. The usage of this word is spreading. I created an entry for this word, but was quickly crushed.

This is what I have:

entry removed

I have read the rules, but see that there are other words of a similar nature. In fact there is an entire section on Wiki- derivatives. What can I do?

Ellensn 16:15, 27 November 2006 (UTC)[reply]

Obviously, you haven't read our criteria for inclusion. There is a section on wiki-jargon, not a section on wiki-neologisms. Those would go to our list of made up words. Only. --Connel MacKenzie 16:22, 27 November 2006 (UTC)[reply]
When I deleted wikrastination the first time, I added it to WT:LOP#W. It is still there.. --Versageek 18:31, 27 November 2006 (UTC)[reply]

context tags

How can a non-prescriptive dictionary such as ourselves determine when a word has gained common acceptance, rather than being considered colloquial or informal? I would submit that a citation in the serious treatment of varied academic subjects every five years over the last hundred years should suffice as proof. Does anyone disagree? DAVilla 03:10, 29 November 2006 (UTC)[reply]

That's a rather stringent criterion. Does a word really have to have been in use in academic literature for 100 years before it's no longer informal? Some words are only formal, even though they've been in existence for only a decade or two, such as streptophyte. --EncycloPetey 03:15, 29 November 2006 (UTC)[reply]
Too stringent, you say! But would it be sufficient? DAVilla 03:22, 29 November 2006 (UTC)[reply]
Yes, but so would having the official stamp of approval from seven national governments and three Hindu deities. Sufficient, yes; necessary, no. (See also: overkill). --EncycloPetey 03:25, 29 November 2006 (UTC)[reply]
Will seven state governments and analyses in three independent religions do? DAVilla 15:01, 29 November 2006 (UTC)[reply]
That's pretty abstract. Do you have a specific example? --Connel MacKenzie 03:17, 29 November 2006 (UTC)[reply]
Of course I do. :-P DAVilla 03:22, 29 November 2006 (UTC)[reply]
Well, I'll say instead, that is too abstract. I agree with the overkill sentiment, at this point in time. One or two examples please. --Connel MacKenzie 18:15, 30 November 2006 (UTC)[reply]
For "irregardless", no, I don't think that 100 years of misuse proves much of anything. By the way, you really ought to use /Citation sub-pages when adding more than nine quotations, right? --Connel MacKenzie 19:29, 30 November 2006 (UTC)[reply]
But would you at least consider it to be colloquial, or do you insist there must be something jocular about decisions rendered by the 9th Circuit Court of Appeals? DAVilla 21:43, 30 November 2006 (UTC)[reply]
And by the way, it's a bit of a reach to call all the quotations misuse considering five of them, plus others I didn't include, were brought into existence before the word was even acknowledged, meaning a 19th Century Wiktionary would have included the term at a time there weren't yet any stylebooks written proclaiming the evils of its use. DAVilla 23:19, 1 December 2006 (UTC)[reply]

New translation table template

Hello all, I've come across three new templates that are lot smarter than the current translation table templates {{top}}, {{mid}} and {{bottom}}. These are:

{{trans-top}}
{{trans-mid}}
{{trans-bottom}}

There output looks like this:

Translations

I prefer the new template as it gives a smarter appearence to our translation tables. I think we better discuss its adoption as standard template - at least for large translation tables such as in the article orange.--Williamsayers79 11:33, 29 November 2006 (UTC)[reply]

I agree, but I would prefer to see a more evident "show" button:   1. you need to notice it.   2. The new user will need to discover its effect. – of course once discovered it is extremely useful.
Can we look forward to similar effects having wider use. Some time ago I was told off for columnising "Related terms" or something similar. I note that orange has "Derived terms" in 3 columns - very neat. — Saltmarsh 15:08, 29 November 2006 (UTC)[reply]
I, too, think the 'Show' button should be more obvious. Beobach972 02:11, 30 November 2006 (UTC)[reply]
I think the text should be idiot-proofed to say 'Click here to see translations'. bd2412 T 19:48, 3 December 2006 (UTC)[reply]

I created these as a demonstration in response to a Grease Pit discussion; they could certainly use some tweaking. They "borrow" the SS definitions from Template:nav, someone ought to do more general CSS, etc. I agree the (show) could be more visible, but against that, it is in the "standard", i.e. expected, place from WP, etc. Might be better if it directly followed the gloss. But this is controlled by the nav CSS at present. Robert Ullmann 05:35, 30 November 2006 (UTC)[reply]

I'm still unsure about the renaming "top/mid/bottom" to "trans-top/trans-mid/trans-bot". (Or was that trans-bottom? {{bot}} is for the language name, but if part of {{trans-bot}} then the three character name might be more consistent.) Someone please be bold with MediaWiki:Monobook.css to make the "show" show bigger. Has a one or two week vote for this started on WT:VOTE yet? The demonstrations of this that I've seen have been quite compelling. The format of having the gloss as the first parameter is very clean, in comparison to the current format. Support despite my discomfort regarding the template names. --Connel MacKenzie 08:17, 30 November 2006 (UTC)[reply]
By the way, the previous discussion of these is here: WT:BP#Collapsible translations sections, here: WT:GP#Dynamic feature on Wikinews, and here: Wiktionary_talk:Translations#Hiding translations. Connel, every time this is brought up, you suggest putting this to a vote, and I'm game, but I'm not writing a conversion bot. :-) However, perhaps it may be best to either give it a longer voting period or wait until after the holiday season. Thoughts? BTW, thank you to Robert Ullmann for adapting these! --Jeffqyzt 21:11, 12 December 2006 (UTC)[reply]
Well, on one hand it is useful innovation. On the other hand, I'd like to see some of the proponents follow through. Read as: Connel doesn't want to write this bot either, therefore he hasn't proposed the vote, least he get stuck with another item on his too-long to-do list.  :-)
I'd think that such an obvious improvement would be pretty automatic, when put to a vote. But I've been disatrously blindsided by making that assumption in the past. How long do you mean, when you say a longer voting period, anyway? January 31st? February 28th? February 29th? (2008!) --Connel MacKenzie 20:05, 13 December 2006 (UTC)[reply]
This system is very useful, not just for long translation sections, but for any extended text that may be of interest to only a small number of users. I have used it on two Italian suffixes -ere and -ire to hide lists of verbs. SemperBlotto 08:25, 30 November 2006 (UTC)[reply]
It looks like the show/hide buttons require JavaScript, but the default for users without JavaScript support is to show the translations. Excellent. —scs 22:07, 30 November 2006 (UTC)[reply]

In the absence of an argument can the top template be made to use the "headword" ? ––Saltmarsh 10:57, 1 December 2006 (UTC)[reply]

Not an idea I like. The template is supposed to take an argument, and using the headword is not a letigimate result. Not providing an argument should result in an error, ugly or dressy but an error the same. DAVilla 15:36, 4 December 2006 (UTC)[reply]
Or better yet, perhaps the absence of any parameter could result in adding the page to a special subcategory of Request for Cleanup until someone adds the necessary parameter? --EncycloPetey 00:40, 5 December 2006 (UTC)[reply]

Maybe this is too radical, but since we can now hide the translation sections, wouldn't it make more sense to include them actually within the definition sections? That was unfeasible when they were huge lists, but now that it's just an extra line it seems neater, and avoids the rather inelegant necessity of repeating or summarising the definitions under the translation section. Widsith 10:24, 5 December 2006 (UTC)[reply]

That is a very good idea; I'd thought about that. It would take some serious work to make the template appear well under the definition. There are other issues though: there are a lot of programs that expect the translations to appear as a list/lists under the L4 Translations header; this would be a very radical change to them. (they presently just ignore top/trans-top/midX/etc.) It also would hit programs that expect to ignore translations. And it would take a format change to the lines themselves, to avoid breaking the # numeric sequence. It is just do-able, but there is lots of messiness; probably best not to. *sigh* Robert Ullmann 11:03, 5 December 2006 (UTC)[reply]

Saltmarsh, putting long lists into columns is always a good idea (I've been doing it with the sometimes very long lists of derived terms I've been adding: green is a particularly extensive example), it's just that you need to make sure you use the {{topx}} and {{midx}} templates (where "x" is a number between 2 and 4 inclusive, indicating the number of columns) instead of {{top}} and {{mid}}, which are supposed to be used only for translations. (The {{bottom}} template is used with both sets of templates.) — Paul G 16:31, 6 December 2006 (UTC)[reply]


OK, I've started a Vote on updating the standard at WT:ELE. --Jeffqyzt 18:25, 19 December 2006 (UTC)[reply]
Why has User:DAVilla added "Translations to be checked" to the {{trans-top}} display text? Was that done in error? —Stephen 05:47, 31 December 2006 (UTC)[reply]