Wiktionary talk:Frequency lists/PG/2005/08/1-10000
Add topicOne more: in the first subsection titled 1-100 there are only 99 words. But in the other subsections we have 100 words.
Here's a few problems I can see:
- Apostrophe is included as a word character in English in positions where it is only rarely permitted such as the first character: ('tis in 2601-2700 is an exception)
- 'I (701 - 800)
- 'The (2301 - 2400)
- 'You (2801 - 2900)
- 't (3201 - 3300)
- 'And (3901 - 4000)
- 'What (3901 - 4000)
- 'It (4101 - 4200)
- 'Oh (4201 - 4300)
- "." seems to be blindly included as a word character even when it's not surrounded by a letter on both sides or is accompanied by an apostrophe:
- it.' (6101 - 6200)
- you.' (7501 - 7600)
- me.' (7801 - 7900)
- URLs are present - is it possible that Gutenberg's newsletters are being included as well as the out-of-copyright books?:
- pobox.com (5401 - 5500)
- gutenberg.net (6201 - 6300)
- pglaf.org. (9801 - 9900)
Most of these are easily filtered out. In the first case I think we'd be better off with fewer false positives at the cost of a small number of false negatives.
- Thanks. I'd just like to say, gah! I had punctuation characters converted to spaces in an earlier iteration; I'm not sure what I goofed up on this last round. Oh wait, these frequency lists are not the latest version, and don't correspond to the template:rank entries. Double gah! --Connel MacKenzie T C 22:09, 8 December 2005 (UTC)
I was searching for some common 3 letter words, and here's what I found:
For j: joy, job, jag, joy, jos, jeg, jar and jaw...but no jug! For k: key, kun, kan, kam and kin...but no kid! For z: zoo, zou and zal...but no zip!
Hard to see how this can really be useful as any sort of approximation of the 10000 most commonly used English words, which is what I was loooking for.122.107.225.220
Why's 'Gutenberg' #243? How often does that come up?
- Presumably because the scanner includes the copyright text at the start of every book (that contains Gutenberg). Conrad.Irwin 03:14, 19 March 2009 (UTC)
Why is 'la' in the first 100 words? It refers to http://en.wiktionary.org/wiki/la where it has number 481 (and even that is a very low number for the meaning of 'la' as a syllable used in solfège (music) Lcla
Crappy
[edit]This page is really crappy, to be blunt. Can we delete it? --Newfriendforyou 00:27, 23 July 2011 (UTC)