Wiktionary talk:Frequency lists/Dutch wordlist
Latest comment: 4 years ago by 伟思礼 in topic Numbers probably inaccurate
Numbers probably inaccurate
[edit]Some of the editors have stated that they adjusted the counts. Not clear whether others did so. So the counts may not be accurate. The edits have removed lots of entries, so I adjusted the "number of words" accordingly.伟思礼 (talk) 04:52, 16 October 2020 (UTC)
Cleanup
[edit]- "Lk" is not a Dutch word. The first letter is in fact a capitalized "i". → Added its frequency number to the number of "ik" (first-person singular).
- Same for the non-existing word "lemand" → "iemand".
- If the first letter of "één" is capitalized, it's written "Eén" (without the first accent). → Added the frequency numbers of "eén" and "één", and moved the word to its new, higher position in the list.
- "Oke" is a misspelling of "oké". → Added the frequency numbers of "oke" and "oké", and moved the word to its new, higher position in the list.
- "Dr" is not a word, but an abbreviation of "doctor". → Added a dot, "dr."
- Changed "mw" (abbreviation for "maatschappelijk werk" ("social work")) to the more likely "mw." ("Mrs").
- "John(ny)", "Harry", "Charlie", "Donald", "Diane", "Flynn", "Jane", "Claire", "Michael", "Mike", "Max", "Will(y)", "Christian", "Christopher", "Alex(ander)", "Ray(mond)", ""Bill(y)", "Sarah", "Jim(my)", "Phil", "Vanessa", "Amy", "George", "David", "Ruth", "Charles", "Christine", "Julian", "Jordan", "Eddie", "Paul(ie)", "Tina", "Sam", "San" and "York" are (parts of) names. → Removed from list.
- "Sir", "Mr", "Ms" and "Mrs" are English words. Dutch subtitlers tend to not translate these forms of address. → Removed from list. (The same goes for "Miss", but since this word is used in Dutch for winners of beauty contests as well, I'm not sure if I should remove it.)
- "The", "dude" and "new" are English words. → Removed from list.
- "Peter" ("godfather"), "mark" ("mark"), "chris" (a kind of gymnastic skill), "daisy" (a kind of biscuit) and "joe" ("yah", "yup") are in fact Dutch words. But they're fairly unusual and therefore there's no doubt in my mind that these are actually names. → Removed from list.
- Capitalized "Engeland", "Londen", "Parijs", "Mexico", "Amerika", "Amerikanen", "Amerikaan", "Amerikaans", "Amerikaanse", "Nederlandse", "Duitsland", "Duits", "Frankrijk", "Spanje", "CIA" and "FBI".
- This is all the time I have. --Caudex Rax ツ (talk) 16:27, 26 July 2014 (UTC)
- Although ais is technically the plural of ai (twee ais = two exclamations of disappointment or pain) and aïs is A♯, I'm pretty confident that in the subtitles it originally was aIs: als with a capital I instead of a lowercase l.
- ‘Ian’, ‘Abby’, ‘Jean’, ‘Julia’, ‘Stark’, ‘Rita’, ‘Bella’, ‘Bell’, ‘El’, ‘Ellen’, ‘Ed’, ‘Ron’, ‘Blake’, ‘Zack’, ‘Dick’, ‘Mickey’, ‘Mikey’, ‘Ted’, ‘Lee’, ‘Stan’, ‘Ford’, ‘Stone’, ‘Evan’, ‘Kane’, ‘Zoe’, ‘Luke’, ‘Dale’, ‘Kate’, ‘Julie’, ‘Nate’, ‘Tim’, ‘Francis’, ‘Franklin’, ‘Frankie’, ‘Anna’, ‘Anne’, ‘Annie’, ‘Ann’, ‘Vic’, ‘Eve’, ‘Fred’, ‘Kong’, ‘Lane’, ‘Grace’, ‘Green’, ‘Grand’, ‘Booth’, ‘Seth’, ‘Ali’, ‘Lois’, ‘Mia’, ‘Kim’, ‘Walt’, ‘Walter’, ‘Alan’, ‘Lex’, ‘Eli’, ‘Pam’, ‘Homer’, ‘Brad’, ‘Steve’, ‘Leo’, ‘Leon’, ‘Leonard’, ‘Marie’, ‘Vince’, ‘Vincent’, ‘Hall’, ‘Carl’, ‘Carol’, ‘Pearl’, ‘River’, ‘Omar’, ‘Rex’, ‘Andrew’, ‘Anderson’, ‘Andy’, ‘Sandy’, ‘Randy’, ‘Miranda’, ‘Amanda’, ‘Tyler’, ‘Miller’, ‘Tom’, ‘Thomas’, ‘Daniël’, ‘Dana’ ‘Bart’, ‘Albert’, ‘Robert’, ‘Martha’, ‘Arthur’, ‘Chuck’, ‘Dave’, ‘Eric’, ‘Henry’, ‘Jake’, ‘Jerry’, ‘Jones’, ‘Kevin’, ‘Martin’, ‘Marty’, ‘Mary’, ‘Scott’, ‘William’, ‘Williams’, ‘Rachel’, ‘Bobby’, ‘Roy’, ‘Troy’, ‘Doug’, ‘Brown’, ‘Jay’, ‘Tony’, ‘Anton’, ‘Nick’, ‘Nicky’, ‘Nicole’, ‘Rock’, ‘Rocky’, ‘House’, ‘Rick’, ‘Ricky’, ‘Patrick’, ‘Richard’, ‘Richie’, ‘Rico’, ‘Samantha’, ‘Carter’, ‘Jackson’, ‘Hank’, ‘Cameron’, ‘Mitchell’, ‘Jack’, ‘O'Neill’, ‘Jonas’, ‘Quinn’, ‘Teal'c’, ‘Jennifer’, ‘Rodney’, ‘Elizabeth’, ‘Chloe’, ‘Nicholas’, ‘Matthew’, ‘Wallace’, ‘Young’, ‘Jackie’, ‘Jacob’, ‘Jamie’, ‘Michelle’, ‘Neil’, ‘Jonathan’, ‘Joey’, ‘Johnson’, ‘Jenny’, ‘Lisa’, ‘Matt’, ‘Ryan’, ‘Larry’, ‘Emily’, ‘Lucy’, ‘Kelly’, ‘Kyle’, ‘Taylor’, ‘Barry’, ‘Terry’, ‘Wayne’, ‘Gary’, ‘Lily’, ‘Sally’, ‘Harvey’, ‘Molly’, ‘Jersey’, ‘Teddy’, ‘Penny’, ‘Betty’, ‘Nancy’, ‘Sammy’, ‘Ashley’, ‘Fox’, ‘Dexter’, ‘Dylan’, ‘Kenny’, ‘Sonny’, ‘Freddy’, ‘Tracy’, ‘Anthony’, ‘Wendy’, ‘Barney’, ‘Maya’, ‘Stanley’, ‘Casey’, ‘Jeremy’, ‘Lenny’, ‘Benny’, ‘Manny’, ‘Percy’, ‘Murphy’, ‘Cindy’, ‘Clay’, ‘Audrey’, ‘Riley’, ‘Scully’, ‘Peyton’, ‘Haley’, ‘Cody’, ‘Phoenix’, ‘Felix’, ‘Guy’, ‘Jeffrey’, ‘Wesley’, ‘Scylla’, ‘Kennedy’, ‘Brian’, ‘Lawrence’, ‘Lucas’, ‘Sara’, ‘Margaret’, ‘Maria’, ‘Liz’, ‘Hannah’, ‘Samuel’, ‘Barnes’, ‘Elena’, ‘Benjamin’, ‘Norman’, ‘Gus’, ‘Allison’, ‘Nelson’, ‘Ross’, ‘Barbara’, ‘Catherine’, ‘Katherine’, ‘Katie’, ‘Laura’, ‘Lauren’, ‘Brooke’, ‘Smith’, ‘White’, ‘Buddy Holly’, ‘Beverly’, ‘Bailey’, ‘Bud’, ‘Jeff’, ‘Noah’, ‘Shaw’, ‘Bob’, ‘Bruce’, ‘Chase’, ‘Heather’, ‘Harold’, ‘Harris’, ‘Hector’, ‘Howard’, ‘Beth’, ‘Joseph’, ‘Josh’, ‘Keith’, ‘Mitch’, ‘Ralph’, ‘Carlos’, ‘Caroline’, ‘Carrie’, ‘Karen’, ‘Oscar’, ‘Picard’, ‘Susan’, ‘Sue’, ‘Bauer’, ‘Eva’, ‘Emma’, ‘Louis’, ‘Lou’, ‘Lewis’, ‘Owen’ and ‘Baker’ are names and, going by Caudex's standard, don't belong on the list.
- Jeans are jeans (tantum plurale) or spijkerbroek.
- The el (plural ellen) exists as an old unit of measurement, but it's nearly extinct. No way it would rank higher than mijl. Also, ‘El’ is used as a part of names, Spanish ones in particular.
- I think the abbreviation e.d. would have gotten split into e and d.
- ‘Mickey’ could be part of Mickey Finn (probably entered Dutch in the 20th C., always used in full) but I think it wouldn't make the top 5k.
- I'm pretty sure ‘Ford’ doesn't refer to the car brand, since other car brands, some of which are much more common, aren't represented in the list.
- A stone is also an old British unit of measurement, but I wouldn't expect it to make the top 5k.
- Technically the verb form kane exists, but it's a mostly pre-1950 suffix on a probably post-1950 verb. Similarly, dale exists, but I cannot imagine it being common in subtitles.
- A frankie is a small franc coin, I don't consider that likely to make the list.
- The word green meaning grove den is rare. It could also be a golf term, but the absence of hole and other golf terms makes this unlikely in my opinion.
- A grand is called a mille in Dutch.
- The word kim means horizon, but I wouldn't expect it to make this list.
- The word walt means ‘is on a rolling boil’ but that certainly wouldn't be in the top 5k.
- There is the combination (au) bain-marie, but ‘bain’ didn't make the list, so yeah.
- The word hall also occurs in the combinations hall of fame and hall of shame but ‘fame’ and ‘shame’ aren't on the list, or in the sense of the meeting room in a hotel, but that's even rarer. I think here it's either a surname or part of the name of a locality, building, room, &c., such as the Hall of the Dutch Senate or Albert Hall.
- The word carol (Christmas song) is really rare and I wouldn't expect to see it here.
- There's actually a Pearl River, but I don't think combining these gives a sensible ranking. I expected this to be part of Pearl Harbor, but even ‘harbor’ isn't on the list, so I suspect most occurrences are just personal names.
- An ongelovige thomas is someone who is hard to convince of things or someone who is in the eyes of a Christian a faithless person. A tommy is a British soldier.
- The rare word amanda is used for a variety of wildly different kinds of bread and confectionery containing almonds.
- A British constable is called a bobby, but I imagine the name will overshadow it here.
- A rachel is an almost forgotten word for a narrow wooden beam.
- The word brown-out is really rare, although it has recently entered Belgium's societal energy discussion. Apart from that though, you'll likely never encounter it.
- A nick is also someone's handle on the internet.
- Although rock is a musical genre, other common genres are absent, except for house which I suspect is mostly used as a name, or possibly as part of a toponym, here. If it were used as a musical genre you'd also expect ‘roll’ to show up in the list as part of the word rock-'n-roll.
- The nowadays rare word richard is a slur for a very rich person.
- A jack is a jacket though. I didn't really know what to do, so in the end I decided to deduct about 2k from the frequency to try to account for the name ‘Jack’, but this is a really rough guesstimate. Why o why did the author remove all caps? Yeah, I know he gives a reason, but the reason sucks.
- The word jonas is also a form of the verb jonassen.
- A phoenix is a feniks.
- A jersey is also a kind of knitted sweater, but coordinating terms aren't on the list and the word is rare.
- A teddy is a teddybeer.
- A penny is also a coin, but it's rare and if used as such the plural is pence and this isn't on the list even though it would be much more likely to show up.
- The bob is the designated driver, and bob is also a term for a hairdo. But the name is far more common.
- An Oscar is of course also an Academy Award. But usually it's a name.
- An eva or evaatje is also a kind of short little apron.
- The rare word lou is street slang for no or not, and Lou are two languages with a combined total of about 1½k speakers.
- A baker is a midwife's assistant or a form of the verb bakeren, to dry-nurse or to swaddle. Yeah, it's rare and antiquated. And I've got a hunch that a lot of mentions are actually part of ‘221B Baker Street’, which should be removed for the same reason we've removed all those names.
- Although l's is a non-standard plural of the letter l (it's in the Green Booklet, but Language Phone states only z has two plurals) and L.S. means lectori salutem, I think this is really just ls: is with a lowercase l instead of a capital I.
- Mn is a misspelling of m'n.
- St., nl. and mevr. are abbreviations.
- ‘com’ is not a Dutch word, it's primary use is in URLs.
- ‘By’, ‘bay’, ‘lake’, ‘east’, ‘south’, ‘hill’ and ‘street’ are English words, presumably mainly mentioned as parts of toponyms. ‘And’ is also an English word.
- he is a misspelling of hé. (Note 1: the word he does exist as an antonym of homo but it's really rare, certainly not among the top 5k. Note 2: I think it's unlikely that he is a misspelling of hè because that would run contrary to common pronunciation rules. People would use heh if they cannot use è.) Hee is an alternative spelling of hé. Hey is a misspelling of hé.
- Although ‘mac’ could be part of ‘McDonald's’, big mac or Macintosh, considering that similar brands like ‘Febo’, ‘Burger King’, ‘Whopper’, ‘Windows’ or ‘Microsoft’, nor the word pc are even on the list, I think it's more likely just a part of a character name in almost all instances.
- Although a back is a defender, the absence of spits, linksback and rechtsback means that this is an English word, probably used as the surname ‘Back’.
- ‘von’ is a common part of German names, not a Dutch word. ‘da’ is a common part of Romance names, not a Dutch word.
- Although ‘on’ can be used as parts of phrases like on speaking terms, by itself it isn't a Dutch word. Same goes for ‘all’ in e.g. all right, and ‘to’ in e.g. place to be. Same for French ‘le’ in e.g. excusez le mot.
- ‘for’ is English, I don't know how it ended up ranking this high in the subtitles. ‘Simply’ is also English; I think it's used a lot in names for shows. ‘Day’ is used mostly as part of D-day, Independence Day and the like, but separately they wouldn't have made this list.
- ‘boo’ is puzzling. If it were a misspelling of boe, you'd expect that to be on the list as well. I think it's probably the name ‘Boo’.
- Em is a misspelling of 'm. It could also be an abbreviation, em. but I don't consider that likely. What is worrying though is that 'm should have been on the list. I strongly suspect the original creator of the list filtered out all single letters. If it hadn't been, maybe it might have had a frequency value of 200k, maybe even 600k or more... which would give it a ballpark ranking of 130 to 50 or so. Compared to that, the contribution of ‘em’ is not significant; we can only accept that a very common word is missing from this list.
Rough stats for: | Google results | Google books | YouTube (this year) | This list (w/o heb) |
---|---|---|---|---|
heb hem | 211 | 647 | 574 | 730248 |
heb 'm | 179 | 341 | 241 | ?????? |
heb 'em | 140 | 187 | 17 | 2330 |
em. | 144 (incl. Eng. &c.) | 91 | 1 | ?? |
- Uh and euh are misspellings of eh.
- Capitalised Rome, San Francisco, Frans, Franse, Fransen, New York, Italië, Italiaans, Italiaanse, VS, NS, Hitler, Lets, Engels, Engelse, China, Chinese, Times, Texas, Californië, Hollywood, Sydney, Brooklyn, Führer, Miami, Miami Beach and Sun.
- 5829 ‘San’'s are still unaccounted for, assuming Francisco as a personal name is rare enough not to make the list.
- 8058 ‘New’'s are still unaccounted for, assuming the city York doesn't occur often enough to make the list.
- The abbreviation vs. exists, but it's somewhat rare in comparison and I would expect to find it less in subtitles.
- let is a form of letten. And I'm fuzzy on this, but I also think in tennis a service is a let if the ball touched the net or if the opponent wasn't ready. Not a common term outside of tennis though.
- An engels is a forgotten unit of measure of about 1½ gram.
- I didn't capitalise chinees (Chinese food, Chinese restaurant) and chinezen (to eat Chinese / to chase the dragon) because I've got a hunch that the lower case meanings are more common.
- There are other named beaches, such as the ones in Normandy, but Miami Beach is the most famous by far.
- I expect -ste to be a suffix occurring in combinations like 442ste regiment.
- Since ‘versa’ is missing, vice- is probably a prefix.
- I think in subtitles ‘no’ is almost always used in the combination in no time, rather than as the abbreviation no.
- Although up is a word, it's ultra-rare. I'm sure this was 7Up before the original uploader ‘cleaned’ the list. Compare other drinks in the list and their frequencies.
- ‘hmm’ is a misspelling of hm.
- It's tempting to look at ‘Miss’ and ‘World’ and create a ‘Miss World’ entry, but honestly I don't think it would rank that high. I think it's more likely that ‘Miss’ is part of character names like ‘Miss Marple’ and that ‘World’ ranks so high mostly because it's part of so many film titles. If you look at the frequency of ‘Frodo’ (2720) you can see how even a single series can skew the statistics. I'm removing both entries.
- ln is a misspelling of in with a lowercase l instead of a capital I.
- ‘nlondertitels’ is a subtitle website. ‘Suurtje’ is a prolific subber.
- Although make is a verb form, I think the creator of this list split all hyphenated words, which means that this was part of the word make-up before the butchering. The word make-over also exists, but it's much rarer.
- Perhaps somewhat optimistically combined ‘earl’ and ‘grey’ into earl grey. It's possible both terms are sometimes used in names, but I'd expect it to be tea fairly often. Please compare its ranking with tea and other drinks and decide for yourself whether this makes some sense or not. In any case, this still leaves 305 ‘earl’s unaccounted for.
- Similarly, I merged all-in-one, leaving 1674 ‘all’s unaccounted for. Yes, I know ‘one’ is also used in other combinations, but even for the most common ones the other parts didn't show up on this list.
- Although teq (usually misspelled as TEQ) could mean toxische equivalentie (toxic equivalent), I think it's more likely it's a misspelling of Tea or part of the name of a start-up that considers itself too hip to use the spelling ‘tech’.
- Fixed the fossilised forms 's morgens, 's ochtends, 's avonds and 's nachts.
- Because in films and on television the words jezus and christus are mostly used as curses and only rarely in the religious sense, I've removed the capitals. There's an ongoing movement in Dutch orthography to use capitals only sparingly, e.g. u is no longer capitalised, nor hij in the Bible, nor personal names used as archetypes, just to name a few examples.
- ‘se’ is part of per se. (Actually, there are some more combinations with se but they are rare in subtitles.)
- I'm not sure what do do with ‘you’. It's clearly not a Dutch word and only used in English phrases, probably almost exclusively in the curse fuck you. Yet ‘fuck’ (6109) has less occurrences than ‘you’ (6554). Considering that by my sloppy estimation, ‘fuck’ is followed by ‘you’ in ¾ of all cases, that leaves about 2k ‘you’s unaccounted for.
- yo and jo are alternative spellings of the same word. I don't think it's settled yet what the official spelling will end up being. Since it's an English loanword, let's spell it as such for now. By the way, jo is the vocative form of jij in Surinam Dutch.
- Okay is a misspelling of oké. Whoa is a misspelling of wow, not to be confused with wauw which means ‘wow’.
- ‘CTU’ is a fictional agency, among other things.
- ‘Blue’ can be used in several combinations, but the most common one by far is out of the blue. This still leaves 30939 ‘the’s unaccounted for.
- pappa is an alternative spelling alongside papa.
- Although Th is an element, Th. is a double initial and t.h. means ‘for rent’, I think we're dealing with the English ordinal suffix ‘-th’ here, which is common in film titles and addresses.
- sex is a misspelling of seks. Interestingly, sexy still has an x.