Wiktionary talk:Frequency lists/Hungarian frequency list 1-10000
Add topic- A user tagged with
with the following comment:
This list counts the frequency of words found on Hungarian webpages. These contain non-Hungarian materials well. For example, the word "and" means "és" in Hungarian. Other flashy examples: "unique, economy, air, their" and even "Washington". The second problem: common abbreviations, for example, "tom", coined from "tudom". Third mistake: inflected and not inflected forms of the same word both appear: "Magyarország - Hungary" and "Magyarországon - in Hungary". Fourth error: nonsense. "Volt" is the past tense form of the irregular verb "van". "Meg" is a verbal prefix, indicating the finalized state of the action. And, for the last: several spelling mistakes, and the "List of 10 000" words list 10 054 words. All in all, I recommend the deletion of the article. (Also, it does not cite sources). — This comment was unsigned.
- Specific complaints about the usefulness of a frequency list, or methodological errors noticed are certainly no reason to blindly delete such a compilation.
- The template to use to nominate things for deletion is "rfd". The "delete" template is a helpful link for lost Wikipedians, only.
- The frequency lists naturally cannot cite all sources, other than saying when and how the web pages were found, that were used in the analysis.
What corpus was used to compile this list? I cannot really see if it's of any value. It contains a lot of English words, never ever used in Hungarian (like bill, theory, etc). Besides most of the words are in NOT in their root form but in an agglutinated one. (for example, "with his permission" instead of "permission"). All in all, for me this list seems to be worse than to have nothing at all.
Here goes a properly compiled frequency list: http://mokk.bme.hu/resources/webcorpus/index_html
Please do delete this garbage. — This comment was unsigned.
- Left a welcome message on the anon page with the following: I noticed that you added comments to the Hungarian "frequency list". Thank you for the link pointing to the Hungarian webcorpus. I did take a look at it hoping to see a better list, but unfortunately found just as many foreign words and inflected forms as in the Wiktionary list. Your concern is valid and your help in cleaning up the list would be much appreciated. Thanks. Panda10 10:28, 31 August 2008 (UTC)
- This list is very useful for anyone studying Hungarian. If you see errors, improve the list by fixing them, but don’t throw out the baby with the bathwater. —Stephen 10:43, 1 September 2008 (UTC)
I have tried to clean up the list by removing the Enlgish words and also moving proper-nouns to a secondardy list. During the review I noticed that the list is fairly unlikely to be what it claims. I.E. it is unlikely that the top 100 or even 1000 words are realy the most frequent. It has a stong bias towards IT and political. This is made worse by problems listed above. I would recomend deleting this list since it is misleading. I've also found many words are in the list 3-4 times in different possitions...
What would be much better is a list which was:
- case insensative
- passed a spell cheking filter
- excludes abreviations
- passed a stemmer or a lemmatizer to normalize for morphological variation (i.e. words are listed according to a lemma form).
- compiled using better material. E.g. using Full text from the Hungarian Online Library plus and a filtered Hungerian Wikipedia dump plus a News Articles or Legal.