Jump to content

User talk:Francis Tyers

Page contents not supported in other languages.
Add topic
From Wiktionary, the free dictionary
Latest comment: 12 years ago by Francis Tyers in topic Armenian corpus

Welcome

[edit]

Hello, and welcome to Wiktionary!

If you have edited Wikipedia, you probably already know some basics, but Wiktionary does have a few conventions of its own. Please take a moment to learn our basics before jumping in.

First, all articles should be in our standard format, even if they are not yet complete. Please take a moment to familiarize yourself with it. You can use one of our pre-defined article templates by typing the name of a non-existent article into the search box and hitting 'Go'. You can link Wikipedia pages, including your user page, using [[w:pagename]], {{pedia}}, or {{wikipedia}}.

Notice that article titles are case-sensitive and are not capitalized unless, like proper nouns, they are ordinarily capitalized (Poland or January). Also, take a moment to familiarize yourself with our criteria for inclusion, since Wiktionary is not an encyclopedia. Don't go looking for a Village pump – we have a Beer parlour. Note that while Wikipedia likes redirects, Wiktionary deletes most redirects, (especially spelling variations) in favor of short entries. Please do not copy entries here from Wikipedia if they are in w:CAT:MtW, they are moved by bot, and will appear presently in the Transwiki: namespace.

A further major caveat is that a "Citation" on Wiktionary is synonymous with a "Quotation", we use these primary sources to construct dictionary definitions from evidence of the word being used. "References" (aka "Citations" on Wikipedia) are used predominantly for verifying Etymologies and usage notes, not the definitions themselves. This is partly to avoid copyright violation, and partly to ensure that we don't fall into the trap of adding "list words", or words that while often defined are never used in practice.

Warning to experienced Wikipedians:
This is not Wikipedia; it is run in a very different manner, and assuming similarity will likely get you blocked for causing disruption. This has happened too frequently in the past, which leads to a (possibly unfair) prejudice about your motives and actions here. You should note particularly that being bold is not encouraged where it goes against any of our policies or against the community consensus, which is generally strong.

Changing policy pages on Wiktionary is very strongly discouraged. If you think something needs changing discuss it at WT:BP, following which we may formally VOTE on the issue.

You should also note that Wiktionary has very different user-space policies, we are here to build a dictionary and your user-page exists only to facilitate that. In particular we have voted to explicitly ban all userboxes with the exception of {{Babel}}; please do not create or use them.

We hope you enjoy editing Wiktionary and being a Wiktionarian. --Ivan Štambuk 06:42, 3 December 2008 (UTC)Reply

mesto

[edit]

Hello Francis.

I've noticed that you've copy/pasted a few Croatian entries of mine to Bosnian and Serbian, and Serbian to Bosnian and Croatian. That is generally completely OK, with the exception a few problems in the process I noted, and of which would like to inform you:

  • Standard Croatian is based on Ijekavian reflex of Common Slavic jat phoneme. Forms written as mesto occur only dialectally (mostly in Kajkavian and Northern Čakavian speeches), and as such represent sub-standard forms which should be corroborated by citations, not being present in e.g. normal dictionaries. Of course, they're nothing less "Croatian" than the officially codified subdialect, and I myself have been creating lots of those (e.g. misto with citations of Ikavian Čakavian and Štokavian Croatian writers) in Ikavian reflex of jat, and plan to do so for every other instance where jat occurrs (and not only with the reflex of jat, but of *t'/*d', strong jer vocalized as *e not *a etc. - all to ==Alternative forms== section). Small additional problem is that this written <e> as an Ekavian jat reflex is phonetically distinct than /e/ in Serbian Ekavian, and dialectologists mark this with special diacritics, but for the time bing the usual notation is OK (i.e. by the time the Institute publishes the massive Rječnik hrvatskoga kajkavskoga književnoga jezika on the Web, which should be very soon)
  • This newly-invented "Bosniak language" (which Bosniaks call bosanski "Bosnian") invented by Muslims in the 1990s exists only as codified language (some would say, mixture of Croatian and Serbian), so entries like mesto which moreover don't even appear in organic idioms spoken by large populations of Bosniaks should have no place being formatted as ==Bosnian==.
  • I noticed that you've copied Croatian inflection template {{hr-decl-dan}} to a Serbian entry. All of those special Croatian inflectional templates are obsolete and should be substituted with the general one {{hr-decl-noun}} that accepts all possible inflected forms one by one. Why is that? Because there is this little thing called pitch accent which can unpredictably alternate withing the paradigm, and with more than 300 morphological-accentological paradigms existing for nouns only, it would be pointless to create 300+ templates to cover all the corner cases. One day I'll write a bot that will add inflection for all Croatian words (verifying it against HML) and in another pass generate the appropriate accents.

Cheers and don't hesitate asking me on anything you find confusing. --Ivan Štambuk 07:07, 3 December 2008 (UTC)Reply

There are {{bs-decl-noun}} and {{sr-decl-noun}}. Templates for inflection tables vary from language to language in layout (one day they're probably be generally customizable via CSS, but lots of general issues needs to be settled before that). No one can copyright inflected forms of a lemma. It says (c) on the site, but it cannot be proved that you actually got it from there. Though in this particular case if necessary a permission from mr. Tadić who is behind the engine might as well be asked for. Also, before bot-adding (for which you need to start a vote expressing rationale and demonstrating proper functioning), paradigms are usually checked by a literate native speaker or someone similarly knowledgeable. Most of the HML generated forms can be reused for Serbian too, taking into account differences such as 'pisaću' vs. 'pisat ću' etc. --Ivan Štambuk 08:16, 3 December 2008 (UTC)Reply

I don't know whether it would be infringement or not, but I can e.g. use their database to extract inflectional pattern of a lexeme, and generate via my own algorithm. But I'm sure that possible copyright issues will be resolved when the time comes. The engine is non-commercial, published by an academic institution, after all..
požar seems to be present in all Slavic languages, but I can't find a ref. for a Proto-Slavic reconsturction of *požarъ (everywhere I looked it's described as po+žar, which is in fact wrong etymology as it's inherited word of Common Slavic lexical stock not a later morphological derivative). I usually do add etyms for all Slavic languages that I see that have WT entries, but when I edit one section I just add it to it.. --Ivan Štambuk 09:23, 3 December 2008 (UTC)Reply
  • [1] - Note that Čakavian is Croat-only dialect and has no place in Serbian language entry. --Ivan Štambuk 02:03, 6 December 2008 (UTC)Reply
  • [2] - also note that per WT:ELE language sections are supposed to be separated by ---- and sorted alphabetically (this latter is enforced by AutoFormat bot). --Ivan Štambuk 02:08, 6 December 2008 (UTC)Reply
  • I am 100% literate only in Croatian, and sincerely don't have much desire to indulge into the thorough expansion and amending of Serbian entries, esp. with dual-script maintenance hell and the subconscious prejudice of "is this literary Serbian of today?". Sometimes when I'm in the mood, I do create bs/sr entries as a part of Slavic cognates series (cf. the contents of Category:Proto-Slavic language). Essentially, my stance is that cloning the existing Croatian entries into bs/sr subsections doesn't add any real new content to the WT, and hence is not a type of mental activity worthwhile pursuing with respect to the millions of other interesting tasks pending my intervention that this project presents. --Ivan Štambuk 15:53, 6 December 2008 (UTC)Reply
Because Serbian is written in both Cyrillic and Latin, and that is the official policy of the standard language guaranteed by the constitution (though the exact formulation in the constitution gives mild preference to Cyrillic IIRC). Cyrillic is the traditional script used for centuries, closely associated with Serbian national pride, literary tradition, Orthodoxy and the association with Eastern cultural provenience. The usage of Latin script is chiefly a result of various "unification" efforts with Croatian envisioned by 19th century Romantic and naive "Illyrians", but ultimately coming to be enforced by certain totalitarian regimes whose names and practices we don't need to mention explicitly here and now. Most native Serbs are familiar with both scripts, but 90% of Serbs in diaspora can only read Latin script, and they represent much more numerous potential user target for Wiktionary than domicile Serbs (suppose they want to improve their language skills). Though when it comes to that, the only 2 Serbian editors editing from IPs I've managed to detect during the last year acted in a highly disturbing manner (one of them is well-known sci.lang troll allergic to Ottoman Turkish LWs etymologies an who has been replacing them with his imaginative Slavic-root alternatives, the other one had a high appetite for adding words of exclusive Croatian usage as ==Serbian==, and was probably not that literate as he made numerous errors - just the other day he added a word meaning 'fist' but translated it as 'hand' - mistakes on basic words such as this usually indicate non-native proficiency). But nevertheless, the potential for "passive users" is still out there, even though it cannot be directly measured (90% of Web users just use content, and don't attempt to create it).
You can indeed generate Latin script out of Cyrillic for Serbian, but not vice versa, and there is no way for a machine to know where 'nj' sequence in writing is monophonemic /nj/ and where /n/+/j/. Admittedly, words like this are rare (инјекција, надживети..), but should nevertheless be checked by a human.
Note that entries per WT:ELE entries can have multiple etymologies, multiple PoS sections, and arbitrary amount of meanings for each one, each meaning being marked with context labels, or additional qualifiers such as {{pf}}/{{impf}} in the inflection line. Not to mention additional sections for the inflection, related terms..it should be best done manually by means of "copy-paste" method where applicable, for the template such as that wouldn't scale much excepting the creation of most primitive type of entries. --Ivan Štambuk 17:14, 6 December 2008 (UTC)Reply
Well, Cyrillic script often serves in practise as a badge of Serbdom, esp. when confronted with Roman script (great "reformer" Karadžić's roots of it, and the "najsavršenije pismo na svetu" arguments are often emphasised). From my experience, younger population is not that bothered with the cultural prominence of Cyrillic, and some are utterly disgusted by it (like the owner of the biggest Serbian [and Balkans] IT forum [3] who made a script that converts submitted Cyrillic to Latin..few years ago there was even a "riot" amongst some of the moderators and users (proud Serbs :) for not being able to write in Cyrillic, and some important folks left..)
I am not familiar with any online Serbian language corpus. If you have a list of extracted and properly spelled Serbian lexemes in Cyrillic that can be compered to Roman-script list for the purpose of extraction of the aforementioned exceptions, feel free to dump it somewhere on the Web :)
Try writing any kind of non-trivial code in the MediaWiki templates, one of the most horrid and degenerate mini-PL ever conceived (the only one I know that doesn't use lazy evaluation for conditionals), and you'll see how simple things can get ugly. c/p is much more easy. --Ivan Štambuk 17:57, 6 December 2008 (UTC)Reply

language order

[edit]

Note: When a page has a section for more than one language, Wiktionary puts those lanagauges in order alphabetially by their English name, with Translingual and English coming first (if there is such a section). So, if a page has a Croatian, Serbian, and Bosnian section, the languages should be ordered: Bosnian, Croatian, Serbian. --EncycloPetey 21:30, 4 December 2008 (UTC)Reply

Creo que hay un bot que lo hace despues de cada cambio no ? Así que no pasa nada si sigo haciendolo de manera más fácil. - Francis Tyers 22:01, 4 December 2008 (UTC)Reply
My primary language is English. Yes, there is a bot that makes changes, but it doesn't always make them correctly (just 99.99% of the time). If you make an error in header level (which is easy to do), the page could be reformatted incorrectly by the bot. It is therefore better practice to put the items in order to begin with. --EncycloPetey 22:14, 4 December 2008 (UTC)Reply

Inflection tables

[edit]

Hi, If you use {| class="inflection-table" then red-links will appear black without the need for #ifexist trickery. Conrad.Irwin 10:14, 26 April 2009 (UTC)Reply

Wiktionary:Votes/2010-04/Voting_policy

[edit]

Just letting you know of this surprisingly contentious vote. Input from more Wiktionarians such as yourself would be much appreciated. Thanks. The uſer hight Bogorm converſation 12:42, 22 May 2010 (UTC)Reply

Etymologies

[edit]

Please don't copy and paste from other websites. It is a copyright violation to do so. Mglovesfun (talk) 11:05, 30 September 2010 (UTC)Reply

Database dump analysis

[edit]

Am I correct in assuming from Vahagn's talkpage that you are able to do database dump analysis? I don't want to pressure you, but if you can, there are some (unfortunately non-Armenian) lists of entries that I could really use. Well, if you're able and interested, I'd be glad to hear it, but even if you just don't want to, that's fine too. Thanks! —Μετάknowledgediscuss/deeds 15:28, 2 December 2012 (UTC)Reply

(I'll respond here, you can respond where-ever is convenient) Thank you so much! The lists I need are of the following types:
  1. entries in a certain language with a certain character in the pagetitle
  2. entries in a certain language with a certain character in the pagetitle when that character is not next to or combined with another character
  3. entries in a certain category that lack a certain string
Are you able to do any of those? —Μετάknowledgediscuss/deeds 15:38, 2 December 2012 (UTC)Reply
The third one is easiest. What category and what string ? - Francis Tyers (talk) 15:45, 2 December 2012 (UTC)Reply
Awesome! There's a few... probably the big ones are members of Category:Yiddish nouns without the string {{yi-noun| and members of Category:Yiddish adjectives without the string {{yi-adj| or {{yi-adjective|. Thanks! —Μετάknowledgediscuss/deeds 15:49, 2 December 2012 (UTC)Reply
(If you don't want to fill up your talkpage, you can put them on User:Metaknowledge/Yiddish headword.) —Μετάknowledgediscuss/deeds 16:17, 2 December 2012 (UTC)Reply
  • Nouns: done.
  • Adjectives: done

Again, thank you so much for your gracious help! —Μετάknowledgediscuss/deeds 03:37, 3 December 2012 (UTC)Reply

That should be it, if you find any bugs, let me know and I'll try and regenerate. - Francis Tyers (talk) 14:17, 3 December 2012 (UTC)Reply
Well, it seems suspicious that there are so few entries, especially so few adjectives, but I guess that's a good thing (less work for me to do!). Thank you again! —Μετάknowledgediscuss/deeds 04:58, 4 December 2012 (UTC)Reply

Armenian corpus

[edit]

Hi! This analyzed corpus is out of copyright, if you're interested. --Vahag (talk) 17:15, 8 December 2012 (UTC)Reply

Wow!!!!! Nice :D It will be really useful for testing the analyser! - Francis Tyers (talk) 17:37, 8 December 2012 (UTC)Reply