Wiktionary:Languages
This is a Wiktionary policy, guideline or common practices page. Specifically it is a policy think tank, working to develop a formal policy. | |
Policies – Entries: CFI - EL - NORM - NPOV - QUOTE - REDIR - DELETE. Languages: LT - AXX. Others: BLOCK - BOTS - VOTES. |
- For a list of all language codes, see Wiktionary:List of languages.
- For information on how to add or remove a language from Wiktionary, see Help:Adding and removing languages.
Wiktionary includes many words in many languages. This page details the conventions and practices relating to the variety of languages on Wiktionary.
Criteria for inclusion
Language information
To distinguish languages, Wiktionary gives each a unique name and a unique code, which identify it. Other information is also collected.
Language names
Wiktionary calls each language it includes by a distinct name. This name is used in headers, translation tables, categories, appendices, and some other places. Most languages only have one name, but some may be known by multiple names. In this case, one of the language's names is chosen for use in Wiktionary. This name is referred to as the canonical name of the language. Canonical language names are chosen by consensus. Whenever possible, common English names of languages are used, and diacritics are avoided. Attested names (names which meet WT:CFI) are strongly preferred.
Canonical names must be unique, meaning that a name must refer to at most one language. When two or more languages are commonly known by the same name, Wiktionary distinguishes them by choosing different canonical names for each one, using a variety of means:
- In many cases, the languages are also known by other names. One of those other names is then chosen so that it is unique. For example, the language of the Pyu city-states, though called "Pyu" by some scholars, is called "Tircul" (code:
pyx
) on Wiktionary, to distinguish it from the language of Papua New Guinea which is called "Pyu" (code:pby
). - Alternative spellings of the same name can also be used to distinguish languages with otherwise identical names. For example, the Riang language of India and Bangladesh (code:
ria
) goes by the name "Reang" on Wiktionary, to distinguish it from the "Riang" of Burma/Myanmar (code:ril
). - If languages cannot be distinguished by alternative names, the place where each language is spoken is appended in parentheses after its name, as in the case of "Buli (Ghana)" (code:
bwu
) and "Buli (Indonesia)" (code:bzq
). - If languages go by the same name and are spoken in the same place, they can be disambiguated by their linguistic families. For example, "Mor (Austronesian)" (code:
mhz
) and "Mor (Papuan)" (code:moq
), both of which are spoken in Indonesia.
Language codes
Each language on Wiktionary also has a unique code assigned to it, usually consisting of two or three letters. This code is used to identify languages when including templates in entries. Language names are not used in this case because they are longer and less precise, as the above section illustrates. Topical categories also use the language code as part of their names.
The list of standard language codes can be found at Wiktionary:List of languages and the list of special language codes, including etymology-only languages, can be found at the subpage Wiktionary:List of languages/special.
Wiktionary chooses codes for languages as follows, in order of priority:
- If the language has a two-letter code in the ISO 639-1 standard, then that code is used. Wikipedia has a list of ISO 639-1 codes.
- A few languages are represented on Wiktionary by 639-1 codes the ISO has deprecated. This is generally the case when the ISO has come to consider a lect to be a group of languages, but Wiktionary still considers it a single language. Serbo-Croatian, for example, is represented by
sh
.
- A few languages are represented on Wiktionary by 639-1 codes the ISO has deprecated. This is generally the case when the ISO has come to consider a lect to be a group of languages, but Wiktionary still considers it a single language. Serbo-Croatian, for example, is represented by
- If the language has a three-letter code in the ISO 639-3 standard, then that code is used. Wikipedia has a list of ISO 639-3 codes. For translingual terms the code
mul
is used. - If the language has a three-letter code in the ISO 639-2 standard, then that code is used. This is quite rare.
- Any language which does not have an ISO code, but which is to be included in Wiktionary, has a new Wiktionary-specific "exceptional" code devised for it. This code consists of two parts. The first part is the nearest three-letter (ISO) family code from ISO 639-5; it is followed by a hyphen. The second part is a series of three lowercase letters which approximate the language name. (No digits, upper case letters, etc are used: IANA tags allow these, case independent, but Mediawiki software is more restrictive.) For example, Gallo is
roa-gal
: "roa
" is the ISO 639-5 code for Romance languages, "gal
" abbreviates "Gallo".- In a very few cases, the Wikimedia Foundation Language Committee has already devised a code of this form to represent the language, using it in the subdomain part of the URL of the language's wiki projects; in that case, we use the Wikimedia code. For example, the WMF uses
map-bms
for Banyumasan (the Banyumasan Wikipedia is map-bms.wikipedia.org), so Wiktionary also represents Banyumasan using this code. If the Wikimedia code is of a different form, it is not used by Wiktionary; for example, Tarantino has the Wikimedia coderoa-tara
, but the Wiktionary coderoa-tar
. - If no family to which the language belongs has an ISO code, or it is not known which family the language belonged to, the prefix
mis
is used: for example, Kassite is represented by the codemis-kas
. - If the language is a substrate, the prefix
qsb
is used rather thanqfa-sub
. - Ancestor or "proto-" languages (which are generally reconstructed, though some are directly attested, like Proto-Norse) are assigned exceptional codes consisting of the language family's code with "
-pro
" added to the end: Proto-Germanic, for example, is represented by the codegem-pro
. Because the entire family code is used as the first part of the code, the code may be longer than seven characters: for example, Proto-Mixe-Zoque isnai-miz-pro
.
- In a very few cases, the Wikimedia Foundation Language Committee has already devised a code of this form to represent the language, using it in the subdomain part of the URL of the language's wiki projects; in that case, we use the Wikimedia code. For example, the WMF uses
Not all lects which have been assigned codes by the ISO are assigned codes or included by Wiktionary. This is the case for some constructed languages, for example. There are also many lects which the ISO has assigned codes which are not treated as distinct languages on Wiktionary. For example, the ISO assigned Moldovan/Moldavian the 639-1 code mo
, but Wiktionary regards it as a form of Romanian and represents it and Romanian by the same code ro
. See Wiktionary:Language treatment for more information.
Mismatch with Wikimedia codes
In a small number of cases, there is a mismatch between the (typically ISO-derived) code used by Wiktionary to represent a language and the code used by the Wikimedia Foundation. For example, Aromanian is represented on Wiktionary and in ISO 639-3 by the code rup
, but the WMF uses the code roa-rup
and locates the Aromanian Wikipedia at roa-rup.wikipedia.org. The templates such as Template:wikipedia which Wiktionary uses to link to its sister projects accept only Wiktionary codes. To enable linking to projects (such as the Aromanian Wikipedia) for which the WMF uses special codes, Module:wikimedia languages maps Wiktionary codes to Wikimedia codes, and Module:languages performs the reverse mapping.
Language families
Wiktionary sorts languages into families. Most families are related through descent from a common ancestor, but a few are merely categories, such as "creoles and pidgins". Wiktionary records which family a language belongs to in the data modules of Module:languages. Like languages, families are represented by unique codes and have unique canonical names.
- English belongs to the West Germanic languages (code:
gmw
). - Serbo-Croatian belongs to South Slavic languages (code:
zls
). - Abenaki belongs to the Algonquian languages (code:
alg
). - Classical Nahuatl belongs to the Nahuan languages (code:
azc-nah
).
Some languages are not naturally descended from other languages, but show other origins. These use special types of families:
- The widely-used constructed language Esperanto is an artificial language (code:
art
). - Chavacano, a creole language, is grouped under the creole or pidgin languages (code:
crp
).
Scripts used by a language
Wiktionary records which script(s) (writing systems) a language is written in as well. This information is primarily used by modules to be able to automatically detect and format non-Latin-alphabet text appropriately. Scripts, too, have unique codes and canonical names.
- English is written in the Latin script (code:
Latn
). - Serbo-Croatian is written in both the Latin and the Cyrillic scripts (codes:
Latn
andCyrl
).
Finding and organising terms in a language
Every language has a main category which contains all terms that the English Wiktionary has for that language. This category is named using the canonical name of the language, followed by the word "language". For example, the main category for English is Category:English language. If the canonical name of the language already ends in the word "language", nothing is added (hence Category:American Sign Language).
The main category for a language will have a variety of subcategories, which organise terms in various ways. The most important is the "lemma" category tree, which organises all lemmas in a language by their part of speech. As Wiktionary is always being expanded and improved upon, not all languages have their own categories yet, and certain subcategories may still be empty or missing. Categories are created as needed, when new entries are added to them. When content is added in a language lacking a category, it can simply be created using the {{auto cat}}
template, as long as the name follows the standard format used by other languages.
Languages generally also have a page which contains information that is useful to users who want to create or edit entries in that language. This page is named "Wiktionary:About (canonical name of language)", for example Wiktionary:English entry guidelines or Wiktionary:About Spanish. These pages contain a wide variety of information, depending on what other editors have found useful to note. They may explain which templates to use, specific conventions regarding spelling, pronunciation or transliteration, and more. By convention, a shortcut redirect is created to these pages for easy access, named WT:A(language code). For example, WT:AEN redirects to Wiktionary:About English (for which the code is en
).
Storing and retrieving language information
Templates and modules use a system for storing and retrieving the various pieces of information that may be associated with a language. The module Module:languages is used to retrieve all language-related information from other modules. This module cannot be used directly in a template, so instead there is another module named Module:languages/templates, which allows templates to access the information.
An overview of all basic information about a language, such as its canonical name, alternative names, code, family or scripts, can be looked up at Wiktionary:List of languages (or WT:LL for short). This is useful if you need to look up the code for a particular language, or need to know what the canonical name of a language is.
The data itself is not stored in Module:languages, but instead is contained in a number of data modules (see Category:Language data modules).
For instructions on how to edit this information, see the documentation of any of the data modules.
Etymology-only languages
Some lects (e.g. dialects, chronolects and topolects) are given their own language codes and can be used in many types of templates in place of full language codes, but don't have their own L2 language entries. An example is Classical Persian, which is given a code fa-cls
, but whose entries are listed under the ==Persian== header (corresponding to language code fa
). The term "etymology-only language" was originally appropriate, as these lects could generally only be used in etymology templates such as {{inh}}
, {{bor}}
and {{der}}
, but their use has now been expanded well beyond these templates, and the term "etymology-only language" is now considered a misnomer. There is consensus (established in the Beer Parlour) to rename them to "language varieties", but this has not been done yet (as of June 2024).
The full list of etymology-only codes can be found in Wiktionary:List of languages/special#Etymology-only languages, and the source module that describes them is Module:etymology languages/data.
See also
- Wiktionary:Dialects
- Wiktionary:Families
- Wiktionary:Scripts
- Module:data consistency check, which checks for non-unique canonical names and other problems
- Wiktionary:Wikimedia language codes for the relationship between WMF project URLs and the language codes