Wiktionary:Corpora
This page is dedicated to listing collections of texts useful for the work of creating a dictionary. These collections are often known as "corpora" or less commonly "corpuses". Many of them feature functions like full-text search, term frequency information and collocation search.
For a more user-friendly introduction to some of the most prominent corpora, as well as other resources like dictionaries, see Wiktionary:Quotations/Resources. Another page, Wiktionary:Searchable external archives also contains information with a more specific focus on those which can solidly provide citations passing Wiktionary's criteria for inclusion.
Note that corpora that contain text in multiple languages but where English text makes up of a significant portion of the corpora are listed in the English table below with their "Dialect" in the listing including the word "Multilingual".
If there are any other resources that you know of which aren't listed here, please do add them or suggest them on the talk page.
English
[edit]Name | Resource Type | Size in words | Size in texts | Dialect | Start year | End year | Original Medium | Available Medium | Genre | Re-use restrictions | Access restrictions | Date of entry update |
---|---|---|---|---|---|---|---|---|---|---|---|---|
News on the Web (NOW) | Corpus, Tagged | 10^10 * 2 | 10^7 * 3 | (Various)[1][2] | 2010 | Present | Written, Computer, Internet | Written | Nonfiction, News | None | Free registration required | 2022/10/31 |
iWeb: The Intelligent Web-based Corpus | Corpus, Tagged | 10^10 * 1 | 10^7 * 2 | (Various)[3][2] | 2017 | 2017 | Written, Computer, Internet | Written | General, esp. Nonfiction | None | Free registration required | 2022/10/31 |
Global Web-Based English (GloWbE) | Corpus, Tagged | 10^9 * 2 | 10^5 * 2 | (Various)[1][2] | 2012 | 2013 | Written, Computer, Internet | Written | General, esp. Nonfiction | None | Free registration required | 2022/10/31 |
Wikipedia Corpus | Corpus, Tagged | 10^9 * 2 | 10^6 * 4 | (Various) | 2014 | 2014 | Written, Computer, Internet | Written | Nonfiction, Encyclopedia | None | Free registration required | 2022/10/30 |
Coronavirus Corpus[4] | Corpus, Tagged | 10^9 * 2 | 10^6 * 2 | (Various)[1][2] | 2020 | 2022 | Written, Computer, Internet | Written | Nonfiction, News, COVID-19 | None | Free registration required | 2024/05/10 |
Corpus of Contemporary American English (COCA) | Corpus, Tagged | 10^9 * 1 | 10^5 * 5 | American | 1990 | 2019 | Multimedia | Written | General, esp. Nonfiction | None | Free registration required | 2023/03/27 |
Early English Books Online (EEBO) | Corpus, Tagged | 10^8 * 8 | 10^4 * 3 | British | 1470 (apprx.) | 1690 (apprx.) | Written, Books, Print | Written | General | None | Free registration required | 2022/10/30 |
Early English Books Online (EEBO) TCP | Corpus, Untagged | - | 10^4 * 6 | British | 1475 | 1700 | Written, Books, Print | Written | General | None | None | 2022/10/31 |
Early English Books Online (EEBO, V2) | Corpus, Untagged | 10^8 * 6 | 10^4 * 1 | British | 1470 (apprx.) | 1690 (apprx.) | Written, Books, Print | Written | General | None | Free registration required | 2022/11/02 |
Filmot | Library | - | 10^8 * 5 | (Various, Multilingual) | 2005 (apprx.) | Present | Spoken, General | Audio-visual | General, esp. Nonfiction | None | None | 2022/10/30 |
YouGlish | Library | - | 10^8 * 1 | (Various) | 2005 (apprx.) | Present | Spoken, Formal[5] | Audio-visual | Nonfiction | None | None | 2022/10/30 |
TED Corpus Search Engine (TCSE) | Corpus, Tagged | 10^7 * 1 | 10^3 * 5 | (Various) | 2007 | 2023[6] | Spoken, Formal, Speeches | Audio-visual | Nonfiction | None | None | 2022/10/30 |
Archive-It Collections | Library | - | 10^6 * 2 | (Various) | 1996 | Present | Written, Computer, Internet | Written | General, esp. Nonfiction | None | None | 2022/10/30 |
ACL Anthology Reference Corpus (ARC) | Corpus, Tagged | 10^7 * 6 | 10^4 * 2 | (Various) | 1979 | 2015 | Written, Periodicals, Journals | Written | Nonfiction, Academic, NLP | None | None | 2022/10/30 |
COVID-19 Open Research Dataset (CORD-19) | Corpus, Tagged | 10^9 * 3 | 10^5 * 7 | (Various) | 1922[7] | 2020[8] | Written, Periodicals, Journals | Written | Nonfiction, Academic | None | None | 2022/10/30 |
EcoLexicon English | Corpus, Tagged | 10^7 * 2 | 10^3 * 2 | (Various) | 1973 | 2016 | Written | Written | Nonfiction, Academic, Environment | None | None | 2022/10/30 |
Lipstick Alley | Social Media | - | - | American, African | 2000 | Present | Written, Computer, Social Media, Forum | Written | General, esp. Nonfiction, Celebrity News | None | Free registration required[9] | 2023/06/23 |
Corpus of Regional African American Language (CORAAL) | Corpus, Untagged | 10^6 * 1 | 10^2 * 2 | American, African | 1968 | 2017 | Spoken, Interviews | Audio | General, Sociolinguistic interviews[10] | None | None | 2022/10/31 |
Google Trends | Trends | - | - | (Various, Multilingual) | 2004 | Present | Written, Computer, Internet Searches | Written | General | None | None | 2022/10/31 |
Google Ngrams | Trends | 10^7 * 2[11][12] | 10^12 * 2[11][13][12] | (Various, Multilingual)[14] | 1470[12] | Present | Multimedia | Written | General | None | None | 2022/10/31 |
Google Books | Library | - | 10^7 * 4 | (Various, Multilingual) | 1400 (apprx.) | Present | Multimedia | Written | General | None | None | 2022/10/31 |
Google Scholar | Library | - | 10^8 * 1[15] | (Various, Multilingual) | 1700 (apprx.) | Present | Written, Periodicals, Journals | Written | Nonfiction, Academic; Law | None | None | 2023/01/19 |
Corpus of Middle English Prose and Verse | Corpus, Untagged | - | 10^2 * 3 | Middle | 1000 | 1500 | Written, Books, Print | Written | General, esp. Nonfiction | None | None | 2022/10/31 |
Michigan Corpus of Upper-level Student Papers (MICUSP) | Corpus, Untagged | 10^6 * 3 | 10^2 * 8 | (Various, ESL[16]) | 2002 | 2009 | Written, College Work | Written | Nonfiction, Academic | Restrictions on commercial use[17] | None | 2022/12/28 |
Michigan Corpus of Academic Spoken English (MiCASE)[18][19] | Corpus, Untagged | 10^6 * 2 | 10^2 * 2 | American (mostly) | 1998 | 2001 | Spoken, Formal, Speeches | Audio,[18] Written | Nonfiction, Academic | Restrictions on commercial use[20] | None | 2022/10/31 |
British Academic Spoken English Corpus (BASE) | Corpus, Tagged | 10^6 * 1 | 10^2 * 2 | British | 1998 | 2005 | Spoken, Formal, Speeches | Written | Nonfiction, Academic | None | None | 2022/11/02 |
British Academic Written English Corpus (BAWE) | Corpus, Tagged | 10^6 * 7 | 10^3 * 3 | British | 2000 | 2007 | Written, College Work | Written | Nonfiction, Academic | None | None | 2022/11/02 |
Public Papers of the Presidents of the United States | Library | - | 10^2 * 1 | American | 1938 | 2002 | Multimedia | Written | Nonfiction, Politics | None | None | 2023/06/17 |
Google Groups | Social Media | - | - | (Various) | 1981 | 2024 | Written, Computer, Social Media, Usenet | Written | General, esp. Nonfiction | None | None | 2024/03/20 |
UsenetArchives.com[21] | Social Media | - | 10^8 * 7 | (Various) | 1981[22] | Present? | Written, Computer, Social Media, Usenet | Written | General, esp. Nonfiction | None | None | 2024/03/20 |
Narkive | Social Media | - | 10^8 * 3 | (Various) | 1990 (apprx.) | Present | Written, Computer, Social Media, Usenet | Written | General, esp. Nonfiction | None | None | 2024/03/20 |
Europeana | Library | - | 10^7 * 2 | (Various, Multilingual) | 0400 (apprx.) | Present | Multimedia | Multimedia | General | None | None | 2022/10/31 |
Internet Archive | Library | - | 10^7 * 6 | (Various, Multilingual) | - | Present | Multimedia | Multimedia | General | None | Free registration required | 2022/10/31 |
Eighteenth Century Collections Online (ECCO) TCP | Corpus, Untagged | - | 10^3 * 2 | (Various, Multilingual) | 1701 | 1800 | Written, Books, Print | Written | General | None | None | 2022/10/31 |
Old Bailey Corpus (OBC) 2.0 | Corpus, Tagged | 10^7 * 4 | 10^6 * 1 | British (various dialects) | 1720 | 1913 | Spoken, Formal, Court Proceedings | Written | Nonfiction, Law, Courts, Criminal | None | Free registration required | 2022/10/31 |
Old Bailey Proceedings Online | Corpus, Untagged | 10^8 * 1 | - | British (various dialects) | 1674 | 1913 | Spoken, Formal, Court Proceedings | Written | Nonfiction, Law, Courts, Criminal | None | None | 2022/10/31 |
Royal Society Corpus (RSC) 6.0.1 Open | Corpus, Tagged | 10^7 * 8 | 10^4 * 2 | British | 1665 | 1920 | Written, Periodicals, Journals, Print | Written | Nonfiction, Academic | None | Free registration required | 2022/10/31 |
Royal Society Corpus (RSC) 6.0.4 Open with Topics | Corpus, Tagged | 10^8 * 3 | 10^4 * 2 | British | 1665 | 1920 | Written, Periodicals, Journals, Print | Written | Nonfiction, Academic | None | Free registration required | 2022/10/31 |
Social Media | 10^12 * 3[23] | - | (Various, Multilingual) | 2005 | Present | Written, Computer, Social Media, Twitter | Written | General, esp. Nonfiction | None | None | 2022/10/31 | |
SocialGrep (Reddit) Corpora | Corpus, Untagged | - | 10^7 * 9 | (Various) | 2005 (apprx.) | Present? | Written, Computer, Social Media, Reddit | Written | General, esp. Nonfiction | None | None | 2022/10/31 |
Europarl 7 Sample, English | Corpus, Tagged | 10^7 * 2 | 10^3 * 8 | International/ELF[24] | 2007 | 2011 | Spoken, Formal, Legislative Proceedings | Written | Nonfiction, Law, Legislatures | None | None | 2022/11/01 |
Europarl 3, English | Corpus, Tagged | 10^7 * 2 | 10^2 * 7 | International/ELF[24] | 1996 | 2006 | Spoken, Formal, Legislative Proceedings | Written | Nonfiction, Law, Legislatures | None | Free registration required | 2022/11/01 |
TARA | Corpus, Tagged | 10^5 * 9 | 10^4 * 2 | British | 2006 (apprx.) | 2006 (apprx.) | Written, Periodicals, Newspapers, Print | Written | Nonfiction, News | None | Free registration required | 2022/11/01 |
British National Corpus (BNC) | Corpus, Tagged | 10^8 * 1 | 10^3 * 4 | British | 1960 | 1993 | Multimedia | Written | General | None | Free registration required | 2022/11/01 |
British National Corpus (BNC) Sampler | Corpus, Tagged | 10^6 * 2 | 10^2 * 2 | British | 1975 | 1993 | Multimedia | Written | General | None | Free registration required | 2022/11/01 |
Phrases in English (BNC)[25][26] | Corpus, Tagged | 10^8 * 1 | 10^3 * 4 | British | 1960 | 1993 | Multimedia | Written | General | None | None | 2023/02/12 |
Just The Word (BNC)[25] | Corpus, Tagged | 10^8 * 1 | 10^3 * 4 | British | 1960 | 1993 | Multimedia | Written | General | None | None | 2023/02/12 |
British English 2006 (BE06) | Corpus, Tagged | 10^6 * 1 | 10^2 * 5 | British | 2003 | 2008 | Written | Written | General | None | Free registration required | 2022/11/01 |
American English 2006 (AME06) | Corpus, Tagged | 10^6 * 1 | 10^2 * 5 | American | 2006 (apprx.) | 2006 (apprx.) | Written | Written | General | None | Free registration required | 2022/11/22 |
Hansard Corpus (British Parliament) | Corpus, Tagged | 10^9 * 2 | 10^6 * 8 | British | 1803 | 2005 | Spoken, Formal, Legislative Proceedings | Written | Nonfiction, Law, Legislatures | None | Free registration required | 2022/11/01 |
British Parliament Hansard | Library | - | - | British | 1800 | Present | Spoken, Formal, Legislative Proceedings | Written | Nonfiction, Law, Legislatures | None | None | 2022/11/01 |
Australian Parliament Hansard | Library | - | - | Australian | 1901 | Present | Spoken, Formal, Legislative Proceedings | Written | Nonfiction, Law, Legislatures | None | None | 2022/11/01 |
Canadian House of Commons Hansard | Library | - | - | Canadian | 2002 | Present | Spoken, Formal, Legislative Proceedings | Written | Nonfiction, Law, Legislatures | None | None | 2022/11/01 |
New Zealand Parliament Hansard | Library | - | - | New Zealand | 1854 | Present | Spoken, Formal, Legislative Proceedings | Written | Nonfiction, Law, Legislatures | None | None | 2022/11/01 |
GovInfo (United States) | Library | - | - | American | 1793 | Present | Multimedia | Written | Nonfiction, Law | None | None | 2022/11/01 |
Transgender Usenet Archive (TUA) | Corpus, Untagged | - | 10^5 * 4 | (Various) | 1994 | 2013 | Written, Computer, Social Media, Usenet | Written | General, Transgender Topics | None | None | 2022/11/01 |
Science Forums | Social Media | - | 10^5 * 1 | (Various) | 1992 | 2014 | Written, Computer, Social Media, BBS | Written | Nonfiction, Science | None | None | 2022/11/01 |
TextFiles.com | Library | - | - | (Various) | 1980 (apprx.) | 1995 (apprx.) | Multimedia | Multimedia | General, esp. Nonfiction, Technology | None | None | 2022/11/01 |
LDS General Conference Corpus | Corpus, Tagged | 10^7 * 3 | 10^4 * 1 | American | 1851 | Present | Spoken, Formal, Speeches | Written | Religious, Latter Day Saints | None | None | 2022/11/01 |
FidoNet Echomail Archive | Social Media | - | - | (Various) | 1990 (apprx.) | 2016 (apprx.) | Written, Computer, Social Media, FidoNet | Written | General, esp. Nonfiction, Technology | None | None | 2022/11/01 |
FidoNet HolySmoke Archive | Library | - | 10^5 * 4 | (Various) | 1993 | 2004 | Written, Computer, Social Media, FidoNet | Written | Nonfiction, Religion | None | None | 2022/11/02 |
Dúchas Project | Library | - | 10^6 * 2 | Irish | 1900 (apprx.) | 1940 (apprx.) | Multimedia | Written | Fiction, Folklore | None | None | 2022/11/02 |
Freiburg-Brown Corpus of American English (FROWN) | Corpus, Tagged | 10^6 * 1 | 10^2 * 5 | American | 1992 | 1992 | Written, Print | Written | General | None | Free registration required | 2022/11/02 |
Brown Corpus Family | Corpus, Tagged | 10^6 * 1 | 10^3 * 2 | - | - | - | Written, Print | Written | General | None | Free registration required | 2022/11/02 |
Brown Family (C8 tags) | Corpus, Tagged | 10^6 * 6 | 10^3 * 2 | (Various) | 1931 | 1991 | Written, Print | Written | General | None | Free registration required | 2022/11/02 |
Brown Corpus[27] | Corpus, Tagged | 10^6 * 1 | 10^3 * 1 | American | 1961 | 1961 | Written, Print | Written | General | None | None | 2022/11/02 |
Corpus of English Dialogues | Corpus, Tagged | 10^6 * 1 | 10^2 * 2 | British(?) | 1560 | 1760 | Multimedia | Written | General, Dialogues | None | Free registration required | 2022/11/02 |
Florence Early English Newspapers (FEEN) | Corpus, Tagged | 10^5 * 3 | -[28] | British(?) | 1620 | 1649 | Written, Periodicals, Newspapers, Print | Written | Nonfiction, News | None | None | 2023/03/27 |
Transhistorical Corpus of Written English | Corpus, Tagged | 10^5 * 5 | 10^2 * 8 | (Various) | 1405 | 2019 | Written | Written | General | None | None | 2022/11/02 |
Linguistic Landscape Corpus | Corpus, Tagged | 10^6 * 5 | 10^2 * 6 | (Various) | 1997 | 2018 | Written | Written | Nonfiction, Academic | None | Free registration required | 2022/11/02 |
ICNALE Online[29] | Corpus, Tagged | 10^6 * 4 | 10^4 * 2 | (Various, ESL[16])[30] | 2007 (apprx.) | 2022 (apprx.) | Multimedia, College Work | Multimedia | Nonfiction, Academic | None | None | 2022/11/02 |
European Football Championship Interpreting Corpus (EFCIC) | Corpus, Tagged | 10^4 * 1 | 10^1 * 1 | - | 2020 | 2020 | Spoken, Entertainment, Interpretation, Interview | Written | Nonfiction, Sports | None | None | 2022/11/02 |
UkWac Complete[31] | Corpus, Tagged | 10^9 * 2 | 10^6 * 3 | British[2] | 2005 (apprx.) | 2007 (apprx.) | Written, Computer, Internet | Written | General | None | None | 2022/11/02 |
UkWac Small[31] | Corpus, Tagged | 10^7 * 8 | 10^5 * 1 | British[2] | 2005 (apprx.) | 2007 (apprx.) | Written, Computer, Internet | Written | General | None | None | 2022/11/02 |
Postcard Archive @ Florida State University[32] | Library | - | 10^3 * 3[33] | (Various) | 1829 (apprx.) | 2016 (apprx.) | Written, Postcards | Written | Nonfiction, Postcards | None | None | 2022/11/06 |
PlayPhrase.me | Corpus, Tagged | - | 10^6 * 8[34] | (Various) | 1970 (apprx.) | Present? | Spoken, Entertainment, Movies | Audio-visual | Fiction, Movies | None | None | 2022/11/07 |
European Union DGT-UD: English | Corpus, Tagged | 10^8 * 1 | 10^4 * 5 | International/ELF[24] | 1948 (apprx.) | 2016 | Written, Legislative Acts | Written | Nonfiction, Law, Legislatures | None | None | 2022/11/16 |
Opus-MontenegrinSubs 1.0: English | Corpus, Tagged | 10^5 * 5 | 10^2 * 2 | (Various) | 2007 | 2013 | Spoken, Entertainment, Television | Written | Fiction, Television | None | None | 2022/11/16 |
Archive of Our Own (AO3) | Library | - | 10^7 * 1 | (Various) | 2007 | Present | Written, Computer, Internet | Written | Fiction, Short Stories, Fan Works[35] | None | None | 2022/11/22 |
SCP Foundation | Library | - | 10^3 * 2 | (Various) | 2007 | Present | Written, Computer, Internet | Written | Fiction, Short Stories, Sci-Fi[35] | None | None | 2022/11/22 |
NEWS-GB (British newspapers) | Corpus, Tagged | 10^8 * 2 | - | British | 2004 (apprx.) | 2004 (apprx.) | Written, Print | Written | Nonfiction, News | None | None | 2022/11/22 |
INTERNET-EN | Corpus, Tagged | 10^8 * 2 | 10^4 * 5 | (Various) | 2006 (apprx.) | 2006 (apprx.) | Written, Computer, Internet | Written | General | None | None | 2022/11/22 |
BLOGS-EN (Political blogs) | Corpus, Tagged | 10^8 * 5 | - | (Various) | 2008 (apprx.) | 2008 (apprx.) | Written, Computer, Internet | Written | Nonfiction, Politics | None | None | 2022/11/22 |
Manually Annotated Sub-Corpus (MASC) | Library[36] | 10^5 * 5 | 10^2 * 4 | American | 1990 (apprx.) | 2010 (apprx.) | Multimedia | Written | General | None | None | 2022/11/23 |
Lancaster Newsbooks Corpus (1654 part) | Corpus, Tagged | 10^5 * 9 | 10^2 * 2 | British | 1653 | 1654 | Written, Periodicals, Newspaper, Print | Written | Nonfiction, News | None | Free registration required | 2022/11/23 |
The Mail Arcive | Library | - | 10^8 * 2 | (Various) | 1990 | Present | Written, Computer, Mailing List | Written | Nonfiction, esp. Coding and Computers | None | None | 2022/11/26 |
CataList (LISTSERV catalog)[37] | Library | - | -[38] | (Various) | 1990 (apprx.) | Present | Written, Computer, Mailing List | Written | Nonfiction | None | None | 2022/11/28 |
United Nations Digital Library | Library | - | 10^5 * 7[39] | (Various, International/ELF[24]) | 1875[40] | Present | Multimedia | Multimedia | Nonfiction, Politics | None | None | 2022/11/29 |
Genius.com | Library | - | - | (Various, Multilingual) | 1900 (apprx.) | Present | Spoken, Entertainment, Music | Written | General, Music | None | None | 2022/12/06 |
Chronicling America | Library | - | - | American | 1777 | 1963 | Written, Periodicals, Newspaper, Print | Written | Nonfiction, News | None | None | 2022/12/06 |
Library of Congress | Library | - | 10^6 * 3[41] | (Various, Multilingual) | 1470 (apprx.) | Present | Multimedia | Multimedia | General | None | None | 2022/12/06 |
World Radio History | Library | - | 10^5 * 1[42] | (Various, Multilingual)[43] | 1900 (apprx.) | Present | Written, Periodicals, Magazines, Print | Written | Nonfiction, Radio; Television; Music | None | None | 2022/12/06 |
Google News Newspapers Archive | Library | - | 10^6 * 6[44][45] | (Various, Multilingual)[43] | 1738 (apprx.) | 2009 | Written, Periodicals, Magazines, Print | Written | Nonfiction, News | None | None | 2022/12/14 |
VESPA[46] | Corpus | 10^6 * 2 | 10^2 * 9 | International/ESL[16] | 2008 (apprx.) | 2008 (apprx.) | Written, College Work | Written | Nonfiction, Academic | Restriction to non-profit educational use only[47] | Free registration required | 2022/12/28 |
I-EN (Internet English Corpus) | Corpus, Tagged | 10^8 * 2 | - | (Various) | 2005 | 2005 | Written, Computer, Internet | Written | Nonfiction, News? | None | None | 2022/12/28 |
I-EN-CC (Internet English Creative Commons Corpus) | Corpus, Tagged | 10^8 * 2 | - | (Various) | 2005 (apprx.) | 2005 (apprx.) | Written, Computer, Internet | Written | Nonfiction, News? | None | None | 2022/12/28 |
Springfield! Springfield! | Library | - | 10^5 * 2 | (Various) | 1910 (apprx.) | Present | Spoken, Entertainment, Movies and Television | Written | General | None | None | 2023/03/27 |
Issuu | Library | - | 10^7 * 5[48] | (Various, Multilingual) | 2000 (apprx.)[49] | Present | Written, Periodicals, Magazines | Written | Nonfiction | None | Free registration required for full access[50] | 2023/01/19 |
Smithsonian Transcription Center | Library | - | -[51] | American | 1400 (apprx.)[52] | Present | Written | Written | Nonfiction | None | None | 2023/01/22 |
Voices Remembering Slavery: Freed People Tell Their Stories | Library | 10^4 * 7[44][53] | 10^1 * 3[54] | American, African | 1932 | 1975[55] | Spoken, Interviews | Audio | General, Anthropological interviews | None | None | 2023/01/28 |
Born in Slavery: Slave Narratives from the Federal Writers' Project | Library | - | 10^3 * 2[56] | American, African[57] | 1936 | 1938 | Written | Written | Nonfiction, Biographies[57][58] | None | None | 2023/01/28 |
Corpus of Historical American English (COHA)[59] | Corpus, Tagged | 10^8 * 5 | 10^5 * 1 | American | 1820 | 2019 | Multimedia | Written | General | None | Free registration required | 2023/02/14 |
The TV Corpus | Corpus, Tagged | 10^8 * 3 | 10^4 * 8 | (Various)[60] | 1950 | 2017 | Spoken, Entertainment, Television | Written | General | None | Free registration required | 2023/03/27 |
The Movie Corpus | Corpus, Tagged | 10^8 * 2 | 10^4 * 3 | (Various)[60] | 1930 | 2018 | Spoken, Entertainment, Movies | Written | General | None | Free registration required | 2023/03/27 |
Corpus of American Soap Operas (CASO) | Corpus, Tagged | 10^8 * 1 | 10^4 * 2 | American | 2001 | 2012 | Spoken, Entertainment, Movies | Written | Fiction, Television, Soap Operas | None | Free registration required | 2023/03/27 |
Corpus of US Supreme Court Opinions | Corpus, Tagged | 10^8 * 1 | 10^4 * 3 | American | 1790 (apprx.) | 2019 (apprx.)[61] | Written | Written | Nonfiction, Law, Courts, Constitutional | None | Free registration required | 2023/02/16 |
TIME Magazine Corpus | Corpus, Tagged | 10^8 * 1 | 10^5 * 3[62] | American | 1923 | 2006 | Written, Periodicals, Magazines, Print | Written | Nonfiction, News | None | Free registration required | 2023/02/16 |
Corpus of Online Registers of English (CORE) | Corpus, Tagged | 10^7 * 5 | 10^4 * 5 | (Various)[63] | 2013 (apprx.) | 2016 (apprx.) | Written, Computer, Internet | Written | General | None | Free registration required | 2023/02/16 |
Strathy Corpus of Canadian English | Corpus, Tagged | 10^7 * 5 | 10^3 * 1 | Canadian | 1921[64] | 2011[64] | Multimedia | Written | General | None | Free registration required | 2023/02/16 |
Biodiversity Heritage Library | Library | - | 10^5 * 3[65] | (Various, Multilingual) | 1400 (apprx.) | Present | Written | Written | Nonfiction, Academic, Biology | None | None | 2023/02/23 |
African American Writers, 1892-1912 (AAW) | Corpus, Untagged | 10^5 * 5 | 10^0 * 8 | American, African | 1892 | 1912 | Written | Written | General | None | None | 2023/03/15 |
Children's Literature (ChiLit) | Corpus, Untagged | 10^6 * 4 | 10^1 * 7 | (Unclear)[66] | ? | ? | Written | Written | Fiction, Children | None | None | 2023/03/15 |
The Philadelphia Neighborhood Corpus of LING560 Studies (PNC)[67] | Corpus | 10^6 * 2 | 10^2 * 3 | American | 1972 | Present?[68] | Spoken, Interviews | Written | (Unclear) | Restrictions on excerpt size[69] | Yes[70] | 2023/03/15 |
British Pathé[71] | Library | - | 10^5 * 2 | British | 1896 | 1984 | Spoken, Formal | Audio-visual | Nonfiction, News | None? | None | 2023/04/06 |
Newspapers.com | Library | - | 10^5 * 8 | (Various)[72] | 1690 | Present | Written, Periodicals, Newspaper, Print | Written | Nonfiction, News | None | Wikipedia Library access available. Paid subscription otherwise required. Free trials are available. | 2023/04/30 |
NewspaperArchive | Library | - | 10^7 * 2[45][73] | (Various, Multilingual)[74] | 1607 | Present | Written, Periodicals, Newspaper, Print | Written | Nonfiction, News | None | Wikipedia Library access available. Paid subscription otherwise required. Free trials are available. | 2024/06/16 |
PressReader | Library | - | ? | (Various) | ? | Present | Written, Periodicals, Newspaper, Print | Written | Nonfiction, News | None | Some snippets freely visible, most content requires paid subscription. Free trials are available. | 2023/05/31 |
ProQuest | Library | - | ? | (Various) | ? | Present | Written, Periodicals, Newspaper, Print | Written | Nonfiction, News | None | Wikipedia Library access available. Some snippets freely visible, most content requires paid subscription. Free trials are available. | 2023/05/31 |
Welsh Newspapers | Library | - | ?[75] | Welsh,[76] Multilingual | 1804 | 1919 | Written, Periodicals, Newspaper, Print | Written | Nonfiction, News | None? | None | 2023/08/08 |
Welsh Journals | Library | - | ?[77] | Welsh, Multilingual | 1735 | 2007 | Written, Periodicals, Print | Written | General | None? | None | 2023/08/08 |
Crime and Punishment Database | Library | - | - | English?[78] | 1730 | 1830 | Written, Formal, Court Records | Written | Nonfiction, Law, Courts, Criminal | None? | None | 2023/08/08 |
American Archive of Public Broadcasting | Library | - | 10^5 * 1[79] | (Various, Multilingual)[43] | 1931[80] | Present | Spoken, esp. Formal | Audio-visual | General, esp. Nonfiction | None | None, additional content available on-site at GBH or the Library of Congress. | 2023/11/01 |
Buckeye Speech Corpus | Corpus, Tagged | 10^6 * 3 | 10^2 * 4 | American | 1999 | 2000 | Spoken, Interviews | Audio, Written | General, Sociolinguistic interviews[81] | Restriction to educational and research use only | Free registration required | 2024/02/19 |
Westminster Detective Library | Library | 10^7 * 5[44][82] | 10^4 * 2[82][83] | American | 1818 | 1891 | Written, Periodicals, Newspapers, Print[84] | Written | Fiction, Short Stories, Detective Stories | None | None | 2024/02/26 |
Usenet Archive (UTZOO Wiseman/Zach Barth) | Social Media | - | 10^6 * 2[85] | (Various) | 1981 | 1991 | Written, Computer, Social Media, Usenet | Written | General, esp. Nonfiction | None | None | 2024/03/20 |
Searchids.com[86] | Library[87] | 10^7 * 7[88] | 10^7 * 2[89] | (Various) | 2006 | 2006 | Written, Computer, Internet Searches | Written | General | Restriction to non-commercial research use only[90] | None | 2024/04/11 |
Freiburg Corpus of English Dialects (FRED) - Interactive Database[91] | Corpus, Untagged[92] | 10^6 * 1[93] | 10^2 * 1[93] | British (various dialects) | 1970[93] | 2000[93][94] | Spoken, Interviews | Audio, Written | Nonfiction, History, Oral History[93] | None | None | 2024/05/09 |
MTSamples.com[95][96] | Library | 10^6 * 3[97] | 10^3 * 5[98] | (Various?) | 2007[99] | 2023 | Written, Computer | Written | Nonfiction, Medicine | Requires attribution[100] | None | 2024/12/16 |
Non-English
[edit]Name | Language | Language Code | Resource Type | Size in words | Size in texts | Start year | End year | Original medium | Available medium | Genre | Use restrictions | Access restrictions | Date of entry update |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Czech National Corpus[101] | Czech | cs | Corpus, Tagged | ? | ? | ? | ? | ? | Multimedia | General | None? | None | 2024/06/18 |
Polish National Corpus[101] | Polish | pl | Corpus, Tagged | 10^9 * 2 | ? | ? | ? | ? | Written | General | None? | None | 2023/02/12 |
Russian National Corpus[101] | Russian | ru | Corpus, Tagged | 10^9 * 2[102] | 10^6 * 5[102] | 1100 | Present | Multimedia | Written | General | Restriction to non-commercial linguistic use only[103] | None | 2023/02/12 |
Turkish National Corpus[104] | Turkish | tr | Corpus, Tagged? | 10^7 * 5 | 10^3 * 6 | 1990 | 2009 | Written[105] | Written | General | Restriction to educational use only[106] | Free registration required | 2023/02/12 |
Bruno Corpus[107] | Spanish | es | Corpus, Untagged | 10^6 * 1 | 10^2 * 5 | ? | 2010 (apprx.) | Written | Written? | General | None | None | 2023/02/12 |
Braun Corpus[107] | German | de | Corpus, Untagged | 10^6 * 1 | 10^2 * 5 | ? | 2008 (apprx.) | Written | Written? | General | None | None | 2023/02/12 |
Corpus del Español: Genre/Historical | Spanish | es | Corpus, Tagged | 10^8 * 1 | 10^4 * 1 | 1200 (apprx.) | 2000 (apprx.) | Multimedia, esp. Written | Written | General | None | Free registration required | 2023/03/24 |
Corpus del Español: Web/Dialects | Spanish[108][109] | es | Corpus, Tagged | 10^9 * 2 | 10^6 * 2 | 2010 (apprx.) | 2014 | Written, Computer, Internet | Written | General | None | Free registration required | 2023/03/24 |
Corpus del Español: NOW | Spanish[108][110] | es | Corpus, Tagged | 10^9 * 7 | 10^7 * 1 | 2012 | 2019 | Written, Computer, Internet | Written | Nonfiction, News | None | Free registration required | 2023/03/24 |
Corpus del Español del Siglo XXI (CORPES)[111] | Spanish | es | Corpus, Tagged | 10^8 * 4 | 10^5 * 4 | 2001 | 2022 | Multimedia, esp. Written | Multimedia, esp. Written | General | None? | None | 2023/03/24 |
Lemko and Karpatska Rus’ Archive[112] | Carpathian Rusyn | rue | Library | - | 10^3 * 2 | 1928 | 1989 | Written, Periodicals, Newspaper, Print | Written | Nonfiction, News | None? | None | 2024/06/18 |
Spauda[112] | Lithuanian | lt | Library | - | ? | 1886 | 2015 | Written, Periodicals, Newspaper, Print | Written | Nonfiction, News | None? | None | 2023/04/04 |
Gallica | French | fr | Library | - | 10^7 | ? | ? | Written, Periodicals, Newspaper, Print | Written | General, esp. Nonfiction, News | None | None | 2023/05/31 |
RetroNews | French | fr | Library | - | ? (>10^6 * 3) | 1631 | 1951 | Written, Periodicals, Newspaper, Print | Written | Nonfiction, News | None | None | 2023/05/31 |
The Database of Early Cantonese Bible | Cantonese | yue | Corpus, Untagged? | - | 10^0 * 7 | 1863 | 1927 | Written, Religious Text | Written | Religious, Christianity, Bible Passages | None? | None | 2023/12/10 |
The Database of Early Christian Literature | Cantonese | yue | Corpus, Untagged? | - | 10^0 * 5 | 1845 (apprx.) | 1906 | Written, Books, Print | Written | Religious, Christianity | None? | None | 2023/12/10 |
A Comprehensive Edition of Tocharian Manuscripts (CEToM) | Tocharian B, Tocharian A | txb, xto | Corpus, Tagged | 10^5 * 2[113] | 10^3 * 2[114] | 500 (apprx.)[115] | 700 (apprx.)[115] | Written | Written | General, esp. Religious, Buddhism | None? | None | 2024/05/05 |
Glossary
[edit]The following is a brief explanation of how various terms are used in describing and categorizing the corpora on this page.
- Access restrictions: Any barriers to accessing the resource's contents, such as registration or paying a subscription. A number of resources can be accessed through the Wikipedia Library for free.
- Apprx.: "Approximately", used to indicate that a date or quantity was estimated or not exactly known when inputted, but a best guess was given.
- Available medium: The format through which the language can be accessed in the resource, such as written text or as a spoken and recorded in a video.
- Esp.: "Especially", used to qualify the most common quality of a corpus, even if there are notable exceptions.
- Hyphen (-): The symbol "-" is used in tables for information about a corpus that cannot be readily determined or approximated.
- Library: Collection of texts gathered with a wide net and without linguistics work particularly in mind. It must be possible to search the contents of these texts.
- Original Medium: The way the language was originally produced, whether it spoken, written, etc.
- Question mark (?): The symbol "?" is used tables for information about a corpus that has not yet been determined, but probably could be.
- Social media: A live website or other online center for mass user communication, or the attempt at a near complete archive of such. If the resource is an archive with a particular focus, then it is considered a library or corpus.
- Re-use restrictions: Unique restrictions on the distribution of the resource's contents beyond general copyright law, in particular restrictions on commercial use or to academic users only. This restriction is particularly relevant to Wiktionary were all content must be able to be redistributed commercially per Wiktionary's CC BY-SA 4.0 license.
- Strikethrough (
- Tagged Corpus: Collection of texts gathered within a specific scope with linguistics work at least partly in mind. The contents of the texts are marked by part of speech, meaning, pragmatics, or any other method.
- Text: A continuous use of language published, released, or spoken as a coherent work. This could be a forum post in a thread, a book, an issue of a magazine, or a speech.
- Untagged Corpus: Collection of texts gathered within a specific scope with linguistics work at least partly in mind. The contents of the texts are not marked by part of speech, meaning, pragmatics, or any other method.
Other lists and databases
[edit]Name | Language | Language Code | Size in corpora | Date of entry update |
---|---|---|---|---|
Corpus Resource Database (CoRD) | Translingual, esp. English | mul, en | 10^2 * 1 | 2023/02/13 |
Czech National Corpus KonText interface | Translingual | mul | 10^3 * 1[116] | 2023/02/13 |
English-Corpora.org | English | en | 10^1 * 2 | 2023/02/13 |
Leipzig Corpora Collection | Translingual | mul | 10^3 * 1 | 2023/02/13 |
Lextutor Web Concordance English | English | en | 10^1 * 5 | 2023/02/13 |
Lextutor Web Concordance French | French | fr | 10^1 * 2 | 2023/02/13 |
LINDAT/CLARIAH-CZ Corpora | Translingual | mul | 10^2 * 7 | 2023/02/13 |
Linguistic Data Consortium (LDC) | Translingual | mul | 10^3 * 1 | 2023/02/13 |
Martin Weisser's On-line Corpora of English | Translingual, esp. English | mul, en | 10^1 * 2 | 2023/02/13 |
SketchEngine | Translingual | mul | 10^1 * 2[117] | 2023/02/13 |
University of Warwick list of free online corpora | English | en | 10^1 * 2 | 2023/02/13 |
University of Edinburgh Scots and Scottish English corpora | Scots, English | sco, en | 10^1 * 3 | 2023/02/13 |
Translingual | mul | 10^3 * 2 | 2024/04/22 | |
CLARIN.SI Online Concordancers | Translingual, esp. Slovene | mul, sl | 10^2 * 2 | 2023/02/26 |
CLARIN.SI Corpus Repository | Translingual, esp. Slovene | mul, sl | 10^2 * 2 | 2023/02/26 |
CLARINO Corpuscle | Translingual, esp. Norwegian | mul, no | 10^1 * 6 | 2023/02/26 |
CLARINO Corpus Repository | Translingual, esp. Norwegian | mul, no | 10^1 * 4 | 2023/02/26 |
Online Resources for African American Language (ORAAL), external data sources | English | en | 10^1 * 1 | 2023/03/15 |
Online Resources for African American Language (ORAAL), supplements | English | en | 10^0 * 4 | 2024/05/08 |
Corpus Linguistics in Context (CLiC) | English | en | 10^0 * 5 | 2023/03/15 |
The Spanish Coprus | Spanish | es | 10^0 * 4 | 2023/03/24 |
Pennsylvania State University scripts and transcripts of popular film, TV, and sports | English[119] | en | 10^1 * 2 | 2023/04/02 |
/r/Screenwriting Guide to Finding Scripts Online | English[119] | en | 10^1 * 2 | 2023/04/02 |
BBC.com[120] | Translingual | mul | 10^1 * 3 | 2024/04/22 |
Corpus4U.org[121][122] | English, Chinese | en, zh | 10^2 * 2 | 2023/06/17 |
Beijing Foreign Studies University CQPweb[123] | Translingual | mul | 10^2 * 2 | 2023/06/17 |
Lancaster Univerity CQPweb | Translingual, esp. English | mul, en | 10^2 * 1 | 2023/06/17 |
Hong Kong University of Science and Technology Resources for Chinese Linguistics | Chinese, esp. Cantonese | zh, yue | 10^0 * 3 | 2023/12/10 |
PolyU Corpus of Spoken Chinese, links to other corpora and databases | Translingual, esp. Chinese | mul, zh | 10^2 * 1 | 2024/01/13 |
Duke University list of collections of African American oral histories | English | en | 10^1 * 1 | 2024/05/08 |
OPUS Open Parallel Corpus Collection | Translingual | mul | 10^3 * 1 | 2024/06/18 |
OPUS Multilingual Search Interface | Translingual | mul | 10^2 * 4 | 2024/06/18 |
See also Wiktionary:Searchable external archives and Wiktionary:Quotations/Resources.
Notes
[edit]- ↑ 1.0 1.1 1.2 Specifically Australia, Bangladesh, Canada, Ghana, Great Britain, Hong Kong, India, Ireland, Jamaica, Kenya, Malaysia, New Zealand, Nigeria, Pakistan, Philippines, Singapore, South Africa, Sri Lanka, Tanzania, the United States
- ↑ 2.0 2.1 2.2 2.3 2.4 2.5 Note that dialect information in internet-derived corpora tends to be somewhat inaccurate because of accidental inclusion of texts in other dialects.
- ^ Specifically Australia, Canada, Ireland, New Zealand, the United Kingdom, and the United States
- ^ Note that this corpus is a sub-corpus of the NOW corpus.
- ^ Particularly speeches and interviews
- ^ As of 2024-05-10, the latest change log entry was from 2023/12/26.
- ^ Most after 2005
- ^ Most before 2017
- ^ An account is required to use the site's built in search function. Nonetheless, the forum threads can still be viewed and navigated without hindrance when logged out.
- ^ Tyler Kendall, Charlie Farrington (2023 June) “CORAAL User Guide”, in Corpus of Regional African American Language[1], retrieved 2024-05-09:
- The core components of CORAAL focus on AAL in Washington DC, […] CORAAL:DC […] is comprised of over 100 sociolinguistic interviews […] In addition to CORAAL:DC, CORAAL includes several smaller components to provide regional breadth. As of July 2021, there are six supplemental components: CORAAL:ATL, which includes 14 sociolinguistic interviews from speakers living in Atlanta, Georgia; CORAAL:DTA, which includes 40 sociolinguistic interviews from the Detroit Dialect Study collected in 1966; CORAAL:LES, comprised of 10 sociolinguistic interviews of speakers from the Lower East Side of New York City; CORAAL:PRV, which includes 15 sociolinguistic interviews from the town of Princeville, a rural African American community in central North Carolina; CORAAL:ROC, which includes 14 sociolinguistic interviews from Rochester, a city in Western Upstate New York; and CORAAL:VLD, which includes 12 speakers from Valdosta, a small city in South Georgia. […] Interviews are sociolinguistic styled interviews on topics such as life in Valdosta, personal histories, and high school sports.
- ↑ 11.0 11.1 Note that this number only represents the size of the English portion of the 2020 release of the corpus.
- ↑ 12.0 12.1 12.2 For specific details, see the "Total counts for Dependencies" file hosted in the Dependencies Downloads Index for the English part of the 2020 release which contains word and book counts for each of the years in the corpus as described on the main Ngram Viewer Exports page.
- ^ Anna L. Shparberg (2021 July) “Google Books Ngram Viewer”, in The Charleston Advisor, volume 23, number 1, Annual Reviews, , pages 16–19
- ^ Note that "British English" and "American English" sub-corpora of Google Ngram are sometime very inaccurate/misleading because of the accidental inclusion of texts in other dialects. Consider color vs colour and airplane vs aeroplane in the "British English" corpus. In both cases, Google Ngram shows the forms as being roughly equally as common from 2000-2019, which is blatantly untrue.
- ^ Madian Khabsa, C. Lee Giles (2014 May 9) “The Number of Scholarly Documents on the Public Web”, in PLOS ONE, volume 9, number 5, →ISSN ,
- ↑ 16.0 16.1 16.2 "English as a Second Language"
- ^ The corpus' fair use statement says that "if any portion of this material is to be used for commercial purposes, such as for textbooks or tests, permission must be obtained in advance and a license fee may be required" which is incompatible with Wiktionary's license.
- ↑ 18.0 18.1 Audio files are available separately on TalkBank.org.
- ^ The corpuses manual can be accessed online.
- ^ The corpus' fair use statement says that "if any portion of this material is to be used for commercial purposes, such as for textbooks or tests, permission must be obtained in advance and a license fee may be required" which is incompatible with Wiktionary's license.
- ^ As of 2024-05-10, the search function seems to be very slow or entirely broken. The groups and discussion threads are still manually navigable, though. The website can also still be searched using Google.
- ^ The website incorporates the UTZOO Wiseman archive.
- ^ Based on back-of-the-napkin extrapolation of data at the Internet Live Stats website.
- ↑ 24.0 24.1 24.2 24.3 "English as a Lingua Franca"
- ↑ 25.0 25.1 The website is composed of a series of search tools, including n-gram and concordance search, based on the BNC.
- ^ Selection of different tools can be done through the "Grams" menu in the top left of the page.
- ^ Full name "Brown University Standard Corpus of Present-Day American English"
- ^ The corpus is made of six "texts", but looking at their descriptions reveals that each one is actually a compilation of multiple texts. For example, "feen4" is described as "7 separate titles". Overall, the exact number of independent texts included is unclear.
- ^ Full name "International Corpus Network of Asian Learners of English, Online Version"
- ^ Shin Ishikawa (2022 April 12) “The ICNALE […] ”, in language.sakura.ne.jp[2], SAKURA Internet, archived from the original on 2022-08-14: “The ICNALE includes […] speeches and essays produced by college students […] in ten countries/ regions in Asia (China, Hong Kong, Indonesia, Japan, Korea, Pakistan, the Philippines, Singapore/ Malaysia, Taiwan, and Thailand) as well as English native speakers.”
- ↑ 31.0 31.1 The name is based on an abbreviation of the phrase "UK web as corpus".
- ^ To search the collection, select either "User-Added Text (Back)" or "User-Added Text (Front)" under "Narrow by Specific Fields", then select "contains" from the drop down just to the right and then enter the search term next to that and hit enter. Note that the overall quality and style of the data presented in the collection varies considerably.
- ^ As of 2023/03/07, 2,574 cards have the field "Writing on Card (Yes or No)?" marked as "yes". Nonetheless, there are cards that do have hand writing on them and have the field marked "no".
- ^ Approximately, the site actually lists its size as "7,600,186 phrases" (emphasis added).
- ↑ 35.0 35.1 Though not exclusively short stories, the format dominates the library.
- ^ Although MASC is technically a corpus, it is only directly available through a web browser as a library. A complete copy of MASC as a corpus can be downloaded, though, and then processed with another application.
- ^ Many, if not most, of the LISTSERVs in the catalog do not have publicly accessible archives.
- ^ The catalog describes itself (as of 28 Nov 2022) as containing of 58,100 public lists, each of which contains a number of messages.
- ^ Approximately, not all items cataloged in the library are available online. In particular, it seems none of the around 300,000 speeches cataloged are available online.
- ^ Most after 1945
- ^ Number of items which are both available online and have their language marked as "English".
- ^ Approximately, based on a search of the collection for the basic words "a" and "the".
- ↑ 43.0 43.1 43.2 The number of non-English items is small.
- ↑ 44.0 44.1 44.2 Approximately, based on a statistical calculation.
- ↑ 45.0 45.1 Note that this number represents the number of newspaper issues in the archive.
- ^ Full name "Varieties of English for Specific Purposes dAtabase"
- ^ The corpus' end-user license states "Grant of the Product license entitles Licensee to use the Product for non-profit educational and/or linguistic research purposes only. [...] Licensees agree not to lease, sell, or commercially exploit the results of their searches (such as texts, concordances, metadata)." which is incompatible with Wiktionary's license.
- ^ Per https://issuu.com/about as of 2023/01/19
- ^ Issuu was founded in 2006, but includes some publications uploaded since then, but most of those are from after 1990, if not 2000.
- ^ Registration is required in order to turn "safe mode" off/show explicit search results.
- ^ Unclear. The collection is organized by "projects" which sometimes correspond to individual texts (such as diaries or funeral programs) and other times correspond to a collection of short texts (such as notes or letters). There were 11,372 projects on 2023/01/22. The length of projects is reported by the number of pages they contain. Using random sampling, it was estimated that the total length of all projects was around 2 million pages on 2023/01/22.
- ^ Most after 1800
- ^ Note that some transcripts were incomplete when this number was calculated.
- ^ Each interview in the collection, regardless of the number of parts it has, is considered one text. According to the "Faces and Voices from the Presentation" article, 26 interviews are in the collection.
- ^ Most from before 1950.
- ^ See Appendix I: Narratives in the Slave Narrative Collection by State for numerical breakdown by state
- ↑ 57.0 57.1 Norman R. Yetman (2001) “The Limitations of the Slave Narrative Collection”, in Library of Congress[3], published c. 2017
- ^ The narratives are based on interviews, but because of the lack of ground-truth audio recordings and doubts about the accuracy of the published versions of the narratives, they are categorized here as "Nonfiction, Biographies" rather than "General, Anthropological interviews" or similar.
- ^ Note that the COHA was updated in 2021.
- ↑ 60.0 60.1 Specifically "United States/Canada", "United Kingdom/Ireland", "Australia/New Zealand", and "Miscellaneous".
- ^ Note that the corpus is listed as going up to the "present", but as of 2023/02/16 the most recent section is the 2010s implying that no opinions from later decades are included.
- ^ Note that this number reflects the number of articles in the corpus, not the number of issues of TIME Magazine in the corpus.
- ^ Specifically Australia, Canada, New Zealand, the United Kingdom, and the United States
- ↑ 64.0 64.1 Note that the Queen's University page describing the corpus describes the start year as 1970 and end year as 2010 despite english-corpora.org providing a source spreadsheet which spans the years 1921 to 2011 and its corpus description page showing a time span from the 1920s to 2010s.
- ^ On the website, this number is associated with how many "volumes" are available and is listed along side the number of "titles" (10^5 * 2) as well as the number of pages. The exact meaning of the terms "volumes" and "titles" in this context is unclear.
- ^ Note that although the corpus does explicitly mention its contents, I have not put in the effort to determine the dialect of each of the included texts.
- ^ The website for the corpus is now offline for unclear reasons, but the it is presumably still possible to access the corpus by contacting the university.
- ^ The corpus' description implies that it is continually expanding project, but in 2018 the page had not been updated in 5 years (since 2013) which may suggest the project stopped expanding around the same time.
- ^ An apparently genuine archived version of the corpus' confidenality agreement does state "If I need to cite more than one paragraph (300 words) in a publication, I will obtain permission from the Philadelphia Neighborhood Corpus Committee".
- ^ An archive of the corpus' home page states that "only members of the research group have access".
- ^ Note that searches cover both metadata and transcripts for newsreels simultaneously.
- ^ Specifically Australia, Canada, Ireland, New Zealand, Panama, and the United Kingdom.
- ^ Derived from the "Newspaper Publication" list. A more direct view of the data as collected on 2024/06/16 can be seen in this published Google Sheet.
- ^ About 90% of the publications are based in predominantly Anglophone countries (the United States [12263], Australia [1223], the United Kingdom [811], Canada [525], Ireland [50], New Zealand [19]) while the rest are from a wide variety of countries. Information derived from the "Newspaper Publication" list. A more direct view of the data as collected on 2024/06/16 can be seen in this published Google Sheet.
- ^ Issue counts are provided for individual publications, but not for the entire collection. 12.7 million articles in English are available, though, with each issue featuring many articles.
- ^ A few publications originate from regions outside Wales, in particular three from London, one from the United States, and one from Argentina. An additional publication has no region listed though its "issuing body note" states "Published in Caernarfon by Thomas Jones", with Caernarfon being in Wales.
- ^ Issue counts are provided for individual publications, but not for the entire collection. 363 thousand pages in English are available, though, with each issue featuring many pages.
- ^ The English Wikipedia article on the Court of Great Sessions in Wales stated on 2023-08-08 that "[o]f the 217 judges who sat on its benches [...], only 30 were Welshmen". Those involved in keeping the court's records likely had a similar make up and so the database's dialect likely reflects England rather than Wales.
- ^ This number represents the number of recordings available online.
- ^ This date represents the earliest year specified for any recording in the archive, though that recording does not have audio. It is not immediately clear what the earliest recording with audio is. The earliest audio-only recording is from 1938.
- ^ “Buckeye Corpus Information”, in Buckeye Corpus[4], c. 2005, retrieved 2024-05-09: “After a significant amount of piloting different protocols for eliciting large amounts of unmonitored speech, a modified sociolinguistic interview format was chosen.”
- ↑ 82.0 82.1 Note that this number was calculated to include the about 25% of work listings which were placeholders on 2024/02/24 but should eventually become full entries and excluded the about 15% of work listings were redirects to other listings on the same date.
- ^ Based on the fact that the list pages for browsing works display 25 works at a time there are 78 pages to browser as of 2024/02/26.
- ^ Not explicitly stated, but browsing the collection on 2024-02-26 revealed only newspapers being cited as the source of the stories provided.
- ^ Samantha Cole (2020 October 13) “2.1 Million of the Oldest Internet Posts Are Now Online for Anyone to Read”, in Vice[5], archived from the original on 2020-10-13: “Around 2.1 million posts from between February 1981 and June 1991 from Henry Spencer's UTZOO NetNews Archive are archived at the Usenet Archive for anyone to browse.”
- ^ There is also a mirror site, Explicit-Id.com
- ^ Though the site does feature a built in search function, it is significantly limited and prone to errors. For this reason, I've classified it as a "library" rather than a "corpus". A complete copy of the original data can be downloaded (see here for details) and processed with another application, though.
- ^ From the number of queries multiplied by the average of 3.5 words per query mentioned in the scientific article that originally accompanied the data: Greg Pass, Abdur Chowdhury, Cayley Torgeson (2006 May) “A Picture of Search”, in Proceedings of the First International Conference on Scalable Information Systems, Hong Kong, , page 2
- ^ Number of queries, per the README included with the data
- ^ This requirement is incompatible with Wiktionary's license.
- ^ Although the database indexes and shows results for the entirety of FRED, audio and transcripts are only viewable for the FRED Sampler (FRED-S) portion. For this reason, most of the information presented in this table is based on the FRED-S, not the complete FRED.
- ^ Although tagged transcripts can be downloaded from from the database, the search function only allows for the plaintext transcripts to be searched.
- ↑ 93.0 93.1 93.2 93.3 93.4 Benedikt Szmrecsanyi, Nuria Hernández (2007) “Manual of Information to Accompany the Freiburg Corpus of English Dialects Sampler (“FRED-S”)”, in FreiDok Plus[6], archived from the original on 2013-04-02
- ^ Most before 1990.
- ^ As of 2024-12-16, the search function seems to be broken. It can still be manually searched using Google.
- ^ A scrap of the corpus from about 2018 is also available as a CSV with the registration of a free account.
- ^ Based on a word count the 2018 scrape which has a similar number of transcription samples in it as the live website.
- ^ According to the website, as of the last update on 2023-07-07.
- ^ Based on Wayback Machine records.
- ^ Per the landing page.
- ↑ 101.0 101.1 101.2 Note that multiple sub-corpora and related corpora can be searched on the site.
- ↑ 102.0 102.1 Note that these numbers represent the size of all the corpora on the site tallied together.
- ^ The corpus' terms FAQ states "All data published under [this website] are available exclusively for non-commercial use for research and educational purposes [...] they can only be used as sources of examples (citations) illustrating a particular linguistic phenomenon." This requirement is incompatible with Wiktionary's license.
- ^ As of 2023-02-12 the query interface was offline.
- ^ The corpus' about page states that it is specifically 98% written and 2% spoken.
- ^ The corpus' user agreement states "TUD sadece araştırma ve sunum amaçlı kullanıma açıktır ve fikri mülkiyet hakları tümüyle Sağlayıcıya aittir." (roughly, '[the corpus] is available for research and presentation purposes only and the intellectual property rights remain the sole property of the Provider.') This requirement is incompatible with Wiktionary's license.
- ↑ 107.0 107.1 This corpus was designed to imitate the English-language Brown Corpus.
- ↑ 108.0 108.1 Specifically Argentina, Bolivia, Chile, Colombia, Costa Rica, Cuba, Dominican Republic, Ecuador, Guatemala, Honduras, Mexico, Nicaragua, Panama, Paraguay, Peru, Puerto Rico, El Salvador, Spain, United States, Uruguay, Venezuela.
- ^ Note that dialect information in internet-derived corpora is usually somewhat inaccurate because of the accidental inclusion of texts in other dialects. This issue is addressed on the website with the conclusion that the "categorization is quite good".
- ^ Note that dialect information in internet-derived corpora is usually somewhat inaccurate because of the accidental inclusion of texts in other dialects. This issue was addressed for the related Web/Dialects corpus with the conclusion that the "categorization is quite good" so a similar level of quality may exist for this corpus.
- ^ Note that the CORPES is currently undergoing continuous revision and so this information may be out of date. To be specific, the information presented is for version 0.99.
- ↑ 112.0 112.1 Note that the newspapers were published in the United States.
- ^ Using the number of "total tokens" listed under "Types of complete words (including unresolved akṣaras)" on 2024-04-22.
- ^ Using the number of manuscripts publicly available on 2024-04-22.
- ↑ 115.0 115.1 George S. Lane, Douglas Q. Adams (2013 July 16) “Tocharian languages”, in Encyclopedia Britannica[7], retrieved 2024-05-05: “Documents from AD 500–700”
- ^ Approximately, it is difficult to see the full list of corpora in order to get an accurate estimate.
- ^ Note that this number reflects the number of corpora freely available. Including the corpora which require a subscription or special permission the number comes up to 722 as of 2023/0/13.
- ^ Note that the database has not been updated since 2016 and has a somewhat buggy search system.
- ↑ 119.0 119.1 Not confirmed to be English exclusively, but probably almost all English.
- ^ The BBC publishes news online in a wide variety of languages which can then be searched manually using a search engine like Google. The languages are specifically Arabic, Azeri, Bangla, Burmese, Chinese, French, Hausa, Hindi, Indonesian, Japanese, Kinyarwanda, Kirundi, Kyrgyz, Marathi, Nepali, Pashto, Persian, Portuguese, Russian, Sinhala, Somali, Spanish, Swahili, Tamil, Turkish, Ukrainian, Urdu, Uzbek, and Vietnamese.
- ^ The forum is primarily written in Chinese, though some posts are in English.
- ^ The section which primarily hosts links to corpuses is labeled "专题研究" (Google Translate translates this as "Special Research".)
- ^ Both user ID and password are "test" for freely available corpora.
Further reading
[edit]- Corpus linguistics on Wikipedia.Wikipedia
- Text corpus on Wikipedia.Wikipedia