Wiktionary:Corpora

This page is dedicated to listing collections of texts useful for the work of creating a dictionary. These collections are often known as "corpora" or less commonly "corpuses". Many of them feature functions like full-text search, term frequency information and collocation search.

For a more user-friendly introduction to some of the most prominent corpora, as well as other resources like dictionaries, see Wiktionary:Quotations/Resources. Another page, Wiktionary:Searchable external archives also contains information with a more specific focus on those which can solidly provide citations passing Wiktionary's criteria for inclusion.

Note that corpora that contain text in multiple languages but where English text makes up of a significant portion of the corpora are listed in the English table below with their "Dialect" in the listing including the word "Multilingual".

If there are any other resources that you know of which aren't listed here, please do add them or suggest them on the talk page.

English

^ Go back to top

English corpora table
Name	Resource Type	Size in words	Size in texts	Dialect	Start year	End year	Original Medium	Available Medium	Genre	Re-use restrictions	Access restrictions	Date of entry update
News on the Web (NOW)	Corpus, Tagged	10^10 * 2	10^7 * 3	(Various)^[1]^[2]	2010	Present	Written, Computer, Internet	Written	Nonfiction, News	None	Free registration required	2022/10/31
iWeb: The Intelligent Web-based Corpus	Corpus, Tagged	10^10 * 1	10^7 * 2	(Various)^[3]^[2]	2017	2017	Written, Computer, Internet	Written	General, esp. Nonfiction	None	Free registration required	2022/10/31
Global Web-Based English (GloWbE)	Corpus, Tagged	10^9 * 2	10^5 * 2	(Various)^[1]^[2]	2012	2013	Written, Computer, Internet	Written	General, esp. Nonfiction	None	Free registration required	2022/10/31
Wikipedia Corpus	Corpus, Tagged	10^9 * 2	10^6 * 4	(Various)	2014	2014	Written, Computer, Internet	Written	Nonfiction, Encyclopedia	None	Free registration required	2022/10/30
Coronavirus Corpus^[4]	Corpus, Tagged	10^9 * 2	10^6 * 2	(Various)^[1]^[2]	2020	2022	Written, Computer, Internet	Written	Nonfiction, News, COVID-19	None	Free registration required	2024/05/10
Corpus of Contemporary American English (COCA)	Corpus, Tagged	10^9 * 1	10^5 * 5	American	1990	2019	Multimedia	Written	General, esp. Nonfiction	None	Free registration required	2023/03/27
Early English Books Online (EEBO)	Corpus, Tagged	10^8 * 8	10^4 * 3	British	1470 (apprx.)	1690 (apprx.)	Written, Books, Print	Written	General	None	Free registration required	2022/10/30
Early English Books Online (EEBO) TCP	Corpus, Untagged	-	10^4 * 6	British	1475	1700	Written, Books, Print	Written	General	None	None	2022/10/31
Early English Books Online (EEBO, V2)	Corpus, Untagged	10^8 * 6	10^4 * 1	British	1470 (apprx.)	1690 (apprx.)	Written, Books, Print	Written	General	None	Free registration required	2022/11/02
Filmot	Library	-	10^8 * 5	(Various, Multilingual)	2005 (apprx.)	Present	Spoken, General	Audio-visual	General, esp. Nonfiction	None	None	2022/10/30
YouGlish	Library	-	10^8 * 1	(Various)	2005 (apprx.)	Present	Spoken, Formal^[5]	Audio-visual	Nonfiction	None	None	2022/10/30
TED Corpus Search Engine (TCSE)	Corpus, Tagged	10^7 * 1	10^3 * 5	(Various)	2007	2023^[6]	Spoken, Formal, Speeches	Audio-visual	Nonfiction	None	None	2022/10/30
Archive-It Collections	Library	-	10^6 * 2	(Various)	1996	Present	Written, Computer, Internet	Written	General, esp. Nonfiction	None	None	2022/10/30
ACL Anthology Reference Corpus (ARC)	Corpus, Tagged	10^7 * 6	10^4 * 2	(Various)	1979	2015	Written, Periodicals, Journals	Written	Nonfiction, Academic, NLP	None	None	2022/10/30
COVID-19 Open Research Dataset (CORD-19)	Corpus, Tagged	10^9 * 3	10^5 * 7	(Various)	1922^[7]	2020^[8]	Written, Periodicals, Journals	Written	Nonfiction, Academic	None	None	2022/10/30
EcoLexicon English	Corpus, Tagged	10^7 * 2	10^3 * 2	(Various)	1973	2016	Written	Written	Nonfiction, Academic, Environment	None	None	2022/10/30
Lipstick Alley	Social Media	-	-	American, African	2000	Present	Written, Computer, Social Media, Forum	Written	General, esp. Nonfiction, Celebrity News	None	Free registration required^[9]	2023/06/23
Corpus of Regional African American Language (CORAAL)	Corpus, Untagged	10^6 * 1	10^2 * 2	American, African	1968	2017	Spoken, Interviews	Audio	General, Sociolinguistic interviews^[10]	None	None	2022/10/31
Google Trends	Trends	-	-	(Various, Multilingual)	2004	Present	Written, Computer, Internet Searches	Written	General	None	None	2022/10/31
Google Ngrams	Trends	10^7 * 2^[11]^[12]	10^12 * 2^[11]^[13]^[12]	(Various, Multilingual)^[14]	1470^[12]	Present	Multimedia	Written	General	None	None	2022/10/31
Google Books	Library	-	10^7 * 4	(Various, Multilingual)	1400 (apprx.)	Present	Multimedia	Written	General	None	None	2022/10/31
Google Scholar	Library	-	10^8 * 1^[15]	(Various, Multilingual)	1700 (apprx.)	Present	Written, Periodicals, Journals	Written	Nonfiction, Academic; Law	None	None	2023/01/19
Corpus of Middle English Prose and Verse	Corpus, Untagged	-	10^2 * 3	Middle	1000	1500	Written, Books, Print	Written	General, esp. Nonfiction	None	None	2022/10/31
Michigan Corpus of Upper-level Student Papers (MICUSP)	Corpus, Untagged	10^6 * 3	10^2 * 8	(Various, ESL^[16])	2002	2009	Written, College Work	Written	Nonfiction, Academic	Restrictions on commercial use^[17]	None	2022/12/28
Michigan Corpus of Academic Spoken English (MiCASE)^[18]^[19]	Corpus, Untagged	10^6 * 2	10^2 * 2	American (mostly)	1998	2001	Spoken, Formal, Speeches	Audio,^[18] Written	Nonfiction, Academic	Restrictions on commercial use^[20]	None	2022/10/31
British Academic Spoken English Corpus (BASE)	Corpus, Tagged	10^6 * 1	10^2 * 2	British	1998	2005	Spoken, Formal, Speeches	Written	Nonfiction, Academic	None	None	2022/11/02
British Academic Written English Corpus (BAWE)	Corpus, Tagged	10^6 * 7	10^3 * 3	British	2000	2007	Written, College Work	Written	Nonfiction, Academic	None	None	2022/11/02
Public Papers of the Presidents of the United States	Library	-	10^2 * 1	American	1938	2002	Multimedia	Written	Nonfiction, Politics	None	None	2023/06/17
Google Groups	Social Media	-	-	(Various)	1981	2024	Written, Computer, Social Media, Usenet	Written	General, esp. Nonfiction	None	None	2024/03/20
UsenetArchives.com^[21]	Social Media	-	10^8 * 7	(Various)	1981^[22]	Present?	Written, Computer, Social Media, Usenet	Written	General, esp. Nonfiction	None	None	2024/03/20
Narkive	Social Media	-	10^8 * 3	(Various)	1990 (apprx.)	Present	Written, Computer, Social Media, Usenet	Written	General, esp. Nonfiction	None	None	2024/03/20
Europeana	Library	-	10^7 * 2	(Various, Multilingual)	0400 (apprx.)	Present	Multimedia	Multimedia	General	None	None	2022/10/31
Internet Archive	Library	-	10^7 * 6	(Various, Multilingual)	-	Present	Multimedia	Multimedia	General	None	Free registration required	2022/10/31
Eighteenth Century Collections Online (ECCO) TCP	Corpus, Untagged	-	10^3 * 2	(Various, Multilingual)	1701	1800	Written, Books, Print	Written	General	None	None	2022/10/31
Old Bailey Corpus (OBC) 2.0	Corpus, Tagged	10^7 * 4	10^6 * 1	British (various dialects)	1720	1913	Spoken, Formal, Court Proceedings	Written	Nonfiction, Law, Courts, Criminal	None	Free registration required	2022/10/31
Old Bailey Proceedings Online	Corpus, Untagged	10^8 * 1	-	British (various dialects)	1674	1913	Spoken, Formal, Court Proceedings	Written	Nonfiction, Law, Courts, Criminal	None	None	2022/10/31
Royal Society Corpus (RSC) 6.0.1 Open	Corpus, Tagged	10^7 * 8	10^4 * 2	British	1665	1920	Written, Periodicals, Journals, Print	Written	Nonfiction, Academic	None	Free registration required	2022/10/31
Royal Society Corpus (RSC) 6.0.4 Open with Topics	Corpus, Tagged	10^8 * 3	10^4 * 2	British	1665	1920	Written, Periodicals, Journals, Print	Written	Nonfiction, Academic	None	Free registration required	2022/10/31
Twitter	Social Media	10^12 * 3^[23]	-	(Various, Multilingual)	2005	Present	Written, Computer, Social Media, Twitter	Written	General, esp. Nonfiction	None	None	2022/10/31
~~SocialGrep (Reddit) Corpora~~^[24]	Corpus, Untagged	-	10^7 * 9	(Various)	2005 (apprx.)	Present?	Written, Computer, Social Media, Reddit	Written	General, esp. Nonfiction	None	None	2025/02/20
Europarl 7 Sample, English	Corpus, Tagged	10^7 * 2	10^3 * 8	International/ELF^[25]	2007	2011	Spoken, Formal, Legislative Proceedings	Written	Nonfiction, Law, Legislatures	None	None	2022/11/01
Europarl 3, English	Corpus, Tagged	10^7 * 2	10^2 * 7	International/ELF^[25]	1996	2006	Spoken, Formal, Legislative Proceedings	Written	Nonfiction, Law, Legislatures	None	Free registration required	2022/11/01
TARA	Corpus, Tagged	10^5 * 9	10^4 * 2	British	2006 (apprx.)	2006 (apprx.)	Written, Periodicals, Newspapers, Print	Written	Nonfiction, News	None	Free registration required	2022/11/01
British National Corpus (BNC)	Corpus, Tagged	10^8 * 1	10^3 * 4	British	1960	1993	Multimedia	Written	General	None	Free registration required	2022/11/01
British National Corpus (BNC) Sampler	Corpus, Tagged	10^6 * 2	10^2 * 2	British	1975	1993	Multimedia	Written	General	None	Free registration required	2022/11/01
Phrases in English (BNC)^[26]^[27]	Corpus, Tagged	10^8 * 1	10^3 * 4	British	1960	1993	Multimedia	Written	General	None	None	2023/02/12
Just The Word (BNC)^[26]	Corpus, Tagged	10^8 * 1	10^3 * 4	British	1960	1993	Multimedia	Written	General	None	None	2023/02/12
British English 2006 (BE06)	Corpus, Tagged	10^6 * 1	10^2 * 5	British	2003	2008	Written	Written	General	None	Free registration required	2022/11/01
American English 2006 (AME06)	Corpus, Tagged	10^6 * 1	10^2 * 5	American	2006 (apprx.)	2006 (apprx.)	Written	Written	General	None	Free registration required	2022/11/22
Hansard Corpus (British Parliament)	Corpus, Tagged	10^9 * 2	10^6 * 8	British	1803	2005	Spoken, Formal, Legislative Proceedings	Written	Nonfiction, Law, Legislatures	None	Free registration required	2022/11/01
British Parliament Hansard	Library	-	-	British	1800	Present	Spoken, Formal, Legislative Proceedings	Written	Nonfiction, Law, Legislatures	None	None	2022/11/01
Australian Parliament Hansard	Library	-	-	Australian	1901	Present	Spoken, Formal, Legislative Proceedings	Written	Nonfiction, Law, Legislatures	None	None	2022/11/01
Canadian House of Commons Hansard	Library	-	-	Canadian	2002	Present	Spoken, Formal, Legislative Proceedings	Written	Nonfiction, Law, Legislatures	None	None	2022/11/01
New Zealand Parliament Hansard	Library	-	-	New Zealand	1854	Present	Spoken, Formal, Legislative Proceedings	Written	Nonfiction, Law, Legislatures	None	None	2022/11/01
GovInfo (United States)	Library	-	-	American	1793	Present	Multimedia	Written	Nonfiction, Law	None	None	2022/11/01
Transgender Usenet Archive (TUA)	Corpus, Untagged	-	10^5 * 4	(Various)	1994	2013	Written, Computer, Social Media, Usenet	Written	General, Transgender Topics	None	None	2022/11/01
~~Science Forums~~^[28]	~~Social Media~~	-	~~10^5 * 1~~	~~(Various)~~	~~1992~~	~~2014~~	~~Written, Computer, Social Media, BBS~~	~~Written~~	~~Nonfiction, Science~~	~~None~~	~~None~~	~~2025/04/16~~
TextFiles.com	Library	-	-	(Various)	1980 (apprx.)	1995 (apprx.)	Multimedia	Multimedia	General, esp. Nonfiction, Technology	None	None	2022/11/01
LDS General Conference Corpus	Corpus, Tagged	10^7 * 3	10^4 * 1	American	1851	Present	Spoken, Formal, Speeches	Written	Religious, Latter Day Saints	None	None	2022/11/01
FidoNet Echomail Archive	Social Media	-	-	(Various)	1990 (apprx.)	2016 (apprx.)	Written, Computer, Social Media, FidoNet	Written	General, esp. Nonfiction, Technology	None	None	2022/11/01
FidoNet HolySmoke Archive	Library	-	10^5 * 4	(Various)	1993	2004	Written, Computer, Social Media, FidoNet	Written	Nonfiction, Religion	None	None	2022/11/02
Dúchas Project	Library	-	10^6 * 2	Irish	1900 (apprx.)	1940 (apprx.)	Multimedia	Written	Fiction, Folklore	None	None	2022/11/02
Freiburg-Brown Corpus of American English (FROWN)	Corpus, Tagged	10^6 * 1	10^2 * 5	American	1992	1992	Written, Print	Written	General	None	Free registration required	2022/11/02
Brown Corpus Family	Corpus, Tagged	10^6 * 1	10^3 * 2	-	-	-	Written, Print	Written	General	None	Free registration required	2022/11/02
Brown Family (C8 tags)	Corpus, Tagged	10^6 * 6	10^3 * 2	(Various)	1931	1991	Written, Print	Written	General	None	Free registration required	2022/11/02
Brown Corpus^[29]	Corpus, Tagged	10^6 * 1	10^3 * 1	American	1961	1961	Written, Print	Written	General	None	None	2022/11/02
Corpus of English Dialogues	Corpus, Tagged	10^6 * 1	10^2 * 2	British(?)	1560	1760	Multimedia	Written	General, Dialogues	None	Free registration required	2022/11/02
Florence Early English Newspapers (FEEN)	Corpus, Tagged	10^5 * 3	-^[30]	British(?)	1620	1649	Written, Periodicals, Newspapers, Print	Written	Nonfiction, News	None	None	2023/03/27
Transhistorical Corpus of Written English	Corpus, Tagged	10^5 * 5	10^2 * 8	(Various)	1405	2019	Written	Written	General	None	None	2022/11/02
Linguistic Landscape Corpus	Corpus, Tagged	10^6 * 5	10^2 * 6	(Various)	1997	2018	Written	Written	Nonfiction, Academic	None	Free registration required	2022/11/02
ICNALE Online^[31]	Corpus, Tagged	10^6 * 4	10^4 * 2	(Various, ESL^[16])^[32]	2007 (apprx.)	2022 (apprx.)	Multimedia, College Work	Multimedia	Nonfiction, Academic	None	None	2022/11/02
European Football Championship Interpreting Corpus (EFCIC)	Corpus, Tagged	10^4 * 1	10^1 * 1	-	2020	2020	Spoken, Entertainment, Interpretation, Interview	Written	Nonfiction, Sports	None	None	2022/11/02
UkWac Complete^[33]	Corpus, Tagged	10^9 * 2	10^6 * 3	British^[2]	2005 (apprx.)	2007 (apprx.)	Written, Computer, Internet	Written	General	None	None	2022/11/02
UkWac Small^[33]	Corpus, Tagged	10^7 * 8	10^5 * 1	British^[2]	2005 (apprx.)	2007 (apprx.)	Written, Computer, Internet	Written	General	None	None	2022/11/02
Postcard Archive @ Florida State University^[34]	Library	-	10^3 * 3^[35]	(Various)	1829 (apprx.)	2016 (apprx.)	Written, Postcards	Written	Nonfiction, Postcards	None	None	2022/11/06
PlayPhrase.me	Corpus, Tagged	-	10^6 * 8^[36]	(Various)	1970 (apprx.)	Present?	Spoken, Entertainment, Movies	Audio-visual	Fiction, Movies	None	None	2022/11/07
European Union DGT-UD: English	Corpus, Tagged	10^8 * 1	10^4 * 5	International/ELF^[25]	1948 (apprx.)	2016	Written, Legislative Acts	Written	Nonfiction, Law, Legislatures	None	None	2022/11/16
Opus-MontenegrinSubs 1.0: English	Corpus, Tagged	10^5 * 5	10^2 * 2	(Various)	2007	2013	Spoken, Entertainment, Television	Written	Fiction, Television	None	None	2022/11/16
Archive of Our Own (AO3)	Library	-	10^7 * 1	(Various)	2007	Present	Written, Computer, Internet	Written	Fiction, Short Stories, Fan Works^[37]	None	None	2022/11/22
SCP Foundation	Library	-	10^3 * 2	(Various)	2007	Present	Written, Computer, Internet	Written	Fiction, Short Stories, Sci-Fi^[37]	None	None	2022/11/22
NEWS-GB (British newspapers)	Corpus, Tagged	10^8 * 2	-	British	2004 (apprx.)	2004 (apprx.)	Written, Print	Written	Nonfiction, News	None	None	2022/11/22
INTERNET-EN	Corpus, Tagged	10^8 * 2	10^4 * 5	(Various)	2006 (apprx.)	2006 (apprx.)	Written, Computer, Internet	Written	General	None	None	2022/11/22
BLOGS-EN (Political blogs)	Corpus, Tagged	10^8 * 5	-	(Various)	2008 (apprx.)	2008 (apprx.)	Written, Computer, Internet	Written	Nonfiction, Politics	None	None	2022/11/22
Manually Annotated Sub-Corpus (MASC)	Library^[38]	10^5 * 5	10^2 * 4	American	1990 (apprx.)	2010 (apprx.)	Multimedia	Written	General	None	None	2022/11/23
Lancaster Newsbooks Corpus (1654 part)	Corpus, Tagged	10^5 * 9	10^2 * 2	British	1653	1654	Written, Periodicals, Newspaper, Print	Written	Nonfiction, News	None	Free registration required	2022/11/23
The Mail Arcive	Library	-	10^8 * 2	(Various)	1990	Present	Written, Computer, Mailing List	Written	Nonfiction, esp. Coding and Computers	None	None	2022/11/26
CataList (LISTSERV catalog)^[39]	Library	-	-^[40]	(Various)	1990 (apprx.)	Present	Written, Computer, Mailing List	Written	Nonfiction	None	None	2022/11/28
United Nations Digital Library	Library	-	10^5 * 7^[41]	(Various, International/ELF^[25])	1875^[42]	Present	Multimedia	Multimedia	Nonfiction, Politics	None	None	2022/11/29
Genius.com	Library	-	-	(Various, Multilingual)	1900 (apprx.)	Present	Spoken, Entertainment, Music	Written	General, Music	None	None	2022/12/06
Chronicling America	Library	-	-	American	1777	1963	Written, Periodicals, Newspaper, Print	Written	Nonfiction, News	None	None	2022/12/06
Library of Congress	Library	-	10^6 * 3^[43]	(Various, Multilingual)	1470 (apprx.)	Present	Multimedia	Multimedia	General	None	None	2022/12/06
World Radio History	Library	-	10^5 * 1^[44]	(Various, Multilingual)^[45]	1900 (apprx.)	Present	Written, Periodicals, Magazines, Print	Written	Nonfiction, Radio; Television; Music	None	None	2022/12/06
Google News Newspapers Archive	Library	-	10^6 * 6^[46]^[47]	(Various, Multilingual)^[45]	1738 (apprx.)	2009	Written, Periodicals, Magazines, Print	Written	Nonfiction, News	None	None	2022/12/14
VESPA^[48]	Corpus	10^6 * 2	10^2 * 9	International/ESL^[16]	2008 (apprx.)	2008 (apprx.)	Written, College Work	Written	Nonfiction, Academic	Restriction to non-profit educational use only^[49]	Free registration required	2022/12/28
I-EN (Internet English Corpus)	Corpus, Tagged	10^8 * 2	-	(Various)	2005	2005	Written, Computer, Internet	Written	Nonfiction, News?	None	None	2022/12/28
I-EN-CC (Internet English Creative Commons Corpus)	Corpus, Tagged	10^8 * 2	-	(Various)	2005 (apprx.)	2005 (apprx.)	Written, Computer, Internet	Written	Nonfiction, News?	None	None	2022/12/28
Springfield! Springfield!	Library	-	10^5 * 2	(Various)	1910 (apprx.)	Present	Spoken, Entertainment, Movies and Television	Written	General	None	None	2023/03/27
Issuu	Library	-	10^7 * 5^[50]	(Various, Multilingual)	2000 (apprx.)^[51]	Present	Written, Periodicals, Magazines	Written	Nonfiction	None	Free registration required for full access^[52]	2023/01/19
Smithsonian Transcription Center	Library	-	-^[53]	American	1400 (apprx.)^[54]	Present	Written	Written	Nonfiction	None	None	2023/01/22
Voices Remembering Slavery: Freed People Tell Their Stories	Library	10^4 * 7^[46]^[55]	10^1 * 3^[56]	American, African	1932	1975^[57]	Spoken, Interviews	Audio	General, Anthropological interviews	None	None	2023/01/28
Born in Slavery: Slave Narratives from the Federal Writers' Project	Library	-	10^3 * 2^[58]	American, African^[59]	1936	1938	Written	Written	Nonfiction, Biographies^[59]^[60]	None	None	2023/01/28
Corpus of Historical American English (COHA)^[61]	Corpus, Tagged	10^8 * 5	10^5 * 1	American	1820	2019	Multimedia	Written	General	None	Free registration required	2023/02/14
The TV Corpus	Corpus, Tagged	10^8 * 3	10^4 * 8	(Various)^[62]	1950	2017	Spoken, Entertainment, Television	Written	General	None	Free registration required	2023/03/27
The Movie Corpus	Corpus, Tagged	10^8 * 2	10^4 * 3	(Various)^[62]	1930	2018	Spoken, Entertainment, Movies	Written	General	None	Free registration required	2023/03/27
Corpus of American Soap Operas (CASO)	Corpus, Tagged	10^8 * 1	10^4 * 2	American	2001	2012	Spoken, Entertainment, Movies	Written	Fiction, Television, Soap Operas	None	Free registration required	2023/03/27
Corpus of US Supreme Court Opinions	Corpus, Tagged	10^8 * 1	10^4 * 3	American	1790 (apprx.)	2019 (apprx.)^[63]	Written	Written	Nonfiction, Law, Courts, Constitutional	None	Free registration required	2023/02/16
TIME Magazine Corpus	Corpus, Tagged	10^8 * 1	10^5 * 3^[64]	American	1923	2006	Written, Periodicals, Magazines, Print	Written	Nonfiction, News	None	Free registration required	2023/02/16
Corpus of Online Registers of English (CORE)	Corpus, Tagged	10^7 * 5	10^4 * 5	(Various)^[65]	2013 (apprx.)	2016 (apprx.)	Written, Computer, Internet	Written	General	None	Free registration required	2023/02/16
Strathy Corpus of Canadian English	Corpus, Tagged	10^7 * 5	10^3 * 1	Canadian	1921^[66]	2011^[66]	Multimedia	Written	General	None	Free registration required	2023/02/16
Biodiversity Heritage Library	Library	-	10^5 * 3^[67]	(Various, Multilingual)	1400 (apprx.)	Present	Written	Written	Nonfiction, Academic, Biology	None	None	2023/02/23
African American Writers, 1892-1912 (AAW)	Corpus, Untagged	10^5 * 5	10^0 * 8	American, African	1892	1912	Written	Written	General	None	None	2023/03/15
Children's Literature (ChiLit)	Corpus, Untagged	10^6 * 4	10^1 * 7	(Unclear)^[68]	?	?	Written	Written	Fiction, Children	None	None	2023/03/15
The Philadelphia Neighborhood Corpus of LING560 Studies (PNC)^[69]	Corpus	10^6 * 2	10^2 * 3	American	1972	Present?^[70]	Spoken, Interviews	Written	(Unclear)	Restrictions on excerpt size^[71]	Yes^[72]	2023/03/15
British Pathé^[73]	Library	-	10^5 * 2	British	1896	1984	Spoken, Formal	Audio-visual	Nonfiction, News	None?	None	2023/04/06
Newspapers.com	Library	-	10^5 * 8	(Various)^[74]	1690	Present	Written, Periodicals, Newspaper, Print	Written	Nonfiction, News	None	Wikipedia Library access available. Paid subscription otherwise required. Free trials are available.	2023/04/30
NewspaperArchive	Library	-	10^7 * 2^[47]^[75]	(Various, Multilingual)^[76]	1607	Present	Written, Periodicals, Newspaper, Print	Written	Nonfiction, News	None	Wikipedia Library access available. Paid subscription otherwise required. Free trials are available.	2024/06/16
PressReader	Library	-	?	(Various)	?	Present	Written, Periodicals, Newspaper, Print	Written	Nonfiction, News	None	Some snippets freely visible, most content requires paid subscription. Free trials are available.	2023/05/31
ProQuest	Library	-	?	(Various)	?	Present	Written, Periodicals, Newspaper, Print	Written	Nonfiction, News	None	Wikipedia Library access available. Some snippets freely visible, most content requires paid subscription. Free trials are available.	2023/05/31
Welsh Newspapers	Library	-	?^[77]	Welsh,^[78] Multilingual	1804	1919	Written, Periodicals, Newspaper, Print	Written	Nonfiction, News	None?	None	2023/08/08
Welsh Journals	Library	-	?^[79]	Welsh, Multilingual	1735	2007	Written, Periodicals, Print	Written	General	None?	None	2023/08/08
Crime and Punishment Database	Library	-	-	English?^[80]	1730	1830	Written, Formal, Court Records	Written	Nonfiction, Law, Courts, Criminal	None?	None	2023/08/08
American Archive of Public Broadcasting	Library	-	10^5 * 1^[81]	(Various, Multilingual)^[45]	1931^[82]	Present	Spoken, esp. Formal	Audio-visual	General, esp. Nonfiction	None	None, additional content available on-site at GBH or the Library of Congress.	2023/11/01
Buckeye Speech Corpus	Corpus, Tagged	10^6 * 3	10^2 * 4	American	1999	2000	Spoken, Interviews	Audio, Written	General, Sociolinguistic interviews^[83]	Restriction to educational and research use only	Free registration required	2024/02/19
Westminster Detective Library	Library	10^7 * 5^[46]^[84]	10^4 * 2^[84]^[85]	American	1818	1891	Written, Periodicals, Newspapers, Print^[86]	Written	Fiction, Short Stories, Detective Stories	None	None	2024/02/26
Usenet Archive (UTZOO Wiseman/Zach Barth)	Social Media	-	10^6 * 2^[87]	(Various)	1981	1991	Written, Computer, Social Media, Usenet	Written	General, esp. Nonfiction	None	None	2024/03/20
Searchids.com^[88]	Library^[89]	10^7 * 7^[90]	10^7 * 2^[91]	(Various)	2006	2006	Written, Computer, Internet Searches	Written	General	Restriction to non-commercial research use only^[92]	None	2024/04/11
Freiburg Corpus of English Dialects (FRED) - Interactive Database^[93]	Corpus, Untagged^[94]	10^6 * 1^[95]	10^2 * 1^[95]	British (various dialects)	1970^[95]	2000^[95]^[96]	Spoken, Interviews	Audio, Written	Nonfiction, History, Oral History^[95]	None	None	2024/05/09
MTSamples.com^[97]^[98]	Library	10^6 * 3^[99]	10^3 * 5^[100]	(Various?)	2007^[101]	2023	Written, Computer	Written	Nonfiction, Medicine	Requires attribution^[102]	None	2024/12/16

Non-English

^ Go back to top

Non-English corpora table
Name	Language	Language Code	Resource Type	Size in words	Size in texts	Start year	End year	Original medium	Available medium	Genre	Use restrictions	Access restrictions	Date of entry update
Czech National Corpus^[103]	Czech	cs	Corpus, Tagged	?	?	?	?	?	Multimedia	General	None?	None	2024/06/18
Polish National Corpus^[103]	Polish	pl	Corpus, Tagged	10^9 * 2	?	?	?	?	Written	General	None?	None	2023/02/12
Russian National Corpus^[103]	Russian	ru	Corpus, Tagged	10^9 * 2^[104]	10^6 * 5^[104]	1100	Present	Multimedia	Written	General	Restriction to non-commercial linguistic use only^[105]	None	2023/02/12
Turkish National Corpus^[106]	Turkish	tr	Corpus, Tagged?	10^7 * 5	10^3 * 6	1990	2009	Written^[107]	Written	General	Restriction to educational use only^[108]	Free registration required	2023/02/12
Bruno Corpus^[109]	Spanish	es	Corpus, Untagged	10^6 * 1	10^2 * 5	?	2010 (apprx.)	Written	Written?	General	None	None	2023/02/12
Braun Corpus^[109]	German	de	Corpus, Untagged	10^6 * 1	10^2 * 5	?	2008 (apprx.)	Written	Written?	General	None	None	2023/02/12
Corpus del Español: Genre/Historical	Spanish	es	Corpus, Tagged	10^8 * 1	10^4 * 1	1200 (apprx.)	2000 (apprx.)	Multimedia, esp. Written	Written	General	None	Free registration required	2023/03/24
Corpus del Español: Web/Dialects	Spanish^[110]^[111]	es	Corpus, Tagged	10^9 * 2	10^6 * 2	2010 (apprx.)	2014	Written, Computer, Internet	Written	General	None	Free registration required	2023/03/24
Corpus del Español: NOW	Spanish^[110]^[112]	es	Corpus, Tagged	10^9 * 7	10^7 * 1	2012	2019	Written, Computer, Internet	Written	Nonfiction, News	None	Free registration required	2023/03/24
Corpus del Español del Siglo XXI (CORPES)^[113]	Spanish	es	Corpus, Tagged	10^8 * 4	10^5 * 4	2001	2022	Multimedia, esp. Written	Multimedia, esp. Written	General	None?	None	2023/03/24
Lemko and Karpatska Rus’ Archive^[114]	Carpathian Rusyn	rue	Library	-	10^3 * 2	1928	1989	Written, Periodicals, Newspaper, Print	Written	Nonfiction, News	None?	None	2024/06/18
Spauda^[114]	Lithuanian	lt	Library	-	?	1886	2015	Written, Periodicals, Newspaper, Print	Written	Nonfiction, News	None?	None	2023/04/04
Gallica	French	fr	Library	-	10^7	?	?	Written, Periodicals, Newspaper, Print	Written	General, esp. Nonfiction, News	None	None	2023/05/31
RetroNews	French	fr	Library	-	? (>10^6 * 3)	1631	1951	Written, Periodicals, Newspaper, Print	Written	Nonfiction, News	None	None	2023/05/31
The Database of Early Cantonese Bible	Cantonese	yue	Corpus, Untagged?	-	10^0 * 7	1863	1927	Written, Religious Text	Written	Religious, Christianity, Bible Passages	None?	None	2023/12/10
The Database of Early Christian Literature	Cantonese	yue	Corpus, Untagged?	-	10^0 * 5	1845 (apprx.)	1906	Written, Books, Print	Written	Religious, Christianity	None?	None	2023/12/10
A Comprehensive Edition of Tocharian Manuscripts (CEToM)	Tocharian B, Tocharian A	txb, xto	Corpus, Tagged	10^5 * 2^[115]	10^3 * 2^[116]	500 (apprx.)^[117]	700 (apprx.)^[117]	Written	Written	General, esp. Religious, Buddhism	None?	None	2024/05/05
Manx Corpus Search	Manx	gv	Corpus, Untagged	10^6 * 2	10^2 * 7	1610	2012	Multimedia, esp. Written	Written	General?	None?	None	2025/03/10
Comprehensive Aramaic Lexicon Project	Aramaic	arc	Corpus, Tagged	?	?	BCE 900 (apprx.)	1300 (apprx.)	Written	Written	?	None?	None	2025/03/24

Glossary

^ Go back to top

The following is a brief explanation of how various terms are used in describing and categorizing the corpora on this page.

Access restrictions: Any barriers to accessing the resource's contents, such as registration or paying a subscription. A number of resources can be accessed through the Wikipedia Library for free.
Apprx.: "Approximately", used to indicate that a date or quantity was estimated or not exactly known when inputted, but a best guess was given.
Available medium: The format through which the language can be accessed in the resource, such as written text or as a spoken and recorded in a video.
Esp.: "Especially", used to qualify the most common quality of a corpus, even if there are notable exceptions.
Hyphen (-): The symbol "-" is used in tables for information about a corpus that cannot be readily determined or approximated.
Library: Collection of texts gathered with a wide net and without linguistics work particularly in mind. It must be possible to search the contents of these texts.
Original Medium: The way the language was originally produced, whether it spoken, written, etc.
Question mark (?): The symbol "?" is used tables for information about a corpus that has not yet been determined, but probably could be.
Social media: A live website or other online center for mass user communication, or the attempt at a near complete archive of such. If the resource is an archive with a particular focus, then it is considered a library or corpus.
Re-use restrictions: Unique restrictions on the distribution of the resource's contents beyond general copyright law, in particular restrictions on commercial use or to academic users only. This restriction is particularly relevant to Wiktionary were all content must be able to be redistributed commercially per Wiktionary's CC BY-SA 4.0 license.
Strikethrough ( ): Resources with their name's crossed out with a strikethrough were nonfunctional or otherwise broken at the time of the entry's last update.
Tagged Corpus: Collection of texts gathered within a specific scope with linguistics work at least partly in mind. The contents of the texts are marked by part of speech, meaning, pragmatics, or any other method.
Text: A continuous use of language published, released, or spoken as a coherent work. This could be a forum post in a thread, a book, an issue of a magazine, or a speech.
Untagged Corpus: Collection of texts gathered within a specific scope with linguistics work at least partly in mind. The contents of the texts are not marked by part of speech, meaning, pragmatics, or any other method.

Other lists and databases

^ Go back to top

Other lists and databases table
Name	Language	Language Code	Size in corpora	Date of entry update
Corpus Resource Database (CoRD)	Translingual, esp. English	mul, en	10^2 * 1	2023/02/13
Czech National Corpus KonText interface	Translingual	mul	10^3 * 1^[118]	2023/02/13
English-Corpora.org	English	en	10^1 * 2	2023/02/13
Leipzig Corpora Collection	Translingual	mul	10^3 * 1	2023/02/13
Lextutor Web Concordance English	English	en	10^1 * 5	2023/02/13
Lextutor Web Concordance French	French	fr	10^1 * 2	2023/02/13
LINDAT/CLARIAH-CZ Corpora	Translingual	mul	10^2 * 7	2023/02/13
Linguistic Data Consortium (LDC)	Translingual	mul	10^3 * 1	2023/02/13
Martin Weisser's On-line Corpora of English	Translingual, esp. English	mul, en	10^1 * 2	2023/02/13
SketchEngine	Translingual	mul	10^1 * 2^[119]	2023/02/13
University of Warwick list of free online corpora	English	en	10^1 * 2	2023/02/13
University of Edinburgh Scots and Scottish English corpora	Scots, English	sco, en	10^1 * 3	2023/02/13
~~SHACHI Database of Language Resources~~^[120]	Translingual	mul	10^3 * 2	2024/04/22
CLARIN.SI Online Concordancers	Translingual, esp. Slovene	mul, sl	10^2 * 2	2023/02/26
CLARIN.SI Corpus Repository	Translingual, esp. Slovene	mul, sl	10^2 * 2	2023/02/26
CLARINO Corpuscle	Translingual, esp. Norwegian	mul, no	10^1 * 6	2023/02/26
CLARINO Corpus Repository	Translingual, esp. Norwegian	mul, no	10^1 * 4	2023/02/26
Online Resources for African American Language (ORAAL), external data sources	English	en	10^1 * 1	2023/03/15
Online Resources for African American Language (ORAAL), supplements	English	en	10^0 * 4	2024/05/08
Corpus Linguistics in Context (CLiC)	English	en	10^0 * 5	2023/03/15
The Spanish Coprus	Spanish	es	10^0 * 4	2023/03/24
Pennsylvania State University scripts and transcripts of popular film, TV, and sports	English^[121]	en	10^1 * 2	2023/04/02
/r/Screenwriting Guide to Finding Scripts Online	English^[121]	en	10^1 * 2	2023/04/02
BBC.com^[122]	Translingual	mul	10^1 * 3	2024/04/22
Corpus4U.org^[123]^[124]	English, Chinese	en, zh	10^2 * 2	2023/06/17
Beijing Foreign Studies University CQPweb^[125]	Translingual	mul	10^2 * 2	2023/06/17
Lancaster Univerity CQPweb	Translingual, esp. English	mul, en	10^2 * 1	2023/06/17
Hong Kong University of Science and Technology Resources for Chinese Linguistics	Chinese, esp. Cantonese	zh, yue	10^0 * 3	2023/12/10
PolyU Corpus of Spoken Chinese, links to other corpora and databases	Translingual, esp. Chinese	mul, zh	10^2 * 1	2024/01/13
Duke University list of collections of African American oral histories	English	en	10^1 * 1	2024/05/08
OPUS Open Parallel Corpus Collection	Translingual	mul	10^3 * 1	2024/06/18
OPUS Multilingual Search Interface	Translingual	mul	10^2 * 4	2024/06/18
Stanford Large Network Dataset Collection	Translingual	mul	10^1 * 1	2025/03/03

Notes

^ Go back to top

↑ ^1.0 ^1.1 ^1.2 Specifically Australia, Bangladesh, Canada, Ghana, Great Britain, Hong Kong, India, Ireland, Jamaica, Kenya, Malaysia, New Zealand, Nigeria, Pakistan, Philippines, Singapore, South Africa, Sri Lanka, Tanzania, the United States
↑ ^2.0 ^2.1 ^2.2 ^2.3 ^2.4 ^2.5 Note that dialect information in internet-derived corpora tends to be somewhat inaccurate because of accidental inclusion of texts in other dialects.
^ Specifically Australia, Canada, Ireland, New Zealand, the United Kingdom, and the United States
^ Note that this corpus is a sub-corpus of the NOW corpus.
^ Particularly speeches and interviews
^ As of 2024-05-10, the latest change log entry was from 2023/12/26.
^ Most after 2005
^ Most before 2017
^ An account is required to use the site's built in search function. Nonetheless, the forum threads can still be viewed and navigated without hindrance when logged out.
^ Tyler Kendall, Charlie Farrington (2023 June) “CORAAL User Guide”, in Corpus of Regional African American Language‎^[1], retrieved 2024-05-09:
The core components of CORAAL focus on AAL in Washington DC, […] CORAAL:DC […] is comprised of over 100 sociolinguistic interviews […] In addition to CORAAL:DC, CORAAL includes several smaller components to provide regional breadth. As of July 2021, there are six supplemental components: CORAAL:ATL, which includes 14 sociolinguistic interviews from speakers living in Atlanta, Georgia; CORAAL:DTA, which includes 40 sociolinguistic interviews from the Detroit Dialect Study collected in 1966; CORAAL:LES, comprised of 10 sociolinguistic interviews of speakers from the Lower East Side of New York City; CORAAL:PRV, which includes 15 sociolinguistic interviews from the town of Princeville, a rural African American community in central North Carolina; CORAAL:ROC, which includes 14 sociolinguistic interviews from Rochester, a city in Western Upstate New York; and CORAAL:VLD, which includes 12 speakers from Valdosta, a small city in South Georgia. […] Interviews are sociolinguistic styled interviews on topics such as life in Valdosta, personal histories, and high school sports.
↑ ^11.0 ^11.1 Note that this number only represents the size of the English portion of the 2020 release of the corpus.
↑ ^12.0 ^12.1 ^12.2 For specific details, see the "Total counts for Dependencies" file hosted in the Dependencies Downloads Index for the English part of the 2020 release which contains word and book counts for each of the years in the corpus as described on the main Ngram Viewer Exports page.
^ Anna L. Shparberg (2021 July) “Google Books Ngram Viewer”, in The Charleston Advisor, volume 23, number 1, Annual Reviews, →DOI, pages 16–19
^ Note that "British English" and "American English" sub-corpora of Google Ngram are sometime very inaccurate/misleading because of the accidental inclusion of texts in other dialects. Consider color vs colour and airplane vs aeroplane in the "British English" corpus. In both cases, Google Ngram shows the forms as being roughly equally as common from 2000-2019, which is blatantly untrue.
^ Madian Khabsa, C. Lee Giles (2014 May 9) “The Number of Scholarly Documents on the Public Web”, in PLOS ONE, volume 9, number 5, →DOI, →ISSN
↑ ^16.0 ^16.1 ^16.2 "English as a Second Language"
^ The corpus' fair use statement says that "if any portion of this material is to be used for commercial purposes, such as for textbooks or tests, permission must be obtained in advance and a license fee may be required" which is incompatible with Wiktionary's license.
↑ ^18.0 ^18.1 Audio files are available separately on TalkBank.org.
^ The corpuses manual can be accessed online.
^ The corpus' fair use statement says that "if any portion of this material is to be used for commercial purposes, such as for textbooks or tests, permission must be obtained in advance and a license fee may be required" which is incompatible with Wiktionary's license.
^ As of 2024-05-10, the search function seems to be very slow or entirely broken. The groups and discussion threads are still manually navigable, though. The website can also still be searched using Google.
^ The website incorporates the UTZOO Wiseman archive.
^ Based on back-of-the-napkin extrapolation of data at the Internet Live Stats website.
^ The corpus appears to have gone down starting around mid 2024 based on archived crawls in the Wayback Machine and remains down as of 2025-02-20.
↑ ^25.0 ^25.1 ^25.2 ^25.3 "English as a Lingua Franca"
↑ ^26.0 ^26.1 The website is composed of a series of search tools, including n-gram and concordance search, based on the BNC.
^ Selection of different tools can be done through the "Grams" menu in the top left of the page.
^ Based on captures of the website stored in the Wayback Machine, it looks like the website went down some time between 2024-03-03 2024-06-18.
^ Full name "Brown University Standard Corpus of Present-Day American English"
^ The corpus is made of six "texts", but looking at their descriptions reveals that each one is actually a compilation of multiple texts. For example, "feen4" is described as "7 separate titles". Overall, the exact number of independent texts included is unclear.
^ Full name "International Corpus Network of Asian Learners of English, Online Version"
^ Shin Ishikawa (2022 April 12) “The ICNALE […] ”, in language.sakura.ne.jp‎^[2], SAKURA Internet, archived from the original on 2022-08-14: “The ICNALE includes […] speeches and essays produced by college students […] in ten countries/ regions in Asia (China, Hong Kong, Indonesia, Japan, Korea, Pakistan, the Philippines, Singapore/ Malaysia, Taiwan, and Thailand) as well as English native speakers.”
↑ ^33.0 ^33.1 The name is based on an abbreviation of the phrase "UK web as corpus".
^ To search the collection, select either "User-Added Text (Back)" or "User-Added Text (Front)" under "Narrow by Specific Fields", then select "contains" from the drop down just to the right and then enter the search term next to that and hit enter. Note that the overall quality and style of the data presented in the collection varies considerably.
^ As of 2023/03/07, 2,574 cards have the field "Writing on Card (Yes or No)?" marked as "yes". Nonetheless, there are cards that do have hand writing on them and have the field marked "no".
^ Approximately, the site actually lists its size as "7,600,186 phrases" (emphasis added).
↑ ^37.0 ^37.1 Though not exclusively short stories, the format dominates the library.
^ Although MASC is technically a corpus, it is only directly available through a web browser as a library. A complete copy of MASC as a corpus can be downloaded, though, and then processed with another application.
^ Many, if not most, of the LISTSERVs in the catalog do not have publicly accessible archives.
^ The catalog describes itself (as of 28 Nov 2022) as containing of 58,100 public lists, each of which contains a number of messages.
^ Approximately, not all items cataloged in the library are available online. In particular, it seems none of the around 300,000 speeches cataloged are available online.
^ Most after 1945
^ Number of items which are both available online and have their language marked as "English".
^ Approximately, based on a search of the collection for the basic words "a" and "the".
↑ ^45.0 ^45.1 ^45.2 The number of non-English items is small.
↑ ^46.0 ^46.1 ^46.2 Approximately, based on a statistical calculation.
↑ ^47.0 ^47.1 Note that this number represents the number of newspaper issues in the archive.
^ Full name "Varieties of English for Specific Purposes dAtabase"
^ The corpus' end-user license states "Grant of the Product license entitles Licensee to use the Product for non-profit educational and/or linguistic research purposes only. [...] Licensees agree not to lease, sell, or commercially exploit the results of their searches (such as texts, concordances, metadata)." which is incompatible with Wiktionary's license.
^ Per https://issuu.com/about as of 2023/01/19
^ Issuu was founded in 2006, but includes some publications uploaded since then, but most of those are from after 1990, if not 2000.
^ Registration is required in order to turn "safe mode" off/show explicit search results.
^ Unclear. The collection is organized by "projects" which sometimes correspond to individual texts (such as diaries or funeral programs) and other times correspond to a collection of short texts (such as notes or letters). There were 11,372 projects on 2023/01/22. The length of projects is reported by the number of pages they contain. Using random sampling, it was estimated that the total length of all projects was around 2 million pages on 2023/01/22.
^ Most after 1800
^ Note that some transcripts were incomplete when this number was calculated.
^ Each interview in the collection, regardless of the number of parts it has, is considered one text. According to the "Faces and Voices from the Presentation" article, 26 interviews are in the collection.
^ Most from before 1950.
^ See Appendix I: Narratives in the Slave Narrative Collection by State for numerical breakdown by state
↑ ^59.0 ^59.1 Norman R. Yetman (2001) “The Limitations of the Slave Narrative Collection”, in Library of Congress‎^[3], published c. 2017
^ The narratives are based on interviews, but because of the lack of ground-truth audio recordings and doubts about the accuracy of the published versions of the narratives, they are categorized here as "Nonfiction, Biographies" rather than "General, Anthropological interviews" or similar.
^ Note that the COHA was updated in 2021.
↑ ^62.0 ^62.1 Specifically "United States/Canada", "United Kingdom/Ireland", "Australia/New Zealand", and "Miscellaneous".
^ Note that the corpus is listed as going up to the "present", but as of 2023/02/16 the most recent section is the 2010s implying that no opinions from later decades are included.
^ Note that this number reflects the number of articles in the corpus, not the number of issues of TIME Magazine in the corpus.
^ Specifically Australia, Canada, New Zealand, the United Kingdom, and the United States
↑ ^66.0 ^66.1 Note that the Queen's University page describing the corpus describes the start year as 1970 and end year as 2010 despite english-corpora.org providing a source spreadsheet which spans the years 1921 to 2011 and its corpus description page showing a time span from the 1920s to 2010s.
^ On the website, this number is associated with how many "volumes" are available and is listed along side the number of "titles" (10^5 * 2) as well as the number of pages. The exact meaning of the terms "volumes" and "titles" in this context is unclear.
^ Note that although the corpus does explicitly mention its contents, I have not put in the effort to determine the dialect of each of the included texts.
^ The website for the corpus is now offline for unclear reasons, but the it is presumably still possible to access the corpus by contacting the university.
^ The corpus' description implies that it is continually expanding project, but in 2018 the page had not been updated in 5 years (since 2013) which may suggest the project stopped expanding around the same time.
^ An apparently genuine archived version of the corpus' confidenality agreement does state "If I need to cite more than one paragraph (300 words) in a publication, I will obtain permission from the Philadelphia Neighborhood Corpus Committee".
^ An archive of the corpus' home page states that "only members of the research group have access".
^ Note that searches cover both metadata and transcripts for newsreels simultaneously.
^ Specifically Australia, Canada, Ireland, New Zealand, Panama, and the United Kingdom.
^ Derived from the "Newspaper Publication" list. A more direct view of the data as collected on 2024/06/16 can be seen in this published Google Sheet.
^ About 90% of the publications are based in predominantly Anglophone countries (the United States [12263], Australia [1223], the United Kingdom [811], Canada [525], Ireland [50], New Zealand [19]) while the rest are from a wide variety of countries. Information derived from the "Newspaper Publication" list. A more direct view of the data as collected on 2024/06/16 can be seen in this published Google Sheet.
^ Issue counts are provided for individual publications, but not for the entire collection. 12.7 million articles in English are available, though, with each issue featuring many articles.
^ A few publications originate from regions outside Wales, in particular three from London, one from the United States, and one from Argentina. An additional publication has no region listed though its "issuing body note" states "Published in Caernarfon by Thomas Jones", with Caernarfon being in Wales.
^ Issue counts are provided for individual publications, but not for the entire collection. 363 thousand pages in English are available, though, with each issue featuring many pages.
^ The English Wikipedia article on the Court of Great Sessions in Wales stated on 2023-08-08 that "[o]f the 217 judges who sat on its benches [...], only 30 were Welshmen". Those involved in keeping the court's records likely had a similar make up and so the database's dialect likely reflects England rather than Wales.
^ This number represents the number of recordings available online.
^ This date represents the earliest year specified for any recording in the archive, though that recording does not have audio. It is not immediately clear what the earliest recording with audio is. The earliest audio-only recording is from 1938.
^ “Buckeye Corpus Information”, in Buckeye Corpus‎^[4], c. 2005, retrieved 2024-05-09: “After a significant amount of piloting different protocols for eliciting large amounts of unmonitored speech, a modified sociolinguistic interview format was chosen.”
↑ ^84.0 ^84.1 Note that this number was calculated to include the about 25% of work listings which were placeholders on 2024/02/24 but should eventually become full entries and excluded the about 15% of work listings were redirects to other listings on the same date.
^ Based on the fact that the list pages for browsing works display 25 works at a time there are 78 pages to browser as of 2024/02/26.
^ Not explicitly stated, but browsing the collection on 2024-02-26 revealed only newspapers being cited as the source of the stories provided.
^ Samantha Cole (2020 October 13) “2.1 Million of the Oldest Internet Posts Are Now Online for Anyone to Read”, in Vice‎^[5], archived from the original on 2020-10-13: “Around 2.1 million posts from between February 1981 and June 1991 from Henry Spencer's UTZOO NetNews Archive are archived at the Usenet Archive for anyone to browse.”
^ There is also a mirror site, Explicit-Id.com
^ Though the site does feature a built in search function, it is significantly limited and prone to errors. For this reason, I've classified it as a "library" rather than a "corpus". A complete copy of the original data can be downloaded (see here for details) and processed with another application, though.
^ From the number of queries multiplied by the average of 3.5 words per query mentioned in the scientific article that originally accompanied the data: Greg Pass, Abdur Chowdhury, Cayley Torgeson (2006 May) “A Picture of Search”, in Proceedings of the First International Conference on Scalable Information Systems, Hong Kong, →DOI, page 2
^ Number of queries, per the README included with the data
^ This requirement is incompatible with Wiktionary's license.
^ Although the database indexes and shows results for the entirety of FRED, audio and transcripts are only viewable for the FRED Sampler (FRED-S) portion. For this reason, most of the information presented in this table is based on the FRED-S, not the complete FRED.
^ Although tagged transcripts can be downloaded from from the database, the search function only allows for the plaintext transcripts to be searched.
↑ ^95.0 ^95.1 ^95.2 ^95.3 ^95.4 Benedikt Szmrecsanyi, Nuria Hernández (2007) “Manual of Information to Accompany the Freiburg Corpus of English Dialects Sampler (“FRED-S”)”, in FreiDok Plus‎^[6], archived from the original on 2013-04-02
^ Most before 1990.
^ As of 2024-12-16, the search function seems to be broken. It can still be manually searched using Google.
^ A scrap of the corpus from about 2018 is also available as a CSV with the registration of a free account.
^ Based on a word count the 2018 scrape which has a similar number of transcription samples in it as the live website.
^ According to the website, as of the last update on 2023-07-07.
^ Based on Wayback Machine records.
^ Per the landing page.
↑ ^103.0 ^103.1 ^103.2 Note that multiple sub-corpora and related corpora can be searched on the site.
↑ ^104.0 ^104.1 Note that these numbers represent the size of all the corpora on the site tallied together.
^ The corpus' terms FAQ states "All data published under [this website] are available exclusively for non-commercial use for research and educational purposes [...] they can only be used as sources of examples (citations) illustrating a particular linguistic phenomenon." This requirement is incompatible with Wiktionary's license.
^ As of 2023-02-12 the query interface was offline.
^ The corpus' about page states that it is specifically 98% written and 2% spoken.
^ The corpus' user agreement states "TUD sadece araştırma ve sunum amaçlı kullanıma açıktır ve fikri mülkiyet hakları tümüyle Sağlayıcıya aittir." (roughly, '[the corpus] is available for research and presentation purposes only and the intellectual property rights remain the sole property of the Provider.') This requirement is incompatible with Wiktionary's license.
↑ ^109.0 ^109.1 This corpus was designed to imitate the English-language Brown Corpus.
↑ ^110.0 ^110.1 Specifically Argentina, Bolivia, Chile, Colombia, Costa Rica, Cuba, Dominican Republic, Ecuador, Guatemala, Honduras, Mexico, Nicaragua, Panama, Paraguay, Peru, Puerto Rico, El Salvador, Spain, United States, Uruguay, Venezuela.
^ Note that dialect information in internet-derived corpora is usually somewhat inaccurate because of the accidental inclusion of texts in other dialects. This issue is addressed on the website with the conclusion that the "categorization is quite good".
^ Note that dialect information in internet-derived corpora is usually somewhat inaccurate because of the accidental inclusion of texts in other dialects. This issue was addressed for the related Web/Dialects corpus with the conclusion that the "categorization is quite good" so a similar level of quality may exist for this corpus.
^ Note that the CORPES is currently undergoing continuous revision and so this information may be out of date. To be specific, the information presented is for version 0.99.
↑ ^114.0 ^114.1 Note that the newspapers were published in the United States.
^ Using the number of "total tokens" listed under "Types of complete words (including unresolved akṣaras)" on 2024-04-22.
^ Using the number of manuscripts publicly available on 2024-04-22.
↑ ^117.0 ^117.1 George S. Lane, Douglas Q. Adams (2013 July 16) “Tocharian languages”, in Encyclopedia Britannica‎^[7], retrieved 2024-05-05: “Documents from AD 500–700”
^ Approximately, it is difficult to see the full list of corpora in order to get an accurate estimate.
^ Note that this number reflects the number of corpora freely available. Including the corpora which require a subscription or special permission the number comes up to 722 as of 2023/0/13.
^ Note that the database has not been updated since 2016 and has a somewhat buggy search system.
↑ ^121.0 ^121.1 Not confirmed to be English exclusively, but probably almost all English.
^ The BBC publishes news online in a wide variety of languages which can then be searched manually using a search engine like Google. The languages are specifically Arabic, Azeri, Bangla, Burmese, Chinese, French, Hausa, Hindi, Indonesian, Japanese, Kinyarwanda, Kirundi, Kyrgyz, Marathi, Nepali, Pashto, Persian, Portuguese, Russian, Sinhala, Somali, Spanish, Swahili, Tamil, Turkish, Ukrainian, Urdu, Uzbek, and Vietnamese.
^ The forum is primarily written in Chinese, though some posts are in English.
^ The section which primarily hosts links to corpuses is labeled "专题研究" (Google Translate translates this as "Special Research".)
^ Both user ID and password are "test" for freely available corpora.

English

Non-English

Glossary

Other lists and databases

Notes

Further reading