 |
 |
Eastern Armenian National Corpus
| Website Date: | 2009-09-10 (Archive) |
| Date Submitted: |
2009-05-31 |
| Announcement ID: |
168976 |
|
Eastern Armenian National Corpus (EANC) is a comprehensive linguistic database of annotated texts in Standard Eastern Armenian (SEA), the language spoken in the Republic of Armenia.
EANC is:
* a comprehensive corpus with about 110 million tokens
* a powerful search engine for making complex lexical morphological queries
* a learner’s corpus including English translations for frequent tokens
* a diachronic corpus covering SEA texts from the mid-19th century to the present
* a mixed corpus consisting of both written discourse and oral discourse
* an annotated corpus with morphological and metatext tagging
* an open access corpus
* an electronic library with full access to over 100 Armenian classic titles
Objective
Current state of Eastern Armenian studies requires new approaches and linguistic tools to validate key empirical hypotheses and findings as well as to expand the field of research. Corpus-based approach will allow revisiting the aspects of the traditional grammar that have not been sufficiently studied and will facilitate developing new descriptive and theoretical concepts.
Eastern Armenian National Corpus (EANC) provides linguists with a searchable annotated database of Eastern Armenian. EANC includes empirical linguistic data ranging from classical Standard Eastern Armenian literature to Yerevan street talk recorded and transcribed in 2008.
The immediate objective of EANC is to help linguists find and explore sentences (occurrences) in SEA texts that meet specific search criteria. EANC allows searching for:
* wordforms and lexemes
* part-of-speech categories, morphological attributes, and inflection types
* punctuation
* contextual queries and collocations
EANC also provides a researcher with an option to build a user-defined subcorpus, such as a single author subcorpus, or a subcorpus containing specific genres and/or periods.
Since EANC provides samples of actual SEA usage across periods, genres, and discourse formats, it can also be used as a powerful educational resource. English translations are provided for about 85 percent of the tokens, facilitating the use of the corpus by non-native speakers, e.g. Armenian language learners. EANC can also be used in various fields such as literature and culture studies, journalism, history, and others.
EANC is a "national" corpus in the sense that it attempts to build the fullest possible representation of the Eastern Armenian language in all its culturally and socially significant aspects, following in the tradition of existing online national corpora – British National Corpus, Russian National Corpus and others.
Importantly, EANC is as much about corpus linguistics as it is about Armenian studies. The EANC team aims to build a modern flexible linguistic database that can be used as a platform for creating corpora of other languages, exploring statistical approaches to language description, as well as applying natural language processing methods.
Composition
EANC is designed as a comprehensive corpus with the objective to include as many Standard Eastern Armenian texts as practicable. As of March 2009, EANC comprises about 110 million tokens. Overall, we have been guided by the goal of comprehensive representation – all literary, scientific and oral texts available to us have been indexed for search. The only exception to this are certain widely-available texts, such as electronic press and legal documents, whose presence has been limited for the sake of balance among different genres.
Due to its comprehensive nature, EANC is inherently different from the "major" languages’ corpora such as Russian National Corpus or British National Corpus which choose their collections selectively. BNC additionally imposes a limit on the number of words per document, truncating longer texts. EANC, on the other hand, includes a great majority of all extant Eastern Armenian literary texts. In this respect, EANC is similar to Czech National Corpus or Slovak National Corpus.
The written discourse subcorpus of EANC includes 836 fiction texts, both prose and poetry (including 206 translated fiction titles), 7,858 newspaper issues and a sizeable collection of scientific and other non-fiction texts.
The SEA oral discourse subcorpus (3 million tokens) is an important structural element of EANC, comprised of spontaneous dialogs, task-oriented interviews, TV talk shows, films, and other audio recordings, all transcribed for EANC. Recently added samples of online communication are of a type intermediate between oral and written register; they have been placed in the oral subcorpus.
Each of the 9,960 document entries in EANC is labeled by metatext information specifying genre and other bibliographic details (e.g.: date of creation/publication, name of the author, etc.).
|
Didn't find what you're looking for? Try our power search! |
Return to the top of this page
Return to announcements home
|
Send comments and questions to H-Net
Webstaff. H-Net reproduces announcements that have been submitted to us as a
free service to the academic community. If you are interested in an announcement
listed here, please contact the organizers or patrons directly. Though we strive
to provide accurate information, H-Net cannot accept responsibility for the text of
announcements appearing in this service. (Administration)
|
|