واژه‌های پایه‌ی زبان فارسی مبتنی‌بر متون مطبوعاتی

نوع مقاله: مقاله پژوهشی

نویسندگان

1 دانشیار گروه زبان‌شناسی ـ دانشگاه علامه طباطبایی ـ مدیر هستة گروه پژوهش های بنیادی برای توسعۀ آموزش زبان فارسی به خارجیان دانشگاه علامه طباطبائی

2 پژوهشگر / بنیاد سعدی

3 عضو هسته پژوهش‏ های بنیادی برای توسعه آموزش زبان فارسی به خارجیان: دانشگاه علامه طباطبائی

چکیده

آموزش واژه‌های زبان، یکی از مهم‌ترین مؤلفه‌های آموزش زبان خارجی است که می‌تواند هر چهار مهارت اصلی زبان را (شنیدن، صحبت‌کردن، خواندن و نوشتن،) تحت تاثیر قراردهد. طبق پژوهش‌هایی که در حوزۀ آموزش واژه صورت گرفته است، واژه‌های پربسامد و پایه‌ی زبان به دلیل سادگی در یادگیری و کاربرد فراوان در زبان روزمره از اهمیت ویژه‌ای برخوردار هستند. فهرست واژه‌های پربسامد یا پایه، مجموعه‌ای از واژه‌هاست که در پیکره‌ای زبانی، فراوانی (تکرار) بیش‌تری داشته‌اند. برای دستیابی به پیکره‌ای مناسب، از متون مطبوعاتی در هفت حوزه‌ی مختلف (شامل فرهنگی، اجتماعی، سیاسی، ورزشی، ادبیات داستانی، اقتصادی و علمی) استفاده شد و در عرض 100 روز کاری پیکره‌ا‌ی 2400 متنی، شامل یک میلیون و دویست‌هزار واژه استخراج گردید. سپس با استفاده از نرم‌افزاری که جهت انجام این پژوهش طراحی شده بود، واژه‌ها براساس نوع برچسب‌گذاری شده و از میان آن‌ها 2000 واژه‌ای که بیش از 50 بار تکرار شده‌اند به عنوان واژه‌های پایه‌ی زبان فارسی مبتنی بر متون مطبوعاتی معرفی شدند.

کلیدواژه‌ها


عنوان مقاله [English]

Persian basic words based on texts by press

نویسندگان [English]

  • Rezamorad Sahraee 1
  • Amir Hosein Mojiri Forushani 2
  • Morvarid Talebi 3
1 associate professor of linguistics department and director of fundamental research in teaching Persian to foreigners group _ Allameh Tabataba’i University
2 Researcher / Saadi Foundation
3 member of fundamental research in teaching Persian to foreigners group_ Allameh Tabataba’i University
چکیده [English]

No information can be transmitted without familiarity with language words. Teaching vocabulary is one of the most important components of foreign language teaching that can affect all four main language skills (listening, speaking, reading and writing). The first step in word teaching is to access the list of basic words. According to the studies conducted in the related area, high-frequency words as well as the basic vocabularies are significant in language teaching since they are easy to learn and frequently used in everyday conversation. Frequently-used words list or frequency dictionary are a set of words that have more repetition in a collection of texts (corpus). Basic words are generally extracted from the architecture of the corpus-based researches, and the output of each linguistic corpus can be a basic words list (depending on the language and type of the corpus texts).
From 1897 there are many lists of basic words in different languages of the world. English language researches is more than any other researches. Since the year 1971, researches has also been conducted in Persian to extract frequent words.
Thorndike, E. L. (1921) presented 30,000 English basic words. In 1923, Ogden C. K. & Richards I. A. listed 850 English basic words. Dolch, E. W. (1936) listed 220 English basic words. West (1953) presented 2,000 English basic words. Coxhead, A. (2000) has derived basic words in four areas of art, commerce, law and science. In 2001, Verlinde, S. & Selva , T. provided a list of frequent words in French. 100 and 1000 English basic words were extracted by Fry, E. B., Polk, J. E., & Fountoukidis, D. (2000). Jones, R. L. & Tschirner, E. (2006) extracted 4,307 German frequent words. Davies, M. & Gardner, D. (2010) extracted 1,000 to 5,000 English basic words. The list of 100 frequent words of Oxford English dictionary and the 3,000 basic words of Langman's dictionary are another researches.
In Iran, Barahani, M. T. (1975), Imen, L. (1978), Safarpour, A. (1991), Tahriryan, M. H. (1994), Hasani, H. (2005), Gharavi Qouchani, M. (2006), Doroody et al. (2008), Alahmad et al. (2009), Bijankhan, M. (1390), Nematzadeh et al. (2011) were among the scholars of basic words.
This research has two main stages: a) extracting texts and registering in database: 8 persons within 100 working days, each day 3 texts with an average of 500 words from three newspapers in one of the seven different areas (including culture, society, politics, sports, fiction, economics and science) extracted, resulting in a corpus with 1,203,589 words (2401 texts).
The software used for this project was written specifically for this research using the PHP programming language and is a web-based software. Types of words in the software are "name", "verb" (and in particular "compound verb"), "preposition", "proper noun", "adjective" and "adverb". Also, in this corpus, each text has metadata of "type of text" (cultural, social, political, etc.), the name of the newspaper, the date of printing, the date of the text typing, the date of frequency extraction and the name of the researcher. A list of these collections was also made to identify broken plurals. For example, "آثار: اثر". Also, a list of inflectional affix (prefixes and suffixes) and rules governing them were provided. This list specifies what suffixes or prefixes can take for any type of word. For example, the verb can start with the "می" prefix, the "ات" suffix can be added to the word "آیه", but before that, the letter "ه" should be deleted from the end of the word.
Each word in software also has attributes like prefix, suffix, word root, word category, text numbers, word frequency, and main word (the word used in the text).
b) Labeling Words: Labeling involves specifying the word category (noun, adjective, verb or preposition), and the lemma. Words with derivational affixes and without inflectional affixes and clitics were recorded. The verbs were recorded as infinitive. Compound verbs were recorded in a infinitive form, and their nominal and verbal parts were not separated.
After the end of the previous stage, the words of the texts (including 1,203,598 words) were obtained along lemma of each word. After corrections such as label correction and lemma correction, the "Basic words" table (including 2,150 words with a frequency of 50 and above) was obtained. In addition, the list of 50 most frequent names, prepositions, adjectives and 20 most frequent adverbs in Persian was also obtained. Then the specific base words of each topic (including 500 specific words of each topic with a frequency of 12 and above) were obtained.

کلیدواژه‌ها [English]

  • corpus-based research
  • newspaper texts corpus
  • teaching vocabulary
  • basic words
  • high-frequency Persian words