Exploring Newspaper Language : Using the Web to Create and Investigate a Large Corpus of Modern Norwegian. için kapak resmi
Exploring Newspaper Language : Using the Web to Create and Investigate a Large Corpus of Modern Norwegian.
Başlık:
Exploring Newspaper Language : Using the Web to Create and Investigate a Large Corpus of Modern Norwegian.
Yazar:
Andersen, Gisle.
ISBN:
9789027274991
Yazar Ek Girişi:
Fiziksel Tanımlama:
1 online resource (362 pages)
Seri:
Studies in Corpus Linguistics
İçerik:
Exploring Newspaper Language -- Editorial page -- Titla page -- LCC data -- Table of contents -- Building a large corpus based on newspapers from the web -- 1. Introduction -- 2. An overview of the Norwegian Newspaper Corpus and its system architecture -- 2.1 Text harvesting -- 2.2 Boilerplate and duplicate removal -- 2.3 Language classification -- 2.4 Text annotation -- 2.4.1 Annotation of source, date and author information -- 2.4.2 Topic classification -- 2.4.3 Part-of-speech tagging -- 2.5 Search system and user interface -- 2.5.1 Corpus WorkBench -- 2.5.2 Corpuscle -- 2.6 Extraction of new words -- 2.7 Classification of new words -- 2.7.1 Anglicism detection -- 2.8 Frequency profiling and lexical database entry -- 2.9 Identification of multiword expressions -- 3. The content of the research contributions to this book -- 4. Concluding remarks -- References -- Part II. Exploiting the web as a corpus - Methods and tools -- Corpuscle - a new corpus management platform for annotated corpora -- 1. Introduction -- 2. Design principles -- 3. Querying the corpus -- 4. API and Web interface -- 4.1 The API -- 4.2 The Web interface -- 5. Editing and manual annotation -- 6. Evaluation and concluding remarks -- References -- OBT+stat -- 1. Introduction -- 2. Background -- 2.1 The history of the Oslo-Bergen Tagger -- 2.2 State of the art for Norwegian POS taggers -- 3. The architecture of the Oslo-Bergen Constraint Grammar Tagger -- 4. Methodology of improvements to the Oslo-Bergen Tagger -- 5. Dealing with left-over ambiguities in the Oslo-Bergen Tagger -- 5.1 Morphological ambiguities -- 5.2 Lemma ambiguities -- 6. Statistical disambiguation -- 7. Modelling challenges and engineering concerns -- 8. Evaluation of the statistical module -- 8.1 How to evaluate -- 8.2 Evaluation results -- 9. Conclusion -- References.

Exploring corpora through syntactic annotation -- 1. Introduction -- 2. Treebanking -- 3. INESS - the Norwegian treebanking infrastructure -- 4. Searching for complex syntactic constructions in a treebank -- 4.1 Passive constructions -- 4.2 Relative clauses -- 5. Conclusion -- References -- Collocations and statistical analysis of n-grams -- 1. Introduction -- 2. Background -- 2.1 Multiword Expressions (MWEs) -- 2.2 Collocations -- 3. Methodology -- 3.1 Data and n-gram extraction -- 3.2 Post-processing of n-gram lists -- 3.3 Contingency tables -- 3.3.1 Bigram Contingency Tables -- 3.3.2 Trigram Contingency Tables -- 3.4 Bigram Association Measures -- 3.5 Trigram Association Measures -- 4. Results -- 4.1 Bigrams -- 4.2 Trigrams -- 5. Conclusion and Future Work -- References -- Automatic topic classi cation of a large newspaper corpus -- 1. Introduction -- 2. Background and related work -- 2.1 The rule-based approach -- 2.2 The pattern-matching approach -- 2.3 Promising results -- 3. Material -- 3.1 Manual annotation -- 3.2 Feature extraction -- 3.3 Cleaning the text -- 3.4 The gold standard -- 4. Overview of our final approach -- 5. Our approach in detail -- 5.1 Hypothesis -- 5.2 De ning categories -- 5.3 Tools -- 5.4 Programming and experimenting -- 6. Data and experimental evaluation -- 6.1 Measuring inter-annotator agreement -- 6.2 Possible sources of error -- 7. Results -- 8. Conclusions and future work -- References -- A data-driven approach to anglicism identification in Norwegian -- 1. Introduction -- 2. Background -- 2.1 Anglicisms -- 2.2 Modelling anglicisms computationally -- 2.3 A machine learning approach to anglicisms -- 3. Methodology -- 3.1 Machine learning algorithm: TiMBL -- 3.2 Lexical data -- 3.4 Building training vectors -- 3.5 Trigrams and frequencies -- 4. Experiments and results -- 4.1 Simple frequency -- 4.2 Productivity.

4.3 Automatic feature selection -- 4.4 Discussion of the results -- 5. Future work -- Reference -- part ii Corpus-based case studies -- A corpus-based study of the adaptation of English import words in Norwegian -- 1. Introduction -- 2. Concepts and terminology -- 3. Previous research on norwegification -- 4. Inventory, data and methods -- 4.1 The three datasets -- 4.2 The Corpuscle interface to the Norwegian Newspaper Corpus -- 4.3 Step-by-step analysis -- 5. Degree of norwegification in the three datasets -- 5.1 Overall results for the three datasets -- 5.2 A more detailed look at the words from the 1997 spelling reform -- 5.3 Degree of norwegification of words from the 2004 spelling reform -- 5.4 Degree of unsolicited norwegification -- 6. Towards an analysis of factors affecting the variation between N/E forms -- 6.1 Complexity of the orthographic change -- 6.2 Collocations -- 6.3 Morphology -- 6.4 Homography -- 6.5 Semantic factors -- 7. Concluding remarks -- References -- Norm clusters in written Norwegian -- 1. Introduction -- 2. A pilot study -- 3. The texts -- 4. Data extraction and processing -- 5. Case I: Feminine nouns in Bokmål -- 5.1 Correspondence analysis -- 5.2 Implication analysis -- 6. Case II: Weak verbs in Bokmål -- 7. Case III: Infinitives in Nynorsk -- 7.1 Correspondence analysis -- 7.2 Implication analysis -- 8. Conclusion -- References -- Lexical neography in modern Norwegian -- 1. Introduction -- 2. What is a lexical neologism -- 2.1 Neoformatives -- 2.2 Neosemanticisms -- 2.3 Neophrasemes -- 2.4 A new definition of neologisms -- 3. Neologisms as a transitional phenomenon -- 4. How are new words coined? -- 5. Why do we need new words? -- 6. Previous registration of neologisms in Norwegian -- 7. Statistical studies of neologisms in modern Norwegian -- 7.1 Manual evaluation of the automatically generated new word candidates.

7.2 Semiautomatic retrieval of neologisms from the Norweian Newspaper Corpus -- 7.3 Results and discussion -- 8. Conclusion and further work -- References -- Ash compound frenzy -- 1. Introduction -- 2. Data, method and results -- 2.1 Selection and preparation of material -- 2.2 Selection of time frame -- 2.3 Quantitative analysis -- 3. Discussion -- 3.1 Remarks on the empirical and quantitative methods -- 3.2 Linguistic remarks on specific compounds -- References -- 4. Appendices -- 4.1 Frequency list exclusive of hapax legomena -- 4.2 New words on April 14-15 -- 4.3 Type and token frequencies per date -- Financial jargon in a general newspaper corpus -- 1. Introduction -- 2. Background -- 3. Methodology -- 3.1 Corpus material and selection of financial jargon -- 3.2 Investigating the influence of English loan words -- 4. Results and discussion -- 4.1 The concepts of financial crisis and credit crunch -- 4.2 The concept of subprime and related concepts -- 4.3 The concept of hedge fund -- 4.4 A note on the use of acronyms -- 4.5 Inflectional forms -- 5. Concluding remarks -- References -- Appendix I -- Metonymic extension and vagueness -- 1. Introduction -- 2. Theoretical framework -- 2.1 Metonymic categories and the ambiguity/vagueness cline -- 2.2 Domains and dimensions in metonymy -- 3. Material and method -- 3.1 Material -- 3.2 Method -- 4. Results and discussion -- 4.1 Results -- 4.2 Discussion -- 5. Concluding remarks -- References -- Spatial metaphors in present-day Norwegian newspaper language -- 1. Introduction -- 2. Cognitive views on language -- 3. The up/down schema in Norwegian newspaper language -- 3.1 The up dimension -- 3.1.1 Høy/høyere -- 3.1.2 Stige and løfte -- 3.1.3 Summary -- 3.2 The down dimension -- 3.2.1 Lav/lavere -- 3.2.2 Synke/senke -- 3.2.3 Summary -- 4. Conclusion -- References.

Doing historical linguistics using contemporary data -- 1. Introduction -- 2. Stage theory -- 3. Deverbal nouns, prototypes and diachronic paths -- 4. Methods and data -- 4.1 Using a synchronic corpus only -- 5. Using older texts from the corpus -- 6. Concluding remarks -- References -- Name index -- Subject index.
Özet:
Retrieving linguistic data from earlier stages of languages is a notoriously difficult task. Using large electronic corpora combined with data on frequency this task can to some extent be solved. In this article I focus on the use of token frequency as described in functional Grammaticalization Theory. Deverbal nouns are non-prototypical members of the noun class. As they get older they tend to develop into more prototypical nouns. In Grammaticalization Theory this process is called lexicalization. This was tested on some zero suffix nouns in the Norwegian newspaper corpus in 2004 using modern texts only. In this article I test these findings using older texts from the same corpus.
Notlar:
Electronic reproduction. Ann Arbor, Michigan : ProQuest Ebook Central, 2017. Available via World Wide Web. Access may be limited to ProQuest Ebook Central affiliated libraries.
Elektronik Erişim:
Click to View
Ayırtma: Copies: