Computational Linguistics and Intelligent Text Processing


Book Description

th CICLing 2010 was the 11 Annual Conference on Intelligent Text Processing and Computational Linguistics. The CICLing conferences provide a wide-scope forum for discussion of the art and craft of natural language processing research as well as the best practices in its applications. This volume contains three invited papers and the regular papers accepted for oral presentation at the conference. The papers accepted for poster pres- tation were published in a special issue of another journal (see information on thewebsite).Since 2001,theproceedingsofCICLingconferenceshavebeen p- lished in Springer’s Lecture Notes in Computer Science series, as volumes 2004, 2276, 2588, 2945, 3406, 3878, 4394, 4919, and 5449. The volume is structured into 12 sections: – Lexical Resources – Syntax and Parsing – Word Sense Disambiguation and Named Entity Recognition – Semantics and Dialog – Humor and Emotions – Machine Translation and Multilingualism – Information Extraction – Information Retrieval – Text Categorization and Classi?cation – Plagiarism Detection – Text Summarization – Speech Generation The 2010 event received a record high number of submissions in the - year history of the CICLing series. A total of 271 papers by 565 authors from 47 countriesweresubmittedforevaluationbytheInternationalProgramCommittee (see Tables 1 and 2). This volume contains revised versions of 61 papers, by 152 authors, selected for oral presentation; the acceptance rate was 23%.




Language, Culture, Computation: Computational Linguistics and Linguistics


Book Description

This Festschrift volume is published in Honor of Yaacov Choueka on the occasion of this 75th birthday. The present three-volumes liber amicorum, several years in gestation, honours this outstanding Israeli computer scientist and is dedicated to him and to his scientific endeavours. Yaacov's research has had a major impact not only within the walls of academia, but also in the daily life of lay users of such technology that originated from his research. An especially amazing aspect of the temporal span of his scholarly work is that half a century after his influential research from the early 1960s, a project in which he is currently involved is proving to be a sensation, as will become apparent from what follows. Yaacov Choueka began his research career in the theory of computer science, dealing with basic questions regarding the relation between mathematical logic and automata theory. From formal languages, Yaacov moved to natural languages. He was a founder of natural-language processing in Israel, developing numerous tools for Hebrew. He is best known for his primary role, together with Aviezri Fraenkel, in the development of the Responsa Project, one of the earliest fulltext retrieval systems in the world. More recently, he has headed the Friedberg Genizah Project, which is bringing the treasures of the Cairo Genizah into the Digital Age. This third part of the three-volume set covers a range of topics related to language, ranging from linguistics to applications of computation to language, using linguistic tools. The papers are grouped in topical sections on: natural language processing; representing the lexicon; and neologisation.




Quantitative Semantics and Soft Computing Methods for the Web: Perspectives and Applications


Book Description

The Internet has been acknowledged as a recent technological revolution, due to its significant impact on society as a whole. Nevertheless, precisely due to its impact, limitations of the current Internet are becoming apparent; in particular, its inability to automatically take into account the meaning of online documents. Some proposals for taking meaning into account began to appear, mainly the so-called Semantic Web, which includes a set of technologies like RDF that are based on new markup languages. Though these technologies could be technically sound, practical limitations, such as the high training level required to construct Semantic Web pages, and the small proportion of current Semantic Web pages make the Sematic Web marginal today and also in the near foreseeable future. Quantitative Semantics and Soft Computing Methods for the Web: Perspectives and Applications will provide relevant theoretical frameworks and the latest empirical research findings related to quantitative, soft-computing and approximate methods for dealing with Internet semantics. The target audience of this book is composed of professionals and researchers working in the fields of information and knowledge related technologies (e.g. Information sciences and technology, computer science, Web science, and artificial intelligence).




Handbook of Linguistic Annotation


Book Description

This handbook offers a thorough treatment of the science of linguistic annotation. Leaders in the field guide the reader through the process of modeling, creating an annotation language, building a corpus and evaluating it for correctness. Essential reading for both computer scientists and linguistic researchers.Linguistic annotation is an increasingly important activity in the field of computational linguistics because of its critical role in the development of language models for natural language processing applications. Part one of this book covers all phases of the linguistic annotation process, from annotation scheme design and choice of representation format through both the manual and automatic annotation process, evaluation, and iterative improvement of annotation accuracy. The second part of the book includes case studies of annotation projects across the spectrum of linguistic annotation types, including morpho-syntactic tagging, syntactic analyses, a range of semantic analyses (semantic roles, named entities, sentiment and opinion), time and event and spatial analyses, and discourse level analyses including discourse structure, co-reference, etc. Each case study addresses the various phases and processes discussed in the chapters of part one.




Essential Speech and Language Technology for Dutch


Book Description

The book provides an overview of more than a decade of joint R&D efforts in the Low Countries on HLT for Dutch. It not only presents the state of the art of HLT for Dutch in the areas covered, but, even more importantly, a description of the resources (data and tools) for Dutch that have been created are now available for both academia and industry worldwide. The contributions cover many areas of human language technology (for Dutch): corpus collection (including IPR issues) and building (in particular one corpus aiming at a collection of 500M word tokens), lexicology, anaphora resolution, a semantic network, parsing technology, speech recognition, machine translation, text (summaries) generation, web mining, information extraction, and text to speech to name the most important ones. The book also shows how a medium-sized language community (spanning two territories) can create a digital language infrastructure (resources, tools, etc.) as a basis for subsequent R&D. At the same time, it bundles contributions of almost all the HLT research groups in Flanders and the Netherlands, hence offers a view of their recent research activities. Targeted readers are mainly researchers in human language technology, in particular those focusing on Dutch. It concerns researchers active in larger networks such as the CLARIN, META-NET, FLaReNet and participating in conferences such as ACL, EACL, NAACL, COLING, RANLP, CICling, LREC, CLIN and DIR ( both in the Low Countries), InterSpeech, ASRU, ICASSP, ISCA, EUSIPCO, CLEF, TREC, etc. In addition, some chapters are interesting for human language technology policy makers and even for science policy makers in general.




Empirical Methods in Natural Language Generation


Book Description

Natural language generation (NLG) is a subfield of natural language processing (NLP) that is often characterized as the study of automatically converting non-linguistic representations (e.g., from databases or other knowledge sources) into coherent natural language text. In recent years the field has evolved substantially. Perhaps the most important new development is the current emphasis on data-oriented methods and empirical evaluation. Progress in related areas such as machine translation, dialogue system design and automatic text summarization and the resulting awareness of the importance of language generation, the increasing availability of suitable corpora in recent years, and the organization of shared tasks for NLG, where different teams of researchers develop and evaluate their algorithms on a shared, held out data set have had a considerable impact on the field, and this book offers the first comprehensive overview of recent empirically oriented NLG research.




The Oxford Handbook of Computational Linguistics


Book Description

Ruslan Mitkov's highly successful Oxford Handbook of Computational Linguistics has been substantially revised and expanded in this second edition. Alongside updated accounts of the topics covered in the first edition, it includes 17 new chapters on subjects such as semantic role-labelling, text-to-speech synthesis, translation technology, opinion mining and sentiment analysis, and the application of Natural Language Processing in educational and biomedical contexts, among many others. The volume is divided into four parts that examine, respectively: the linguistic fundamentals of computational linguistics; the methods and resources used, such as statistical modelling, machine learning, and corpus annotation; key language processing tasks including text segmentation, anaphora resolution, and speech recognition; and the major applications of Natural Language Processing, from machine translation to author profiling. The book will be an essential reference for researchers and students in computational linguistics and Natural Language Processing, as well as those working in related industries.




Competition in Language Change


Book Description

This book addresses one of the most pervasive questions in historical linguistics – why variation becomes stable rather than being eliminated – by revisiting the so far neglected history of the English dative alternation. The alternation between a nominal and a prepositional ditransitive pattern (John gave Mary a book vs. John gave a book to Mary) emerged in Middle English and is closely connected to broader changes at that time. Accordingly, the main quantitative investigation focuses on ditransitive patterns in the Penn-Helsinki Parsed Corpus of Middle English; in addition, the book employs an Evolutionary Game Theory model. The results are approached from an ‘evolutionary construction grammar’ perspective, combining evolutionary thinking with diachronic constructionist notions, and the alternation’s emergence is interpreted as a story of constructional innovation, competition, cooperation and co-evolution. The book not only provides a thorough and detailed analysis of the history of one of the most-discussed syntactic phenomena in English, but by fusing two frameworks and employing two different methodologies also presents a highly innovative approach to a problem of relevance to historical linguistics in general.




User Interfaces to the Web of Data based on Natural Language Generation


Book Description

We explore how Virtual Research Environments based on Semantic Web technologies support research interactions with RDF data in various stages of corpus-based analysis, analyze the Web of Data in terms of human readability, derive labels from variables in SPARQL queries, apply Natural Language Generation to improve user interfaces to the Web of Data by verbalizing SPARQL queries and RDF graphs, and present a method to automatically induce RDF graph verbalization templates via distant supervision.




Syntax-based Statistical Machine Translation


Book Description

This unique book provides a comprehensive introduction to the most popular syntax-based statistical machine translation models, filling a gap in the current literature for researchers and developers in human language technologies. While phrase-based models have previously dominated the field, syntax-based approaches have proved a popular alternative, as they elegantly solve many of the shortcomings of phrase-based models. The heart of this book is a detailed introduction to decoding for syntax-based models. The book begins with an overview of synchronous-context free grammar (SCFG) and synchronous tree-substitution grammar (STSG) along with their associated statistical models. It also describes how three popular instantiations (Hiero, SAMT, and GHKM) are learned from parallel corpora. It introduces and details hypergraphs and associated general algorithms, as well as algorithms for decoding with both tree and string input. Special attention is given to efficiency, including search approximations such as beam search and cube pruning, data structures, and parsing algorithms. The book consistently highlights the strengths (and limitations) of syntax-based approaches, including their ability to generalize phrase-based translation units, their modeling of specific linguistic phenomena, and their function of structuring the search space.