Unicode Demystified


Book Description

Unicode is a critical enabling technology for developers who want to internationalize applications for global environments. But, until now, developers have had to turn to standards documents for crucial information on utilizing Unicode. In Unicode Demystified, one of IBM's leading software internationalization experts covers every key aspect of Unicode development, offering practical examples and detailed guidance for integrating Unicode 3.0 into virtually any application or environment. Writing from a developer's point of view, Rich Gillam presents a systematic introduction to Unicode's goals, evolution, and key elements. Gillam illuminates the Unicode standards documents with insightful discussions of character properties, the Unicode character database, storage formats, character sequences, Unicode normalization, character encoding conversion, and more. He presents practical techniques for text processing, locating text boundaries, searching, sorting, rendering text, accepting user input, and other key development tasks. Along the way, he offers specific guidance on integrating Unicode with other technologies, including Java, JavaScript, XML, and the Web. For every developer building internationalized applications, internationalizing existing applications, or interfacing with systems that already utilize Unicode.




Unicode Explained


Book Description

Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. There are hundreds of different encoding systems for mapping characters to numbers, but Unicode promises a single mapping. Unicode enables a single software product or website to be targeted across multiple platforms, languages and countries without re-engineering. It's no wonder that industry giants like Apple, Hewlett-Packard, IBM andMicrosoft have all adopted Unicode. Containing everything you need to understand Unicode, this comprehensive reference from O'Reilly takes you on a detailed guide through the complex character world. For starters, it explains how to identify and classify characters - whether they're common, uncommon, or exotic. It then shows you how to type them, utilize their properties, and process character data in a robust manner. The book is broken up into three distinct parts. The first few chapters provide you with a tutorial presentation of Unicode and character data. It gives you a firm grasp of the terminology you need to reference various components, including character sets, fonts and encodings, glyphs and character repertoires. The middle section offers more detailed information about using Unicode and other character codes. It explains the principles and methods of defining character codes, describes some of the widely used codes, and presents code conversion techniques. It also discusses properties of characters, collation and sorting, line breaking rules and Unicode encodings. The final four chapters cover more advanced material, such as programming to support Unicode. You simply can't afford to be without the nuggets of valuable information detailed in Unicode Explained.




The Unicode cookbook for linguists


Book Description

This text is a practical guide for linguists, and programmers, who work with data in multilingual computational environments. We introduce the basic concepts needed to understand how writing systems and character encodings function, and how they work together at the intersection between the Unicode Standard and the International Phonetic Alphabet. Although these standards are often met with frustration by users, they nevertheless provide language researchers and programmers with a consistent computational architecture needed to process, publish and analyze lexical data from the world's languages. Thus we bring to light common, but not always transparent, pitfalls which researchers face when working with Unicode and IPA. Having identified and overcome these pitfalls involved in making writing systems and character encodings syntactically and semantically interoperable (to the extent that they can be), we created a suite of open-source Python and R tools to work with languages using orthography profiles that describe author- or document-specific orthographic conventions. In this cookbook we describe a formal specification of orthography profiles and provide recipes using open source tools to show how users can segment text, analyze it, identify errors, and to transform it into different written forms for comparative linguistics research. This book is a prime example of open publishing as envisioned by Language Science Press. It is open access, has accompanying open source software, has open peer review, versioning and so on. Read more in this blog post.




Unicode Tutorials - Herong's Tutorial Examples


Book Description

This Unicode tutorial book is a collection of notes and sample codes written by the author while he was learning Unicode himself. Topics include Character Sets and Encodings; GB2312/GB18030 Character Set and Encodings; JIS X0208 Character Set and Encodings; Unicode Character Set; Basic Multilingual Plane (BMP); Unicode Transformation Formats (UTF); Surrogates and Supplementary Characters; Unicode Character Blocks; Python Support of Unicode Characters; Java Character Set and Encoding; Java Encoding Maps, Counts and Conversion. Updated in 2024 (Version v5.32) with minor changes. For latest updates and free sample chapters, visit https://www.herongyang.com/Unicode.




Unicode Blocks - Herong's Notes


Book Description

This book is a collection of notes on Unicode code point blocks written by the author while he was learning Unicode himself. Topics include Introduction of Unicode character sets and code blocks; List of Unicode code blocks and their character samples. Updated in 2024 (Version v5.32) with minor changes. For latest updates and free sample chapters, visit https://www.herongyang.com/Unicode-Blocks.







Unicode Tutorials - Herong's Tutorial Examples


Book Description

This Unicode tutorial book is a collection of notes and sample codes written by the author while he was learning Unicode himself. Topics include Character Sets and Encodings; GB2312/GB18030 Character Set and Encodings; JIS X0208 Character Set and Encodings; Unicode Character Set; Basic Multilingual Plane (BMP); Unicode Transformation Formats (UTF); Surrogates and Supplementary Characters; Unicode Character Blocks; Python Support of Unicode Characters; Java Character Set and Encoding; Java Encoding Maps, Counts and Conversion. Updated in 2024 (Version v5.32) with minor changes. For latest updates and free sample chapters, visit https://www.herongyang.com/Unicode.




The Unicode Standard, Version 4.0


Book Description

bull; Most detailed, comprehensive guide to the Unicode programming standard. bull; Created and authorized by the Unicode Consortium: the world's leading hardware and software vendors. bull; Accompanying CD-ROM contains the entire Unicode Character Database, plus other materials.




The Unicode Standard


Book Description

The Unicode Standard is a new international standard used to encode written characters for storage in computer files or transmission over communication lines. This book is the authorized description and guide to this new standard. It is an essential reference for computer programmers and software developers who deal with multilingual text. Volume 1 covers alphabeths in countries across Europe, Africa, and the Indian subcontinent.




Language Culture Type


Book Description

Language Culture Type grew out of the first international type-design competition, the 2001 bukva: raz!, whose goal was to promote global cultural pluralism, interaction, and diversity in typographic communications. The book lavishly presents the winning entries, along with information about each typeface, its language, and its designer. A series of essays gives context for the interplay of types and languages in the world today -- including the attempt to mesh all existing scripts into a single digital encoding system called Unicode. It also delves into the specific issues around developing typefaces for the many linguistic cultures in the world, from the various Cyrillic letterforms to Vietnam's ancient ideographic script.