Wastholm.com

spaCy is a modern Python library for industrial-strength Natural Language Processing. In this free and interactive online course, you'll learn how to use spaCy to build advanced natural language understanding systems, using both rule-based and machine learning approaches.

Recently, I became curious about the origin of the common Japanese word “ありがとう” (arigatou), which is used in modern Japanese to express gratitude or simply say “Thank you”. I had heard from several people that it originally meant something like “It is hard for me to exist”, and for some time I accepted this explanation. After all, one way to write this word is “有り難う”, which contains a form of “aru” (to exist), and a form of “gatai” which can be used to mean the previous verb is difficult to do.

Det svenska uttrycket, liksom dess tyska förebild ”einem die Stange halten” har enligt Svenska Akademiens ordbok två olika ursprung, med från början lite olika betydelse. I det första fallet avser stången en lans eller motsvarande, och uttrycket bygger på att första ledet i en äldre stridsordning hade lansar som vapen som de höll mot fienden — och så länge de höll fienden stången kunde fienden inte ta sig framåt. Det andra fallet bygger på att skiljedomaren vid en medeltida tvekamp skilde de stridande åt med en stång när den ena parten gav upp och förklarade sig övervunnen. En numera utdöd variant av det uttrycket var ”hålla stången rätt emellan (två personer)”, med betydelsen ’medla på rätt sätt’.

A couple of weeks ago, my colleague Yuko Tamura wrote an insightful article about the many ways of abbreviating Japanese words and phrases. Today, I would like to follow up on this with a little overview of things abbreviated in the domain of grammar, where Japanese appears to be just as rigorous with cutting things out as it is with lexical expressions.

Being George Orwell's thoughts on how to write well, and his formulation of six rules:

i. Never use a metaphor, simile or other figure of speech which you are used to seeing in print.

ii. Never use a long word where a short one will do.

iii. If it is possible to cut a word out, always cut it out.

iv. Never use the passive where you can use the active.

v. Never use a foreign phrase, a scientific word or a jargon word if you can think of an everyday English equivalent.

vi. Break any of these rules sooner than say anything outright barbarous.

KanjiVG (Kanji Vector Graphics) provides vector graphics and other information about kanji used by the Japanese language. For each character, it provides an SVG file which gives the shape and direction of its strokes, as well as the stroke order. Each file is also enriched with information about the components of the character such as the radical, or the type of stroke employed.

It is very easy to create stroke order diagrams, animations, kanji dictionaries, and much more using KanjiVG. See Projects using KanjiVG for a growing list of applications of the KanjiVG data.

This analogy to lossy compression is not just a way to understand ChatGPT’s facility at repackaging information found on the Web by using different words. It’s also a way to understand the “hallucinations,” or nonsensical answers to factual questions, to which large language models such as ChatGPT are all too prone. These hallucinations are compression artifacts, but—like the incorrect labels generated by the Xerox photocopier—they are plausible enough that identifying them requires comparing them against the originals, which in this case means either the Web or our own knowledge of the world. When we think about them this way, such hallucinations are anything but surprising; if a compression algorithm is designed to reconstruct text after ninety-nine per cent of the original has been discarded, we should expect that significant portions of what it generates will be entirely fabricated.

As mentioned earlier, this is because most of Hokkaido’s Japanese place names are derived from アイヌ語. There was a conscious effort by 和人 in the 19th century to map out Hokkaido with Japanese place names under assimilation policies of the Edo Period (1603-1868).

Not all Ainu place names were rendered into Japanese in the same way, however. Some took the sound of the Ainu name, some took the meaning and some were shortened. For example, 札幌 is shortened from “sat poro pet” (dry, big river). 旭川 (Asahikawa), on the other hand, comes from a name misheard by 和人 as “cup pet” (morning sun river), but is thought to have had a different original name like “cuk pet” (autumn river). The misheard meaning was translated into 旭 (asahi, morning sun) combined with 川 (kawa, river).

With several thousand characters to contend with, how were the Japanese able to use typewriters before the advent of digital technology? The answer is the kanji typewriter (和文タイプライター or 邦文タイプライター), which was invented by Kyota Sugimoto in 1915. This invention was deemed so important that it was selected as one of the ten greatest Japanese inventions by the Japanese Patent Office during their 100th anniversary celebrations in 1985. Here are some photos of that first model. (Photos courtesy Canon Semiconductor Equipment.)

Suppose we want to combine a BERT-based named entity recognition (NER) model with a rule-based NER model built on top of spaCy. Although BERT's NER exhibits extremely high performance, it is usually combined with rule-based approaches for practical purposes. In such cases, what often bothers us is that tokens of spaCy and BERT are different, even if the input sentences are the same. For example, let's say the input sentence is "John Johanson 's house"; BERT tokenizes this sentence like ["john", "johan", "##son", "'", "s", "house"] and spaCy tokenizes it like ["John", "Johanson", "'s", "house"]. To combine the outputs, we need to calculate the correspondence between the two different token sequences. This correspondence is the "alignment".

1–10 (136)   Next >   Last >|