KanjiVG (Kanji Vector Graphics) provides vector graphics and other information about kanji used by the Japanese language. For each character, it provides an SVG file which gives the shape and direction of its strokes, as well as the stroke order. Each file is also enriched with information about the components of the character such as the radical, or the type of stroke employed.

It is very easy to create stroke order diagrams, animations, kanji dictionaries, and much more using KanjiVG. See Projects using KanjiVG for a growing list of applications of the KanjiVG data.

This analogy to lossy compression is not just a way to understand ChatGPT’s facility at repackaging information found on the Web by using different words. It’s also a way to understand the “hallucinations,” or nonsensical answers to factual questions, to which large language models such as ChatGPT are all too prone. These hallucinations are compression artifacts, but—like the incorrect labels generated by the Xerox photocopier—they are plausible enough that identifying them requires comparing them against the originals, which in this case means either the Web or our own knowledge of the world. When we think about them this way, such hallucinations are anything but surprising; if a compression algorithm is designed to reconstruct text after ninety-nine per cent of the original has been discarded, we should expect that significant portions of what it generates will be entirely fabricated.

As mentioned earlier, this is because most of Hokkaido’s Japanese place names are derived from アイヌ語. There was a conscious effort by 和人 in the 19th century to map out Hokkaido with Japanese place names under assimilation policies of the Edo Period (1603-1868).

Not all Ainu place names were rendered into Japanese in the same way, however. Some took the sound of the Ainu name, some took the meaning and some were shortened. For example, 札幌 is shortened from “sat poro pet” (dry, big river). 旭川 (Asahikawa), on the other hand, comes from a name misheard by 和人 as “cup pet” (morning sun river), but is thought to have had a different original name like “cuk pet” (autumn river). The misheard meaning was translated into 旭 (asahi, morning sun) combined with 川 (kawa, river).

With several thousand characters to contend with, how were the Japanese able to use typewriters before the advent of digital technology? The answer is the kanji typewriter (和文タイプライター or 邦文タイプライター), which was invented by Kyota Sugimoto in 1915. This invention was deemed so important that it was selected as one of the ten greatest Japanese inventions by the Japanese Patent Office during their 100th anniversary celebrations in 1985. Here are some photos of that first model. (Photos courtesy Canon Semiconductor Equipment.)

Suppose we want to combine a BERT-based named entity recognition (NER) model with a rule-based NER model built on top of spaCy. Although BERT's NER exhibits extremely high performance, it is usually combined with rule-based approaches for practical purposes. In such cases, what often bothers us is that tokens of spaCy and BERT are different, even if the input sentences are the same. For example, let's say the input sentence is "John Johanson 's house"; BERT tokenizes this sentence like ["john", "johan", "##son", "'", "s", "house"] and spaCy tokenizes it like ["John", "Johanson", "'s", "house"]. To combine the outputs, we need to calculate the correspondence between the two different token sequences. This correspondence is the "alignment".

Free and Open Source Machine Translation API, entirely self-hosted. Unlike other APIs, it doesn't rely on proprietary providers such as Google or Azure to perform translations.

The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.

Now I will concede that certain terms of venery have made the transition from factoid to actual phrase. Pod of whales. Troop of monkeys. Gaggle of geese. Pack of wolves. Those tend to be used for animals that naturally live in small groups, and those are fine. Keep ‘em.

They’re not the ones that annoy me. But “murder of crows,” and the like—the ones that people giggle over despite no actual instance of anyone using the term to refer to a flock of crows maybe ever in history—those need to go.

Accuracy is part of the reason. Bandwidth is another. Why use our limited brain space on fake animal facts when there are so many interesting things that are actually true? Wombats don’t form wisdoms, but they poop cubes. Did you know that? Cubes! You’ll blow them away at bar trivia with that one.

Well, actually, there are a ton of different ways to say “father” in Japanese, and what better day to take a look at them than today?

"Today" being yesterday, the third Sunday in June, or Father's Day (父の日).

Japanese language exercises aimed at school children but also great for non-native learners like me. For me it didn't work in Firefox, which is my preferred browser, but this could possibly be because of my paranoid privacy-enhancing browser extensions.

