People who create web forms, databases, or ontologies are often unaware how different people’s names can be in other countries. They build their forms or databases in a way that assumes too much on the part of foreign users. This article will first introduce you to some of the different styles used for personal names, and then some of the possible implications for handling those on the Web.

Click on the parts that are in the kanji you are looking for. You can click on them again to de-select them.

Amongst the thousands of languages spoken across the world, here are just eighty. How many can you distinguish between?

Nicholas Ostler, author of Ad Infinitum, a history of Latin, and the Chairman of the Foundation for Endangered Languages, compares Latin's presence on the internet (interretialis) to a small European language - it is comparable to "Icelandic, Lithuanian or Slovenian". § Ostler emails his brother in Latin for fun and enthusiasts maintain websites such as Circulus Latinus Interretialis (Internet Latin Circle), Grex Latine Loquentium (Flock of those Speaking Latin) and the connected online paper Ephemeris. The Finnish radio station YLE even broadcasts news in Latin.

TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, translation, and more.

Last week, while working on new features for our product, I had to find a quick and efficient way to extract the main topics/objects from a sentence. Since I’m using Python, I initially thought that it’s going to be a very easy task to achieve with NLTK. However, when I tried its default tools (POS tagger, Parser…), I indeed got quite accurate results, but performance was pretty bad. So I had to find a better way. Like I did in my previous post, I’ll start with the bottom line – Here you can find my code for extracting the main topics/noun phrases from a given sentence. It works fine with real sentences (from a blog/news article). It’s a bit less accurate compared to the default NLTK tools, but it works much faster!

Bookmark

translate.google.com/toolkit, posted May '13 by peter in conversion free language nlp online

Google Translator Toolkit is a powerful and easy-to-use editor that helps translators work faster and better.

So what characters can you count on nearly everyone being able to see? To answer this question, I looked at the characters in the intersection of several common fonts: Verdana, Georgia, Times New Roman, Arial, Courier New, and Droid Sans. My thought was that this would make a very conservative set of characters. There are 585 characters supported by all the fonts listed above. Most of the characters with code points up to U+01FF are included. This range includes the code blocks for Basic Latin, Latin-1 Supplement, Latin Extended-A, and some of Latin Extended-B. The rest of the characters in the intersection are Greek and Cyrillic letters and a few scattered symbols. Flat, natural, sharp, and gradient didn’t make the cut.

Swedish, adding to all the awesomeness, has proven especially adept at coining new words for the new circumstances occasioned by new technologies. Below, some of the best Swedologisms I could find, via the Swedish news site The Local. We should, obviously, incorporate them into English as soon as possible.

Down in the depths of your organisation, you have a treasure-trove of valuable data. But how hard is it for your users to retrieve it? Salvage your data with a natural language interface - ask your app English questions, get clear answers and reports back.

1–10 (97)   Next >   Last >|