Last week, while working on new features for our product, I had to find a quick and efficient way to extract the main topics/objects from a sentence. Since I’m using Python, I initially thought that it’s going to be a very easy task to achieve with NLTK. However, when I tried its default tools (POS tagger, Parser…), I indeed got quite accurate results, but performance was pretty bad. So I had to find a better way. Like I did in my previous post, I’ll start with the bottom line – Here you can find my code for extracting the main topics/noun phrases from a given sentence. It works fine with real sentences (from a blog/news article). It’s a bit less accurate compared to the default NLTK tools, but it works much faster!

Google Translator Toolkit is a powerful and easy-to-use editor that helps translators work faster and better.

Down in the depths of your organisation, you have a treasure-trove of valuable data. But how hard is it for your users to retrieve it? Salvage your data with a natural language interface - ask your app English questions, get clear answers and reports back.

This is an interview with Gabriel Weinberg, founder of Duck Duck Go and general all around startup guru, on what DDG’s architecture looks like in 2012.

An app offering real-time translations is to allow people in Japan to speak to foreigners over the phone with both parties using their native tongue.

NTT Docomo - the country's biggest mobile network - will initially convert Japanese to English, Mandarin and Korean, with other languages to follow.

Even though the translations are bound to be hilariously bad sometimes, this may still be useful in some situations.

So what exactly did we achieve? Our research has dramatically increased the number of authors that can be distinguished using writing-style analysis: from about 300 to 100,000. More importantly, the accuracy of our algorithms drops off gently as the number of authors increases, so we can be confident that they will continue to perform well as we scale the problem even further. Our work is therefore the first time that stylometry has been shown to have to have serious implications for online anonymity.

Pattern is a web mining module for the Python programming language.

It bundles tools for data retrieval (Google + Twitter + Wikipedia API, web spider, HTML DOM parser), text analysis (rule-based shallow parser, WordNet interface, syntactical + semantical n-gram search algorithm, tf-idf + cosine similarity + LSA metrics) and data visualization (graph networks).

The module is bundled with 30+ example scripts.

This article shows you how to write a relatively simple script to extract text paragraphs from large chunks of HTML code, without knowing its structure or the tags used. It works on news articles and blogs pages with worthwhile text content, among others…

Jellyfish is a python library for doing approximate and phonetic matching of strings.


String comparison: * Levenshtein Distance * Damerau-Levenshtein Distance * Jaro Distance * Jaro-Winkler Distance * Match Rating Approach Comparison * Hamming Distance

Phonetic encoding:

* American Soundex * Metaphone * NYSIIS (New York State Identification and Intelligence System) * Match Rating Codex

Open source Python modules, linguistic data and documentation for research and development in natural language processing and text analytics, with distributions for Windows, Mac OSX and Linux.

|< First   < Previous   11–20 (46)   Next >   Last >|