The basic insight behind Levenshtein automata is that it's possible to construct a Finite state automaton that recognizes exactly the set of strings within a given Levenshtein distance of a target word. We can then feed in any word, and the automaton will accept or reject it based on whether the Levenshtein distance to the target word is at most the distance specified when we constructed the automaton. Further, due to the nature of FSAs, it will do so in O(n) time with the length of the string being tested. Compare this to the standard Dynamic Programming Levenshtein algorithm, which takes O(mn) time, where m and n are the lengths of the two input words! It's thus immediately apparrent that Levenshtein automaton provide, at a minimum, a faster way for us to check many words against a single target word and maximum distance - not a bad improvement to start with!

Of course, if that were the only benefit of Levenshtein automata, this would be a short article. There's much more to come, but first let's see what a Levenshtein automaton looks like, and how we can build one.

When you think about it, OpenSea would actually be much "better" in the immediate sense if all the web3 parts were gone. It would be faster, cheaper for everyone, and easier to use. For example, to accept a bid on my NFT, I would have had to pay over $80-$150+ just in ethereum transaction fees. That puts an artificial floor on all bids, since otherwise you'd lose money by accepting a bid for less than the gas fees. Payment fees by credit card, which typically feel extortionary, look cheap compared to that. OpenSea could even publish a simple transparency log if people wanted a public record of transactions, offers, bids, etc to verify their accounting.

However, if they had built a platform to buy and sell images that wasn't nominally based on crypto, I don't think it would have taken off. Not because it isn't distributed, because as we've seen so much of what's required to make it work is already not distributed. I don't think it would have taken off because this is a gold rush. People have made money through cryptocurrency speculation, those people are interested in spending that cryptocurrency in ways that support their investment while offering additional returns, and so that defines the setting for the market of transfer of wealth.

But while there are many tools to automatically renew certificates for publicly available webservers (certbot, simp_le, I wrote about how to do that 3 years back), it's hard to find any useful information about how to issue certificates for internal non Internet facing servers and/or devices with Let's Encrypt.

This article is not an attempt to explain containers in one go. Instead, it's a front-page for my multi-year study of the domain. It outlines the said learning path and then walks you through it, pointing to more in-depth write-ups on this same blog.

Mastering containers is no simple task, so take your time, and don't skip the hands-on parts!

The Conventional Commits specification is a lightweight convention on top of commit messages. It provides an easy set of rules for creating an explicit commit history; which makes it easier to write automated tools on top of. This convention dovetails with SemVer, by describing the features, fixes, and breaking changes made in commit messages.

The moral of this story? Don't get trapped by the sunk cost fallacy. If you find yourself passing parameters and adding conditional paths through shared code, the abstraction is incorrect. It may have been right to begin with, but that day has passed. Once an abstraction is proved wrong the best strategy is to re-introduce duplication and let it show you what's right. Although it occasionally makes sense to accumulate a few conditionals to gain insight into what's going on, you'll suffer less pain if you abandon the wrong abstraction sooner rather than later.

When the abstraction is wrong, the fastest way forward is back. This is not retreat, it's advance in a better direction. Do it. You'll improve your own life, and the lives of all who follow.

I want to detail some techniques you can leverage to make your Maven builds faster in this post. The following post will focus on how to do the same inside of Docker.

Typesense is an open source, typo tolerant search engine that is optimized for instant sub-50ms searches, while providing an intuitive developer experience.

This is a list of (Fuzzy) Data Matching software. The software in this list is open source and/or freely available.

The term data matching is used to indicate the procedure of bringing together information from two or more records that are believed to belong to the same entity. Data matching has two applications: (1) to match data across multiple datasets (linkage) and (2) to match data within a dataset (deduplication). See the Wikipedia page about data matching for more information.

Similar terms: record linkage, data matching, deduplication, fuzzy matching, entity resolution

Suppose we want to combine a BERT-based named entity recognition (NER) model with a rule-based NER model built on top of spaCy. Although BERT's NER exhibits extremely high performance, it is usually combined with rule-based approaches for practical purposes. In such cases, what often bothers us is that tokens of spaCy and BERT are different, even if the input sentences are the same. For example, let's say the input sentence is "John Johanson 's house"; BERT tokenizes this sentence like ["john", "johan", "##son", "'", "s", "house"] and spaCy tokenizes it like ["John", "Johanson", "'s", "house"]. To combine the outputs, we need to calculate the correspondence between the two different token sequences. This correspondence is the "alignment".

1–10 (502)   Next >   Last >|