A Simple Near Duplicate Detection Algorithm
Near duplicate detection (NDD) is the operation whereby two documents are tested for partial similarity. A typical application of this would be to ignore similar web pages for a search engine result set. These sorts of partly similar pages are common nowadays due to syndication, re-posting, media content pools, etc.
The problem with NDD (vs. exact duplicate detection) is that you can’t just hash the document. You have to generate some sort of signature that, when compared with other such signatures, generates a numerical measure of similarity, not a boolean answer. There are various types of algorithms out there to do this, for example, Cosine Similarity, N-Gram bases algorithms, and various algorithms that make assumptions about document content, such as this algorithm developed at Stanford a few years ago, or this one developed at Microsoft.
I implemented a variation of the N-Grams approach that has certain interesting properties. The signatures are of fixed size, no matter what the size of the documents. No assumptions are made about the document content. Similarity computation for two documents is O(K) complexity.
The rough outline of the algorithm for generating the signatures is as follows:
- Break document into N-Grams (N=4 or 5 tends to be a good value)
- Hash the string value of each N-Gram
- Toss out duplicate values in the list of hashes
- Sort the resulting set of hashes
- Take the top K values from the sorted set
At this point, similarity is measured by counting the common values in any two fixed length sorted signature sets, which is a simple operation of O(K).
I assume this algorithm has already been published somewhere, as I’d find it hard to believe something so straightforward is actually original. However, I’m not too mathematically inclined, at least in a formal manner, so I haven’t really waded into the mess of Greek letters that is the centerpiece of many an academic paper. If you know where this algorithm was originally proposed, feel free to provide a link in the comments. Here’s a link to the code: https://firstname.lastname@example.org/santanubasu/NDD.git
The NDDSig class shows how to construct and use the signatures. As an interesting example, I used a tricky test case posed at the end of this post on the LingPipe blog (LingPipe is a product of Alias-I, a leading natural language processing software company). The problem described in the blog was that the two pairs of tweets were returning the same similarity measures (50%) even though one was an actual near duplicate and the other was not.
Here are the actual near duplicate tweets:
Vet, 77, Busted For Obama Death Threat | The Smoking Gun http://t.co/MrTUwxv via @
Vet, 77, Busted For Obama Death Threat http://tinyurl.com/25zyxgp #tcot #tlot #sgp
And here are the somewhat similar but not actual duplicate tweets.
Playing a show in Chicago, IL at 9:00 PM today at LE PASSAGE http://artistdata.com/a/32on
Playing a show in Cape Girardeau, MO at 9:00 PM today at The Venue http://artistdata.com/a/32ow
As described in their blog:
Despite sharing half the words they are clearly not retweets and contain different information. It might be quite difficult to automatically reject ‘obama’ near duplicates at 50% token overlap but retain the ‘today’ near duplicates with 50% overlap
The NDDSig class will calculate similarity measures for the first pair of tweets at 70%, and the second set at 50%. Granted, this is not an exhaustive test case. Perhaps in a future post I’ll run some more in depth tests to see where things stand.
- Posted in: Uncategorized