A Simple Near Duplicate Detection Algorithm

Near duplicate detection (NDD) is the operation whereby two documents are tested for partial similarity.  A typical application of this would be to ignore similar web pages for a search engine result set.  These sorts of partly similar pages are common nowadays due to syndication, re-posting, media content pools, etc.

The problem with NDD (vs. exact duplicate detection) is that you can’t just hash the document.  You have to generate some sort of signature that, when compared with other such signatures, generates a numerical measure of similarity, not a boolean answer.  There are various types of algorithms out there to do this, for example, Cosine Similarity, N-Gram bases algorithms, and various algorithms that make assumptions about document content, such as this algorithm developed at Stanford a few years ago, or this one developed at Microsoft.

I implemented a variation of the N-Grams approach that has certain interesting properties.  The signatures are of fixed size, no matter what the size of the documents.  No assumptions are made about the document content.  Similarity computation for two documents is O(K) complexity.

The rough outline of the algorithm for generating the signatures is as follows:

  • Break document into N-Grams (N=4 or 5 tends to be a good value)
  • Hash the string value of each N-Gram
  • Toss out duplicate values in the list of hashes
  • Sort the resulting set of hashes
  • Take the top K values from the sorted set

At this point, similarity is measured by counting the common values in any two fixed length sorted signature sets, which is a simple operation of O(K).

I assume this algorithm has already been published somewhere, as I’d find it hard to believe something so straightforward is actually original.  However, I’m not too mathematically inclined, at least in a formal manner, so I haven’t really waded into the mess of Greek letters that is the centerpiece of many an academic paper.  If you know where this algorithm was originally proposed, feel free to provide a link in the comments.  Here’s a link to the code: https://santanubasu@github.com/santanubasu/NDD.git

The NDDSig class shows how to construct and use the signatures.  As an interesting example, I used a tricky test case posed at the end of this post on the LingPipe blog (LingPipe is a product of Alias-I, a leading natural language processing software company).  The problem described in the blog was that the two pairs of tweets were returning the same similarity measures (50%) even though one was an actual near duplicate and the other was not.

Here are the actual near duplicate tweets:

Vet, 77, Busted For Obama Death Threat | The Smoking Gun http://t.co/MrTUwxv via @

Vet, 77, Busted For Obama Death Threat http://tinyurl.com/25zyxgp #tcot #tlot #sgp

And here are the somewhat similar but not actual duplicate tweets.

Playing a show in Chicago, IL at 9:00 PM today at LE PASSAGE http://artistdata.com/a/32on

Playing a show in Cape Girardeau, MO at 9:00 PM today at The Venue http://artistdata.com/a/32ow

As described in their blog:

Despite sharing half the words they are clearly not retweets and contain different information. It might be quite difficult to automatically reject ‘obama’ near duplicates at 50% token overlap but retain the ‘today’ near duplicates with 50% overlap

The NDDSig class will calculate similarity measures for the first pair of tweets at 70%, and the second set at 50%.  Granted, this is not an exhaustive test case.  Perhaps in a future post I’ll run some more in depth tests to see where things stand.

2 Comments

  1. I’m afraid your approach isn’t O(K). One simple way to see this is that at one step you sort a set larger than K then take its top K element. Generating all the doc signatures will take O(D * M * log M) where D is the number of docs and M > K is the maximum number of n-grams in any doc.

    Character and token N-grams are commonly applied to the general problem of string matching-based reduplication (or record linkage) algorithms.

    The trick, though, is when you have N documents doing all N*N comparisons efficiently. That’s way too expensive for a big doc collection, so you need some kind of quick-filter indexing on the signatures.

    For text, look up shingling and min-hash algorithms. The shingles are often tokens or sequences of tokens and they’re often sorted by frequency (or by inverse document frequency if you can afford global stats) and only the middle elements retained.

  2. Santanu

    Thanks for the pointers Bob. What you’re saying is obviously correct, but the O(K) only refers to the similarity comparison of two documents (via their signatures), not to the generation of the signatures themselves. That’s a one time operation so the O(M * log M) complexity (per doc) isn’t too problematic.

    As you point out, doing N*N signature comparisons across an entire document collection clearly won’t be efficient. The use case I had in mind was accurate detection of near duplicates in a result set, which is itself obtained from a document collection that has been created using a quick filtering approach as you mention. Alternatively, result sets merged from multiple individually de-duplicated sub-indices might be good candidates for a secondary de-duplication pass prior to being served to users.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: