On Tuesday Digg announced that they were finally rolling out some major updates to their duplicate detection technology, and let’s be honest – it’s about time! The technology they had in place before was hardly reliable.
The way that it used to work is that you’d go ahead and insert your URL and all the details for the story you were submitting, and then after you had done all that, Digg would ask you if you were sure it wasn’t a duplicate. Along with that it would show you a list of stories that it thought could be similar to yours. Many were not the slightest bit related, and some were submitted days, weeks, or even months ago. Nevertheless, if you assured Digg that your link was in fact NOT a duplicate (even if it was), you could proceed and submit your link.
According to Digg, and Brent Csutoras, they’ve updated the way that their software looks for duplicates. They said that most commonly the types of duplicate stories being submitted were the same stories from the same site, but with different URLs. So solve this problem, they devised a solution that will identify these duplicates using a document similarity algorithm. In other words, it is now capable of identifying identical content from the same source.
Another issue is the same or similar story covered on different sites. Here’s where things get a little trickier. Digg claims that they’ve worked on doing a better job at detecting duplicates with similar descriptive information. Their software will not match stories with similar titles and descriptions with a higher level of accuracy. This doesn’t sound like it’s the perfect solution, but any improvement is better than what they had before.
The order of submitting information has also been altered. Before Digg would not check for duplicates until you had entered your URL and all the descriptive information, so if there were duplicates you wouldn’t find out until you’d wasted several minutes of your time. Now it will check for duplicates immediately after your URL entry, but before you enter descriptive information.
These changes are still being perfected, so during the pilot period now Digg will continue to only block submissions of the exact same URLs within a 30 day period. They will also monitor when Digg users bypass high-confidence duplicates.
If helps fight the never-ending barage of spam, progress is a good thing.