Three different answers, for different levels of one questions: how
similar are these documents?

If they have the same exact bytes, the Solr/Lucene deduplication
technique will work, and is very fast. (I don't remember if it is a
Lucene or Solr feature.)

If they have "minor text changes", different metadata etc., the
Nutch/Hadoop job may work.

If they are rearranged, plagiarized, etc. the Mahout LSA/LSI tools
(can't find LSH as an acronym) are the most useful.

Order of execution: the Solr/Lucene deduplication feature can be done
one document at a time, almost entirely in memory. I don't know about
the Nutch/Hadoop idea. The LSA/LSI tools very definitely need all (or
most) of the documents to build a model, then tests each document
against the model. Since this is a numerical comparison, there will be
a failure rate, both ways: false positives and false negatives. False
positives throw away valid documents.



On 7/28/11, Ted Dunning <[email protected]> wrote:
> Mahout also has an LSH implementation that can help with this.
>
> On Thu, Jul 28, 2011 at 9:37 AM, Ken Krugler
> <[email protected]>wrote:
>
>>
>> On Jul 28, 2011, at 8:49am, Rich Heimann wrote:
>>
>> > All,
>> >
>> > I am curious if Lucene and/or Mahout can identify duplicate documents? I
>> am
>> > having trouble with many redundant docs in my corpus, which is causing
>> > inflated values and an expense on users to process and reprocess much of
>> the
>> > material. Can the redundancy be removed or managed in some sense my
>> either
>> > Lucene at ingestion or Mahout at post-processing? The Vector Space Model
>> > seems to be notional similar to PCA or Factor Analysis, which both have
>> > similar ambitions. Thoughts???
>>
>> Nutch has a TextProfileSignature class that creates a hash which is
>> somewhat resilient to minor text changes between documents.
>>
>> Assuming you have such a hash, then it's trivial to use a Hadoop workflow
>> to remove duplicates.
>>
>> Or Solr supports removing duplicates as well - see
>> http://wiki.apache.org/solr/Deduplication
>>
>> -- Ken
>>
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://bixolabs.com
>> custom data mining solutions
>>
>>
>>
>>
>>
>>
>>
>


-- 
Lance Norskog
[email protected]

Reply via email to