RE: Duplicate URLs

Nemani, Raj Sun, 26 Sep 2010 19:42:31 -0700

Hi Markus,


As you can see, I have used the "Digest" field as the source filed for
the processor and then stored the digest in a newly added filed to the
schma called 'Sig".  I am still testing my index to make sure what I did
not create any un-intended results.  OTOH, is there way I can tell the
processor to just *use* the "digest" filed for dedupe process with me
having to create a new 'sig" filed to store the digest of the 'Digest"
filed?

 

Thanks for your continued help

Raj

 

 

________________________________

From: Markus Jelsma [mailto:[email protected]] 
Sent: Sunday, September 26, 2010 1:31 PM
To: [email protected]; Nemani, Raj
Subject: RE: Duplicate URLs

 

Nutch has a fuzzy hashing algorithm for generating digests for a
document. Solr incorporates the TextProfileSignature that comes from
Nutch. I'm not sure if the digest field is generated by this algoritm,
if it is, it makes sense to use that for deduplication. If the digest
field is generated by an exact hashing algoritm such as MD5, it won't
allow you do use the TextProfileSignature algoritm in Solr for fuzzy
matching.
 

        -----Original message-----
        From: Nemani, Raj <[email protected]>
        Sent: Fri 24-09-2010 23:18
        To: [email protected]; Markus Jelsma
<[email protected]>; 
        Subject: RE: Duplicate URLs
        
        So I used to Solr deduping in the end by configuring Solr for
Deduping
        in SolrConfig.xml.  Here is what I ended up doing.  I noticed
that the
        digest field generated by Nutch for the two URLs I mentioned is
same.
        So I used that as the filed and created new Signature field in
the
        schma.xml.  Here are my config changes from SolConfig.xml.  It
does feel
        weird to use he digest filed for this purpose.  Does this make
sense?  
        
        SolrConfig.xml
        ---------------------------
        
        
        
        <updateRequestProcessorChain name="dedupe">
            <processor
        
class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory"
        >
              <bool name="enabled">true</bool>
              <str name="signatureField">sig</str>
              <bool name="overwriteDupes">true</bool>
              <str
        
name="signatureClass">org.apache.solr.update.processor.Lookup3Signature<
        /str> 
          <str name="fields">digest</str>
          </processor>
            <processor class="solr.LogUpdateProcessorFactory" />
            <processor class="solr.RunUpdateProcessorFactory" />
          </updateRequestProcessorChain>
        
        
        <requestHandler name="/update"
class="solr.XmlUpdateRequestHandler" >
           <lst name="defaults">
             <str name="update.processor">dedupe</str>
           </lst>
         </requestHandler>
        
        Schema.xml
        --------------------
        
        <field name="sig" type="string" stored="true" indexed="true"
        multiValued="true" />
        
        -----Original Message-----
        From: Markus Jelsma [mailto:[email protected]] 
        Sent: Friday, September 24, 2010 7:29 AM
        To: [email protected]
        Subject: Re: Duplicate URLs
        
        
        On Friday 24 September 2010 00:33:54 Nemani, Raj wrote:
        > My solr index has sources other than the data generated from
Nutch
        crawls. 
        >  What this means is that when I do solrDedup from Nutch, the
dedup
        process
        >  will happen across the entire solr Index, not just on the
documents
        >  generated and submitted by Nutch, Am I correct?
        
        Correct.
        
        > 
        > Is there a way I can have the deduping done on the Nutch side
before
        >  sending the data set to Solr even if it means I need to
generate the
        Nutch
        >  index.  Just to reiterate my dupes are based on the content,
not on
        the
        >  URL.
        
        I'm not sure. You'll need a Nutch index to deduplicate first.
But it's
        the 
        index that will be deduplicated, not the parsed segments.
Sending stuff
        to 
        Solr then would not be very helpful.
        
        > 
        > On the other hand it looks like you have to supply the Nutch
index
        >  directory to Nutch dedup command, not the segments directory.
Here
        are
        >  the Hadoop log entries. Could the documentation be wrong?
Note that
        I
        >  have not generated the Nutch index.  After merging the
segements and
        >  inverting the links, I just called the Dedup on my segments
        directory.  It
        >  did not seem to do anything.  Do I have to build the Nutch
Index and
        then
        >  call the dedup on the segments directory?
        
        Nutch dedup command required a parameter pointing to an index,
you'll
        need an 
        index in Nutch to dedup.
        
        > 
        > 2010-09-23 17:42:39,673 INFO  indexer.DeleteDuplicates -
Dedup:
        starting at
        >  2010-09-23 17:42:39 2010-09-23 17:42:39,698 INFO
        indexer.DeleteDuplicates
        >  - Dedup: adding indexes in: crawl/segments 2010-09-23
17:42:40,792
        WARN 
        >  mapred.FileInputFormat - Can't open index at
        >
        
file:/C:/projects/OpenSource/branch-1.2/crawl/segments/20100923174134:0+
        21
        > 47483647, skipping. (no segments* file found in
        >
        
org.apache.nutch.indexer.fsdirect...@file:/C:/projects/OpenSource/branch
        -1
        > .2/crawl/segments/20100923174134: files: [content,
crawl_fetch,
        >  crawl_generate, crawl_parse, parse_data, parse_text])
2010-09-23
        >  17:42:45,200 INFO  indexer.DeleteDuplicates - Dedup: finished
at
        >  2010-09-23 17:42:45, elapsed: 00:00:05
        
        That's the segments* doing there?  It shouldn't.
        
        > 
        > Thanks for all your help
        > Raj
        > 
        > 
        > 
        > -----Original Message-----
        > From: Markus Jelsma [mailto:[email protected]]
        > Sent: Thursday, September 23, 2010 4:52 PM
        > To: [email protected]
        > Subject: RE: Duplicate URLs
        > 
        > bin/nutch solrdedup
        > Usage: SolrDeleteDuplicates <solr url>
        > 
        >  
        > 
        > You could also handle deduplication in your Solr
configuration. It
        exposes
        >  more options and lets you mark duplicates (documents with
identical
        >  signatures) or overwrite them (deduplicate).
        > 
        >  
        > 
        > http://wiki.apache.org/solr/Deduplication
        >  
        > -----Original message-----
        > From: Nemani, Raj <[email protected]>
        > Sent: Thu 23-09-2010 22:48
        > To: [email protected];
        > Subject: RE: Duplicate URLs
        > 
        > Thanks again.  One final question.  I do not create Nutch
index.  I
        just
        >  push the crawl segments to Solr using the follwing command
line.  
        > 
        > bin/nutch solrindex $solr_endpoint crawl/crawldb crawl/linkdb
        >  crawl/segments/*
        > 
        > Do I need to create Nutch index to get the Dedup going because
I saw
        online
        >  script that submits the nutch Index directory to Dedup
command.  Can
        I
        >  just pass in the Segments directory (as shown in the document
from
        the
        >  link you sent) without having to build the Nutch index?
        > 
        > I am going to try both ways in the mean time.
        > 
        > Thanks so much again
        > Raj
        > 
        > 
        > -----Original Message-----
        > From: Markus Jelsma [mailto:[email protected]]
        > Sent: Thursday, September 23, 2010 4:33 PM
        > To: [email protected]
        > Subject: RE: Duplicate URLs
        > 
        > Deduplication is a mechanism where a hash is being generated
based on
        >  contents of some field (title and/or content as the usual).
It can be
        as
        >  simple as an MD5 hash or a more fuzzy match. Nutch can
deduplicate
        itself
        >  by using that command line option. You can also use Nutch to
        deduplicate
        >  whatever you pushed to a Solr index, and you can configure
Solr to
        >  deduplicate as well.
        > 
        >  
        > 
        > http://wiki.apache.org/nutch/CommandLineOptions
        > 
        >  
        > 
        > 
        >  
        > -----Original message-----
        > From: Nemani, Raj <[email protected]>
        > Sent: Thu 23-09-2010 22:26
        > To: [email protected];
        > Subject: RE: Duplicate URLs
        > 
        > Markus,
        > 
        > Thanks so much.
        > Any link that outlines the step to take that you can forward
or just
        >  explain if you can.  I appreciate your help.  I will keep
looking
        online
        >  in the meantime.
        > 
        > Thanks
        > Raj
        > 
        > 
        > -----Original Message-----
        > From: Markus Jelsma [mailto:[email protected]]
        > Sent: Thursday, September 23, 2010 4:20 PM
        > To: [email protected]
        > Subject: RE: Duplicate URLs
        > 
        > Use deduplication.
        >  
        > -----Original message-----
        > From: Nemani, Raj <[email protected]>
        > Sent: Thu 23-09-2010 22:12
        > To: [email protected];
        > Subject: Duplicate URLs
        > 
        > All,
        > 
        > 
        > 
        > I just wanted to see if there is way we can tell Nutch to
treat the
        > following URLs as same.  
        > 
        > 
        > 
        > 
        > 
        >
        
http://SITENAME.DOMAINNAME.com/research/briefing_books/avian_flu/who_rec
        > _action.htm
        > 
        > 
        > 
        >
http://SITENAME/research/briefing_books/avian_flu/who_rec_action.htm
        > 
        > 
        > 
        > 
        > 
        > As you know you can set up web servers such that both the URLs
above
        > resolve to the same end point.  In other words the two URLs
are
        actually
        > *same* even though they are physically different.  Is there
anyway I
        can
        > tell NUTCH to treat these URLs as same?
        > 
        > I cannot use to filtering to ignore one or the other (wither
with
        > DOMAINNAME or without) because I need to allow both patterns
to allow
        > genuine URLs.
        > 
        > 
        > 
        > Thanks
        > 
        > Raj
        > 
        
        Markus Jelsma - Technisch Architect - Buyways BV
        http://www.linkedin.com/in/markus17
        050-8536620 / 06-50258350

RE: Duplicate URLs

Reply via email to