Re: Solr Delete Duplicates

Julien Nioche Mon, 11 Nov 2013 08:04:10 -0800

I have just committed this patch to trunk so you can use the latter if you
don't want to have to apply the patch.
Just use
svn co https://svn.apache.org/repos/asf/nutch/trunk nutch-trunk
to pull the content of the trunk


Julien



On 11 November 2013 13:50, <[email protected]> wrote:

> Thank you so much Talat. I will try this out. Hopefully this  will fix my
> problem :)
>
> Madhvi
>
>
>
>
>
>
> On 11/11/13 8:45 AM, "Talat UYARER" <[email protected]> wrote:
>
> >Hi Madhvi,
> >
> >If you have outside Nutch data in Solr. You are right. You need this
> >patch :)
> >
> >I can explain how you should do on Linux.
> >- You should download nutch source code from
> >http://nutch.apache.org/downloads.html
> >- Extract source's zip file.
> >- Download patch file next to (not inside) the source code directory.
> >- Go to source directory in terminal
> >- Give this command:
> >patch -p0  < ../NUTCH-1100-1.6-1.patch
> >- If everything is ok. You can rebuild nutch with ant. You can look at
> >detailed build instructions at
> >https://wiki.apache.org/nutch/HowToContribute
> >
> >That's all
> >Talat
> >
> >11-11-2013 15:02 tarihinde, [email protected] yazdı:
> >> No, I am not truncating the page. The url-validator checks that the urls
> >> crawled by Nutch are valid. So this plugin will probably not fix my
> >> problem, though I have added it to my list of plugins because you said
> >> that it's good to have.
> >>
> >> I think that here the problem is that my SOLR index contains data from
> >> Nutch and outside Nutch. The outside Nutch indexed data does not have
> >> digest field. So while executing SolrDeleteDuplicates, Nutch is giving
> >>an
> >> exception on not finding the digest field. Nutch needs to skip the
> >> document if there is no digest field.
> >>
> >> I think that this patch will probably fix my problem. Am I correct? If
> >>yes
> >> then how do I apply the patch or is there a jar with the fix available
> >> that I can download?
> >> I can also add the digest field on outside Nutch data. What should be
> >>the
> >> value in that field?
> >>
> >> Thanks,
> >> Madhvi
> >>
> >>
> >>
> >>
> >> On 11/9/13 6:30 AM, "Talat UYARER" <[email protected]> wrote:
> >>
> >>> Hi Madhvi,
> >>>
> >>> After adding urlfilter-validator in nutch-site.xml, do you truncate
> >>>your
> >>> webpage ?
> >>>
> >>> Talat
> >>>
> >>>
> >>> 09-11-2013 00:35 tarihinde, [email protected] yazdı:
> >>>> Hi Talat,
> >>>>
> >>>> I can re-create this exception. This exception starts happening as
> >>>>soon
> >>>> as
> >>>> I index from outside Nutch. SolrDeleleteDuplicates works fine as long
> >>>>as
> >>>> the whole solr index came from Nutch.
> >>>> I haven't found out yet specifically which field might be causing it.
> >>>> But
> >>>> looking at issue below, it might be because of the digest field not
> >>>> being
> >>>> there.
> >>>>    https://issues.apache.org/jira/browse/NUTCH-1100
> >>>>
> >>>> Can it be some other field?
> >>>>
> >>>> Also, there is a patch for digest field. How should I apply it? Any
> >>>>help
> >>>> will be great!
> >>>>
> >>>>
> >>>> Madhvi
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On 11/6/13 2:19 PM, "Talat UYARER" <[email protected]> wrote:
> >>>>
> >>>>> You wrote wrong. You should write like this
> >>>>>
> >>>>> <property>
> >>>>> <name>plugin.includes</name>
> >>>>>
> >>>>>
> >>>>><value>protocol-http|urlfilter-(regex|validator)|parse-(html|tika|meta
> >>>>>ta
> >>>>> gs
> >>>>> |js
> >>>>>
> >>>>>
> >>>>>|swf)|index-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer-(p
> >>>>>as
> >>>>> s|
> >>>>> r
> >>>>> egex|basic)</value>
> >>>>> </property>
> >>>>>
> >>>>> And you write in nutch-site.xml after than you should rebuild with
> >>>>>ant
> >>>>> clean runtime
> >>>>>
> >>>>> Talat
> >>>>>
> >>>>> [email protected] şunu yazdı:
> >>>>>
> >>>>>> Hi Talat,
> >>>>>> No, I am not using url filter-validator plugin. Here is my list of
> >>>>>> plugins:
> >>>>>>
> >>>>>> <property>
> >>>>>>    <name>plugin.includes</name>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>><value>protocol-http|urlfilter-regex|parse-(html|tika|metatags|js|swf
> >>>>>>)|
> >>>>>> in
> >>>>>> de
> >>>>>>
> >>>>>>
> >>>>>>x-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer-(pass|regex
> >>>>>>|b
> >>>>>> as
> >>>>>> ic
> >>>>>> )</value>
> >>>>>> </property>
> >>>>>>
> >>>>>>
> >>>>>> Do I just need to change this to:
> >>>>>>
> >>>>>> <property>
> >>>>>> <name>plugin.includes</name>
> >>>>>>
> >>>>>>
> >>>>>><value>protocol-http|urlfilter-regex|parse|validator-(html|tika|metat
> >>>>>>ag
> >>>>>> s|
> >>>>>> js
> >>>>>>
> >>>>>>
> >>>>>>|swf)|index-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer-(
> >>>>>>pa
> >>>>>> ss
> >>>>>> |r
> >>>>>> egex|basic)</value>
> >>>>>> </property>
> >>>>>>
> >>>>>> Thank you so much,
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Madhvi
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 11/6/13 1:08 PM, "Talat UYARER" <[email protected]> wrote:
> >>>>>>
> >>>>>>> Hi Madhvi,
> >>>>>>>
> >>>>>>> Can you tell me what is your active plugins in your nutch-site.xml.
> >>>>>>> I am
> >>>>>>> not sure but we have a issue simalar this. if your solr return
> >>>>>>>null,
> >>>>>>> this
> >>>>>>> will because this issue. Please check your solr return data
> >>>>>>>
> >>>>>>> You can look at https://issues.apache.org/jira/browse/NUTCH-1100
> >>>>>>>
> >>>>>>> if yours is same, you should use urlfilter-validator plugin.
> >>>>>>>
> >>>>>>> Urlfilter-validator has lots of benifit.  i told in
> >>>>>>>
> >>>>>>>
> >>>>>>>
> http://mail-archives.apache.org/mod_mbox/nutch-user/201310.mbox/%3c5
> >>>>>>>26
> >>>>>>> 5B
> >>>>>>> C2
> >>>>>>> [email protected]%3e
> >>>>>>>
> >>>>>>> Talat
> >>>>>>>
> >>>>>>> [email protected] şunu yazdı:
> >>>>>>>
> >>>>>>>> I am going to start my own thread rather than being under
> >>>>>>>>javozzo's
> >>>>>>>> thread :)!
> >>>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> I am using Nutch 1.5.1 and Solr 3.6 and having problem with
> >>>>>>>>command
> >>>>>>>> SolrDeleteDuplicates. Looking at Hadoop logs: I am getting error:
> >>>>>>>>
> >>>>>>>> java.lang.NullPointerException
> >>>>>>>> at org.apache.hadoop.io.Text.encode(Text.java:388)
> >>>>>>>> at org.apache.hadoop.io.Text.set(Text.java:178)
> >>>>>>>> at
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$
> >>>>>>>>1.
> >>>>>>>> ne
> >>>>>>>> xt
> >>>>>>>> (S
> >>>>>>>> olrDeleteDuplicates.java:270)
> >>>>>>>> at
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$
> >>>>>>>>1.
> >>>>>>>> ne
> >>>>>>>> xt
> >>>>>>>> (S
> >>>>>>>> olrDeleteDuplicates.java:241)
> >>>>>>>> at
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(Map
> >>>>>>>>Ta
> >>>>>>>> sk
> >>>>>>>> .j
> >>>>>>>> av
> >>>>>>>> a:236)
> >>>>>>>> at
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.j
> >>>>>>>>av
> >>>>>>>> a:
> >>>>>>>> 21
> >>>>>>>> 6)
> >>>>>>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> >>>>>>>> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
> >>>>>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
> >>>>>>>> at
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java
> >>>>>>>>:2
> >>>>>>>> 12
> >>>>>>>> )
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Also had another question about updating Nutch to 1.6 and 1.7. I
> >>>>>>>>had
> >>>>>>>> tried
> >>>>>>>> updating to newer version of Nutch but got exception during
> >>>>>>>>deleting
> >>>>>>>> duplicates in SOLR. After lot of research online found that a
> >>>>>>>>field
> >>>>>>>> had
> >>>>>>>> changed. A few said digest field and others said that url field is
> >>>>>>>> no
> >>>>>>>> longer there. So here are my questions:
> >>>>>>>> 1:  Is there a newer solr mapping file that needs to be used?
> >>>>>>>> 2: Can the SOLR index from 1.5.1 and index from newer version
> >>>>>>>> co-exist
> >>>>>>>> or
> >>>>>>>> we need to re-index from one version of Nutch?
> >>>>>>>>
> >>>>>>>> I will really appreciate any help with this.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks in advance,
> >>>>>>>> Madhvi
> >>>>>>>>
> >>>>>>>> Madhvi Arora
> >>>>>>>> AutomationDirect
> >>>>>>>> The #1 Best Mid-Sized Company to work for in
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>Atlanta<
> http://www.ajc.com/business/topworkplaces/automationdirect-
> >>>>>>>>co
> >>>>>>>> m-
> >>>>>>>> to
> >>>>>>>> p-midsize-1421260.html> 2012
> >>>>>>>>
> >>>>>>
> >>>>
> >>>
> >>
> >
>
>


-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Solr Delete Duplicates

Reply via email to