Thanks Julien. I will try out one of these options later today.

Madhvi






On 11/11/13 11:03 AM, "Julien Nioche" <[email protected]>
wrote:

>I have just committed this patch to trunk so you can use the latter if you
>don't want to have to apply the patch.
>Just use
>svn co https://svn.apache.org/repos/asf/nutch/trunk nutch-trunk
>to pull the content of the trunk
>
>Julien
>
>
>
>On 11 November 2013 13:50, <[email protected]> wrote:
>
>> Thank you so much Talat. I will try this out. Hopefully this  will fix
>>my
>> problem :)
>>
>> Madhvi
>>
>>
>>
>>
>>
>>
>> On 11/11/13 8:45 AM, "Talat UYARER" <[email protected]> wrote:
>>
>> >Hi Madhvi,
>> >
>> >If you have outside Nutch data in Solr. You are right. You need this
>> >patch :)
>> >
>> >I can explain how you should do on Linux.
>> >- You should download nutch source code from
>> >http://nutch.apache.org/downloads.html
>> >- Extract source's zip file.
>> >- Download patch file next to (not inside) the source code directory.
>> >- Go to source directory in terminal
>> >- Give this command:
>> >patch -p0  < ../NUTCH-1100-1.6-1.patch
>> >- If everything is ok. You can rebuild nutch with ant. You can look at
>> >detailed build instructions at
>> >https://wiki.apache.org/nutch/HowToContribute
>> >
>> >That's all
>> >Talat
>> >
>> >11-11-2013 15:02 tarihinde, [email protected] yazdı:
>> >> No, I am not truncating the page. The url-validator checks that the
>>urls
>> >> crawled by Nutch are valid. So this plugin will probably not fix my
>> >> problem, though I have added it to my list of plugins because you
>>said
>> >> that it's good to have.
>> >>
>> >> I think that here the problem is that my SOLR index contains data
>>from
>> >> Nutch and outside Nutch. The outside Nutch indexed data does not have
>> >> digest field. So while executing SolrDeleteDuplicates, Nutch is
>>giving
>> >>an
>> >> exception on not finding the digest field. Nutch needs to skip the
>> >> document if there is no digest field.
>> >>
>> >> I think that this patch will probably fix my problem. Am I correct?
>>If
>> >>yes
>> >> then how do I apply the patch or is there a jar with the fix
>>available
>> >> that I can download?
>> >> I can also add the digest field on outside Nutch data. What should be
>> >>the
>> >> value in that field?
>> >>
>> >> Thanks,
>> >> Madhvi
>> >>
>> >>
>> >>
>> >>
>> >> On 11/9/13 6:30 AM, "Talat UYARER" <[email protected]> wrote:
>> >>
>> >>> Hi Madhvi,
>> >>>
>> >>> After adding urlfilter-validator in nutch-site.xml, do you truncate
>> >>>your
>> >>> webpage ?
>> >>>
>> >>> Talat
>> >>>
>> >>>
>> >>> 09-11-2013 00:35 tarihinde, [email protected] yazdı:
>> >>>> Hi Talat,
>> >>>>
>> >>>> I can re-create this exception. This exception starts happening as
>> >>>>soon
>> >>>> as
>> >>>> I index from outside Nutch. SolrDeleleteDuplicates works fine as
>>long
>> >>>>as
>> >>>> the whole solr index came from Nutch.
>> >>>> I haven't found out yet specifically which field might be causing
>>it.
>> >>>> But
>> >>>> looking at issue below, it might be because of the digest field not
>> >>>> being
>> >>>> there.
>> >>>>    https://issues.apache.org/jira/browse/NUTCH-1100
>> >>>>
>> >>>> Can it be some other field?
>> >>>>
>> >>>> Also, there is a patch for digest field. How should I apply it? Any
>> >>>>help
>> >>>> will be great!
>> >>>>
>> >>>>
>> >>>> Madhvi
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> On 11/6/13 2:19 PM, "Talat UYARER" <[email protected]> wrote:
>> >>>>
>> >>>>> You wrote wrong. You should write like this
>> >>>>>
>> >>>>> <property>
>> >>>>> <name>plugin.includes</name>
>> >>>>>
>> >>>>>
>> 
>>>>>>><value>protocol-http|urlfilter-(regex|validator)|parse-(html|tika|me
>>>>>>>ta
>> >>>>>ta
>> >>>>> gs
>> >>>>> |js
>> >>>>>
>> >>>>>
>> 
>>>>>>>|swf)|index-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer-
>>>>>>>(p
>> >>>>>as
>> >>>>> s|
>> >>>>> r
>> >>>>> egex|basic)</value>
>> >>>>> </property>
>> >>>>>
>> >>>>> And you write in nutch-site.xml after than you should rebuild with
>> >>>>>ant
>> >>>>> clean runtime
>> >>>>>
>> >>>>> Talat
>> >>>>>
>> >>>>> [email protected] şunu yazdı:
>> >>>>>
>> >>>>>> Hi Talat,
>> >>>>>> No, I am not using url filter-validator plugin. Here is my list
>>of
>> >>>>>> plugins:
>> >>>>>>
>> >>>>>> <property>
>> >>>>>>    <name>plugin.includes</name>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> 
>>>>>>>><value>protocol-http|urlfilter-regex|parse-(html|tika|metatags|js|s
>>>>>>>>wf
>> >>>>>>)|
>> >>>>>> in
>> >>>>>> de
>> >>>>>>
>> >>>>>>
>> 
>>>>>>>>x-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer-(pass|reg
>>>>>>>>ex
>> >>>>>>|b
>> >>>>>> as
>> >>>>>> ic
>> >>>>>> )</value>
>> >>>>>> </property>
>> >>>>>>
>> >>>>>>
>> >>>>>> Do I just need to change this to:
>> >>>>>>
>> >>>>>> <property>
>> >>>>>> <name>plugin.includes</name>
>> >>>>>>
>> >>>>>>
>> 
>>>>>>>><value>protocol-http|urlfilter-regex|parse|validator-(html|tika|met
>>>>>>>>at
>> >>>>>>ag
>> >>>>>> s|
>> >>>>>> js
>> >>>>>>
>> >>>>>>
>> 
>>>>>>>>|swf)|index-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer
>>>>>>>>-(
>> >>>>>>pa
>> >>>>>> ss
>> >>>>>> |r
>> >>>>>> egex|basic)</value>
>> >>>>>> </property>
>> >>>>>>
>> >>>>>> Thank you so much,
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> Madhvi
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> On 11/6/13 1:08 PM, "Talat UYARER" <[email protected]>
>>wrote:
>> >>>>>>
>> >>>>>>> Hi Madhvi,
>> >>>>>>>
>> >>>>>>> Can you tell me what is your active plugins in your
>>nutch-site.xml.
>> >>>>>>> I am
>> >>>>>>> not sure but we have a issue simalar this. if your solr return
>> >>>>>>>null,
>> >>>>>>> this
>> >>>>>>> will because this issue. Please check your solr return data
>> >>>>>>>
>> >>>>>>> You can look at https://issues.apache.org/jira/browse/NUTCH-1100
>> >>>>>>>
>> >>>>>>> if yours is same, you should use urlfilter-validator plugin.
>> >>>>>>>
>> >>>>>>> Urlfilter-validator has lots of benifit.  i told in
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> http://mail-archives.apache.org/mod_mbox/nutch-user/201310.mbox/%3c5
>> >>>>>>>26
>> >>>>>>> 5B
>> >>>>>>> C2
>> >>>>>>> [email protected]%3e
>> >>>>>>>
>> >>>>>>> Talat
>> >>>>>>>
>> >>>>>>> [email protected] şunu yazdı:
>> >>>>>>>
>> >>>>>>>> I am going to start my own thread rather than being under
>> >>>>>>>>javozzo's
>> >>>>>>>> thread :)!
>> >>>>>>>>
>> >>>>>>>> Hi,
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> I am using Nutch 1.5.1 and Solr 3.6 and having problem with
>> >>>>>>>>command
>> >>>>>>>> SolrDeleteDuplicates. Looking at Hadoop logs: I am getting
>>error:
>> >>>>>>>>
>> >>>>>>>> java.lang.NullPointerException
>> >>>>>>>> at org.apache.hadoop.io.Text.encode(Text.java:388)
>> >>>>>>>> at org.apache.hadoop.io.Text.set(Text.java:178)
>> >>>>>>>> at
>> >>>>>>>>
>> >>>>>>>>
>> 
>>>>>>>>>>org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputForma
>>>>>>>>>>t$
>> >>>>>>>>1.
>> >>>>>>>> ne
>> >>>>>>>> xt
>> >>>>>>>> (S
>> >>>>>>>> olrDeleteDuplicates.java:270)
>> >>>>>>>> at
>> >>>>>>>>
>> >>>>>>>>
>> 
>>>>>>>>>>org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputForma
>>>>>>>>>>t$
>> >>>>>>>>1.
>> >>>>>>>> ne
>> >>>>>>>> xt
>> >>>>>>>> (S
>> >>>>>>>> olrDeleteDuplicates.java:241)
>> >>>>>>>> at
>> >>>>>>>>
>> >>>>>>>>
>> 
>>>>>>>>>>org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(M
>>>>>>>>>>ap
>> >>>>>>>>Ta
>> >>>>>>>> sk
>> >>>>>>>> .j
>> >>>>>>>> av
>> >>>>>>>> a:236)
>> >>>>>>>> at
>> >>>>>>>>
>> >>>>>>>>
>> 
>>>>>>>>>>org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask
>>>>>>>>>>.j
>> >>>>>>>>av
>> >>>>>>>> a:
>> >>>>>>>> 21
>> >>>>>>>> 6)
>> >>>>>>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>> >>>>>>>> at 
>>org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>> >>>>>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>> >>>>>>>> at
>> >>>>>>>>
>> >>>>>>>>
>> 
>>>>>>>>>>org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.ja
>>>>>>>>>>va
>> >>>>>>>>:2
>> >>>>>>>> 12
>> >>>>>>>> )
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Also had another question about updating Nutch to 1.6 and 1.7.
>>I
>> >>>>>>>>had
>> >>>>>>>> tried
>> >>>>>>>> updating to newer version of Nutch but got exception during
>> >>>>>>>>deleting
>> >>>>>>>> duplicates in SOLR. After lot of research online found that a
>> >>>>>>>>field
>> >>>>>>>> had
>> >>>>>>>> changed. A few said digest field and others said that url
>>field is
>> >>>>>>>> no
>> >>>>>>>> longer there. So here are my questions:
>> >>>>>>>> 1:  Is there a newer solr mapping file that needs to be used?
>> >>>>>>>> 2: Can the SOLR index from 1.5.1 and index from newer version
>> >>>>>>>> co-exist
>> >>>>>>>> or
>> >>>>>>>> we need to re-index from one version of Nutch?
>> >>>>>>>>
>> >>>>>>>> I will really appreciate any help with this.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Thanks in advance,
>> >>>>>>>> Madhvi
>> >>>>>>>>
>> >>>>>>>> Madhvi Arora
>> >>>>>>>> AutomationDirect
>> >>>>>>>> The #1 Best Mid-Sized Company to work for in
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>Atlanta<
>> http://www.ajc.com/business/topworkplaces/automationdirect-
>> >>>>>>>>co
>> >>>>>>>> m-
>> >>>>>>>> to
>> >>>>>>>> p-midsize-1421260.html> 2012
>> >>>>>>>>
>> >>>>>>
>> >>>>
>> >>>
>> >>
>> >
>>
>>
>
>
>-- 
>
>Open Source Solutions for Text Engineering
>
>http://digitalpebble.blogspot.com/
>http://www.digitalpebble.com
>http://twitter.com/digitalpebble

Reply via email to