Re: Solr Delete Duplicates

marora Mon, 11 Nov 2013 05:03:45 -0800

No, I am not truncating the page. The url-validator checks that the urls
crawled by Nutch are valid. So this plugin will probably not fix my
problem, though I have added it to my list of plugins because you said
that it's good to have.


I think that here the problem is that my SOLR index contains data from
Nutch and outside Nutch. The outside Nutch indexed data does not have
digest field. So while executing SolrDeleteDuplicates, Nutch is giving an
exception on not finding the digest field. Nutch needs to skip the
document if there is no digest field.

I think that this patch will probably fix my problem. Am I correct? If yes
then how do I apply the patch or is there a jar with the fix available
that I can download?
I can also add the digest field on outside Nutch data. What should be the
value in that field?

Thanks,
Madhvi




On 11/9/13 6:30 AM, "Talat UYARER" <[email protected]> wrote:

>Hi Madhvi,
>
>After adding urlfilter-validator in nutch-site.xml, do you truncate your
>webpage ?
>
>Talat
>
>
>09-11-2013 00:35 tarihinde, [email protected] yazdı:
>> Hi Talat,
>>
>> I can re-create this exception. This exception starts happening as soon
>>as
>> I index from outside Nutch. SolrDeleleteDuplicates works fine as long as
>> the whole solr index came from Nutch.
>> I haven't found out yet specifically which field might be causing it.
>>But
>> looking at issue below, it might be because of the digest field not
>>being
>> there.
>>   https://issues.apache.org/jira/browse/NUTCH-1100
>>
>> Can it be some other field?
>>
>> Also, there is a patch for digest field. How should I apply it? Any help
>> will be great!
>>
>>
>> Madhvi
>>
>>
>>
>>
>>
>>
>> On 11/6/13 2:19 PM, "Talat UYARER" <[email protected]> wrote:
>>
>>> You wrote wrong. You should write like this
>>>
>>> <property>
>>> <name>plugin.includes</name>
>>> 
>>><value>protocol-http|urlfilter-(regex|validator)|parse-(html|tika|metata
>>>gs
>>> |js
>>> 
>>>|swf)|index-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer-(pas
>>>s|
>>> r
>>> egex|basic)</value>
>>> </property>
>>>
>>> And you write in nutch-site.xml after than you should rebuild with ant
>>> clean runtime
>>>
>>> Talat
>>>
>>> [email protected] şunu yazdı:
>>>
>>>> Hi Talat,
>>>> No, I am not using url filter-validator plugin. Here is my list of
>>>> plugins:
>>>>
>>>> <property>
>>>>   <name>plugin.includes</name>
>>>>
>>>> 
>>>><value>protocol-http|urlfilter-regex|parse-(html|tika|metatags|js|swf)|
>>>>in
>>>> de
>>>> 
>>>>x-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer-(pass|regex|b
>>>>as
>>>> ic
>>>> )</value>
>>>> </property>
>>>>
>>>>
>>>> Do I just need to change this to:
>>>>
>>>> <property>
>>>> <name>plugin.includes</name>
>>>> 
>>>><value>protocol-http|urlfilter-regex|parse|validator-(html|tika|metatag
>>>>s|
>>>> js
>>>> 
>>>>|swf)|index-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer-(pa
>>>>ss
>>>> |r
>>>> egex|basic)</value>
>>>> </property>
>>>>
>>>> Thank you so much,
>>>>
>>>>
>>>>
>>>> Madhvi
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 11/6/13 1:08 PM, "Talat UYARER" <[email protected]> wrote:
>>>>
>>>>> Hi Madhvi,
>>>>>
>>>>> Can you tell me what is your active plugins in your nutch-site.xml.
>>>>>I am
>>>>> not sure but we have a issue simalar this. if your solr return null,
>>>>> this
>>>>> will because this issue. Please check your solr return data
>>>>>
>>>>> You can look at https://issues.apache.org/jira/browse/NUTCH-1100
>>>>>
>>>>> if yours is same, you should use urlfilter-validator plugin.
>>>>>
>>>>> Urlfilter-validator has lots of benifit.  i told in
>>>>> 
>>>>>http://mail-archives.apache.org/mod_mbox/nutch-user/201310.mbox/%3c526
>>>>>5B
>>>>> C2
>>>>> [email protected]%3e
>>>>>
>>>>> Talat
>>>>>
>>>>> [email protected] şunu yazdı:
>>>>>
>>>>>> I am going to start my own thread rather than being under javozzo's
>>>>>> thread :)!
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> I am using Nutch 1.5.1 and Solr 3.6 and having problem with command
>>>>>> SolrDeleteDuplicates. Looking at Hadoop logs: I am getting error:
>>>>>>
>>>>>> java.lang.NullPointerException
>>>>>> at org.apache.hadoop.io.Text.encode(Text.java:388)
>>>>>> at org.apache.hadoop.io.Text.set(Text.java:178)
>>>>>> at
>>>>>> 
>>>>>>org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.
>>>>>>ne
>>>>>> xt
>>>>>> (S
>>>>>> olrDeleteDuplicates.java:270)
>>>>>> at
>>>>>> 
>>>>>>org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.
>>>>>>ne
>>>>>> xt
>>>>>> (S
>>>>>> olrDeleteDuplicates.java:241)
>>>>>> at
>>>>>> 
>>>>>>org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTa
>>>>>>sk
>>>>>> .j
>>>>>> av
>>>>>> a:236)
>>>>>> at
>>>>>> 
>>>>>>org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.jav
>>>>>>a:
>>>>>> 21
>>>>>> 6)
>>>>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>>>>>> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>>>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>>>>>> at
>>>>>> 
>>>>>>org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:2
>>>>>>12
>>>>>> )
>>>>>>
>>>>>>
>>>>>> Also had another question about updating Nutch to 1.6 and 1.7. I had
>>>>>> tried
>>>>>> updating to newer version of Nutch but got exception during deleting
>>>>>> duplicates in SOLR. After lot of research online found that a field
>>>>>>had
>>>>>> changed. A few said digest field and others said that url field is
>>>>>>no
>>>>>> longer there. So here are my questions:
>>>>>> 1:  Is there a newer solr mapping file that needs to be used?
>>>>>> 2: Can the SOLR index from 1.5.1 and index from newer version
>>>>>>co-exist
>>>>>> or
>>>>>> we need to re-index from one version of Nutch?
>>>>>>
>>>>>> I will really appreciate any help with this.
>>>>>>
>>>>>>
>>>>>> Thanks in advance,
>>>>>> Madhvi
>>>>>>
>>>>>> Madhvi Arora
>>>>>> AutomationDirect
>>>>>> The #1 Best Mid-Sized Company to work for in
>>>>>> 
>>>>>>Atlanta<http://www.ajc.com/business/topworkplaces/automationdirect-co
>>>>>>m-
>>>>>> to
>>>>>> p-midsize-1421260.html> 2012
>>>>>>
>>>>
>>
>

Re: Solr Delete Duplicates

Reply via email to