Thanks Julien. I will try out one of these options later today. Madhvi
On 11/11/13 11:03 AM, "Julien Nioche" <[email protected]> wrote: >I have just committed this patch to trunk so you can use the latter if you >don't want to have to apply the patch. >Just use >svn co https://svn.apache.org/repos/asf/nutch/trunk nutch-trunk >to pull the content of the trunk > >Julien > > > >On 11 November 2013 13:50, <[email protected]> wrote: > >> Thank you so much Talat. I will try this out. Hopefully this will fix >>my >> problem :) >> >> Madhvi >> >> >> >> >> >> >> On 11/11/13 8:45 AM, "Talat UYARER" <[email protected]> wrote: >> >> >Hi Madhvi, >> > >> >If you have outside Nutch data in Solr. You are right. You need this >> >patch :) >> > >> >I can explain how you should do on Linux. >> >- You should download nutch source code from >> >http://nutch.apache.org/downloads.html >> >- Extract source's zip file. >> >- Download patch file next to (not inside) the source code directory. >> >- Go to source directory in terminal >> >- Give this command: >> >patch -p0 < ../NUTCH-1100-1.6-1.patch >> >- If everything is ok. You can rebuild nutch with ant. You can look at >> >detailed build instructions at >> >https://wiki.apache.org/nutch/HowToContribute >> > >> >That's all >> >Talat >> > >> >11-11-2013 15:02 tarihinde, [email protected] yazdı: >> >> No, I am not truncating the page. The url-validator checks that the >>urls >> >> crawled by Nutch are valid. So this plugin will probably not fix my >> >> problem, though I have added it to my list of plugins because you >>said >> >> that it's good to have. >> >> >> >> I think that here the problem is that my SOLR index contains data >>from >> >> Nutch and outside Nutch. The outside Nutch indexed data does not have >> >> digest field. So while executing SolrDeleteDuplicates, Nutch is >>giving >> >>an >> >> exception on not finding the digest field. Nutch needs to skip the >> >> document if there is no digest field. >> >> >> >> I think that this patch will probably fix my problem. Am I correct? >>If >> >>yes >> >> then how do I apply the patch or is there a jar with the fix >>available >> >> that I can download? >> >> I can also add the digest field on outside Nutch data. What should be >> >>the >> >> value in that field? >> >> >> >> Thanks, >> >> Madhvi >> >> >> >> >> >> >> >> >> >> On 11/9/13 6:30 AM, "Talat UYARER" <[email protected]> wrote: >> >> >> >>> Hi Madhvi, >> >>> >> >>> After adding urlfilter-validator in nutch-site.xml, do you truncate >> >>>your >> >>> webpage ? >> >>> >> >>> Talat >> >>> >> >>> >> >>> 09-11-2013 00:35 tarihinde, [email protected] yazdı: >> >>>> Hi Talat, >> >>>> >> >>>> I can re-create this exception. This exception starts happening as >> >>>>soon >> >>>> as >> >>>> I index from outside Nutch. SolrDeleleteDuplicates works fine as >>long >> >>>>as >> >>>> the whole solr index came from Nutch. >> >>>> I haven't found out yet specifically which field might be causing >>it. >> >>>> But >> >>>> looking at issue below, it might be because of the digest field not >> >>>> being >> >>>> there. >> >>>> https://issues.apache.org/jira/browse/NUTCH-1100 >> >>>> >> >>>> Can it be some other field? >> >>>> >> >>>> Also, there is a patch for digest field. How should I apply it? Any >> >>>>help >> >>>> will be great! >> >>>> >> >>>> >> >>>> Madhvi >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> On 11/6/13 2:19 PM, "Talat UYARER" <[email protected]> wrote: >> >>>> >> >>>>> You wrote wrong. You should write like this >> >>>>> >> >>>>> <property> >> >>>>> <name>plugin.includes</name> >> >>>>> >> >>>>> >> >>>>>>><value>protocol-http|urlfilter-(regex|validator)|parse-(html|tika|me >>>>>>>ta >> >>>>>ta >> >>>>> gs >> >>>>> |js >> >>>>> >> >>>>> >> >>>>>>>|swf)|index-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer- >>>>>>>(p >> >>>>>as >> >>>>> s| >> >>>>> r >> >>>>> egex|basic)</value> >> >>>>> </property> >> >>>>> >> >>>>> And you write in nutch-site.xml after than you should rebuild with >> >>>>>ant >> >>>>> clean runtime >> >>>>> >> >>>>> Talat >> >>>>> >> >>>>> [email protected] şunu yazdı: >> >>>>> >> >>>>>> Hi Talat, >> >>>>>> No, I am not using url filter-validator plugin. Here is my list >>of >> >>>>>> plugins: >> >>>>>> >> >>>>>> <property> >> >>>>>> <name>plugin.includes</name> >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>>>><value>protocol-http|urlfilter-regex|parse-(html|tika|metatags|js|s >>>>>>>>wf >> >>>>>>)| >> >>>>>> in >> >>>>>> de >> >>>>>> >> >>>>>> >> >>>>>>>>x-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer-(pass|reg >>>>>>>>ex >> >>>>>>|b >> >>>>>> as >> >>>>>> ic >> >>>>>> )</value> >> >>>>>> </property> >> >>>>>> >> >>>>>> >> >>>>>> Do I just need to change this to: >> >>>>>> >> >>>>>> <property> >> >>>>>> <name>plugin.includes</name> >> >>>>>> >> >>>>>> >> >>>>>>>><value>protocol-http|urlfilter-regex|parse|validator-(html|tika|met >>>>>>>>at >> >>>>>>ag >> >>>>>> s| >> >>>>>> js >> >>>>>> >> >>>>>> >> >>>>>>>>|swf)|index-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer >>>>>>>>-( >> >>>>>>pa >> >>>>>> ss >> >>>>>> |r >> >>>>>> egex|basic)</value> >> >>>>>> </property> >> >>>>>> >> >>>>>> Thank you so much, >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> Madhvi >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> On 11/6/13 1:08 PM, "Talat UYARER" <[email protected]> >>wrote: >> >>>>>> >> >>>>>>> Hi Madhvi, >> >>>>>>> >> >>>>>>> Can you tell me what is your active plugins in your >>nutch-site.xml. >> >>>>>>> I am >> >>>>>>> not sure but we have a issue simalar this. if your solr return >> >>>>>>>null, >> >>>>>>> this >> >>>>>>> will because this issue. Please check your solr return data >> >>>>>>> >> >>>>>>> You can look at https://issues.apache.org/jira/browse/NUTCH-1100 >> >>>>>>> >> >>>>>>> if yours is same, you should use urlfilter-validator plugin. >> >>>>>>> >> >>>>>>> Urlfilter-validator has lots of benifit. i told in >> >>>>>>> >> >>>>>>> >> >>>>>>> >> http://mail-archives.apache.org/mod_mbox/nutch-user/201310.mbox/%3c5 >> >>>>>>>26 >> >>>>>>> 5B >> >>>>>>> C2 >> >>>>>>> [email protected]%3e >> >>>>>>> >> >>>>>>> Talat >> >>>>>>> >> >>>>>>> [email protected] şunu yazdı: >> >>>>>>> >> >>>>>>>> I am going to start my own thread rather than being under >> >>>>>>>>javozzo's >> >>>>>>>> thread :)! >> >>>>>>>> >> >>>>>>>> Hi, >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> I am using Nutch 1.5.1 and Solr 3.6 and having problem with >> >>>>>>>>command >> >>>>>>>> SolrDeleteDuplicates. Looking at Hadoop logs: I am getting >>error: >> >>>>>>>> >> >>>>>>>> java.lang.NullPointerException >> >>>>>>>> at org.apache.hadoop.io.Text.encode(Text.java:388) >> >>>>>>>> at org.apache.hadoop.io.Text.set(Text.java:178) >> >>>>>>>> at >> >>>>>>>> >> >>>>>>>> >> >>>>>>>>>>org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputForma >>>>>>>>>>t$ >> >>>>>>>>1. >> >>>>>>>> ne >> >>>>>>>> xt >> >>>>>>>> (S >> >>>>>>>> olrDeleteDuplicates.java:270) >> >>>>>>>> at >> >>>>>>>> >> >>>>>>>> >> >>>>>>>>>>org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputForma >>>>>>>>>>t$ >> >>>>>>>>1. >> >>>>>>>> ne >> >>>>>>>> xt >> >>>>>>>> (S >> >>>>>>>> olrDeleteDuplicates.java:241) >> >>>>>>>> at >> >>>>>>>> >> >>>>>>>> >> >>>>>>>>>>org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(M >>>>>>>>>>ap >> >>>>>>>>Ta >> >>>>>>>> sk >> >>>>>>>> .j >> >>>>>>>> av >> >>>>>>>> a:236) >> >>>>>>>> at >> >>>>>>>> >> >>>>>>>> >> >>>>>>>>>>org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask >>>>>>>>>>.j >> >>>>>>>>av >> >>>>>>>> a: >> >>>>>>>> 21 >> >>>>>>>> 6) >> >>>>>>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) >> >>>>>>>> at >>org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) >> >>>>>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) >> >>>>>>>> at >> >>>>>>>> >> >>>>>>>> >> >>>>>>>>>>org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.ja >>>>>>>>>>va >> >>>>>>>>:2 >> >>>>>>>> 12 >> >>>>>>>> ) >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> Also had another question about updating Nutch to 1.6 and 1.7. >>I >> >>>>>>>>had >> >>>>>>>> tried >> >>>>>>>> updating to newer version of Nutch but got exception during >> >>>>>>>>deleting >> >>>>>>>> duplicates in SOLR. After lot of research online found that a >> >>>>>>>>field >> >>>>>>>> had >> >>>>>>>> changed. A few said digest field and others said that url >>field is >> >>>>>>>> no >> >>>>>>>> longer there. So here are my questions: >> >>>>>>>> 1: Is there a newer solr mapping file that needs to be used? >> >>>>>>>> 2: Can the SOLR index from 1.5.1 and index from newer version >> >>>>>>>> co-exist >> >>>>>>>> or >> >>>>>>>> we need to re-index from one version of Nutch? >> >>>>>>>> >> >>>>>>>> I will really appreciate any help with this. >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> Thanks in advance, >> >>>>>>>> Madhvi >> >>>>>>>> >> >>>>>>>> Madhvi Arora >> >>>>>>>> AutomationDirect >> >>>>>>>> The #1 Best Mid-Sized Company to work for in >> >>>>>>>> >> >>>>>>>> >> >>>>>>>>Atlanta< >> http://www.ajc.com/business/topworkplaces/automationdirect- >> >>>>>>>>co >> >>>>>>>> m- >> >>>>>>>> to >> >>>>>>>> p-midsize-1421260.html> 2012 >> >>>>>>>> >> >>>>>> >> >>>> >> >>> >> >> >> > >> >> > > >-- > >Open Source Solutions for Text Engineering > >http://digitalpebble.blogspot.com/ >http://www.digitalpebble.com >http://twitter.com/digitalpebble

