I have just committed this patch to trunk so you can use the latter if you don't want to have to apply the patch. Just use svn co https://svn.apache.org/repos/asf/nutch/trunk nutch-trunk to pull the content of the trunk
Julien On 11 November 2013 13:50, <[email protected]> wrote: > Thank you so much Talat. I will try this out. Hopefully this will fix my > problem :) > > Madhvi > > > > > > > On 11/11/13 8:45 AM, "Talat UYARER" <[email protected]> wrote: > > >Hi Madhvi, > > > >If you have outside Nutch data in Solr. You are right. You need this > >patch :) > > > >I can explain how you should do on Linux. > >- You should download nutch source code from > >http://nutch.apache.org/downloads.html > >- Extract source's zip file. > >- Download patch file next to (not inside) the source code directory. > >- Go to source directory in terminal > >- Give this command: > >patch -p0 < ../NUTCH-1100-1.6-1.patch > >- If everything is ok. You can rebuild nutch with ant. You can look at > >detailed build instructions at > >https://wiki.apache.org/nutch/HowToContribute > > > >That's all > >Talat > > > >11-11-2013 15:02 tarihinde, [email protected] yazdı: > >> No, I am not truncating the page. The url-validator checks that the urls > >> crawled by Nutch are valid. So this plugin will probably not fix my > >> problem, though I have added it to my list of plugins because you said > >> that it's good to have. > >> > >> I think that here the problem is that my SOLR index contains data from > >> Nutch and outside Nutch. The outside Nutch indexed data does not have > >> digest field. So while executing SolrDeleteDuplicates, Nutch is giving > >>an > >> exception on not finding the digest field. Nutch needs to skip the > >> document if there is no digest field. > >> > >> I think that this patch will probably fix my problem. Am I correct? If > >>yes > >> then how do I apply the patch or is there a jar with the fix available > >> that I can download? > >> I can also add the digest field on outside Nutch data. What should be > >>the > >> value in that field? > >> > >> Thanks, > >> Madhvi > >> > >> > >> > >> > >> On 11/9/13 6:30 AM, "Talat UYARER" <[email protected]> wrote: > >> > >>> Hi Madhvi, > >>> > >>> After adding urlfilter-validator in nutch-site.xml, do you truncate > >>>your > >>> webpage ? > >>> > >>> Talat > >>> > >>> > >>> 09-11-2013 00:35 tarihinde, [email protected] yazdı: > >>>> Hi Talat, > >>>> > >>>> I can re-create this exception. This exception starts happening as > >>>>soon > >>>> as > >>>> I index from outside Nutch. SolrDeleleteDuplicates works fine as long > >>>>as > >>>> the whole solr index came from Nutch. > >>>> I haven't found out yet specifically which field might be causing it. > >>>> But > >>>> looking at issue below, it might be because of the digest field not > >>>> being > >>>> there. > >>>> https://issues.apache.org/jira/browse/NUTCH-1100 > >>>> > >>>> Can it be some other field? > >>>> > >>>> Also, there is a patch for digest field. How should I apply it? Any > >>>>help > >>>> will be great! > >>>> > >>>> > >>>> Madhvi > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> On 11/6/13 2:19 PM, "Talat UYARER" <[email protected]> wrote: > >>>> > >>>>> You wrote wrong. You should write like this > >>>>> > >>>>> <property> > >>>>> <name>plugin.includes</name> > >>>>> > >>>>> > >>>>><value>protocol-http|urlfilter-(regex|validator)|parse-(html|tika|meta > >>>>>ta > >>>>> gs > >>>>> |js > >>>>> > >>>>> > >>>>>|swf)|index-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer-(p > >>>>>as > >>>>> s| > >>>>> r > >>>>> egex|basic)</value> > >>>>> </property> > >>>>> > >>>>> And you write in nutch-site.xml after than you should rebuild with > >>>>>ant > >>>>> clean runtime > >>>>> > >>>>> Talat > >>>>> > >>>>> [email protected] şunu yazdı: > >>>>> > >>>>>> Hi Talat, > >>>>>> No, I am not using url filter-validator plugin. Here is my list of > >>>>>> plugins: > >>>>>> > >>>>>> <property> > >>>>>> <name>plugin.includes</name> > >>>>>> > >>>>>> > >>>>>> > >>>>>><value>protocol-http|urlfilter-regex|parse-(html|tika|metatags|js|swf > >>>>>>)| > >>>>>> in > >>>>>> de > >>>>>> > >>>>>> > >>>>>>x-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer-(pass|regex > >>>>>>|b > >>>>>> as > >>>>>> ic > >>>>>> )</value> > >>>>>> </property> > >>>>>> > >>>>>> > >>>>>> Do I just need to change this to: > >>>>>> > >>>>>> <property> > >>>>>> <name>plugin.includes</name> > >>>>>> > >>>>>> > >>>>>><value>protocol-http|urlfilter-regex|parse|validator-(html|tika|metat > >>>>>>ag > >>>>>> s| > >>>>>> js > >>>>>> > >>>>>> > >>>>>>|swf)|index-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer-( > >>>>>>pa > >>>>>> ss > >>>>>> |r > >>>>>> egex|basic)</value> > >>>>>> </property> > >>>>>> > >>>>>> Thank you so much, > >>>>>> > >>>>>> > >>>>>> > >>>>>> Madhvi > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> On 11/6/13 1:08 PM, "Talat UYARER" <[email protected]> wrote: > >>>>>> > >>>>>>> Hi Madhvi, > >>>>>>> > >>>>>>> Can you tell me what is your active plugins in your nutch-site.xml. > >>>>>>> I am > >>>>>>> not sure but we have a issue simalar this. if your solr return > >>>>>>>null, > >>>>>>> this > >>>>>>> will because this issue. Please check your solr return data > >>>>>>> > >>>>>>> You can look at https://issues.apache.org/jira/browse/NUTCH-1100 > >>>>>>> > >>>>>>> if yours is same, you should use urlfilter-validator plugin. > >>>>>>> > >>>>>>> Urlfilter-validator has lots of benifit. i told in > >>>>>>> > >>>>>>> > >>>>>>> > http://mail-archives.apache.org/mod_mbox/nutch-user/201310.mbox/%3c5 > >>>>>>>26 > >>>>>>> 5B > >>>>>>> C2 > >>>>>>> [email protected]%3e > >>>>>>> > >>>>>>> Talat > >>>>>>> > >>>>>>> [email protected] şunu yazdı: > >>>>>>> > >>>>>>>> I am going to start my own thread rather than being under > >>>>>>>>javozzo's > >>>>>>>> thread :)! > >>>>>>>> > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> > >>>>>>>> I am using Nutch 1.5.1 and Solr 3.6 and having problem with > >>>>>>>>command > >>>>>>>> SolrDeleteDuplicates. Looking at Hadoop logs: I am getting error: > >>>>>>>> > >>>>>>>> java.lang.NullPointerException > >>>>>>>> at org.apache.hadoop.io.Text.encode(Text.java:388) > >>>>>>>> at org.apache.hadoop.io.Text.set(Text.java:178) > >>>>>>>> at > >>>>>>>> > >>>>>>>> > >>>>>>>>org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$ > >>>>>>>>1. > >>>>>>>> ne > >>>>>>>> xt > >>>>>>>> (S > >>>>>>>> olrDeleteDuplicates.java:270) > >>>>>>>> at > >>>>>>>> > >>>>>>>> > >>>>>>>>org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$ > >>>>>>>>1. > >>>>>>>> ne > >>>>>>>> xt > >>>>>>>> (S > >>>>>>>> olrDeleteDuplicates.java:241) > >>>>>>>> at > >>>>>>>> > >>>>>>>> > >>>>>>>>org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(Map > >>>>>>>>Ta > >>>>>>>> sk > >>>>>>>> .j > >>>>>>>> av > >>>>>>>> a:236) > >>>>>>>> at > >>>>>>>> > >>>>>>>> > >>>>>>>>org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.j > >>>>>>>>av > >>>>>>>> a: > >>>>>>>> 21 > >>>>>>>> 6) > >>>>>>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) > >>>>>>>> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) > >>>>>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) > >>>>>>>> at > >>>>>>>> > >>>>>>>> > >>>>>>>>org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java > >>>>>>>>:2 > >>>>>>>> 12 > >>>>>>>> ) > >>>>>>>> > >>>>>>>> > >>>>>>>> Also had another question about updating Nutch to 1.6 and 1.7. I > >>>>>>>>had > >>>>>>>> tried > >>>>>>>> updating to newer version of Nutch but got exception during > >>>>>>>>deleting > >>>>>>>> duplicates in SOLR. After lot of research online found that a > >>>>>>>>field > >>>>>>>> had > >>>>>>>> changed. A few said digest field and others said that url field is > >>>>>>>> no > >>>>>>>> longer there. So here are my questions: > >>>>>>>> 1: Is there a newer solr mapping file that needs to be used? > >>>>>>>> 2: Can the SOLR index from 1.5.1 and index from newer version > >>>>>>>> co-exist > >>>>>>>> or > >>>>>>>> we need to re-index from one version of Nutch? > >>>>>>>> > >>>>>>>> I will really appreciate any help with this. > >>>>>>>> > >>>>>>>> > >>>>>>>> Thanks in advance, > >>>>>>>> Madhvi > >>>>>>>> > >>>>>>>> Madhvi Arora > >>>>>>>> AutomationDirect > >>>>>>>> The #1 Best Mid-Sized Company to work for in > >>>>>>>> > >>>>>>>> > >>>>>>>>Atlanta< > http://www.ajc.com/business/topworkplaces/automationdirect- > >>>>>>>>co > >>>>>>>> m- > >>>>>>>> to > >>>>>>>> p-midsize-1421260.html> 2012 > >>>>>>>> > >>>>>> > >>>> > >>> > >> > > > > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

