Yas, just to be sure, you are using the original URL (the one that was in the ParseResult passed as parameter to the filter) in the ParseResult constructor, right?
> -----Original Message----- > From: Sebastian Nagel <wastl.na...@googlemail.com> > Sent: 07 March 2018 12:36 > To: email@example.com > Subject: Re: Regarding Internal Links > > Hi, > > that needs to be fixed. It's because there is no CrawlDb entry for the partial > documents. May also be happen after NUTCH-2456. Could you open a Jira issue > to address the problem? Thanks! > > As a quick work-around: > - either disable scoring-opic while indexing > - or check dbDatum for null in scoring-opic indexerScore(...) > > Thanks, > Sebastian > > On 03/07/2018 11:13 AM, Yash Thenuan Thenuan wrote: > > Thanks Yossi, I am now able to parse the data successfully but I am > > getting Error at the time of indexing. > > Below are the hadoop logs for indexing. > > > > ElasticRestIndexWriter > > elastic.rest.host : hostname > > elastic.rest.port : port > > elastic.rest.index : elastic index command elastic.rest.max.bulk.docs > > : elastic bulk index doc counts. (default 250) > > elastic.rest.max.bulk.size : elastic bulk index length. (default > > 2500500 > > ~2.5MB) > > > > > > 2018-03-07 15:41:52,327 INFO indexer.IndexerMapReduce - > IndexerMapReduce: > > crawldb: crawl/crawldb > > 2018-03-07 15:41:52,327 INFO indexer.IndexerMapReduce - > IndexerMapReduce: > > linkdb: crawl/linkdb > > 2018-03-07 15:41:52,327 INFO indexer.IndexerMapReduce - > IndexerMapReduces: > > adding segment: crawl/segments/20180307130959 > > 2018-03-07 15:41:53,677 INFO anchor.AnchorIndexingFilter - Anchor > > deduplication is: off > > 2018-03-07 15:41:54,861 INFO indexer.IndexWriters - Adding > > org.apache.nutch.indexwriter.elasticrest.ElasticRestIndexWriter > > 2018-03-07 15:41:55,168 INFO client.AbstractJestClient - Setting > > server pool to a list of 1 servers: [http://localhost:9200] > > 2018-03-07 15:41:55,170 INFO client.JestClientFactory - Using multi > > thread/connection supporting pooling connection manager > > 2018-03-07 15:41:55,238 INFO client.JestClientFactory - Using default > > GSON instance > > 2018-03-07 15:41:55,238 INFO client.JestClientFactory - Node > > Discovery disabled... > > 2018-03-07 15:41:55,238 INFO client.JestClientFactory - Idle > > connection reaping disabled... > > 2018-03-07 15:41:55,282 INFO elasticrest.ElasticRestIndexWriter - > > Processing remaining requests [docs = 1, length = 210402, total docs = > > 1] > > 2018-03-07 15:41:55,361 INFO elasticrest.ElasticRestIndexWriter - > > Processing to finalize last execute > > 2018-03-07 15:41:55,458 INFO elasticrest.ElasticRestIndexWriter - > > Previous took in ms 175, including wait 97 > > 2018-03-07 15:41:55,468 WARN mapred.LocalJobRunner - > > job_local1561152089_0001 > > java.lang.Exception: java.lang.NullPointerException at > > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.ja > > va:462) at > > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:52 > > 9) Caused by: java.lang.NullPointerException at > > org.apache.nutch.scoring.opic.OPICScoringFilter.indexerScore(OPICScori > > ngFilter.java:171) > > at > > org.apache.nutch.scoring.ScoringFilters.indexerScore(ScoringFilters.ja > > va:120) > > at > > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java > > :296) > > at > > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java > > :57) at > > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444) > > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) > > at > > > org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(Loc > > alJobRunner.java:319) at > > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511 > > ) at java.util.concurrent.FutureTask.run(FutureTask.java:266) > > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j > > ava:1149) > > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor. > > java:624) at java.lang.Thread.run(Thread.java:748) > > 2018-03-07 15:41:55,510 ERROR indexer.IndexingJob - Indexer: > > java.io.IOException: Job failed! > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873) > > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147) > > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > > at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239) > > > > > > On Wed, Mar 7, 2018 at 12:30 AM, Yossi Tamari <yossi.tam...@pipl.com> > wrote: > > > >> Regarding the configuration parameter, your Parse Filter should > >> expose a setConf method that receives a conf parameter. Keep that as > >> a member variable and pass it where necessary. > >> Regarding parsestatus, contentmeta and parsemeta, you're going to > >> have to look at them yourself (probably in a debugger), but as a > >> baseline, you can probably just use the values in the inbound > >> ParseResult (of the whole document). > >> More specifically, parsestatus is an indication of whether parsing > >> was successful. Unless your parsing may fail even when the whole > >> document parsing was successful, you don't need to change it. > >> contentmeta is all the information that was gathered about this page > >> before parsing, so again, you probably just want to keep it, and > >> finally parsemeta is the metadata that was gathered during parsing > >> and may be useful for indexing, so passing the metadata from the > >> original ParseResult makes sense, or just using the constructor that does > >> not > require it if you don't care about the metadata. > >> This should all be easier to understand if you look at what the HTML > >> Parser does with each of these fields. > >> > >>> -----Original Message----- > >>> From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in> > >>> Sent: 06 March 2018 20:17 > >>> To: firstname.lastname@example.org > >>> Subject: RE: Regarding Internal Links > >>> > >>> I am able to get parsetext data structure. > >>> But having trouble with parseData as it's constructor is asking for > >> parsestatus, > >>> outlinks, contentmeta and parsemeta. > >>> Outlinks I can get from outlinkExtractor but what about other parameters? > >>> And again getoutlinks is asking for configuration and i don't know, > >>> from > >> where I > >>> can get it? > >>> > >>> On 6 Mar 2018 18:32, "Yossi Tamari" <yossi.tam...@pipl.com> wrote: > >>> > >>>> You should go over each segment, and for each one produce a > >>>> ParseText and a ParseData. This is basically what the HTML Parser > >>>> does for the whole document, which is why I suggested you should dive > into its code. > >>>> A ParseText is basically just a String containing the actual > >>>> content of the segment (after stripping the HTML tags). This is > >>>> usually the document you want to index. > >>>> The ParseData structure is a little more complex, but the main > >>>> things it contains are the title of this segment, and the outlinks > >>>> from the segment (for further crawling). Take a look at the code of > >>>> both classes and it should be relatively clear. > >>>> Finally, you need to build one ParseResult object, with the > >>>> original URL, and for each of the ParseText/ParseData pairs, call > >>>> the put method, with the internal URL of the segment as the key. > >>>> > >>>>> -----Original Message----- > >>>>> From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in> > >>>>> Sent: 06 March 2018 14:45 > >>>>> To: email@example.com > >>>>> Subject: RE: Regarding Internal Links > >>>>> > >>>>>> I am able to get the content corresponding to each Internal link > >>>>>> by writing a parse filter plugin. Now I am not getting how to > >>>>>> proceed further. How can I parse them as separate document and > >>>>>> what should my ParseResult filter return?? > >>>> > >>>> > >> > >> > >