Hi, that needs to be fixed. It's because there is no CrawlDb entry for the partial documents. May also be happen after NUTCH-2456. Could you open a Jira issue to address the problem? Thanks!
As a quick work-around: - either disable scoring-opic while indexing - or check dbDatum for null in scoring-opic indexerScore(...) Thanks, Sebastian On 03/07/2018 11:13 AM, Yash Thenuan Thenuan wrote: > Thanks Yossi, I am now able to parse the data successfully but I am getting > Error at the time of indexing. > Below are the hadoop logs for indexing. > > ElasticRestIndexWriter > elastic.rest.host : hostname > elastic.rest.port : port > elastic.rest.index : elastic index command > elastic.rest.max.bulk.docs : elastic bulk index doc counts. (default 250) > elastic.rest.max.bulk.size : elastic bulk index length. (default 2500500 > ~2.5MB) > > > 2018-03-07 15:41:52,327 INFO indexer.IndexerMapReduce - IndexerMapReduce: > crawldb: crawl/crawldb > 2018-03-07 15:41:52,327 INFO indexer.IndexerMapReduce - IndexerMapReduce: > linkdb: crawl/linkdb > 2018-03-07 15:41:52,327 INFO indexer.IndexerMapReduce - IndexerMapReduces: > adding segment: crawl/segments/20180307130959 > 2018-03-07 15:41:53,677 INFO anchor.AnchorIndexingFilter - Anchor > deduplication is: off > 2018-03-07 15:41:54,861 INFO indexer.IndexWriters - Adding > org.apache.nutch.indexwriter.elasticrest.ElasticRestIndexWriter > 2018-03-07 15:41:55,168 INFO client.AbstractJestClient - Setting server > pool to a list of 1 servers: [http://localhost:9200] > 2018-03-07 15:41:55,170 INFO client.JestClientFactory - Using multi > thread/connection supporting pooling connection manager > 2018-03-07 15:41:55,238 INFO client.JestClientFactory - Using default GSON > instance > 2018-03-07 15:41:55,238 INFO client.JestClientFactory - Node Discovery > disabled... > 2018-03-07 15:41:55,238 INFO client.JestClientFactory - Idle connection > reaping disabled... > 2018-03-07 15:41:55,282 INFO elasticrest.ElasticRestIndexWriter - > Processing remaining requests [docs = 1, length = 210402, total docs = 1] > 2018-03-07 15:41:55,361 INFO elasticrest.ElasticRestIndexWriter - > Processing to finalize last execute > 2018-03-07 15:41:55,458 INFO elasticrest.ElasticRestIndexWriter - Previous > took in ms 175, including wait 97 > 2018-03-07 15:41:55,468 WARN mapred.LocalJobRunner - > job_local1561152089_0001 > java.lang.Exception: java.lang.NullPointerException > at > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) > at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) > Caused by: java.lang.NullPointerException > at > org.apache.nutch.scoring.opic.OPICScoringFilter.indexerScore(OPICScoringFilter.java:171) > at > org.apache.nutch.scoring.ScoringFilters.indexerScore(ScoringFilters.java:120) > at > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:296) > at > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:57) > at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2018-03-07 15:41:55,510 ERROR indexer.IndexingJob - Indexer: > java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873) > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147) > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239) > > > On Wed, Mar 7, 2018 at 12:30 AM, Yossi Tamari <yossi.tam...@pipl.com> wrote: > >> Regarding the configuration parameter, your Parse Filter should expose a >> setConf method that receives a conf parameter. Keep that as a member >> variable and pass it where necessary. >> Regarding parsestatus, contentmeta and parsemeta, you're going to have to >> look at them yourself (probably in a debugger), but as a baseline, you can >> probably just use the values in the inbound ParseResult (of the whole >> document). >> More specifically, parsestatus is an indication of whether parsing was >> successful. Unless your parsing may fail even when the whole document >> parsing was successful, you don't need to change it. contentmeta is all the >> information that was gathered about this page before parsing, so again, you >> probably just want to keep it, and finally parsemeta is the metadata that >> was gathered during parsing and may be useful for indexing, so passing the >> metadata from the original ParseResult makes sense, or just using the >> constructor that does not require it if you don't care about the metadata. >> This should all be easier to understand if you look at what the HTML >> Parser does with each of these fields. >> >>> -----Original Message----- >>> From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in> >>> Sent: 06 March 2018 20:17 >>> To: user@nutch.apache.org >>> Subject: RE: Regarding Internal Links >>> >>> I am able to get parsetext data structure. >>> But having trouble with parseData as it's constructor is asking for >> parsestatus, >>> outlinks, contentmeta and parsemeta. >>> Outlinks I can get from outlinkExtractor but what about other parameters? >>> And again getoutlinks is asking for configuration and i don't know, from >> where I >>> can get it? >>> >>> On 6 Mar 2018 18:32, "Yossi Tamari" <yossi.tam...@pipl.com> wrote: >>> >>>> You should go over each segment, and for each one produce a ParseText >>>> and a ParseData. This is basically what the HTML Parser does for the >>>> whole document, which is why I suggested you should dive into its code. >>>> A ParseText is basically just a String containing the actual content >>>> of the segment (after stripping the HTML tags). This is usually the >>>> document you want to index. >>>> The ParseData structure is a little more complex, but the main things >>>> it contains are the title of this segment, and the outlinks from the >>>> segment (for further crawling). Take a look at the code of both >>>> classes and it should be relatively clear. >>>> Finally, you need to build one ParseResult object, with the original >>>> URL, and for each of the ParseText/ParseData pairs, call the put >>>> method, with the internal URL of the segment as the key. >>>> >>>>> -----Original Message----- >>>>> From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in> >>>>> Sent: 06 March 2018 14:45 >>>>> To: user@nutch.apache.org >>>>> Subject: RE: Regarding Internal Links >>>>> >>>>>> I am able to get the content corresponding to each Internal link >>>>>> by writing a parse filter plugin. Now I am not getting how to >>>>>> proceed further. How can I parse them as separate document and >>>>>> what should my ParseResult filter return?? >>>> >>>> >> >> >