Yas, just to be sure, you are using the original URL (the one that was in the 
ParseResult passed as parameter to the filter) in the ParseResult constructor, 
right?

> -----Original Message-----
> From: Sebastian Nagel <wastl.na...@googlemail.com>
> Sent: 07 March 2018 12:36
> To: user@nutch.apache.org
> Subject: Re: Regarding Internal Links
> 
> Hi,
> 
> that needs to be fixed. It's because there is no CrawlDb entry for the partial
> documents. May also be happen after NUTCH-2456. Could you open a Jira issue
> to address the problem? Thanks!
> 
> As a quick work-around:
> - either disable scoring-opic while indexing
> - or check dbDatum for null in scoring-opic indexerScore(...)
> 
> Thanks,
> Sebastian
> 
> On 03/07/2018 11:13 AM, Yash Thenuan Thenuan wrote:
> > Thanks Yossi, I am now able to parse the data successfully but I am
> > getting Error at the time of indexing.
> > Below are the hadoop logs for indexing.
> >
> > ElasticRestIndexWriter
> > elastic.rest.host : hostname
> > elastic.rest.port : port
> > elastic.rest.index : elastic index command elastic.rest.max.bulk.docs
> > : elastic bulk index doc counts. (default 250)
> > elastic.rest.max.bulk.size : elastic bulk index length. (default
> > 2500500
> > ~2.5MB)
> >
> >
> > 2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce -
> IndexerMapReduce:
> > crawldb: crawl/crawldb
> > 2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce -
> IndexerMapReduce:
> > linkdb: crawl/linkdb
> > 2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce -
> IndexerMapReduces:
> > adding segment: crawl/segments/20180307130959
> > 2018-03-07 15:41:53,677 INFO  anchor.AnchorIndexingFilter - Anchor
> > deduplication is: off
> > 2018-03-07 15:41:54,861 INFO  indexer.IndexWriters - Adding
> > org.apache.nutch.indexwriter.elasticrest.ElasticRestIndexWriter
> > 2018-03-07 15:41:55,168 INFO  client.AbstractJestClient - Setting
> > server pool to a list of 1 servers: [http://localhost:9200]
> > 2018-03-07 15:41:55,170 INFO  client.JestClientFactory - Using multi
> > thread/connection supporting pooling connection manager
> > 2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Using default
> > GSON instance
> > 2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Node
> > Discovery disabled...
> > 2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Idle
> > connection reaping disabled...
> > 2018-03-07 15:41:55,282 INFO  elasticrest.ElasticRestIndexWriter -
> > Processing remaining requests [docs = 1, length = 210402, total docs =
> > 1]
> > 2018-03-07 15:41:55,361 INFO  elasticrest.ElasticRestIndexWriter -
> > Processing to finalize last execute
> > 2018-03-07 15:41:55,458 INFO  elasticrest.ElasticRestIndexWriter -
> > Previous took in ms 175, including wait 97
> > 2018-03-07 15:41:55,468 WARN  mapred.LocalJobRunner -
> > job_local1561152089_0001
> > java.lang.Exception: java.lang.NullPointerException at
> > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.ja
> > va:462) at
> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:52
> > 9) Caused by: java.lang.NullPointerException at
> > org.apache.nutch.scoring.opic.OPICScoringFilter.indexerScore(OPICScori
> > ngFilter.java:171)
> > at
> > org.apache.nutch.scoring.ScoringFilters.indexerScore(ScoringFilters.ja
> > va:120)
> > at
> > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java
> > :296)
> > at
> > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java
> > :57) at
> > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
> > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
> > at
> >
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(Loc
> > alJobRunner.java:319) at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511
> > ) at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> > at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j
> > ava:1149)
> > at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
> > java:624) at java.lang.Thread.run(Thread.java:748)
> > 2018-03-07 15:41:55,510 ERROR indexer.IndexingJob - Indexer:
> > java.io.IOException: Job failed!
> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873)
> > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
> > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> > at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)
> >
> >
> > On Wed, Mar 7, 2018 at 12:30 AM, Yossi Tamari <yossi.tam...@pipl.com>
> wrote:
> >
> >> Regarding the configuration parameter, your Parse Filter should
> >> expose a setConf method that receives a conf parameter. Keep that as
> >> a member variable and pass it where necessary.
> >> Regarding parsestatus, contentmeta and parsemeta, you're going to
> >> have to look at them yourself (probably in a debugger), but as a
> >> baseline, you can probably just use the values in the inbound
> >> ParseResult (of the whole document).
> >> More specifically, parsestatus is an indication of whether parsing
> >> was successful. Unless your parsing may fail even when the whole
> >> document parsing was successful, you don't need to change it.
> >> contentmeta is all the information that was gathered about this page
> >> before parsing, so again, you probably just want to keep it, and
> >> finally parsemeta is the metadata that was gathered during parsing
> >> and may be useful for indexing, so passing the metadata from the
> >> original ParseResult makes sense, or just using the constructor that does 
> >> not
> require it if you don't care about the metadata.
> >> This should all be easier to understand if you look at what the HTML
> >> Parser does with each of these fields.
> >>
> >>> -----Original Message-----
> >>> From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> >>> Sent: 06 March 2018 20:17
> >>> To: user@nutch.apache.org
> >>> Subject: RE: Regarding Internal Links
> >>>
> >>> I am able to get parsetext data structure.
> >>> But having trouble with parseData as it's constructor is asking for
> >> parsestatus,
> >>> outlinks, contentmeta and parsemeta.
> >>> Outlinks I can get from outlinkExtractor but what about other parameters?
> >>> And again getoutlinks is asking for configuration and i don't know,
> >>> from
> >> where I
> >>> can get it?
> >>>
> >>> On 6 Mar 2018 18:32, "Yossi Tamari" <yossi.tam...@pipl.com> wrote:
> >>>
> >>>> You should go over each segment, and for each one produce a
> >>>> ParseText and a ParseData. This is basically what the HTML Parser
> >>>> does for the whole document, which is why I suggested you should dive
> into its code.
> >>>> A ParseText is basically just a String containing the actual
> >>>> content of the segment (after stripping the HTML tags). This is
> >>>> usually the document you want to index.
> >>>> The ParseData structure is a little more complex, but the main
> >>>> things it contains are the title of this segment, and the outlinks
> >>>> from the segment (for further crawling). Take a look at the code of
> >>>> both classes and it should be relatively clear.
> >>>> Finally, you need to build one ParseResult object, with the
> >>>> original URL, and for each of the ParseText/ParseData pairs, call
> >>>> the put method, with the internal URL of the segment as the key.
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> >>>>> Sent: 06 March 2018 14:45
> >>>>> To: user@nutch.apache.org
> >>>>> Subject: RE: Regarding Internal Links
> >>>>>
> >>>>>> I am able to get the content corresponding to each Internal link
> >>>>>> by writing a parse filter plugin. Now  I am  not getting how to
> >>>>>> proceed further. How can I parse them as separate document and
> >>>>>> what should my ParseResult filter return??
> >>>>
> >>>>
> >>
> >>
> >


Reply via email to