Re: Regarding Internal Links

Sebastian Nagel Wed, 07 Mar 2018 02:36:58 -0800

Hi,

that needs to be fixed. It's because there is no CrawlDb entry for the
partial documents. May also be happen after NUTCH-2456. Could you open
a Jira issue to address the problem? Thanks!


As a quick work-around:
- either disable scoring-opic while indexing
- or check dbDatum for null in scoring-opic indexerScore(...)

Thanks,
Sebastian

On 03/07/2018 11:13 AM, Yash Thenuan Thenuan wrote:
> Thanks Yossi, I am now able to parse the data successfully but I am getting
> Error at the time of indexing.
> Below are the hadoop logs for indexing.
> 
> ElasticRestIndexWriter
> elastic.rest.host : hostname
> elastic.rest.port : port
> elastic.rest.index : elastic index command
> elastic.rest.max.bulk.docs : elastic bulk index doc counts. (default 250)
> elastic.rest.max.bulk.size : elastic bulk index length. (default 2500500
> ~2.5MB)
> 
> 
> 2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce - IndexerMapReduce:
> crawldb: crawl/crawldb
> 2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce - IndexerMapReduce:
> linkdb: crawl/linkdb
> 2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
> adding segment: crawl/segments/20180307130959
> 2018-03-07 15:41:53,677 INFO  anchor.AnchorIndexingFilter - Anchor
> deduplication is: off
> 2018-03-07 15:41:54,861 INFO  indexer.IndexWriters - Adding
> org.apache.nutch.indexwriter.elasticrest.ElasticRestIndexWriter
> 2018-03-07 15:41:55,168 INFO  client.AbstractJestClient - Setting server
> pool to a list of 1 servers: [http://localhost:9200]
> 2018-03-07 15:41:55,170 INFO  client.JestClientFactory - Using multi
> thread/connection supporting pooling connection manager
> 2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Using default GSON
> instance
> 2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Node Discovery
> disabled...
> 2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Idle connection
> reaping disabled...
> 2018-03-07 15:41:55,282 INFO  elasticrest.ElasticRestIndexWriter -
> Processing remaining requests [docs = 1, length = 210402, total docs = 1]
> 2018-03-07 15:41:55,361 INFO  elasticrest.ElasticRestIndexWriter -
> Processing to finalize last execute
> 2018-03-07 15:41:55,458 INFO  elasticrest.ElasticRestIndexWriter - Previous
> took in ms 175, including wait 97
> 2018-03-07 15:41:55,468 WARN  mapred.LocalJobRunner -
> job_local1561152089_0001
> java.lang.Exception: java.lang.NullPointerException
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.NullPointerException
> at
> org.apache.nutch.scoring.opic.OPICScoringFilter.indexerScore(OPICScoringFilter.java:171)
> at
> org.apache.nutch.scoring.ScoringFilters.indexerScore(ScoringFilters.java:120)
> at
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:296)
> at
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:57)
> at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 2018-03-07 15:41:55,510 ERROR indexer.IndexingJob - Indexer:
> java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873)
> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)
> 
> 
> On Wed, Mar 7, 2018 at 12:30 AM, Yossi Tamari <yossi.tam...@pipl.com> wrote:
> 
>> Regarding the configuration parameter, your Parse Filter should expose a
>> setConf method that receives a conf parameter. Keep that as a member
>> variable and pass it where necessary.
>> Regarding parsestatus, contentmeta and parsemeta, you're going to have to
>> look at them yourself (probably in a debugger), but as a baseline, you can
>> probably just use the values in the inbound ParseResult (of the whole
>> document).
>> More specifically, parsestatus is an indication of whether parsing was
>> successful. Unless your parsing may fail even when the whole document
>> parsing was successful, you don't need to change it. contentmeta is all the
>> information that was gathered about this page before parsing, so again, you
>> probably just want to keep it, and finally parsemeta is the metadata that
>> was gathered during parsing and may be useful for indexing, so passing the
>> metadata from the original ParseResult makes sense, or just using the
>> constructor that does not require it if you don't care about the metadata.
>> This should all be easier to understand if you look at what the HTML
>> Parser does with each of these fields.
>>
>>> -----Original Message-----
>>> From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
>>> Sent: 06 March 2018 20:17
>>> To: user@nutch.apache.org
>>> Subject: RE: Regarding Internal Links
>>>
>>> I am able to get parsetext data structure.
>>> But having trouble with parseData as it's constructor is asking for
>> parsestatus,
>>> outlinks, contentmeta and parsemeta.
>>> Outlinks I can get from outlinkExtractor but what about other parameters?
>>> And again getoutlinks is asking for configuration and i don't know, from
>> where I
>>> can get it?
>>>
>>> On 6 Mar 2018 18:32, "Yossi Tamari" <yossi.tam...@pipl.com> wrote:
>>>
>>>> You should go over each segment, and for each one produce a ParseText
>>>> and a ParseData. This is basically what the HTML Parser does for the
>>>> whole document, which is why I suggested you should dive into its code.
>>>> A ParseText is basically just a String containing the actual content
>>>> of the segment (after stripping the HTML tags). This is usually the
>>>> document you want to index.
>>>> The ParseData structure is a little more complex, but the main things
>>>> it contains are the title of this segment, and the outlinks from the
>>>> segment (for further crawling). Take a look at the code of both
>>>> classes and it should be relatively clear.
>>>> Finally, you need to build one ParseResult object, with the original
>>>> URL, and for each of the ParseText/ParseData pairs, call the put
>>>> method, with the internal URL of the segment as the key.
>>>>
>>>>> -----Original Message-----
>>>>> From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
>>>>> Sent: 06 March 2018 14:45
>>>>> To: user@nutch.apache.org
>>>>> Subject: RE: Regarding Internal Links
>>>>>
>>>>>> I am able to get the content corresponding to each Internal link
>>>>>> by writing a parse filter plugin. Now  I am  not getting how to
>>>>>> proceed further. How can I parse them as separate document and
>>>>>> what should my ParseResult filter return??
>>>>
>>>>
>>
>>
>

Re: Regarding Internal Links

Reply via email to