RE: Regarding Internal Links

2018-03-07 Thread Yossi Tamari
1. Go to https://issues.apache.org/jira/projects/NUTCH
2. Click "Log-In" (upper right corner). Create a user if needed and log in.
3. Click "Create" (in the top banner).
4. Fill in the fields. They are mostly self-explanatory, and those that you 
don't understand can probably be ignored. The important thing is to provide as 
much relevant information as possible - in this case what your Parse Plugin 
does, and the error that happens in the Index phase (these go in the 
Description field). Provide the same log as you provided here, either in the 
description field as well, but using the format options, or as an attachment.
5. Click "Create" at the bottom of the dialog, and you're done!

> -Original Message-
> From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> Sent: 07 March 2018 12:51
> To: user@nutch.apache.org
> Subject: RE: Regarding Internal Links
> 
> Yossi I tried with both the original url and the newer one but it didn't 
> worked!!
> However for now I disabled the scoring opic as suggested by Sebastian and it
> worked for now.
> And I will open a jira issue but I am new to open source world so can you 
> please
> help me regarding this?
> Thanks a lot yossi and Sebastian.
> 
> On 7 Mar 2018 16:11, "Yossi Tamari" <yossi.tam...@pipl.com> wrote:
> 
> Yas, just to be sure, you are using the original URL (the one that was in the
> ParseResult passed as parameter to the filter) in the ParseResult constructor,
> right?
> 
> > -Original Message-
> > From: Sebastian Nagel <wastl.na...@googlemail.com>
> > Sent: 07 March 2018 12:36
> > To: user@nutch.apache.org
> > Subject: Re: Regarding Internal Links
> >
> > Hi,
> >
> > that needs to be fixed. It's because there is no CrawlDb entry for the
> partial
> > documents. May also be happen after NUTCH-2456. Could you open a Jira
> issue
> > to address the problem? Thanks!
> >
> > As a quick work-around:
> > - either disable scoring-opic while indexing
> > - or check dbDatum for null in scoring-opic indexerScore(...)
> >
> > Thanks,
> > Sebastian
> >
> > On 03/07/2018 11:13 AM, Yash Thenuan Thenuan wrote:
> > > Thanks Yossi, I am now able to parse the data successfully but I am
> > > getting Error at the time of indexing.
> > > Below are the hadoop logs for indexing.
> > >
> > > ElasticRestIndexWriter
> > > elastic.rest.host : hostname
> > > elastic.rest.port : port
> > > elastic.rest.index : elastic index command
> > > elastic.rest.max.bulk.docs
> > > : elastic bulk index doc counts. (default 250)
> > > elastic.rest.max.bulk.size : elastic bulk index length. (default
> > > 2500500
> > > ~2.5MB)
> > >
> > >
> > > 2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce -
> > IndexerMapReduce:
> > > crawldb: crawl/crawldb
> > > 2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce -
> > IndexerMapReduce:
> > > linkdb: crawl/linkdb
> > > 2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce -
> > IndexerMapReduces:
> > > adding segment: crawl/segments/20180307130959
> > > 2018-03-07 15:41:53,677 INFO  anchor.AnchorIndexingFilter - Anchor
> > > deduplication is: off
> > > 2018-03-07 15:41:54,861 INFO  indexer.IndexWriters - Adding
> > > org.apache.nutch.indexwriter.elasticrest.ElasticRestIndexWriter
> > > 2018-03-07 15:41:55,168 INFO  client.AbstractJestClient - Setting
> > > server pool to a list of 1 servers: [http://localhost:9200]
> > > 2018-03-07 15:41:55,170 INFO  client.JestClientFactory - Using multi
> > > thread/connection supporting pooling connection manager
> > > 2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Using
> > > default GSON instance
> > > 2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Node
> > > Discovery disabled...
> > > 2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Idle
> > > connection reaping disabled...
> > > 2018-03-07 15:41:55,282 INFO  elasticrest.ElasticRestIndexWriter -
> > > Processing remaining requests [docs = 1, length = 210402, total docs
> > > = 1]
> > > 2018-03-07 15:41:55,361 INFO  elasticrest.ElasticRestIndexWriter -
> > > Processing to finalize last execute
> > > 2018-03-07 15:41:55,458 INFO  elasticrest.ElasticRestIndexWriter -
> > > Previous took in ms 175, including wait 97
> > > 2018-03-07 15:41:55,468 WARN  mapred.LocalJobRunner -
> > > job_local1561152089_0001
> > > java.lang.Exception: java.lang.NullPointerExcep

RE: Regarding Internal Links

2018-03-07 Thread Yash Thenuan Thenuan
Yossi I tried with both the original url and the newer one but it didn't
worked!!
However for now I disabled the scoring opic as suggested by Sebastian and
it worked for now.
And I will open a jira issue but I am new to open source world so can you
please  help me regarding this?
Thanks a lot yossi and Sebastian.

On 7 Mar 2018 16:11, "Yossi Tamari" <yossi.tam...@pipl.com> wrote:

Yas, just to be sure, you are using the original URL (the one that was in
the ParseResult passed as parameter to the filter) in the ParseResult
constructor, right?

> -Original Message-
> From: Sebastian Nagel <wastl.na...@googlemail.com>
> Sent: 07 March 2018 12:36
> To: user@nutch.apache.org
> Subject: Re: Regarding Internal Links
>
> Hi,
>
> that needs to be fixed. It's because there is no CrawlDb entry for the
partial
> documents. May also be happen after NUTCH-2456. Could you open a Jira
issue
> to address the problem? Thanks!
>
> As a quick work-around:
> - either disable scoring-opic while indexing
> - or check dbDatum for null in scoring-opic indexerScore(...)
>
> Thanks,
> Sebastian
>
> On 03/07/2018 11:13 AM, Yash Thenuan Thenuan wrote:
> > Thanks Yossi, I am now able to parse the data successfully but I am
> > getting Error at the time of indexing.
> > Below are the hadoop logs for indexing.
> >
> > ElasticRestIndexWriter
> > elastic.rest.host : hostname
> > elastic.rest.port : port
> > elastic.rest.index : elastic index command elastic.rest.max.bulk.docs
> > : elastic bulk index doc counts. (default 250)
> > elastic.rest.max.bulk.size : elastic bulk index length. (default
> > 2500500
> > ~2.5MB)
> >
> >
> > 2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce -
> IndexerMapReduce:
> > crawldb: crawl/crawldb
> > 2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce -
> IndexerMapReduce:
> > linkdb: crawl/linkdb
> > 2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce -
> IndexerMapReduces:
> > adding segment: crawl/segments/20180307130959
> > 2018-03-07 15:41:53,677 INFO  anchor.AnchorIndexingFilter - Anchor
> > deduplication is: off
> > 2018-03-07 15:41:54,861 INFO  indexer.IndexWriters - Adding
> > org.apache.nutch.indexwriter.elasticrest.ElasticRestIndexWriter
> > 2018-03-07 15:41:55,168 INFO  client.AbstractJestClient - Setting
> > server pool to a list of 1 servers: [http://localhost:9200]
> > 2018-03-07 15:41:55,170 INFO  client.JestClientFactory - Using multi
> > thread/connection supporting pooling connection manager
> > 2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Using default
> > GSON instance
> > 2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Node
> > Discovery disabled...
> > 2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Idle
> > connection reaping disabled...
> > 2018-03-07 15:41:55,282 INFO  elasticrest.ElasticRestIndexWriter -
> > Processing remaining requests [docs = 1, length = 210402, total docs =
> > 1]
> > 2018-03-07 15:41:55,361 INFO  elasticrest.ElasticRestIndexWriter -
> > Processing to finalize last execute
> > 2018-03-07 15:41:55,458 INFO  elasticrest.ElasticRestIndexWriter -
> > Previous took in ms 175, including wait 97
> > 2018-03-07 15:41:55,468 WARN  mapred.LocalJobRunner -
> > job_local1561152089_0001
> > java.lang.Exception: java.lang.NullPointerException at
> > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.ja
> > va:462) at
> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:52
> > 9) Caused by: java.lang.NullPointerException at
> > org.apache.nutch.scoring.opic.OPICScoringFilter.indexerScore(OPICScori
> > ngFilter.java:171)
> > at
> > org.apache.nutch.scoring.ScoringFilters.indexerScore(ScoringFilters.ja
> > va:120)
> > at
> > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java
> > :296)
> > at
> > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java
> > :57) at
> > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
> > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
> > at
> >
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(Loc
> > alJobRunner.java:319) at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511
> > ) at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> > at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j
> > ava:1149)
> > at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
> > java:624) at j

RE: Regarding Internal Links

2018-03-07 Thread Yossi Tamari
Yas, just to be sure, you are using the original URL (the one that was in the 
ParseResult passed as parameter to the filter) in the ParseResult constructor, 
right?

> -Original Message-
> From: Sebastian Nagel <wastl.na...@googlemail.com>
> Sent: 07 March 2018 12:36
> To: user@nutch.apache.org
> Subject: Re: Regarding Internal Links
> 
> Hi,
> 
> that needs to be fixed. It's because there is no CrawlDb entry for the partial
> documents. May also be happen after NUTCH-2456. Could you open a Jira issue
> to address the problem? Thanks!
> 
> As a quick work-around:
> - either disable scoring-opic while indexing
> - or check dbDatum for null in scoring-opic indexerScore(...)
> 
> Thanks,
> Sebastian
> 
> On 03/07/2018 11:13 AM, Yash Thenuan Thenuan wrote:
> > Thanks Yossi, I am now able to parse the data successfully but I am
> > getting Error at the time of indexing.
> > Below are the hadoop logs for indexing.
> >
> > ElasticRestIndexWriter
> > elastic.rest.host : hostname
> > elastic.rest.port : port
> > elastic.rest.index : elastic index command elastic.rest.max.bulk.docs
> > : elastic bulk index doc counts. (default 250)
> > elastic.rest.max.bulk.size : elastic bulk index length. (default
> > 2500500
> > ~2.5MB)
> >
> >
> > 2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce -
> IndexerMapReduce:
> > crawldb: crawl/crawldb
> > 2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce -
> IndexerMapReduce:
> > linkdb: crawl/linkdb
> > 2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce -
> IndexerMapReduces:
> > adding segment: crawl/segments/20180307130959
> > 2018-03-07 15:41:53,677 INFO  anchor.AnchorIndexingFilter - Anchor
> > deduplication is: off
> > 2018-03-07 15:41:54,861 INFO  indexer.IndexWriters - Adding
> > org.apache.nutch.indexwriter.elasticrest.ElasticRestIndexWriter
> > 2018-03-07 15:41:55,168 INFO  client.AbstractJestClient - Setting
> > server pool to a list of 1 servers: [http://localhost:9200]
> > 2018-03-07 15:41:55,170 INFO  client.JestClientFactory - Using multi
> > thread/connection supporting pooling connection manager
> > 2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Using default
> > GSON instance
> > 2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Node
> > Discovery disabled...
> > 2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Idle
> > connection reaping disabled...
> > 2018-03-07 15:41:55,282 INFO  elasticrest.ElasticRestIndexWriter -
> > Processing remaining requests [docs = 1, length = 210402, total docs =
> > 1]
> > 2018-03-07 15:41:55,361 INFO  elasticrest.ElasticRestIndexWriter -
> > Processing to finalize last execute
> > 2018-03-07 15:41:55,458 INFO  elasticrest.ElasticRestIndexWriter -
> > Previous took in ms 175, including wait 97
> > 2018-03-07 15:41:55,468 WARN  mapred.LocalJobRunner -
> > job_local1561152089_0001
> > java.lang.Exception: java.lang.NullPointerException at
> > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.ja
> > va:462) at
> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:52
> > 9) Caused by: java.lang.NullPointerException at
> > org.apache.nutch.scoring.opic.OPICScoringFilter.indexerScore(OPICScori
> > ngFilter.java:171)
> > at
> > org.apache.nutch.scoring.ScoringFilters.indexerScore(ScoringFilters.ja
> > va:120)
> > at
> > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java
> > :296)
> > at
> > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java
> > :57) at
> > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
> > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
> > at
> >
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(Loc
> > alJobRunner.java:319) at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511
> > ) at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> > at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j
> > ava:1149)
> > at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
> > java:624) at java.lang.Thread.run(Thread.java:748)
> > 2018-03-07 15:41:55,510 ERROR indexer.IndexingJob - Indexer:
> > java.io.IOException: Job failed!
> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873)
> > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
> > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
&

Re: Regarding Internal Links

2018-03-07 Thread Sebastian Nagel
 successful, you don't need to change it. contentmeta is all the
>> information that was gathered about this page before parsing, so again, you
>> probably just want to keep it, and finally parsemeta is the metadata that
>> was gathered during parsing and may be useful for indexing, so passing the
>> metadata from the original ParseResult makes sense, or just using the
>> constructor that does not require it if you don't care about the metadata.
>> This should all be easier to understand if you look at what the HTML
>> Parser does with each of these fields.
>>
>>> -Original Message-
>>> From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
>>> Sent: 06 March 2018 20:17
>>> To: user@nutch.apache.org
>>> Subject: RE: Regarding Internal Links
>>>
>>> I am able to get parsetext data structure.
>>> But having trouble with parseData as it's constructor is asking for
>> parsestatus,
>>> outlinks, contentmeta and parsemeta.
>>> Outlinks I can get from outlinkExtractor but what about other parameters?
>>> And again getoutlinks is asking for configuration and i don't know, from
>> where I
>>> can get it?
>>>
>>> On 6 Mar 2018 18:32, "Yossi Tamari" <yossi.tam...@pipl.com> wrote:
>>>
>>>> You should go over each segment, and for each one produce a ParseText
>>>> and a ParseData. This is basically what the HTML Parser does for the
>>>> whole document, which is why I suggested you should dive into its code.
>>>> A ParseText is basically just a String containing the actual content
>>>> of the segment (after stripping the HTML tags). This is usually the
>>>> document you want to index.
>>>> The ParseData structure is a little more complex, but the main things
>>>> it contains are the title of this segment, and the outlinks from the
>>>> segment (for further crawling). Take a look at the code of both
>>>> classes and it should be relatively clear.
>>>> Finally, you need to build one ParseResult object, with the original
>>>> URL, and for each of the ParseText/ParseData pairs, call the put
>>>> method, with the internal URL of the segment as the key.
>>>>
>>>>> -Original Message-
>>>>> From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
>>>>> Sent: 06 March 2018 14:45
>>>>> To: user@nutch.apache.org
>>>>> Subject: RE: Regarding Internal Links
>>>>>
>>>>>> I am able to get the content corresponding to each Internal link
>>>>>> by writing a parse filter plugin. Now  I am  not getting how to
>>>>>> proceed further. How can I parse them as separate document and
>>>>>> what should my ParseResult filter return??
>>>>
>>>>
>>
>>
> 



Re: Regarding Internal Links

2018-03-07 Thread Yash Thenuan Thenuan
Thanks Yossi, I am now able to parse the data successfully but I am getting
Error at the time of indexing.
Below are the hadoop logs for indexing.

ElasticRestIndexWriter
elastic.rest.host : hostname
elastic.rest.port : port
elastic.rest.index : elastic index command
elastic.rest.max.bulk.docs : elastic bulk index doc counts. (default 250)
elastic.rest.max.bulk.size : elastic bulk index length. (default 2500500
~2.5MB)


2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce - IndexerMapReduce:
crawldb: crawl/crawldb
2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce - IndexerMapReduce:
linkdb: crawl/linkdb
2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
adding segment: crawl/segments/20180307130959
2018-03-07 15:41:53,677 INFO  anchor.AnchorIndexingFilter - Anchor
deduplication is: off
2018-03-07 15:41:54,861 INFO  indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.elasticrest.ElasticRestIndexWriter
2018-03-07 15:41:55,168 INFO  client.AbstractJestClient - Setting server
pool to a list of 1 servers: [http://localhost:9200]
2018-03-07 15:41:55,170 INFO  client.JestClientFactory - Using multi
thread/connection supporting pooling connection manager
2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Using default GSON
instance
2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Node Discovery
disabled...
2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Idle connection
reaping disabled...
2018-03-07 15:41:55,282 INFO  elasticrest.ElasticRestIndexWriter -
Processing remaining requests [docs = 1, length = 210402, total docs = 1]
2018-03-07 15:41:55,361 INFO  elasticrest.ElasticRestIndexWriter -
Processing to finalize last execute
2018-03-07 15:41:55,458 INFO  elasticrest.ElasticRestIndexWriter - Previous
took in ms 175, including wait 97
2018-03-07 15:41:55,468 WARN  mapred.LocalJobRunner -
job_local1561152089_0001
java.lang.Exception: java.lang.NullPointerException
at
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: java.lang.NullPointerException
at
org.apache.nutch.scoring.opic.OPICScoringFilter.indexerScore(OPICScoringFilter.java:171)
at
org.apache.nutch.scoring.ScoringFilters.indexerScore(ScoringFilters.java:120)
at
org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:296)
at
org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:57)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at
org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-03-07 15:41:55,510 ERROR indexer.IndexingJob - Indexer:
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)


On Wed, Mar 7, 2018 at 12:30 AM, Yossi Tamari <yossi.tam...@pipl.com> wrote:

> Regarding the configuration parameter, your Parse Filter should expose a
> setConf method that receives a conf parameter. Keep that as a member
> variable and pass it where necessary.
> Regarding parsestatus, contentmeta and parsemeta, you're going to have to
> look at them yourself (probably in a debugger), but as a baseline, you can
> probably just use the values in the inbound ParseResult (of the whole
> document).
> More specifically, parsestatus is an indication of whether parsing was
> successful. Unless your parsing may fail even when the whole document
> parsing was successful, you don't need to change it. contentmeta is all the
> information that was gathered about this page before parsing, so again, you
> probably just want to keep it, and finally parsemeta is the metadata that
> was gathered during parsing and may be useful for indexing, so passing the
> metadata from the original ParseResult makes sense, or just using the
> constructor that does not require it if you don't care about the metadata.
> This should all be easier to understand if you look at what the HTML
> Parser does with each of these fields.
>
> > -Original Message-
> > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > Sent: 06 March 2018 20:17
> > To: user@nutch.apache.org
> > Subject: RE: Regarding Internal Links
> >
&g

RE: Regarding Internal Links

2018-03-06 Thread Yossi Tamari
Regarding the configuration parameter, your Parse Filter should expose a 
setConf method that receives a conf parameter. Keep that as a member variable 
and pass it where necessary.
Regarding parsestatus, contentmeta and parsemeta, you're going to have to look 
at them yourself (probably in a debugger), but as a baseline, you can probably 
just use the values in the inbound ParseResult (of the whole document).
More specifically, parsestatus is an indication of whether parsing was 
successful. Unless your parsing may fail even when the whole document parsing 
was successful, you don't need to change it. contentmeta is all the information 
that was gathered about this page before parsing, so again, you probably just 
want to keep it, and finally parsemeta is the metadata that was gathered during 
parsing and may be useful for indexing, so passing the metadata from the 
original ParseResult makes sense, or just using the constructor that does not 
require it if you don't care about the metadata.
This should all be easier to understand if you look at what the HTML Parser 
does with each of these fields.

> -Original Message-
> From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> Sent: 06 March 2018 20:17
> To: user@nutch.apache.org
> Subject: RE: Regarding Internal Links
> 
> I am able to get parsetext data structure.
> But having trouble with parseData as it's constructor is asking for 
> parsestatus,
> outlinks, contentmeta and parsemeta.
> Outlinks I can get from outlinkExtractor but what about other parameters?
> And again getoutlinks is asking for configuration and i don't know, from 
> where I
> can get it?
> 
> On 6 Mar 2018 18:32, "Yossi Tamari" <yossi.tam...@pipl.com> wrote:
> 
> > You should go over each segment, and for each one produce a ParseText
> > and a ParseData. This is basically what the HTML Parser does for the
> > whole document, which is why I suggested you should dive into its code.
> > A ParseText is basically just a String containing the actual content
> > of the segment (after stripping the HTML tags). This is usually the
> > document you want to index.
> > The ParseData structure is a little more complex, but the main things
> > it contains are the title of this segment, and the outlinks from the
> > segment (for further crawling). Take a look at the code of both
> > classes and it should be relatively clear.
> > Finally, you need to build one ParseResult object, with the original
> > URL, and for each of the ParseText/ParseData pairs, call the put
> > method, with the internal URL of the segment as the key.
> >
> > > -----Original Message-
> > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > > Sent: 06 March 2018 14:45
> > > To: user@nutch.apache.org
> > > Subject: RE: Regarding Internal Links
> > >
> > > > I am able to get the content corresponding to each Internal link
> > > > by writing a parse filter plugin. Now  I am  not getting how to
> > > > proceed further. How can I parse them as separate document and
> > > > what should my ParseResult filter return??
> >
> >



RE: Regarding Internal Links

2018-03-06 Thread Yossi Tamari
You should go over each segment, and for each one produce a ParseText and a 
ParseData. This is basically what the HTML Parser does for the whole document, 
which is why I suggested you should dive into its code.
A ParseText is basically just a String containing the actual content of the 
segment (after stripping the HTML tags). This is usually the document you want 
to index.
The ParseData structure is a little more complex, but the main things it 
contains are the title of this segment, and the outlinks from the segment (for 
further crawling). Take a look at the code of both classes and it should be 
relatively clear.
Finally, you need to build one ParseResult object, with the original URL, and 
for each of the ParseText/ParseData pairs, call the put method, with the 
internal URL of the segment as the key.  

> -Original Message-
> From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> Sent: 06 March 2018 14:45
> To: user@nutch.apache.org
> Subject: RE: Regarding Internal Links
> 
> > I am able to get the content corresponding to each Internal link by
> > writing a parse filter plugin. Now  I am  not getting how to proceed
> > further. How can I parse them as separate document and what should
> > my ParseResult filter return??



RE: Regarding Internal Links

2018-03-06 Thread Yash Thenuan Thenuan
> I am able to get the content corresponding to each Internal link by
> writing a parse filter plugin. Now  I am  not getting how to proceed
> further. How can I parse them as separate document and what should
> my ParseResult filter return??


RE: Regarding Internal Links

2018-03-05 Thread Yossi Tamari
You will need to write a HTML Parser Filter plugin. It receives the DOM of the 
document as parameter, you will have to scan this and isolate the relevant 
sections, then extract the content of these sections (probably copying code 
from the HTML parser). Your filter returns a ParseResult, which is really a Map 
from the URL (with the anchor in your case), to a Parse object, which the HTML 
parser creates. You can have as many of these as you want, as long as the URLs 
are different.

This is going to require that you dive into the code of the HTML Parser or the 
Tika Parser, there is no way around this.

> -Original Message-
> From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> Sent: 05 March 2018 13:59
> To: user@nutch.apache.org
> Subject: Re: Regarding Internal Links
> 
> Please help me out regarding this.
> It's urgent.
> 
> On 5 Mar 2018 15:41, "Yash Thenuan Thenuan" <rit2014...@iiita.ac.in> wrote:
> 
> > How can I achieve this in nutch 1.x?
> >
> > On 1 Mar 2018 22:30, "Sebastian Nagel" <wastl.na...@googlemail.com>
> wrote:
> >
> >> Hi,
> >>
> >> Yes, that's possible but only for Nutch 1.x:
> >> a ParseResult [1] may contain multiple ParseData objects each
> >> accessible by a separate URL.
> >> This feature is not available for 2.x [2].
> >>
> >> It's used by the feed parser plugin to add a single entry for every
> >> feed item.  Afaik, that's not supported out of the box for sections
> >> of a page (e.g., split by anchors or h1/h2/h3). You would need to
> >> write a parse-filter plugin to achieve this.
> >>
> >> I've once used it to index parts of a page identified by XPath
> >> expressions.
> >>
> >> Best,
> >> Sebastian
> >>
> >> [1] https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/
> >> nutch/parse/ParseResult.html
> >> [2] https://nutch.apache.org/apidocs/apidocs-2.3.1/org/apache/
> >> nutch/parse/Parse.html
> >>
> >>
> >> On 03/01/2018 08:02 AM, Yash Thenuan Thenuan wrote:
> >> > Hi there,
> >> > For example we have a url
> >> > https://wiki.apache.org/nutch/NutchTutorial#Table_of_Contents
> >> > here #table_of _contents is a internal link.
> >> > I want to separate the contents of the page on the basis of
> >> > internal
> >> links.
> >> > Is this possible in nutch??
> >> > I want to index the contents of each internal link separately.
> >> >
> >>
> >>



Re: Regarding Internal Links

2018-03-05 Thread Yash Thenuan Thenuan
Please help me out regarding this.
It's urgent.

On 5 Mar 2018 15:41, "Yash Thenuan Thenuan"  wrote:

> How can I achieve this in nutch 1.x?
>
> On 1 Mar 2018 22:30, "Sebastian Nagel"  wrote:
>
>> Hi,
>>
>> Yes, that's possible but only for Nutch 1.x:
>> a ParseResult [1] may contain multiple ParseData objects
>> each accessible by a separate URL.
>> This feature is not available for 2.x [2].
>>
>> It's used by the feed parser plugin to add a single
>> entry for every feed item.  Afaik, that's not supported
>> out of the box for sections of a page (e.g., split by
>> anchors or h1/h2/h3). You would need to write a
>> parse-filter plugin to achieve this.
>>
>> I've once used it to index parts of a page identified
>> by XPath expressions.
>>
>> Best,
>> Sebastian
>>
>> [1] https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/
>> nutch/parse/ParseResult.html
>> [2] https://nutch.apache.org/apidocs/apidocs-2.3.1/org/apache/
>> nutch/parse/Parse.html
>>
>>
>> On 03/01/2018 08:02 AM, Yash Thenuan Thenuan wrote:
>> > Hi there,
>> > For example we have a url
>> > https://wiki.apache.org/nutch/NutchTutorial#Table_of_Contents
>> > here #table_of _contents is a internal link.
>> > I want to separate the contents of the page on the basis of internal
>> links.
>> > Is this possible in nutch??
>> > I want to index the contents of each internal link separately.
>> >
>>
>>


Re: Regarding Internal Links

2018-03-05 Thread Yash Thenuan Thenuan
How can I achieve this in nutch 1.x?

On 1 Mar 2018 22:30, "Sebastian Nagel"  wrote:

> Hi,
>
> Yes, that's possible but only for Nutch 1.x:
> a ParseResult [1] may contain multiple ParseData objects
> each accessible by a separate URL.
> This feature is not available for 2.x [2].
>
> It's used by the feed parser plugin to add a single
> entry for every feed item.  Afaik, that's not supported
> out of the box for sections of a page (e.g., split by
> anchors or h1/h2/h3). You would need to write a
> parse-filter plugin to achieve this.
>
> I've once used it to index parts of a page identified
> by XPath expressions.
>
> Best,
> Sebastian
>
> [1] https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/parse/
> ParseResult.html
> [2] https://nutch.apache.org/apidocs/apidocs-2.3.1/org/
> apache/nutch/parse/Parse.html
>
>
> On 03/01/2018 08:02 AM, Yash Thenuan Thenuan wrote:
> > Hi there,
> > For example we have a url
> > https://wiki.apache.org/nutch/NutchTutorial#Table_of_Contents
> > here #table_of _contents is a internal link.
> > I want to separate the contents of the page on the basis of internal
> links.
> > Is this possible in nutch??
> > I want to index the contents of each internal link separately.
> >
>
>


Re: Regarding Internal Links

2018-03-01 Thread Sebastian Nagel
Hi,

Yes, that's possible but only for Nutch 1.x:
a ParseResult [1] may contain multiple ParseData objects
each accessible by a separate URL.
This feature is not available for 2.x [2].

It's used by the feed parser plugin to add a single
entry for every feed item.  Afaik, that's not supported
out of the box for sections of a page (e.g., split by
anchors or h1/h2/h3). You would need to write a
parse-filter plugin to achieve this.

I've once used it to index parts of a page identified
by XPath expressions.

Best,
Sebastian

[1] 
https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/parse/ParseResult.html
[2] 
https://nutch.apache.org/apidocs/apidocs-2.3.1/org/apache/nutch/parse/Parse.html


On 03/01/2018 08:02 AM, Yash Thenuan Thenuan wrote:
> Hi there,
> For example we have a url
> https://wiki.apache.org/nutch/NutchTutorial#Table_of_Contents
> here #table_of _contents is a internal link.
> I want to separate the contents of the page on the basis of internal links.
> Is this possible in nutch??
> I want to index the contents of each internal link separately.
>