Re: problems extracting outlinks

2017-08-09 Thread Carlos Pérez Miguel
Hi Sebastian,

Thank you for your answer. I am using Nutch 1.12. Same plugins as you. I am
using this old version because I use a modified version (not those
plugins). I guess something changed in the parse-html plugin since my
version.

Anyway, I think I found a clue about what is happening. This page is in
catalan, a language in which is normal the use of single quotes. Most of
the attributes of the html code are surrounded by single quotes and some of
the values of those attributes use as well single quotes, so, I think the
parser is confused by that. For example, in that page, line 278 we can see
this tag:



Thanks,
Carlos

Carlos Pérez Miguel

2017-08-09 18:47 GMT+02:00 Sebastian Nagel :

> Hi Carlos,
>
> sorry but I'm not able to reproduce the problem using Nutch 1.14-SNAPSHOT
> and the call
>
> $ bin/nutch parsechecker -Dplugin.includes='protocol-http|parse-html' \
>   https://www.seguroscatalanaoccidente.com/cat/particulars/vida/
> assegurances-de-vida/vida-proteccio
>
> Could you tell us which Nutch version is used and also which plugins are
> enabled?
>
> Thanks,
> Sebastian
>
>
> On 08/09/2017 12:09 PM, Carlos Pérez Miguel wrote:
> > Hi,
> >
> > While crawling a site, I found that the crawl stopped before expected
> > because lots of urls being downloaded was of the form:
> >
> > http://www.domain.com/something/"http://www.domain.com;
> >
> > After reading the html of the pages containing that outlinks I found that
> > those outlinks are note included in the source code, so I guess there may
> > be something incorrect in the page content or in the parse made by nutch.
> > How can I know which problem is? I am a little lost with this one.
> >
> > In order to see the problem:
> >
> > $ bin/nutch parsechecker
> > https://www.seguroscatalanaoccidente.com/cat/particulars/vida/
> assegurances-de-vida/vida-proteccio
> >
> > And within the results we can see this particular outlink:
> >  outlink: toUrl:
> > https://www.seguroscatalanaoccidente.com/cat/particulars/vida/
> assegurances-de-vida/
> > "http://www.seguroscatalanaoccidente.com; anchor:
> > www.seguroscatalanaoccidente.com
> >
> > Is there any way to solve or avoid this? maybe with the regex-urlfilter
> > file?
> >
> > Thanks
> >
> > Carlos Pérez Miguel
> >
>
>


Re: Custom IndexWriter never called on index command

2017-08-09 Thread Barnabás Balázs
Small followup tidbit:

The reduce function of IndexerMapReduce only receives CrawlDatums (line 198) 
the parseData/parseText is always null thus the function returns in line 261.

So the main question now:
Why is it the Indexer only receiving CrawlDatums when the Parse function 
executed before the Indexer creates the ParseData perfectly?
On 2017. 08. 09. 19:00:27, Barnabás Balázs  
wrote:
Dear community!

I'm relatively new to Nutch 1.x and got stumped on an indexing issue.
I have a local Java application that sends Nutch jobs to a remote Hadoop 
deployment for execution. The jobs are sent in the following order:
Inject -> Generate -> Fetch -> Parse -> Index -> Update -> Invertlinks
Once a round is finished it starts over. The commands are of course configured 
based on the previous one's results (when necessary).

This setup seems to work, I can see that fetch gathers the correct urls for 
example. The problem is the Index stage. I implemented a custom IndexWriter 
that should send data to Couchbase buckets and Kafka Producers, however even 
though the plugin seems to construct correctly (I can see Kafka producer setup 
records in the reduce log), the open/write/update functions are never called. I 
put logs in each and also used remote debugging to make sure that they are 
really never called.
I also used a debugger inside the IndexerMapReduce class and to be honest I'm 
not sure where the IndexWriter is used, but the job definitely receives data (I 
saw the fetched urls).

I should mention that I also created an HTMLParseFilter plugin and that one 
works perfectly, so plugin deployment shouldn't be the issue. Also in the logs 
I can see the following:
Registered Plugins: ... Couchbase indexer (indexer-couchbase) ... 
org.apache.nutch.indexer.IndexWriters: Adding correct.package.Indexer
I've been stuck on this issue for a few days now, any help/ideas would be 
appreciated on why my IndexWriter is never called when running an Indexer job.

Best,
Barnabas

Re: fetching pdfs from our website

2017-08-09 Thread d.ku...@technisat.de
Hey Sebastian,

thanks a lot. I already increased it to around 65MB. All our pdfs about 3 to 
8mb big.
Any other suggestions?
;)



Thanks
David

> Am 09.08.2017 um 18:50 schrieb Sebastian Nagel :
> 
> Hi David,
> 
> for PDFs you usually need to increase the following property:
> 
> 
>  http.content.limit
>  65536
>  The length limit for downloaded content using the http
>  protocol, in bytes. If this value is nonnegative (>=0), content longer
>  than it will be truncated; otherwise, no truncation at all. Do not
>  confuse this setting with the file.content.limit setting.
>  
> 
> 
> In doubt, also set the equivalent properties ftp.content.limit and 
> file.content.limit
> 
> Best,
> Sebastian
> 
>> On 08/08/2017 03:00 PM, d.ku...@technisat.de wrote:
>> Hey currently,
>> 
>> we are on nutch 2.3.1 and using it to crawl our websites. 
>> One of our focus is to get all the pdfs on our website crawled.  -> Links on 
>> different Websites are like: https://assets0.mysite.com/asset /DB_product.pdf
>> I tried different things:
>> At the configurations I removed ever occurrence of pdf in 
>> regex-urlfilter.txt and added the download url, added  parse-tika to 
>> nutch-.site.xml in plugins, added application/pdf in default-site.xml in 
>> http-accept, added pdf to parse-plugins.xml.
>> But still no pdf link is been fetched. 
>> 
>> regex-urlfilter.txt
>> +https://assets.*. mysite.com/asset
>> 
>> parse-plugins.xml
>> 
>>   
>>
>> 
>> nutch-site.xml
>> 
>> plugin.includes
>> protocol-http|urlfilter-regex|parse-(html|metatags|tika)|index-(basic|anchor|more|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-solr
>> 
>> 
>> default-site.xml
>> 
>>  http.accept
>>  
>> application/pdf,text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
>>  Value of the "Accept" request header field.
>>  
>> 
>> 
>> Is there anything else I have to configure?
>> 
>> Thanks
>> 
>> David
>> 
>> 
>> 
> 


Custom IndexWriter never called on index command

2017-08-09 Thread Barnabás Balázs
Dear community!

I'm relatively new to Nutch 1.x and got stumped on an indexing issue.
I have a local Java application that sends Nutch jobs to a remote Hadoop 
deployment for execution. The jobs are sent in the following order:
Inject -> Generate -> Fetch -> Parse -> Index -> Update -> Invertlinks
Once a round is finished it starts over. The commands are of course configured 
based on the previous one's results (when necessary).

This setup seems to work, I can see that fetch gathers the correct urls for 
example. The problem is the Index stage. I implemented a custom IndexWriter 
that should send data to Couchbase buckets and Kafka Producers, however even 
though the plugin seems to construct correctly (I can see Kafka producer setup 
records in the reduce log), the open/write/update functions are never called. I 
put logs in each and also used remote debugging to make sure that they are 
really never called.
I also used a debugger inside the IndexerMapReduce class and to be honest I'm 
not sure where the IndexWriter is used, but the job definitely receives data (I 
saw the fetched urls).

I should mention that I also created an HTMLParseFilter plugin and that one 
works perfectly, so plugin deployment shouldn't be the issue. Also in the logs 
I can see the following:
Registered Plugins: ... Couchbase indexer (indexer-couchbase) ... 
org.apache.nutch.indexer.IndexWriters: Adding correct.package.Indexer
I've been stuck on this issue for a few days now, any help/ideas would be 
appreciated on why my IndexWriter is never called when running an Indexer job.

Best,
Barnabas

Re: fetching pdfs from our website

2017-08-09 Thread Sebastian Nagel
Hi David,

for PDFs you usually need to increase the following property:


  http.content.limit
  65536
  The length limit for downloaded content using the http
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  


In doubt, also set the equivalent properties ftp.content.limit and 
file.content.limit

Best,
Sebastian

On 08/08/2017 03:00 PM, d.ku...@technisat.de wrote:
> Hey currently,
> 
> we are on nutch 2.3.1 and using it to crawl our websites. 
> One of our focus is to get all the pdfs on our website crawled.  -> Links on 
> different Websites are like: https://assets0.mysite.com/asset /DB_product.pdf
> I tried different things:
> At the configurations I removed ever occurrence of pdf in regex-urlfilter.txt 
> and added the download url, added  parse-tika to nutch-.site.xml in plugins, 
> added application/pdf in default-site.xml in http-accept, added pdf to 
> parse-plugins.xml.
> But still no pdf link is been fetched. 
> 
> regex-urlfilter.txt
> +https://assets.*. mysite.com/asset
> 
> parse-plugins.xml
> 
>  
>   
> 
> nutch-site.xml
> 
> plugin.includes
> protocol-http|urlfilter-regex|parse-(html|metatags|tika)|index-(basic|anchor|more|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-solr
> 
> 
> default-site.xml
> 
>   http.accept
>   
> application/pdf,text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
>   Value of the "Accept" request header field.
>   
> 
> 
> Is there anything else I have to configure?
> 
> Thanks
> 
> David
> 
> 
> 



Re: problems extracting outlinks

2017-08-09 Thread Sebastian Nagel
Hi Carlos,

sorry but I'm not able to reproduce the problem using Nutch 1.14-SNAPSHOT and 
the call

$ bin/nutch parsechecker -Dplugin.includes='protocol-http|parse-html' \
  
https://www.seguroscatalanaoccidente.com/cat/particulars/vida/assegurances-de-vida/vida-proteccio

Could you tell us which Nutch version is used and also which plugins are 
enabled?

Thanks,
Sebastian


On 08/09/2017 12:09 PM, Carlos Pérez Miguel wrote:
> Hi,
> 
> While crawling a site, I found that the crawl stopped before expected
> because lots of urls being downloaded was of the form:
> 
> http://www.domain.com/something/"http://www.domain.com;
> 
> After reading the html of the pages containing that outlinks I found that
> those outlinks are note included in the source code, so I guess there may
> be something incorrect in the page content or in the parse made by nutch.
> How can I know which problem is? I am a little lost with this one.
> 
> In order to see the problem:
> 
> $ bin/nutch parsechecker
> https://www.seguroscatalanaoccidente.com/cat/particulars/vida/assegurances-de-vida/vida-proteccio
> 
> And within the results we can see this particular outlink:
>  outlink: toUrl:
> https://www.seguroscatalanaoccidente.com/cat/particulars/vida/assegurances-de-vida/
> "http://www.seguroscatalanaoccidente.com; anchor:
> www.seguroscatalanaoccidente.com
> 
> Is there any way to solve or avoid this? maybe with the regex-urlfilter
> file?
> 
> Thanks
> 
> Carlos Pérez Miguel
> 



problems extracting outlinks

2017-08-09 Thread Carlos Pérez Miguel
Hi,

While crawling a site, I found that the crawl stopped before expected
because lots of urls being downloaded was of the form:

http://www.domain.com/something/"http://www.domain.com;

After reading the html of the pages containing that outlinks I found that
those outlinks are note included in the source code, so I guess there may
be something incorrect in the page content or in the parse made by nutch.
How can I know which problem is? I am a little lost with this one.

In order to see the problem:

$ bin/nutch parsechecker
https://www.seguroscatalanaoccidente.com/cat/particulars/vida/assegurances-de-vida/vida-proteccio

And within the results we can see this particular outlink:
 outlink: toUrl:
https://www.seguroscatalanaoccidente.com/cat/particulars/vida/assegurances-de-vida/
"http://www.seguroscatalanaoccidente.com; anchor:
www.seguroscatalanaoccidente.com

Is there any way to solve or avoid this? maybe with the regex-urlfilter
file?

Thanks

Carlos Pérez Miguel