Re: readseg dump and non-ASCII characters

2017-11-15 Thread Michael Coffey
Thanks for the note, Sebastian. Yes, it is the fetched HTML that I parse using 
python-based tools after getting it from readseg. This is an alternative I 
decided to use after having struggled with raw-binary-content and solr.
I figured it was a problem of readseg either decoding or encoding properly, but 
I didn't know which. Your point #3 seems to say it's the decode that goes wrong 
becasue it doesn't consider the encoding of the fetched page.

A follow-up: I don't quite understand how the "LC_ALL=en_US.utf8" would apply 
to a Hadoop job. Does it somehow propagate to all nodes in the cluster? Would 
it work just as well, or better, to use "-Dfile.encoding=UTF8" in the binNutch 
command?

  From: Sebastian Nagel 
 To: user@nutch.apache.org 
 Sent: Wednesday, November 15, 2017 5:18 AM
 Subject: Re: readseg dump and non-ASCII characters
   
Hi Michael,

from the arguments I guess you're interested in the raw/binary HTML content, 
right?
After a closer look I have no simple answer:

 1. HTML has no fix encoding - it could be everything, pageA may have a 
different
    encoding than pageB.

 2. That's different for parsed text: it's a Java String internally

 3. "readseg dump" converts all data to a Java String using the default platform
    encoding. On Linux having these locales installed you may get different 
results for:
      LC_ALL=en_US.utf8  ./bin/nutch reaseg -dump
      LC_ALL=en_US      ./bin/nutch reaseg -dump
      LC_ALL=ru_RU      ./bin/nutch reaseg -dump
    In doubt, try to set UTF-8 to your platform encoding. Most pages nowadays 
are UTF-8.
    Btw., this behavior isn't ideal, it should be fixed as part NUTCH-1807.

 4. a more reliable solution would require to detect the HTML encoding (the 
code is available
    in Nutch) and then convert the byte[] content using the right encoding.

Best,
Sebastian



On 11/15/2017 02:20 AM, Michael Coffey wrote:
> Greetings Nutchlings,
> I have been using readseg-dump successfully to retrieve content crawled by 
> nutch, but I have one significant problem: many non-ASCII characters appear 
> as '???' in the dumped text file. This happens fairly frequently in the 
> headlines of news sites that I crawl, for things like quotes, apostrophes, 
> and dashes.
> Am I doing something wrong, or is this a known bug? I use a python utf8 
> decoder, so it would be nice if everything were UTF8.
> Here is the command that I use to dump each segment (using nutch 
> 1.12).bin/nutch readseg -dump  segPath destPath -noparse -noparsedata 
> -noparsetext -nogenerate
> It is so close to working perfectly!
> 



   

Re: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

2017-11-15 Thread Michael Coffey
Also, try the boilerpipe demo online at https://boilerpipe-web.appspot.com/


From: Markus Jelsma 
To: "user@nutch.apache.org"  
Sent: Wednesday, November 15, 2017 2:06 PM
Subject: RE: [MASSMAIL]RE: Removing header,Footer and left menus while crawling



The DefaultExtractor gives as i remember the same as ArticleExtractor, which is 
fine for contiguous regions of text. CanolaExtractor must be used if you expect 
lots of non-contiguous regions of text. The latter is also more prone to get 
the boilerplate text you want to avoid in the first place.


By the way, if you intend to extract CJK websites you need to manually modify 
Boilerpipe to take into account the different character-to-information ratio, 
or try Canola.


-Original message-

> From:Michael Coffey 

> Sent: Wednesday 15th November 2017 23:00

> To: user@nutch.apache.org

> Subject: Re: [MASSMAIL]RE: Removing header,Footer and left menus while 
> crawling

> 

> I found a lot of detail about the boilerpipe algorithm in 

> http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf

> 

> 

> Seems like very short paragraphs can be a problem, since one of the primary 
> features used for determining boilerplate is the length of a given text block.

> 

> I would also look into the tika.extractor.boilerpipe.algorithm setting. It 
> can be DefaultExtractor, ArticleExtractor or CanolaExtractor. I don't know 
> what the differences are, but I bet ArticleExtractor (the default algorithm ) 
> inserts the Title.

> 

> 

> 

> 

> From: Markus Jelsma 

> To: "user@nutch.apache.org"  

> Sent: Wednesday, November 15, 2017 1:38 PM

> Subject: RE: [MASSMAIL]RE: Removing header,Footer and left menus while 
> crawling

> 

> 

> 

> Boilerpipe is a crude tool but cheap and effective enough for many sorts of 
> websites. It does has a problem with pages with little text, just as all 
> extractors have a degree of problems with little text.

> 

> 

> I believe Boilerpipe adds the title hardcoded, or it is TikaParser doing it. 
> I am not sure, but remember you can get rid of it by removing some lines of 
> code. See TikaParser.java, i think it is there.

> 

> 

> Regards,

> 

> Makrus

> 

> 

> > non-open source contribution, you could try our extractor if you want, 
> > there is a (low speed) test available at 
> > https://www.openindex.io/saas/data-extraction/ . It is not free or open 
> > source but available and actively developed, and does much more than just 
> > text extraction.

> 

> 

> 

> 

> -Original message-

> 

> > From:Rushikesh K 

> 

> > Sent: Wednesday 15th November 2017 22:21

> 

> > To: user@nutch.apache.org; eru...@uci.cu

> 

> > Subject: Re: [MASSMAIL]RE: Removing header,Footer and left menus while 
> > crawling

> 

> > 

> 

> > Hello, 

> 

> > 

> 

> > 

> 

> > Eyeris - Thanks for your response, i was able to make working with tika 
> > boilerpipe but now i have a different problem ,some of the crawled pages 
> > doesnt have the expected data 

> 

> > For some pages it brings back only the Title and skips all the content i am 
> > not sure in what special cases does this do.But in my case i have two 
> > problems now  

> 

> > 1. when my page has a image and 1 or 2 lines of text it doesnt get those 
> > lines of data.(the data is in the  tag) 

> 

> > 2.why is it adding Title to the starting of the content is there a way not 
> > to include that. 

> 

> > 

> 

> > For example see the following image for the first URL it came back with out 
> > any date 

> 

> > 

> 

> > 

> 

> > 

> 

> > On Wed, Nov 15, 2017 at 8:57 AM, Eyeris Rodriguez Rueda  > > wrote:

> 

> > Hello.

> 

> 

> > 

> 

> 

> > I am using tika boilerpipe with good results in aproximately 2000 websites.

> 

> 

> > Rushikesh if tika boilerpipe is not working for you maybe it is because you 
> > don´t are parsing documents with tika. please check this configuration

> 

> 

> > and tell us.

> 

> 

> > 

> 

> 

> > make sure that tika plugin is activated in plugin.included property then 
> > check:

> 

> 

> > 

> 

> 

> > ***

> 

> 

> > Use tika parser instead of parse-html.

> 

> 

> > 

> 

> 

> > parse-plugins.xml

> 

> 

> > 

> 

> 

> > 

> 

> 

> > 

> 

> 

> > 

> 

> 

> > 

> 

> 

> > 

> 

> 

> > 

> 

> 

> > 

> 

> 

> > ***

> 

> 

> > 

> 

> 

> > ***

> 

> 

> > nutch-site.xml

> 

> 

> > 

> 

> 

> >   tika.extractor

> 

> 

> >   boilerpipe

> 

> 

> >   

> 

> 

> >   Which text extraction algorithm to use. Valid values are: boilerpipe or 
> > none.

> 

> 

> >   

> 

> 

> > 

> 

> 

> > 

> 

> 

> > 

> 

> 

> >   tika.extractor.boilerpipe.algorithm

> 

> 

> >   ArticleExtractor

> 

> 

> >   


RE: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

2017-11-15 Thread Markus Jelsma
The DefaultExtractor gives as i remember the same as ArticleExtractor, which is 
fine for contiguous regions of text. CanolaExtractor must be used if you expect 
lots of non-contiguous regions of text. The latter is also more prone to get 
the boilerplate text you want to avoid in the first place.

By the way, if you intend to extract CJK websites you need to manually modify 
Boilerpipe to take into account the different character-to-information ratio, 
or try Canola.
 
-Original message-
> From:Michael Coffey 
> Sent: Wednesday 15th November 2017 23:00
> To: user@nutch.apache.org
> Subject: Re: [MASSMAIL]RE: Removing header,Footer and left menus while 
> crawling
> 
> I found a lot of detail about the boilerpipe algortithm in 
> http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf
> 
> 
> Seems like very short paragraphs can be a problem, since one of the primary 
> features used for determining boilerplate is the length of a given text block.
> 
> I would also look into the tika.extractor.boilerpipe.algorithm setting. It 
> can be DefaultExtractor, ArticleExtractor or CanolaExtractor. I don't know 
> what the differences are, but I bet ArticleExtractor (the default algorithm ) 
> inserts the Title.
> 
> 
> 
> 
> From: Markus Jelsma 
> To: "user@nutch.apache.org"  
> Sent: Wednesday, November 15, 2017 1:38 PM
> Subject: RE: [MASSMAIL]RE: Removing header,Footer and left menus while 
> crawling
> 
> 
> 
> Boilerpipe is a crude tool but cheap and effective enough for many sorts of 
> websites. It does has a problem with pages with little text, just as all 
> extractors have a degree of problems with little text.
> 
> 
> I believe Boilerpipe adds the title hardcoded, or it is TikaParser doing it. 
> I am not sure, but remember you can get rid of it by removing some lines of 
> code. See TikaParser.java, i think it is there.
> 
> 
> Regards,
> 
> Makrus
> 
> 
> > non-open source contribution, you could try our extractor if you want, 
> > there is a (low speed) test available at 
> > https://www.openindex.io/saas/data-extraction/ . It is not free or open 
> > source but available and actively developed, and does much more than just 
> > text extraction.
> 
> 
> 
> 
> -Original message-
> 
> > From:Rushikesh K 
> 
> > Sent: Wednesday 15th November 2017 22:21
> 
> > To: user@nutch.apache.org; eru...@uci.cu
> 
> > Subject: Re: [MASSMAIL]RE: Removing header,Footer and left menus while 
> > crawling
> 
> > 
> 
> > Hello, 
> 
> > 
> 
> > 
> 
> > Eyeris - Thanks for your response, i was able to make working with tika 
> > boilerpipe but now i have a different problem ,some of the crawled pages 
> > doesnt have the expected data 
> 
> > For some pages it brings back only the Title and skips all the content i am 
> > not sure in what special cases does this do.But in my case i have two 
> > problems now  
> 
> > 1. when my page has a image and 1 or 2 lines of text it doesnt get those 
> > lines of data.(the data is in the  tag) 
> 
> > 2.why is it adding Title to the starting of the content is there a way not 
> > to include that. 
> 
> > 
> 
> > For example see the following image for the first URL it came back with out 
> > any date 
> 
> > 
> 
> > 
> 
> > 
> 
> > On Wed, Nov 15, 2017 at 8:57 AM, Eyeris Rodriguez Rueda  > > wrote:
> 
> > Hello.
> 
> 
> > 
> 
> 
> > I am using tika boilerpipe with good results in aproximately 2000 websites.
> 
> 
> > Rushikesh if tika boilerpipe is not working for you maybe it is because you 
> > don´t are parsing documents with tika. please check this configuration
> 
> 
> > and tell us.
> 
> 
> > 
> 
> 
> > make sure that tika plugin is activated in plugin.included property then 
> > check:
> 
> 
> > 
> 
> 
> > ***
> 
> 
> > Use tika parser instead of parse-html.
> 
> 
> > 
> 
> 
> > parse-plugins.xml
> 
> 
> > 
> 
> 
> > 
> 
> 
> > 
> 
> 
> > 
> 
> 
> > 
> 
> 
> > 
> 
> 
> > 
> 
> 
> > 
> 
> 
> > ***
> 
> 
> > 
> 
> 
> > ***
> 
> 
> > nutch-site.xml
> 
> 
> > 
> 
> 
> >   tika.extractor
> 
> 
> >   boilerpipe
> 
> 
> >   
> 
> 
> >   Which text extraction algorithm to use. Valid values are: boilerpipe or 
> > none.
> 
> 
> >   
> 
> 
> > 
> 
> 
> > 
> 
> 
> > 
> 
> 
> >   tika.extractor.boilerpipe.algorithm
> 
> 
> >   ArticleExtractor
> 
> 
> >   
> 
> 
> >   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, 
> > ArticleExtractor
> 
> 
> >   or CanolaExtractor.
> 
> 
> >   
> 
> 
> > 
> 
> 
> > 
> 
> 
> > 
> 
> 
> > 
> 
> 
> > 
> 
> 
> > 
> 
> 
> > 
> 
> 
> > 
> 
> 
> > 
> 
> 
> > 
> 
> 
> > 
> 
> 
> > 
> 
> 
> > 
> 
> 
> > 
> 
> 
> > - Mensaje original -
> 
> 
> > De: "Markus Jelsma"  > >
> 
> 
> > Para: us

Re: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

2017-11-15 Thread Michael Coffey
I found a lot of detail about the boilerpipe algortithm in 
http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf


Seems like very short paragraphs can be a problem, since one of the primary 
features used for determining boilerplate is the length of a given text block.

I would also look into the tika.extractor.boilerpipe.algorithm setting. It can 
be DefaultExtractor, ArticleExtractor or CanolaExtractor. I don't know what the 
differences are, but I bet ArticleExtractor (the default algorithm ) inserts 
the Title.




From: Markus Jelsma 
To: "user@nutch.apache.org"  
Sent: Wednesday, November 15, 2017 1:38 PM
Subject: RE: [MASSMAIL]RE: Removing header,Footer and left menus while crawling



Boilerpipe is a crude tool but cheap and effective enough for many sorts of 
websites. It does has a problem with pages with little text, just as all 
extractors have a degree of problems with little text.


I believe Boilerpipe adds the title hardcoded, or it is TikaParser doing it. I 
am not sure, but remember you can get rid of it by removing some lines of code. 
See TikaParser.java, i think it is there.


Regards,

Makrus


> non-open source contribution, you could try our extractor if you want, there 
> is a (low speed) test available at 
> https://www.openindex.io/saas/data-extraction/ . It is not free or open 
> source but available and actively developed, and does much more than just 
> text extraction.




-Original message-

> From:Rushikesh K 

> Sent: Wednesday 15th November 2017 22:21

> To: user@nutch.apache.org; eru...@uci.cu

> Subject: Re: [MASSMAIL]RE: Removing header,Footer and left menus while 
> crawling

> 

> Hello, 

> 

> 

> Eyeris - Thanks for your response, i was able to make working with tika 
> boilerpipe but now i have a different problem ,some of the crawled pages 
> doesnt have the expected data 

> For some pages it brings back only the Title and skips all the content i am 
> not sure in what special cases does this do.But in my case i have two 
> problems now  

> 1. when my page has a image and 1 or 2 lines of text it doesnt get those 
> lines of data.(the data is in the  tag) 

> 2.why is it adding Title to the starting of the content is there a way not to 
> include that. 

> 

> For example see the following image for the first URL it came back with out 
> any date 

> 

> 

> 

> On Wed, Nov 15, 2017 at 8:57 AM, Eyeris Rodriguez Rueda  > wrote:

> Hello.


> 


> I am using tika boilerpipe with good results in aproximately 2000 websites.


> Rushikesh if tika boilerpipe is not working for you maybe it is because you 
> don´t are parsing documents with tika. please check this configuration


> and tell us.


> 


> make sure that tika plugin is activated in plugin.included property then 
> check:


> 


> ***


> Use tika parser instead of parse-html.


> 


> parse-plugins.xml


> 


> 


> 


> 


> 


> 


> 


> 


> ***


> 


> ***


> nutch-site.xml


> 


>   tika.extractor


>   boilerpipe


>   


>   Which text extraction algorithm to use. Valid values are: boilerpipe or 
> none.


>   


> 


> 


> 


>   tika.extractor.boilerpipe.algorithm


>   ArticleExtractor


>   


>   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, 
> ArticleExtractor


>   or CanolaExtractor.


>   


> 


> 


> 


> 


> 


> 


> 


> 


> 


> 


> 


> 


> 


> 


> - Mensaje original -


> De: "Markus Jelsma"  >


> Para: user@nutch.apache.org 


> Enviados: Martes, 14 de Noviembre 2017 17:40:08


> Asunto: [MASSMAIL]RE: Removing header,Footer and left menus while crawling


> 


> Hello Rushikesh - why is Boilerpipe not working for you. Are you having 
> trouble getting it configured - it is really just setting a boolean value. Or 
> does it work, but not to your satisfaction?


> 


> The Bayan solution should work, theoretically, but just with a lot of tedious 
> manual per-site configuration.


> 


> Regards,


> Markus


> 


> -Original message-


> > From:Rushikesh K  > >


> > Sent: Tuesday 14th November 2017 23:30


> > To: user@nutch.apache.org 


> > Cc: Sebastian Nagel  > >; betancourt.jo...@gmail.com 
> > 


> > Subject: Re: Removing header,Footer and left menus while crawling


> >


> > Hello,


> >


> > *Jorge*


> > Thanks for response,Sorry for confusion i am using Nutch 1.13 but also  i


> > tried configuring Tika boilerpipe with this version but this doesnt work


> > for me.As you suggested to use own parser ,i am not a jav

RE: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

2017-11-15 Thread Markus Jelsma
Boilerpipe is a crude tool but cheap and effective enough for many sorts of 
websites. It does has a problem with pages with little text, just as all 
extractors have a degree of problems with little text.

I believe Boilerpipe adds the title hardcoded, or it is TikaParser doing it. I 
am not sure, but remember you can get rid of it by removing some lines of code. 
See TikaParser.java, i think it is there.

Regards,
Makrus

> non-open source contribution, you could try our extractor if you want, there 
> is a (low speed) test available at 
> https://www.openindex.io/saas/data-extraction/ . It is not free or open 
> source but available and actively developed, and does much more than just 
> text extraction.


 
-Original message-
> From:Rushikesh K 
> Sent: Wednesday 15th November 2017 22:21
> To: user@nutch.apache.org; eru...@uci.cu
> Subject: Re: [MASSMAIL]RE: Removing header,Footer and left menus while 
> crawling
> 
> Hello, 
> 
> 
> Eyeris - Thanks for your response, i was able to make working with tika 
> boilerpipe but now i have a different problem ,some of the crawled pages 
> doesnt have the expected data 
> For some pages it brings back only the Title and skips all the content i am 
> not sure in what special cases does this do.But in my case i have two 
> problems now  
> 1. when my page has a image and 1 or 2 lines of text it doesnt get those 
> lines of data.(the data is in the  tag) 
> 2.why is it adding Title to the starting of the content is there a way not to 
> include that. 
> 
> For example see the following image for the first URL it came back with out 
> any date 
> 
> 
> 
> On Wed, Nov 15, 2017 at 8:57 AM, Eyeris Rodriguez Rueda  > wrote:
> Hello.
 
> 
 
> I am using tika boilerpipe with good results in aproximately 2000 websites.
 
> Rushikesh if tika boilerpipe is not working for you maybe it is because you 
> don´t are parsing documents with tika. please check this configuration
 
> and tell us.
 
> 
 
> make sure that tika plugin is activated in plugin.included property then 
> check:
 
> 
 
> ***
 
> Use tika parser instead of parse-html.
 
> 
 
> parse-plugins.xml
 
> 
 
> 
 
>                 
 
>         
 
> 
 
>         
 
>                 
 
>         
 
> ***
 
> 
 
> ***
 
> nutch-site.xml
 
> 
 
>   tika.extractor
 
>   boilerpipe
 
>   
 
>   Which text extraction algorithm to use. Valid values are: boilerpipe or 
> none.
 
>   
 
> 
 
> 
 
> 
 
>   tika.extractor.boilerpipe.algorithm
 
>   ArticleExtractor
 
>   
 
>   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, 
> ArticleExtractor
 
>   or CanolaExtractor.
 
>   
 
> 
 
> 
 
> 
 
> 
 
> 
 
> 
 
> 
 
> 
 
> 
 
> 
 
> 
 
> 
 
> 
 
> 
 
> - Mensaje original -
 
> De: "Markus Jelsma"  >
 
> Para: user@nutch.apache.org 
 
> Enviados: Martes, 14 de Noviembre 2017 17:40:08
 
> Asunto: [MASSMAIL]RE: Removing header,Footer and left menus while crawling
 
> 
 
> Hello Rushikesh - why is Boilerpipe not working for you. Are you having 
> trouble getting it configured - it is really just setting a boolean value. Or 
> does it work, but not to your satisfaction?
 
> 
 
> The Bayan solution should work, theoretically, but just with a lot of tedious 
> manual per-site configuration.
 
> 
 
> Regards,
 
> Markus
 
> 
 
> -Original message-
 
> > From:Rushikesh K  > >
 
> > Sent: Tuesday 14th November 2017 23:30
 
> > To: user@nutch.apache.org 
 
> > Cc: Sebastian Nagel  > >; betancourt.jo...@gmail.com 
> > 
 
> > Subject: Re: Removing header,Footer and left menus while crawling
 
> >
 
> > Hello,
 
> >
 
> > *Jorge*
 
> > Thanks for response,Sorry for confusion i am using Nutch 1.13 but also  i
 
> > tried configuring Tika boilerpipe with this version but this doesnt work
 
> > for me.As you suggested to use own parser ,i am not a java developer by
 
> > chance.
 
> > By chance if you or anyone in the community has a working file ,it would be
 
> > great if you can share it because there are many people facing with this
 
> > issue (i came to know this when i googled).
 
> >
 
> > Mark Vega
 
> > we also tried Bayan Group extractor plugin with Nutch 1.13 but this is also
 
> > not working.we followed the same steps.I can share the changes if you want
 
> > to take a look.
 
> >
 
> > I appreciate for your quick suggestions!
 
> >
 
> > Thanks
 
> > Rushikesh
 
> >
 
> > On Tue, Nov 14, 2017 at 8:34 AM, Jorge Betancourt <
 
> > betancourt.jo...@gmail.com > wrote:
 
> >
 
> > > Hello Rushikesh,
 
> > >
 
> > > Are you using Nutch 1.3 or Nutch 1.13? If youre using 

Re: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

2017-11-15 Thread Rushikesh K
Hello,
Eyeris - Thanks for your response, i was able to make working with tika
boilerpipe but now i have a different problem ,some of the crawled pages
doesn't have the expected data
For some pages it brings back only the *Title *and skips all the content i
am not sure in what special cases does this do.But in my case i have two
problems now
1. when my page has a image and 1 or 2 lines of text it doesn't get those
lines of data.(the data is in the  tag)
2.why is it adding *Title* to the starting of the *content* is there a way
not to include that.

For example see the following image for the first URL it came back with out
any date

[image: Inline image 1]

On Wed, Nov 15, 2017 at 8:57 AM, Eyeris Rodriguez Rueda 
wrote:

> Hello.
>
> I am using tika boilerpipe with good results in aproximately 2000 websites.
> Rushikesh if tika boilerpipe is not working for you maybe it is because
> you don´t are parsing documents with tika. please check this configuration
> and tell us.
>
> make sure that tika plugin is activated in plugin.included property then
> check:
>
> ***
> Use tika parser instead of parse-html.
>
> parse-plugins.xml
>
> 
> 
> 
>
> 
> 
> 
> ***
>
> ***
> nutch-site.xml
> 
>   tika.extractor
>   boilerpipe
>   
>   Which text extraction algorithm to use. Valid values are: boilerpipe or
> none.
>   
> 
>
> 
>   tika.extractor.boilerpipe.algorithm
>   ArticleExtractor
>   
>   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor,
> ArticleExtractor
>   or CanolaExtractor.
>   
> 
> 
>
>
>
>
>
>
>
>
>
>
>
>
> - Mensaje original -
> De: "Markus Jelsma" 
> Para: user@nutch.apache.org
> Enviados: Martes, 14 de Noviembre 2017 17:40:08
> Asunto: [MASSMAIL]RE: Removing header,Footer and left menus while crawling
>
> Hello Rushikesh - why is Boilerpipe not working for you. Are you having
> trouble getting it configured - it is really just setting a boolean value.
> Or does it work, but not to your satisfaction?
>
> The Bayan solution should work, theoretically, but just with a lot of
> tedious manual per-site configuration.
>
> Regards,
> Markus
>
> -Original message-
> > From:Rushikesh K 
> > Sent: Tuesday 14th November 2017 23:30
> > To: user@nutch.apache.org
> > Cc: Sebastian Nagel ;
> betancourt.jo...@gmail.com
> > Subject: Re: Removing header,Footer and left menus while crawling
> >
> > Hello,
> >
> > *Jorge*
> > Thanks for response,Sorry for confusion i am using Nutch 1.13 but also  i
> > tried configuring Tika boilerpipe with this version but this doesn't work
> > for me.As you suggested to use own parser ,i am not a java developer by
> > chance.
> > By chance if you or anyone in the community has a working file ,it would
> be
> > great if you can share it because there are many people facing with this
> > issue (i came to know this when i googled).
> >
> > Mark Vega
> > we also tried Bayan Group extractor plugin with Nutch 1.13 but this is
> also
> > not working.we followed the same steps.I can share the changes if you
> want
> > to take a look.
> >
> > I appreciate for your quick suggestions!
> >
> > Thanks
> > Rushikesh
> >
> > On Tue, Nov 14, 2017 at 8:34 AM, Jorge Betancourt <
> > betancourt.jo...@gmail.com> wrote:
> >
> > > Hello Rushikesh,
> > >
> > > Are you using Nutch 1.3 or Nutch 1.13? If you're using Nutch 1.13,
> then you
> > > could use the Tika boilerpipe implementation, on the nutch-site.xml you
> > > need to enable this feature with:
> > >
> > > 
> > >   tika.extractor
> > >   boilerpipe
> > >   
> > >   Which text extraction algorithm to use. Valid values are: boilerpipe
> or
> > > none.
> > >   
> > > 
> > >
> > > And configure the proper extractor with
> > > the tika.extractor.boilerpipe.algorithm setting.
> > >
> > > This is not a perfect solution, but I've used it successfully in the
> past,
> > > of course, your results will depend on how is the structure (markup of
> the
> > > website).
> > >
> > > Other option could be to implement your own parser if you need to have
> more
> > > control over what to include/exclude from the HTML. You can take a
> look at
> > > this issue https://issues.apache.org/jira/browse/NUTCH-585 which
> contains
> > > some info and old patches.
> > >
> > > Best Regards,
> > > Jorge
> > >
> > > On Mon, Nov 13, 2017 at 8:58 PM Rushikesh K  >
> > > wrote:
> > >
> > > > Hello Sebastian,
> > > > we are most excited in using the  Nutch 1.3 (with solr 6.4)  for
> crawling
> > > > our website and we are happy with the search results  but we had
> > > > requirement to skip the header footer and left menus and some other
> parts
> > > > of the page, can you please guide how can we exclude those parts.i
> was
> > > > trying various ways on google but nothing works for me.
> > > >
> > > > Appreciate for your h

RE: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

2017-11-15 Thread Markus Jelsma
You could do that, but you would need to fiddle around in TikaParser.java. 
Using TeeContentHandler you can add both the normal ContentHandler, and the 
Boilerpipe version.

 
 
-Original message-
> From:Michael Coffey 
> Sent: Wednesday 15th November 2017 20:34
> To: user@nutch.apache.org
> Subject: Re: [MASSMAIL]RE: Removing header,Footer and left menus while 
> crawling
> 
> I am curious, is it possible to send boilerpipe output to Solr as a separate 
> "plaintext" field, in addition to the usual "content" field (rather than 
> replacing it)? If so, would someone please give an overview of how to do it?
> 


Re: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

2017-11-15 Thread Michael Coffey
I am curious, is it possible to send boilerpipe output to Solr as a separate 
"plaintext" field, in addition to the usual "content" field (rather than 
replacing it)? If so, would someone please give an overview of how to do it?


Why do I only get 28 records when I crawl the tutorial example of nutch.apache.org?

2017-11-15 Thread Sol Lederman
If I google for site:nutch.apache.org I get ~12,500 results. When I crawl
the site via nutch I get 28 records in the solr index.

Here's the relevant piece of my regex-urlfilter.txt file. It's just the
default that comes with nutch.

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|
EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|
tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
# +.
+^http://([a-z0-9]*\.)*nutch.apache.org/


I'm sure I can find a number of examples of files that should be crawled
and aren't. Here's one example.

https://nutch.apache.org/javadoc.html has links to a number of
apidocs pages that are picked up by nutch. But, this page,
https://nutch.apache.org/miredot/1.12/index.html, is not picked up. It's
referenced like this:

1.13 (1.X branch)

I wouldn't imagine that relative links would be a problem as other relative
links are handled fine. And, I did click on that link and it doesn't stray
from nutch.apache.org.

I thought the problem might have to do with http vs. https. So, I changed
the last line of the filter to be this:

+^(http|https)://([a-z0-9]*\.)*nutch.apache.org/


When I did that then the /miredot/ url got fetched and parsed but the
urls indexed into Solr were the same as before including https.

What am I missing?

Thanks.

Sol


Re: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

2017-11-15 Thread Eyeris Rodriguez Rueda
Hello.

I am using tika boilerpipe with good results in aproximately 2000 websites. 
Rushikesh if tika boilerpipe is not working for you maybe it is because you 
don´t are parsing documents with tika. please check this configuration
and tell us.

make sure that tika plugin is activated in plugin.included property then check:

***
Use tika parser instead of parse-html.

parse-plugins.xml








***

***
nutch-site.xml

  tika.extractor
  boilerpipe
  
  Which text extraction algorithm to use. Valid values are: boilerpipe or none.
  



  tika.extractor.boilerpipe.algorithm
  ArticleExtractor
  
  Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, 
ArticleExtractor
  or CanolaExtractor.
  














- Mensaje original -
De: "Markus Jelsma" 
Para: user@nutch.apache.org
Enviados: Martes, 14 de Noviembre 2017 17:40:08
Asunto: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

Hello Rushikesh - why is Boilerpipe not working for you. Are you having trouble 
getting it configured - it is really just setting a boolean value. Or does it 
work, but not to your satisfaction?

The Bayan solution should work, theoretically, but just with a lot of tedious 
manual per-site configuration.

Regards,
Markus

-Original message-
> From:Rushikesh K 
> Sent: Tuesday 14th November 2017 23:30
> To: user@nutch.apache.org
> Cc: Sebastian Nagel ; betancourt.jo...@gmail.com
> Subject: Re: Removing header,Footer and left menus while crawling
> 
> Hello,
> 
> *Jorge*
> Thanks for response,Sorry for confusion i am using Nutch 1.13 but also  i
> tried configuring Tika boilerpipe with this version but this doesn't work
> for me.As you suggested to use own parser ,i am not a java developer by
> chance.
> By chance if you or anyone in the community has a working file ,it would be
> great if you can share it because there are many people facing with this
> issue (i came to know this when i googled).
> 
> Mark Vega
> we also tried Bayan Group extractor plugin with Nutch 1.13 but this is also
> not working.we followed the same steps.I can share the changes if you want
> to take a look.
> 
> I appreciate for your quick suggestions!
> 
> Thanks
> Rushikesh
> 
> On Tue, Nov 14, 2017 at 8:34 AM, Jorge Betancourt <
> betancourt.jo...@gmail.com> wrote:
> 
> > Hello Rushikesh,
> >
> > Are you using Nutch 1.3 or Nutch 1.13? If you're using Nutch 1.13, then you
> > could use the Tika boilerpipe implementation, on the nutch-site.xml you
> > need to enable this feature with:
> >
> > 
> >   tika.extractor
> >   boilerpipe
> >   
> >   Which text extraction algorithm to use. Valid values are: boilerpipe or
> > none.
> >   
> > 
> >
> > And configure the proper extractor with
> > the tika.extractor.boilerpipe.algorithm setting.
> >
> > This is not a perfect solution, but I've used it successfully in the past,
> > of course, your results will depend on how is the structure (markup of the
> > website).
> >
> > Other option could be to implement your own parser if you need to have more
> > control over what to include/exclude from the HTML. You can take a look at
> > this issue https://issues.apache.org/jira/browse/NUTCH-585 which contains
> > some info and old patches.
> >
> > Best Regards,
> > Jorge
> >
> > On Mon, Nov 13, 2017 at 8:58 PM Rushikesh K 
> > wrote:
> >
> > > Hello Sebastian,
> > > we are most excited in using the  Nutch 1.3 (with solr 6.4)  for crawling
> > > our website and we are happy with the search results  but we had
> > > requirement to skip the header footer and left menus and some other parts
> > > of the page, can you please guide how can we exclude those parts.i was
> > > trying various ways on google but nothing works for me.
> > >
> > > Appreciate for your help in Advance!
> > > --
> > > Regards
> > > Rushikesh M
> > > .Net Developer
> > >
> >
> 
> 
> 
> -- 
> Regards
> Rushikesh M
> .Net Developer
> 
La @universidad_uci es Fidel: 15 años conectados al futuro... conectados a la 
Revolución
2002-2017


Re: readseg dump and non-ASCII characters

2017-11-15 Thread Sebastian Nagel
Hi Michael,

from the arguments I guess you're interested in the raw/binary HTML content, 
right?
After a closer look I have no simple answer:

 1. HTML has no fix encoding - it could be everything, pageA may have a 
different
encoding than pageB.

 2. That's different for parsed text: it's a Java String internally

 3. "readseg dump" converts all data to a Java String using the default platform
encoding. On Linux having these locales installed you may get different 
results for:
   LC_ALL=en_US.utf8  ./bin/nutch reaseg -dump
   LC_ALL=en_US   ./bin/nutch reaseg -dump
   LC_ALL=ru_RU   ./bin/nutch reaseg -dump
In doubt, try to set UTF-8 to your platform encoding. Most pages nowadays 
are UTF-8.
Btw., this behavior isn't ideal, it should be fixed as part NUTCH-1807.

 4. a more reliable solution would require to detect the HTML encoding (the 
code is available
in Nutch) and then convert the byte[] content using the right encoding.

Best,
Sebastian



On 11/15/2017 02:20 AM, Michael Coffey wrote:
> Greetings Nutchlings,
> I have been using readseg-dump successfully to retrieve content crawled by 
> nutch, but I have one significant problem: many non-ASCII characters appear 
> as '???' in the dumped text file. This happens fairly frequently in the 
> headlines of news sites that I crawl, for things like quotes, apostrophes, 
> and dashes.
> Am I doing something wrong, or is this a known bug? I use a python utf8 
> decoder, so it would be nice if everything were UTF8.
> Here is the command that I use to dump each segment (using nutch 
> 1.12).bin/nutch readseg -dump  segPath destPath -noparse -noparsedata 
> -noparsetext -nogenerate
> It is so close to working perfectly!
>