Re: nutch adds %20 in urls instead of spaces

2024-01-09 Thread Markus Jelsma
Hello Steve,

Having those spaces normalized/encoded is expected behaviour with
urlnormalizer-basic active. I would recommend to keep it this way and have
all URLs in Solr properly encoded. Having spaces in Solr IDs is also not
recommended as it can lead to unexpected behaviour.

If you really don't want them encoded, disable urlnormalizer-basic in your
configuration.

Regards,
Markus

Op di 9 jan 2024 om 19:20 schreef Steve Cohen :

> Hello,
>
> I am updating a nutch crawl that read files in directories that have
> spaces. The urls show %20 instead of spaces. This doesn't seem to be what
> the behavior was in the past.
>
> In nutch 1.10 I get these results
>
> Nutch 1.10
>
>
>
> ParseData::
> Version: 5
> Status: success(1,0)
> Title: Index of /nycor/10-15-2018 and on - Scanned
> Outlinks: 4
>   outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2018/ anchor:
> 2018/
>   outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2019/ anchor:
> 2019/
>   outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2022/ anchor:
> 2022/
>   outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/Shipment Date
> Unknown/ anchor: Shipment Date Unknown/
>
> in Nutch 1.19, I get this
>
>
> ParseData::
> Version: 5
> Status: success(1,0)
> Title: Index of /nycor/10-15-2018 and on - Scanned
> Outlinks: 4
>   outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2018/
> anchor: 2018/
>   outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2019/
> anchor: 2019/
>   outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2022/
> anchor: 2022/
>   outlink: toUrl:
> file:/nycor/10-15-2018%20and%20on%20-%20Scanned/Shipment%20Date%20Unknown/
> anchor: Shipment Date Unknown/
>
> We are uploading to solr and the links aren't right with the %20s in the
> url. How do I remove the %20s?
>
> Thanks,
> Steve Cohen
>


Re: Nutch - Restriction by content type

2023-11-16 Thread Markus Jelsma
Hello,

You can skip certain types of documents based on their file extension,
using the urlfilter-suffix. It only filters known suffixes. Filtering based
on content type is not possible, because to know the content type requires
fetching and parsing them.

You can skip specific content types when indexing using the Jexl indexing
filter.

Regards,
Markus

Op do 16 nov 2023 om 14:56 schreef Raj Chidara :

> Hello
>   Can we control crawling of web pages by its content type through any
> configuration setting?  For example, I want to crawl only pages whose
> content type is text/html from a website and does not want to crawl other
> pages/files.
>
>
>
> Thanks and Regards
>
> Raj Chidara
>
>
>
>
>
> Worldwide Offices:
>
> USA | UK | India | Singapore | Japan
>
> *ISO 9001, 27001, 2 Compliant
>
>
>
> www.DDIsmart.com
>
>
>
>
>
>
>
>
>
>
>
>
>
> DISCLAIMER: This message is intended solely for the use of the individual
> or entity to which it is addressed. If you are not the intended recipient,
> you should not use, copy, alter, or disclose the contents of this message.
> All information or opinions expressed in this message and/or any
> attachments are those of the author and are not necessarily those of the
> group companies.
>
>
>
>
>
>
>


Re: Re[2]: Siet is not crawling

2023-08-13 Thread Markus Jelsma
Hello Raj,

I see. Unfortunately turning on Javascript supporting protocol plugins such
as Htmlunit or Selenium does not always solve the problem

Maybe you can ask at the Selenium project about this. They are the experts
on that particular problem.

Regards,
Markus

Op di 1 aug 2023 om 19:38 schreef Raj Chidara :

> Hello Markus
>   Now, I have removed all other protocol-* and given only
> protocol-selenium.  Now it crawled few pages.  However, there is no content
> read from pages.  All pages are shown as only with text *Home*
>
> Thanks and Regards
> Raj Chidara
>
>
>
>  On Mon, 30 Jan 2023 18:35:06 +0530 *Markus Jelsma
> >* wrote ---
>
> Yes, remove the other protocol-* plugins from the configuration. With all
> three active it is not always determined which one is going to do the
> work.
>
> Op ma 30 jan. 2023 om 12:50 schreef Raj Chidara :
>
>
> >
> > Hello Markus
> > Sorry for duplicate question. I added selenium plugin in
> > conf/nutch-default.xml and included following
> >
> > plugin.includes
> >
> >
> protocol-http|protocol-httpclient|protocol-selenium|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)
>
> >
> > Still the site is not crawling. Are there any additional steps to be
> > followed for installation of selenium. Please suggest
> >
> >
> > Thanks and Regards
> >
> > Raj Chidara
> >
> > - Original Message -
> > From: Markus Jelsma (markus.jel...@openindex.io)
> > Date: 30-01-2023 16:26
> > To: user@nutch.apache.org
> > Subject: Re: Siet is not crawling
> >
> > Hello Raj,
> >
> > I think the same question about the same site was asked here some time
> ago.
> > Anyway, this site loads its content via Javascript. You will need a
> > protocol plugin that supports it, either protocol-htmlunit, or
> > protocol-selenium, instead of protocol-http or any other.
> >
> > Change the configuration for plugin.includes, and it should work.
> >
> > Markus
> >
> > Op ma 30 jan. 2023 om 10:39 schreef Raj Chidara <
> raj.chid...@ddismart.com
> > >:
> >
> > >
> > > Hello,
> > >
> > > Nutch is not able crawl this site. Are there any nutch configuration
> > > changes required for this site?
> > >
> > > https://www.ich.org/
> > >
> > >
> > > Thanks and Regards
> > >
> > > Raj Chidara
> > >
> > >
> > >
> >
> >
>
>
>
>


Re: Nutch Exception

2023-07-24 Thread Markus Jelsma
Hello,

Please check the logs for more information.

Regards,
Markus

Op ma 24 jul 2023 om 19:05 schreef Raj Chidara :

> Hi
>
>   Nutch 1.19 compiled with ant without any errors and when running
> Injector, getting an error that
>
>
>
> 19:20:25.055 [main] ERROR org.apache.nutch.crawl.Injector - Injector job
> did not succeed, job id: job_local952809651_0001, job status: FAILED,
> reason: NA
>
> Exception in thread "main" java.lang.RuntimeException: Injector job did
> not succeed, job id: job_local952809651_0001, job status: FAILED, reason: NA
>
> at org.apache.nutch.crawl.Injector.inject(Injector.java:442)
>
> at org.apache.nutch.crawl.Injector.inject(Injector.java:365)
>
> at org.apache.nutch.crawl.Injector.inject(Injector.java:360)
>
> at org.apache.nutch.crawl.Crawl.run(Crawl.java:249)
>
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:81)
>
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:146)
>
>
>
>
>
> Thanks and Regards
>
> Raj Chidara


Re: Nutch 1.19/Hadoop compatible

2023-03-07 Thread Markus Jelsma
Hello Mike,

> Is nutch 1.19 compatible with Hadoop 3.3.4?
Yes!

Regards,
Markus

Op di 7 mrt 2023 om 17:37 schreef Mike :

> Hello!
>
> Is nutch 1.19 compatible with Hadoop 3.3.4?
>
>
> Thanks!
>
> mike
>


Re: Capture and index match count on regex

2023-02-26 Thread Markus Jelsma
Hello Joe,

> Now I'd like to capture and index the count of forward slash characters
'/'

It seems you are trying to do that with the subcollection plugin, i don't
think that is going to work with it.

Instead, i would suggest to write a simple index plugin that does the
counting, and adds the sum of slashes to a field of the NutchDocument
object that is available there.

Check out the index-basic plugin as an example.

Regards,
Markus

Op zo 26 feb 2023 om 00:57 schreef Gilvary, Joseph
:

> Happy Saturday/Sunday,
>
> I parse some values with index-replace to get some strings I want to
> index, like:
>
>   id:dirsubcollection="https?:\/\/(.*?)([^\/]*)$"$1"
>   dirsubcollection="^[a-zA-Z0-9\.-]*\/"
>
>   id:lastsubdir="https?:\/\/(.*?)([^\/]*)$"$1"
>   lastsubdir="\/$"
>   lastsubdir="[a-zA-Z0-9\._-]*\/"
>
> Now I'd like to capture and index the count of forward slash characters
> '/' but I don't see a way to pull that from this plugin. Is there some
> other plugin I should look at? I appreciate any suggestions to solve this.
>
>  Thanks, stay safe, stay healthy,
>
>  Joe
>


Re: Re[2]: Siet is not crawling

2023-01-30 Thread Markus Jelsma
Yes, remove the other protocol-* plugins from the configuration. With all
three active it is not always determined which one is going to do the work.

Op ma 30 jan. 2023 om 12:50 schreef Raj Chidara :

>
> Hello Markus
>   Sorry for duplicate question.  I added selenium plugin in
> conf/nutch-default.xml and included following
>
> plugin.includes
>
> protocol-http|protocol-httpclient|protocol-selenium|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)
>
> Still the site is not crawling.  Are there any additional steps to be
> followed for installation of selenium. Please suggest
>
>
> Thanks and Regards
>
> Raj Chidara
>
> - Original Message -
> From: Markus Jelsma (markus.jel...@openindex.io)
> Date: 30-01-2023 16:26
> To: user@nutch.apache.org
> Subject: Re: Siet is not crawling
>
> Hello Raj,
>
> I think the same question about the same site was asked here some time ago.
> Anyway, this site loads its content via Javascript. You will need a
> protocol plugin that supports it, either protocol-htmlunit, or
> protocol-selenium, instead of protocol-http or any other.
>
> Change the configuration for plugin.includes, and it should work.
>
> Markus
>
> Op ma 30 jan. 2023 om 10:39 schreef Raj Chidara  >:
>
> >
> > Hello,
> >
> >   Nutch is not able crawl this site.  Are there any nutch configuration
> > changes required for this site?
> >
> > https://www.ich.org/
> >
> >
> > Thanks and Regards
> >
> > Raj Chidara
> >
> >
> >
>
>


Re: Siet is not crawling

2023-01-30 Thread Markus Jelsma
Hello Raj,

I think the same question about the same site was asked here some time ago.
Anyway, this site loads its content via Javascript. You will need a
protocol plugin that supports it, either protocol-htmlunit, or
protocol-selenium, instead of protocol-http or any other.

Change the configuration for plugin.includes, and it should work.

Markus

Op ma 30 jan. 2023 om 10:39 schreef Raj Chidara :

>
> Hello,
>
>   Nutch is not able crawl this site.  Are there any nutch configuration
> changes required for this site?
>
> https://www.ich.org/
>
>
> Thanks and Regards
>
> Raj Chidara
>
>
>


Re: Nutch/Hadoop Cluster

2023-01-14 Thread Markus Jelsma
Hello Mike,

> would it pay off for me to put a hadoop cluster on top of the 3 servers.

Yes, for as many reasons as Hadoop exists for. It can be tedious to set up
for the first time, and there are many components. But at least you have
three servers, which is kind of required by Zookeeper, that you will also
need.

Ideally you would have some additional VMs to run the controlling Hadoop
programs and perhaps the Hadoop client nodes on. The workers can run on
bare metal.

> 1.) a server would not be integrated directly into the crawl process as a
master.

What do you mean? Can you elaborate?

> 2.) can I run multiple crawl jobs on one server?

Yes! Just have separate instances of Nutch home dirs on your Hadoop client
nodes, each having their own configuration.

Regards,
Markus

Op za 14 jan. 2023 om 18:42 schreef Mike :

> Hi!
>
> I am now crawling the internet in local mode in parallel with up to 10
> instances on 3 computers. would it pay off for me to put a hadoop cluster
> on top of the 3 servers.
>
> 1.) a server would not be integrated directly into the crawl process as a
> master.
> 2.) can I run multiple crawl jobs on one server?
>
> Thanks
>


Re: Not able to crawl ich

2022-12-17 Thread Markus Jelsma
Hello Raj,

This site loads its content via Javascript, so you need a protocol plugin
that supports it. HtmlUnit does not seem to work with this site, but
Selenium does. Please change your protocol plugin accordingly in you
plugin.includes configuration directive.

I tested it with our own parser as i have no Nutch here at the moment. But
it has support for Selenium so it should work, even though the version is a
bit outdated.

Regards,
Markus

Op za 17 dec. 2022 om 10:28 schreef Raj Chidara :

>
> Hi
>   I am not able to crawl this site https://www.ich.org/.  Can any one
> suggest a solution for this.  This site does not has robots.txt file.  When
> I try to check robots.txt, site is shown as under construction and
> returning response status 200.  Could it be any reason for issue?
>
>
>
> Thanks and Regards
>
> Raj Chidara
>
>
>
>
>
>
>


Re: CSV indexer file data overwriting

2022-11-25 Thread Markus Jelsma
Hi Paul, the account has been created. You should receive an email from
Jira in your inbox or spam box.

Thanks,
Markus

Op vr 25 nov. 2022 om 14:01 schreef Paul Escobar <
paul.escobar.mos...@gmail.com>:

> Hello Markus,
>
> I'm very comfortable with your proposal, open source projects must take
> advantage of any little contribution no matter the way.
>
> Best,
>
> El vie, 25 nov 2022 a las 7:21, Markus Jelsma ( >)
> escribió:
>
> > Hello Paul,
> >
> > > I tried to comment on this jira issue, but I don't have access,
> > unfortunately I don't know how to do it.
> >
> > Due to too much spam, it is no longer possible to create an account for
> > yourself, but we can do that for you if you wish
> >
> > Regards,
> > Markus
> >
> > Op do 24 nov. 2022 om 22:46 schreef Paul Escobar <
> > paul.escobar.mos...@gmail.com>:
> >
> > > Hello Sebastian,
> > >
> > > I got it, csv indexer needs one task to run properly, I tested it and
> it
> > > worked. Thank you for the advice.
> > >
> > > I tried to comment on this jira issue, but I don't have access,
> > > unfortunately I don't know how to do it.
> > >
> > > I think if a commiter changed CSVIndexerWriter.java:
> > >
> > > if (fs.exists(csvLocalOutFile)) {
> > >// clean-up
> > >LOG.warn("Removing existing output path {}", csvLocalOutFile);
> > >fs.delete(csvLocalOutFile, true);
> > > }
> > >
> > > Trying to append data instead of delete and create the file, the issue
> > > would be fixed in local mode, at least.
> > >
> > > Thanks again,
> > >
> > >
> > > El jue, 24 nov 2022 a las 7:38, Sebastian Nagel (<
> > > wastl.na...@googlemail.com>)
> > > escribió:
> > >
> > > > Hi Paul,
> > > >
> > > >  > the indexer was writing the
> > > >  > documents info in the file (nutch.csv) twice,
> > > >
> > > > Yes, I see. And now I know what I've overseen:
> > > >
> > > >   .../bin/nutch index -Dmapreduce.job.reduces=2
> > > >
> > > > You need to run the CSV indexer with only a single reducer.
> > > > In order to do so, please pass the option
> > > >--num-tasks 1
> > > > to the script bin/crawl.
> > > >
> > > > Alternatively, you could change
> > > >NUM_TASKS=2
> > > > in bin/crawl to
> > > >NUM_TASKS=1
> > > >
> > > > This is related to why at now you can't run the CSV indexer
> > > > in (pseudo)distributed mode, see my previous note:
> > > >
> > > >  > A final note: the CSV indexer only works in local mode, it does
> not
> > > yet
> > > >  > work in distributed mode (on a real Hadoop cluster). It was
> > initially
> > > >  > thought for debugging, not for larger production set up.
> > > >
> > > > The issue is described here:
> > > >https://issues.apache.org/jira/browse/NUTCH-2793
> > > >
> > > > It's a though one because a solution requires a change of the
> > IndexWriter
> > > > interface. Index writers are plugins and do not know from which
> reducer
> > > > task they are run and to which path on a distributed or parallelized
> > > system
> > > > they have to write. On Hadoop the writing the output is done in two
> > > steps:
> > > > write to a local file and then "commit" the output to the final
> > location
> > > > on the
> > > > distributed file system.
> > > >
> > > > But yes, should have a look again at this issue which is stalled
> since
> > > > quite
> > > > some time. Also because, it's now clear that you might run into
> issues
> > > even
> > > > in local mode.
> > > >
> > > > Thanks for reporting the issue! If you can, please also comment on
> the
> > > > Jira issue!
> > > >
> > > > Best,
> > > > Sebastian
> > > >
> > > >
> > > >
> > > >
> > >
> > > --
> > > Paul Escobar Mossos
> > > skype: paulescom
> > > telefono: +57 1 3006815404
> > >
> >
>
>
> --
> Paul Escobar Mossos
> skype: paulescom
> telefono: +57 1 3006815404
>


Re: CSV indexer file data overwriting

2022-11-25 Thread Markus Jelsma
Hello Paul,

> I tried to comment on this jira issue, but I don't have access,
unfortunately I don't know how to do it.

Due to too much spam, it is no longer possible to create an account for
yourself, but we can do that for you if you wish

Regards,
Markus

Op do 24 nov. 2022 om 22:46 schreef Paul Escobar <
paul.escobar.mos...@gmail.com>:

> Hello Sebastian,
>
> I got it, csv indexer needs one task to run properly, I tested it and it
> worked. Thank you for the advice.
>
> I tried to comment on this jira issue, but I don't have access,
> unfortunately I don't know how to do it.
>
> I think if a commiter changed CSVIndexerWriter.java:
>
> if (fs.exists(csvLocalOutFile)) {
>// clean-up
>LOG.warn("Removing existing output path {}", csvLocalOutFile);
>fs.delete(csvLocalOutFile, true);
> }
>
> Trying to append data instead of delete and create the file, the issue
> would be fixed in local mode, at least.
>
> Thanks again,
>
>
> El jue, 24 nov 2022 a las 7:38, Sebastian Nagel (<
> wastl.na...@googlemail.com>)
> escribió:
>
> > Hi Paul,
> >
> >  > the indexer was writing the
> >  > documents info in the file (nutch.csv) twice,
> >
> > Yes, I see. And now I know what I've overseen:
> >
> >   .../bin/nutch index -Dmapreduce.job.reduces=2
> >
> > You need to run the CSV indexer with only a single reducer.
> > In order to do so, please pass the option
> >--num-tasks 1
> > to the script bin/crawl.
> >
> > Alternatively, you could change
> >NUM_TASKS=2
> > in bin/crawl to
> >NUM_TASKS=1
> >
> > This is related to why at now you can't run the CSV indexer
> > in (pseudo)distributed mode, see my previous note:
> >
> >  > A final note: the CSV indexer only works in local mode, it does not
> yet
> >  > work in distributed mode (on a real Hadoop cluster). It was initially
> >  > thought for debugging, not for larger production set up.
> >
> > The issue is described here:
> >https://issues.apache.org/jira/browse/NUTCH-2793
> >
> > It's a though one because a solution requires a change of the IndexWriter
> > interface. Index writers are plugins and do not know from which reducer
> > task they are run and to which path on a distributed or parallelized
> system
> > they have to write. On Hadoop the writing the output is done in two
> steps:
> > write to a local file and then "commit" the output to the final location
> > on the
> > distributed file system.
> >
> > But yes, should have a look again at this issue which is stalled since
> > quite
> > some time. Also because, it's now clear that you might run into issues
> even
> > in local mode.
> >
> > Thanks for reporting the issue! If you can, please also comment on the
> > Jira issue!
> >
> > Best,
> > Sebastian
> >
> >
> >
> >
>
> --
> Paul Escobar Mossos
> skype: paulescom
> telefono: +57 1 3006815404
>


Re: Few websites not crawling

2022-11-23 Thread Markus Jelsma
Hello,

The German site is crawlable, but it does produce awful URLs with some
;jsessionid=<> attached to it. The Chinese site is all Javascript, it
requires HtmlUnit or Selenium protocol plugin for it to work at all. No
guarantee if it will.

Regards,
Markus

Op wo 23 nov. 2022 om 11:07 schreef Raj Chidara :

>
> I am not able to crawl these websites.  They do not have robots.txt file.
> Can any one suggest a solution for this
>
> https://www.cmde.org.cn/
>
> https://www.bfarm.de/EN/Home/_node.html
>
>
> Thanks and Regards
>
> Raj Chidara
>
>
>
>


Re: Incomplete TLD List

2022-11-08 Thread Markus Jelsma
Hello Mike,

You can try adding the TLD to conf/domain-suffixes.xml and see if it works.

Regards,
Markus

Op di 8 nov. 2022 om 11:16 schreef Mike :

> Hi!
> Some of the new TLDs are wrongly indexed by Nutch, is it possible to extend
> the TLD list?
>
> "url":"https://about.google/intl/en_FR/how-our-business-works/;,
> "tstamp":"2022-11-06T17:22:14.808Z",
> "domain":"google",
> "digest":"3b9a23d42f200392d12a697bbb8d4d87",
>
>
> Thanks
>
> Mike
>


Re: How should the headings plugin be configured?

2022-10-31 Thread Markus Jelsma
Hmmm, using a clean current Nutch i can get it to work with:

 
   http.agent.name
   NutchTest
 
 
   index.parse.md
   h1,h2
 
 
   plugin.includes
   headings|protocol-http|parse-tika|index-metadata
 


$ bin/nutch indexchecker https://nutch.apache.org/
digest :13584e71e6e09a71071936feb97892b8
h1 :Apache Nutch™
id :https://nutch.apache.org/

Can you check you configuration? Is a plugin name mispelled? Is the
headings plugin active during fetch/parse? Is the index-metadata plugin
active?

Regards,
Markus


Op ma 31 okt. 2022 om 14:14 schreef Mike :

> Hello Markus!
>
> Thank you for taking care of my problem!
>
> I removed the metatag.h# fron index.parse.md but ntuch indexchecker do not
> show me still the fields.
>
> Am Mo., 31. Okt. 2022 um 12:56 Uhr schrieb Markus Jelsma <
> markus.jel...@openindex.io>:
>
> > Hello Mike,
> >
> > Please remove the metatag.* prefix in the index.parse.md config and i
> > think
> > you should be fine.
> >
> > Regards,
> > Markus
> >
> > Op ma 31 okt. 2022 om 12:32 schreef Mike :
> >
> > > Yes, sorry, I also forgot to post this setting:
> > >
> > > 
> > >index.parse.md
> > >
> > >
> > >
> >
> metatag.description,metatag.keywords,metatag.rating,metatag.h1,metatag.h2,metatag.h3,metatag.h4,metatag.h5,metatag.h6
> > >
> > >Comma-separated list of keys to be taken from the parse metadata to
> > > generate fields.
> > >Can be used e.g. for 'description' or 'keywords' provided that these
> > > values are generated
> > >by a parser (see parse-metatags plugin)
> > >
> > > 
> > >
> > > The Nutch parsechecker shows me the fields but the indexchecker
> doesn't.
> > >
> > > Am Mo., 31. Okt. 2022 um 04:51 Uhr schrieb Mike :
> > >
> > > > Hello!
> > > >
> > > > I've tried everything and set everything up and get the nutch
> headings
> > > > plugin working:
> > > >
> > > > nutch-site.xml
> > > >
> > > > protocol-okhttp
> > > >   
> > > >
> > > >
> > >
> >
> protocol-okhttp|...|parse-(html|tika|text|metatags)|index-(basic|anchor|more|metadata)|...|headings|nutch-extensionpoints
> > > > 
> > > >
> > > > schema.xml
> > > >
> > > >
> > > > 
> > > >  > > > multiValued="true"/>
> > > >  > > > multiValued="true"/>
> > > >  > > > multiValued="true"/>
> > > >  > > > multiValued="true"/>
> > > >  > > > multiValued="true"/>
> > > >  > > > multiValued="true"/>
> > > >
> > > > index-writers.xml
> > > >   
> > > >   
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > >   
> > > > ...
> > > >
> > > > After indexing to solr there are no HTML headings tags in my solr
> > index,
> > > > what's missing?
> > > >
> > > > thanks!
> > > >
> > >
> >
>


Re: How should the headings plugin be configured?

2022-10-31 Thread Markus Jelsma
Hello Mike,

Please remove the metatag.* prefix in the index.parse.md config and i think
you should be fine.

Regards,
Markus

Op ma 31 okt. 2022 om 12:32 schreef Mike :

> Yes, sorry, I also forgot to post this setting:
>
> 
>index.parse.md
>
>
>  
> metatag.description,metatag.keywords,metatag.rating,metatag.h1,metatag.h2,metatag.h3,metatag.h4,metatag.h5,metatag.h6
>
>Comma-separated list of keys to be taken from the parse metadata to
> generate fields.
>Can be used e.g. for 'description' or 'keywords' provided that these
> values are generated
>by a parser (see parse-metatags plugin)
>
> 
>
> The Nutch parsechecker shows me the fields but the indexchecker doesn't.
>
> Am Mo., 31. Okt. 2022 um 04:51 Uhr schrieb Mike :
>
> > Hello!
> >
> > I've tried everything and set everything up and get the nutch headings
> > plugin working:
> >
> > nutch-site.xml
> >
> > protocol-okhttp
> >   
> >
> >
> protocol-okhttp|...|parse-(html|tika|text|metatags)|index-(basic|anchor|more|metadata)|...|headings|nutch-extensionpoints
> > 
> >
> > schema.xml
> >
> >
> > 
> >  > multiValued="true"/>
> >  > multiValued="true"/>
> >  > multiValued="true"/>
> >  > multiValued="true"/>
> >  > multiValued="true"/>
> >  > multiValued="true"/>
> >
> > index-writers.xml
> >   
> >   
> > 
> > 
> > 
> > 
> > 
> > 
> >   
> > ...
> >
> > After indexing to solr there are no HTML headings tags in my solr index,
> > what's missing?
> >
> > thanks!
> >
>


Re: How should the headings plugin be configured?

2022-10-31 Thread Markus Jelsma
Hello Mike,

I think it should be working just fine with it enabled in
protocol.includes. You can check Nutch' parser output by using:
$ bin/nutch parsechecker 

You should see one or more h# output fields present. You can then use the
index-metadata plugin to map the parser output fields to the indexer output
by setting the values for index.parse.md.

Regards,
Markus

Op ma 31 okt. 2022 om 04:51 schreef Mike :

> Hello!
>
> I've tried everything and set everything up and get the nutch headings
> plugin working:
>
> nutch-site.xml
>
> protocol-okhttp
>   
>
>
> protocol-okhttp|...|parse-(html|tika|text|metatags)|index-(basic|anchor|more|metadata)|...|headings|nutch-extensionpoints
> 
>
> schema.xml
>
>
> 
>  multiValued="true"/>
>  multiValued="true"/>
>  multiValued="true"/>
>  multiValued="true"/>
>  multiValued="true"/>
>  multiValued="true"/>
>
> index-writers.xml
>   
>   
> 
> 
> 
> 
> 
> 
>   
> ...
>
> After indexing to solr there are no HTML headings tags in my solr index,
> what's missing?
>
> thanks!
>


Re: Nutch/Hadoop: Error (FreeGenerator job did not succeed)

2022-10-14 Thread Markus Jelsma
Hello,

You cannot just run Nutch's JAR like that on Hadoop, you need the large
.job file instead. If you build Nutch from source, you will get a
runtime/deploy directory. Upload its contents to a Hadoop client and run
Nutch commands using bin/nutch ... You will then automatically use the
large .job file that is on the same level as the bin directory.

Application log files on Hadoop are to be found everywhere. Select
individuel mapper or reduce subtasks, click deeper, and look to inspect
their logs. That is where the application logs are to be found.

Good luck!
Markus

Op vr 14 okt. 2022 om 16:18 schreef Mike :

> Hi!
>
> I've been using Nutch for a while but I'm new to hadoop. got a cluster with
> hadoop 3.2.3 installed.
>
> do i have to install nutch on the hadoop filesystem or can i run it
> "local"? the clients don't need more from nutch than the info on master in
> the command line: hadoop jar /home/debian/nutch40/lib/apache-nutch-1.19.jar
> org.apache.nutch.tools.FreeGenerator -conf /home/debian/
> nutch40/conf/nutch-default.xml
> -Dplugin.folder=/home/debian/nutch40/plugins/
> /crawl/urls//tranco-top350k-20221007.txt /home/debian/crawl/segments/
>
> I get an error on the command:
>
> Exception in thread "main" java.lang.RuntimeException: FreeGenerator job
> did not succeed, job id: job_1665751705815_0007, job status: FAILED,
> reason: Task failed task_1665751705815_0007_m_00
>
>
> Since I'm new I can't find the logs in hadoop properly yet.
>
> Is there a guide how to install Natch (1.19) on Hadoop that I can't find?
>
> Thanks
> Mike
>


Re: [VOTE] Release Apache Nutch 1.19 RC#1

2022-08-29 Thread Markus Jelsma
Hello Sebastian,

No, the JAR isn't present. Multiple JARs are missing, probably because they
are loaded after httpasyncclient. I checked the previously emptied Ivy
cache. The Ivy files are there, but the JAR is missing there too.

markus@midas:~$ ls .ivy2/cache/org.apache.httpcomponents/httpasyncclient/
ivy-4.1.4.xml  ivy-4.1.4.xml.original  ivydata-4.1.4.properties

I manually downloaded the JAR from [1] and added it to the jars/ directory
in the Ivy cache. It still cannot find the JAR, perhaps the Ivy cache needs
some more things than just adding the JAR manually.

The odd thing is, that i got the URL below FROM the ivydata-4.1.4.properties
file in the cache.

Since Ralf can compile it without problems, it seems to be an issue on my
machine only. So Nutch seems fine, therefore +1.

Regards,
Markus

[1]
https://repo1.maven.org/maven2/org/apache/httpcomponents/httpasyncclient/4.1.4/


Op zo 28 aug. 2022 om 12:05 schreef Sebastian Nagel
:

> Hi Ralf,
>
> > It fetches it parses
>
> So a +1 ?
>
> Best,
> Sebastian
>
> On 8/25/22 05:22, BlackIce wrote:
> > nevermind I made a typo...
> >
> > It fetches it parses
> >
> > On Thu, Aug 25, 2022 at 3:42 AM BlackIce  wrote:
> >>
> >> so far... it doesn't select anything when creating segments:
> >> 0 records selected for fetching, exiting
> >>
> >> On Wed, Aug 24, 2022 at 3:02 PM BlackIce  wrote:
> >>>
> >>> I have been able to compile under OpenJDK 11
> >>> Have not done anything further so far
> >>> I'm gonna try to get to it this evening
> >>>
> >>> Greetz
> >>> Ralf
> >>>
> >>> On Wed, Aug 24, 2022 at 1:29 PM Markus Jelsma
> >>>  wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> Everything seems fine, the crawler seems fine when trying the binary
> >>>> distribution. The source won't work because this computer still cannot
> >>>> compile it. Clearing the local Ivy cache did not do much. This is the
> known
> >>>> compiler error with the elastic-indexer plugin:
> >>>> compile:
> >>>> [echo] Compiling plugin: indexer-elastic
> >>>>[javac] Compiling 3 source files to
> >>>> /home/markus/temp/apache-nutch-1.19/build/indexer-elastic/classes
> >>>>[javac]
> >>>>
> /home/markus/temp/apache-nutch-1.19/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java:39:
> >>>> error: package org.apache.http.impl.nio.client does not exist
> >>>>[javac] import
> org.apache.http.impl.nio.client.HttpAsyncClientBuilder;
> >>>>[javac]   ^
> >>>>[javac] 1 error
> >>>>
> >>>>
> >>>> The binary distribution works fine though. I do see a lot of new
> messages
> >>>> when fetching:
> >>>> 2022-08-24 13:21:15,867 INFO o.a.n.n.URLExemptionFilters
> [LocalJobRunner
> >>>> Map Task Executor #0] Found 0 extensions at
> >>>> point:'org.apache.nutch.net.URLExemptionFilter'
> >>>>
> >>>> This is also new at start of each task:
> >>>> SLF4J: Class path contains multiple SLF4J bindings.
> >>>> SLF4J: Found binding in
> >>>>
> [jar:file:/home/markus/temp/apache-nutch-1.19/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> >>>>
> >>>> SLF4J: Found binding in
> >>>>
> [jar:file:/home/markus/temp/apache-nutch-1.19/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> >>>>
> >>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> >>>> explanation.
> >>>> SLF4J: Actual binding is of type
> >>>> [org.apache.logging.slf4j.Log4jLoggerFactory]
> >>>>
> >>>> And this one at the end of fetcher:
> >>>> log4j:WARN No appenders could be found for logger
> >>>> (org.apache.commons.httpclient.params.DefaultHttpParams).
> >>>> log4j:WARN Please initialize the log4j system properly.
> >>>> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig
> for
> >>>> more info.
> >>>>
> >>>> I am worried about the indexer-elastic plugin, maybe others have that
> >>>> problem too? Otherwise everything seems fine.
> >>>>
> >>>> Markus
> >>>>
> >>>> Op ma 22 aug. 2022 

Re: [VOTE] Release Apache Nutch 1.19 RC#1

2022-08-24 Thread Markus Jelsma
Hi,

Everything seems fine, the crawler seems fine when trying the binary
distribution. The source won't work because this computer still cannot
compile it. Clearing the local Ivy cache did not do much. This is the known
compiler error with the elastic-indexer plugin:
compile:
[echo] Compiling plugin: indexer-elastic
   [javac] Compiling 3 source files to
/home/markus/temp/apache-nutch-1.19/build/indexer-elastic/classes
   [javac]
/home/markus/temp/apache-nutch-1.19/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java:39:
error: package org.apache.http.impl.nio.client does not exist
   [javac] import org.apache.http.impl.nio.client.HttpAsyncClientBuilder;
   [javac]   ^
   [javac] 1 error


The binary distribution works fine though. I do see a lot of new messages
when fetching:
2022-08-24 13:21:15,867 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner
Map Task Executor #0] Found 0 extensions at
point:'org.apache.nutch.net.URLExemptionFilter'

This is also new at start of each task:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/home/markus/temp/apache-nutch-1.19/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in
[jar:file:/home/markus/temp/apache-nutch-1.19/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
SLF4J: Actual binding is of type
[org.apache.logging.slf4j.Log4jLoggerFactory]

And this one at the end of fetcher:
log4j:WARN No appenders could be found for logger
(org.apache.commons.httpclient.params.DefaultHttpParams).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
more info.

I am worried about the indexer-elastic plugin, maybe others have that
problem too? Otherwise everything seems fine.

Markus

Op ma 22 aug. 2022 om 17:30 schreef Sebastian Nagel :

> Hi Folks,
>
> A first candidate for the Nutch 1.19 release is available at:
>
>https://dist.apache.org/repos/dist/dev/nutch/1.19/
>
> The release candidate is a zip and tar.gz archive of the binary and
> sources in:
>https://github.com/apache/nutch/tree/release-1.19
>
> In addition, a staged maven repository is available here:
>https://repository.apache.org/content/repositories/orgapachenutch-1020
>
> We addressed 87 issues:
>https://s.apache.org/lf6li
>
>
> Please vote on releasing this package as Apache Nutch 1.19.
> The vote is open for the next 72 hours and passes if a majority
> of at least three +1 Nutch PMC votes are cast.
>
> [ ] +1 Release this package as Apache Nutch 1.19.
> [ ] -1 Do not release this package because…
>
> Cheers,
> Sebastian
> (On behalf of the Nutch PMC)
>
> P.S.
> Here is my +1.
> - tested most of Nutch tools and run a test crawl on a single-node cluster
>   running Hadoop 3.3.4, see
>   https://github.com/sebastian-nagel/nutch-test-single-node-cluster/)
>


Re: [DISCUSS] Release 1.19 ?

2022-08-09 Thread Markus Jelsma
Sounds good!

I see we're still at Tika 2.3.0, i'll submit a patch to upgrade to the
current 2.4.1.

Thanks!
Markus

Op di 9 aug. 2022 om 09:11 schreef Sebastian Nagel :

> Hi all,
>
> more than 60 issues are done for Nutch 1.19
>
>   https://issues.apache.org/jira/projects/NUTCH/versions/12349580
>
> including
>  - important dependency upgrades
>- Hadoop 3.3.3
>- Any23 2.7
>- Tika 2.3.0
>  - plugin-specific URL stream handlers (NUTCH-2429)
>  - migration
>- from Java/JDK 8 to 11
>- from Log4j 1 to Log4j 2
>
> ... and various other fixes and improvements.
>
> The last release (1.18) happened in January 2021, so it's definitely high
> time
> to release 1.19. As usual, we'll check all remaining issues whether they
> should
> be fixed now or can be done in a later release.
>
> I would be ready to push a release candidate during the next two weeks and
> will start to work through the remaining issues and also check for
> dependency
> upgrades required to address potential vulnerabilities. Please, comment on
> issues you want to get fixed already in 1.19! Reviews of open pull
> requests and
> patches are also welcome!
>
> Thanks,
> Sebastian
>


Re: Does Nutch work with Hadoop Versions greater than 3.1.3?

2022-06-13 Thread Markus Jelsma
To add to Sebastian, it runs on Hadoop 3.3.x very good as well. Actually, i
never had any Hadoop version that could not run Nutch out of the box and
without issues.

Op ma 13 jun. 2022 om 11:54 schreef Sebastian Nagel
:

> Hi Michael,
>
> Nutch (1.18, and trunk/master) should work together with more recent Hadoop
> versions.
>
> At Common Crawl we use a modified Nutch version based on the recent trunk
> running on Hadoop 3.2.2 (soon 3.2.3) and Java 11, even on a mixed Hadoop
> cluster
> with x64 and arm64 AWS EC2 instances.
>
> But I'm sure there are more possible combinations.
>
> One important note: in trunk/master there is a yet unsolved regression
> caused by
> the newly introduced plugin-based URL stream handlers, see NUTCH-2936 and
> NUTCH-2949. Unless these are resolved, you need to undo these commits in
> order
> to run Nutch (built from trunk/master) in distributed mode.
>
> Best,
> Sebastian
>
> On 6/13/22 01:37, Michael Coffey wrote:
> > Do current 1.x versions of Nutch (1.18, and trunk/master) work with
> versions of Hadoop greater than 3.1.3? I ask because Hadoop 3.1.3 is from
> October 2019, and there are many newer versions available. For example,
> 3.1.4 came out in 2020, and there are 3.2.x and 3.3.x versions that came
> out this year.
> >
> > I don’t care about newer features in Hadoop, I just have general
> concerns about stability and security. I am working on reviving an old
> project and would like to put together the best possible infrastructure for
> the future.
> >
> >
>


OkHttp NoClassDefFoundError: okhttp3/Authenticator

2021-07-23 Thread Markus Jelsma
Hello,

With a 1.18 checkout i am trying the okhttp plugin. I couldn't get it to
work on 1.15 due to another NoClassDefFoundError, and now with 1.18, it
still doesn't work and throws another NoClassDefFoundError.

java.lang.NoClassDefFoundError: okhttp3/Authenticator
at java.base/java.lang.Class.getDeclaredConstructors0(Native Method)
at
java.base/java.lang.Class.privateGetDeclaredConstructors(Class.java:3137)
at java.base/java.lang.Class.getConstructor0(Class.java:3342)
at java.base/java.lang.Class.getConstructor(Class.java:2151)
at
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:164)
at
org.apache.nutch.protocol.ProtocolFactory.getProtocolInstanceByExtension(ProtocolFactory.java:177)
at
org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:146)
at
org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:308)
Caused by: java.lang.ClassNotFoundException: okhttp3.Authenticator
at
java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
at
org.apache.nutch.plugin.PluginClassLoader.loadClass(PluginClassLoader.java:104)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
... 8 more

Any ideas what's going on?

Thanks,
Markus


Re: Recommendation for free and production-ready Hadoop setup to run Nutch

2021-06-01 Thread Markus Jelsma
Hello Sebastian,

We have always used vanilla Apache Hadoop on our own physical servers that
are running on the latest Debian, which also runs on ARM. It will run HDFS
and YARN and any other custom job you can think of. It has snappy
compression, which is a massive improvement for large data shuffling jobs,
it runs on Java 11 and if neccessary even on AWS, but i dislike it.

You can easily read/write large files between HDFS en S3 without storing it
on local filesystem so it ticks that box too.

I don't know much about Docker, except that i don't like it either, but
that is personal. I do like vanilla Apache Hadoop.

Regards,
Markus



Op di 1 jun. 2021 om 16:35 schreef Sebastian Nagel
:

> Hi,
>
> does anybody have a recommendation for a free and production-ready Hadoop
> setup?
>
> - HDFS + YARN
> - run Nutch but also other MapReduce and Spark-on-Yarn jobs
> - with native library support: libhadoop.so and compression
>libs (bzip2, zstd, snappy)
> - must run on AWS EC2 instances and read/write to S3
> - including smaller ones (2 vCPUs, 16 GiB RAM)
> - ideally,
>- Hadoop 3.3.0
>- Java 11 and
>- support to run on ARM machines
>
> So far, Common Crawl uses Cloudera CDH but with no free updates
> anymore we consider either to switch to Amazon EMR, a Cloudera
> subscription or to use vanilla Hadoop (esp. since only HDFS and YARN
> are required).
>
> A dockerized setup is also an option (at least, for development and
> testing). So far, I've looked on [1] - the upgrade to Hadoop 3.3.0
> was straight-forward [2]. But native library support is still missing.
>
> Thanks,
> Sebastian
>
> [1] https://github.com/big-data-europe/docker-hadoop
> [2]
> https://github.com/sebastian-nagel/docker-hadoop/tree/2.0.0-hadoop3.3.0-java11
>


Re: Crawling same domain URL's

2021-05-11 Thread Markus Jelsma
Hello Prateek,

You are right, it is limited by the number of CPU cores and how many
threads it can handle, but you can still process a million records per day
if you have a few cores. If you parse as a separate step, it can run even
faster.

Indeed, it won't work if you need to process 10 million recors of the same
host every day. If you want to use Hadoop for this, you can opt for a
custom YARN application [1]. We have done that too for some of our
distributed tools, it works very nice.

Regards,
Markus

[1]
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html

Op di 11 mei 2021 om 14:54 schreef prateek :

> Hi Markus,
>
> Depending upon the core of the machine, I can only increase the number of
> threads upto a limit. After that performance degradation will come into the
> picture.
> So running a single mapper will still be a bottleneck in this case. I am
> looking for options to distribute the same domain URLs across various
> mappers. Not sure if that's even possible with Nutch or not.
>
> Regards
> Prateek
>
> On Tue, May 11, 2021 at 11:58 AM Markus Jelsma  >
> wrote:
>
> > Hello Prateet,
> >
> > If you want to fetch stuff from the same host/domain as fast as you want,
> > increase the number of threads, and the number of threads per queue. Then
> > decrease all the fetch delays.
> >
> > Regards,
> > Markus
> >
> > Op di 11 mei 2021 om 12:48 schreef prateek :
> >
> > > Hi Lewis,
> > >
> > > As mentioned earlier, it does not matter how many mappers I assign to
> > fetch
> > > tasks. Since all the URLs are of the same domain, everything will be
> > > assigned to the same mapper and all other mappers will have no task to
> > > execute. So I am looking for ways I can crawl the same domain URLs
> > quickly.
> > >
> > > Regards
> > > Prateek
> > >
> > > On Mon, May 10, 2021 at 1:02 AM Lewis John McGibbney <
> lewi...@apache.org
> > >
> > > wrote:
> > >
> > > > Hi Prateek,
> > > > mapred.map.tasks -->mapreduce.job.maps
> > > > mapred.reduce.tasks  -->mapreduce.job.reduces
> > > > You should be able to override in these in nutch-site.xml then
> publish
> > to
> > > > your Hadoop cluster.
> > > > lewismc
> > > >
> > > > On 2021/05/07 15:18:38, prateek  wrote:
> > > > > Hi,
> > > > >
> > > > > I am trying to crawl URLs belonging to the same domain (around
> 140k)
> > > and
> > > > > because of the fact that all the same domain URLs go to the same
> > > mapper,
> > > > > only one mapper is used for fetching. All others are just a waste
> of
> > > > > resources. These are the configurations I have tried till now but
> > it's
> > > > > still very slow.
> > > > >
> > > > > Attempt 1 -
> > > > > fetcher.threads.fetch : 10
> > > > > fetcher.server.delay : 1
> > > > > fetcher.threads.per.queue : 1,
> > > > > fetcher.server.min.delay : 0.0
> > > > >
> > > > > Attempt 2 -
> > > > > fetcher.threads.fetch : 10
> > > > > fetcher.server.delay : 1
> > > > > fetcher.threads.per.queue : 3,
> > > > > fetcher.server.min.delay : 0.5
> > > > >
> > > > > Is there a way to distribute the same domain URLs across all the
> > > > > fetcher.threads.fetch? I understand that in this case crawl delay
> > > cannot
> > > > be
> > > > > reinforced across different mappers but for my use case it's ok to
> > > crawl
> > > > > aggressively. So any suggestions?
> > > > >
> > > > > Regards
> > > > > Prateek
> > > > >
> > > >
> > >
> >
>


Re: Crawling same domain URL's

2021-05-11 Thread Markus Jelsma
Hello Prateet,

If you want to fetch stuff from the same host/domain as fast as you want,
increase the number of threads, and the number of threads per queue. Then
decrease all the fetch delays.

Regards,
Markus

Op di 11 mei 2021 om 12:48 schreef prateek :

> Hi Lewis,
>
> As mentioned earlier, it does not matter how many mappers I assign to fetch
> tasks. Since all the URLs are of the same domain, everything will be
> assigned to the same mapper and all other mappers will have no task to
> execute. So I am looking for ways I can crawl the same domain URLs quickly.
>
> Regards
> Prateek
>
> On Mon, May 10, 2021 at 1:02 AM Lewis John McGibbney 
> wrote:
>
> > Hi Prateek,
> > mapred.map.tasks -->mapreduce.job.maps
> > mapred.reduce.tasks  -->mapreduce.job.reduces
> > You should be able to override in these in nutch-site.xml then publish to
> > your Hadoop cluster.
> > lewismc
> >
> > On 2021/05/07 15:18:38, prateek  wrote:
> > > Hi,
> > >
> > > I am trying to crawl URLs belonging to the same domain (around 140k)
> and
> > > because of the fact that all the same domain URLs go to the same
> mapper,
> > > only one mapper is used for fetching. All others are just a waste of
> > > resources. These are the configurations I have tried till now but it's
> > > still very slow.
> > >
> > > Attempt 1 -
> > > fetcher.threads.fetch : 10
> > > fetcher.server.delay : 1
> > > fetcher.threads.per.queue : 1,
> > > fetcher.server.min.delay : 0.0
> > >
> > > Attempt 2 -
> > > fetcher.threads.fetch : 10
> > > fetcher.server.delay : 1
> > > fetcher.threads.per.queue : 3,
> > > fetcher.server.min.delay : 0.5
> > >
> > > Is there a way to distribute the same domain URLs across all the
> > > fetcher.threads.fetch? I understand that in this case crawl delay
> cannot
> > be
> > > reinforced across different mappers but for my use case it's ok to
> crawl
> > > aggressively. So any suggestions?
> > >
> > > Regards
> > > Prateek
> > >
> >
>


Re: Nutch getting rid of older segments

2021-04-07 Thread Markus Jelsma
Hello Abhay,

You only need to keep or merge old segments if you 'quickly' need to
reindex the data, and are unable to start with a fresh crawl. If you
frequently recrawl all urls, e.g. a month, then segments older than a month
can safely be removed.

You can also do daily an monthly merges, like we do. This makes it possible
to revisit old data for research, in case websites change layout, or are no
longer customer and not being crawled anymore.

Regards,
Markus

Op di 6 apr. 2021 om 21:54 schreef Abhay Ratnaparkhi <
abhay.ratnapar...@gmail.com>:

> Hello,
>
> I have a large number of segments occupying disk space. It is a good
> strategy to delete old segments or it's better to merge them.
>
> Thank you
> Abhay
>


Re: Nutch Configure multiple fetch plugins

2021-04-02 Thread Markus Jelsma
Hello Abhay,

You can configure a protocol plugin per host using the
host-protocol-mapping.txt configuration file. Its usage is:
   or protocol:  

Regards,
Markus



Op vr 2 apr. 2021 om 15:18 schreef Abhay Ratnaparkhi <
abhay.ratnapar...@gmail.com>:

> Hello,
>
> I would like to know how  to configure multiple fetch plugins (like
> protocol-selenium for dynamic and protocol-http for static content)? I
> remember seeing this feature before but could't find it.
>
> Thank you
> ~Abhay
>


Re: EXTERNAL: Re: 301 perm redirect pages are still in Solr

2021-03-09 Thread Markus Jelsma
Hello Hany,

Sure, check these commands:

 solrclean remove HTTP 301 and 404 documents from solr - DEPRECATED
use the clean command instead
 clean remove HTTP 301 and 404 documents and duplicates from
indexing backends configured via plugins

Regards,
Markus

Op di 9 mrt. 2021 om 08:49 schreef Hany NASR :

> Hello Markus,
>
> I added the property in nutch-site.xml with no luck.
>
> The documents still exist in Solr; any advice?
>
> Regards,
> Hany
>
> From: Markus Jelsma 
> Sent: Monday, March 8, 2021 3:40 PM
> To: user@nutch.apache.org
> Subject: EXTERNAL: Re: 301 perm redirect pages are still in Solr
>
> Hello Hany,
>
> You need to tell the indexer to delete those record. This will help:
>
>   
>  
>indexer.delete
>true
>  
>
> Regards,
> Markus
>
> Op ma 8 mrt. 2021 om 15:31 schreef Hany NASR  hany.n...@hsbc.com>.invalid>:
>
> > Hi All,
> >
> > I'm using Nutch 1.15, and figure out that permeant redirect pages (301)
> > are still indexed and not removed in Solr.
> >
> > When I exported the crawlDB I found the page Status: 5 (db_redir_perm).
> >
> > How can I keep Solr index up to date and make Nutch clean these pages
> > automatically?
> >
> > Regards,
> > Hany
> >
> > -
> > SAVE PAPER - THINK BEFORE YOU PRINT!
> >
> > This E-mail is confidential.
> >
> > It may also be legally privileged. If you are not the addressee you may
> > not copy,
> > forward, disclose or use any part of it. If you have received this
> message
> > in error,
> > please delete it and all copies from your system and notify the sender
> > immediately by
> > return E-mail.
> >
> > Internet communications cannot be guaranteed to be timely secure, error
> or
> > virus-free.
> > The sender does not accept liability for any errors or omissions.
> >
>
> **
> This message originated from the Internet.  Its originator may or
> may not be who they claim to be and the information contained in
> the message and any attachments may or may not be accurate.
> **
>
> -
> SAVE PAPER - THINK BEFORE YOU PRINT!
>
> This E-mail is confidential.
>
> It may also be legally privileged. If you are not the addressee you may
> not copy,
> forward, disclose or use any part of it. If you have received this message
> in error,
> please delete it and all copies from your system and notify the sender
> immediately by
> return E-mail.
>
> Internet communications cannot be guaranteed to be timely secure, error or
> virus-free.
> The sender does not accept liability for any errors or omissions.
>


Re: 301 perm redirect pages are still in Solr

2021-03-08 Thread Markus Jelsma
Hello Hany,

You need to tell the indexer to delete those record. This will help:

  
 
   indexer.delete
   true
 

Regards,
Markus

Op ma 8 mrt. 2021 om 15:31 schreef Hany NASR :

> Hi All,
>
> I'm using Nutch 1.15, and figure out that permeant redirect pages (301)
> are still indexed and not removed in Solr.
>
> When I exported the crawlDB I found the page Status: 5 (db_redir_perm).
>
> How can I keep Solr index up to date and make Nutch clean these pages
> automatically?
>
> Regards,
> Hany
>
> -
> SAVE PAPER - THINK BEFORE YOU PRINT!
>
> This E-mail is confidential.
>
> It may also be legally privileged. If you are not the addressee you may
> not copy,
> forward, disclose or use any part of it. If you have received this message
> in error,
> please delete it and all copies from your system and notify the sender
> immediately by
> return E-mail.
>
> Internet communications cannot be guaranteed to be timely secure, error or
> virus-free.
> The sender does not accept liability for any errors or omissions.
>


RE: [ANNOUNCE] Apache Nutch 1.17 Release

2020-07-02 Thread Markus Jelsma
Thanks Sebastian!

 
 
-Original message-
> From:Sebastian Nagel 
> Sent: Thursday 2nd July 2020 16:42
> To: user@nutch.apache.org
> Cc: d...@nutch.apache.org; annou...@apache.org
> Subject: [ANNOUNCE] Apache Nutch 1.17 Release
> 
> The Apache Nutch team is pleased to announce the release of
> Apache Nutch v1.17.
> 
> Nutch is a well matured, production ready Web crawler. Nutch 1.x enables
> fine grained configuration, relying on Apache Hadoop™ data structures.
> 
> Source and binary distributions are available for download from the
> Apache Nutch download site:
>https://nutch.apache.org/downloads.html
> 
> Please verify signatures using the KEYS file available at the above
> location when downloading the release.
> 
> This release includes more than 60 bug fixes and improvements, the full
> list of changes can be seen in the release report
>   https://s.apache.org/ovhry
> Please also check the changelog for breaking changes:
>   https://apache.org/dist/nutch/1.17/CHANGES.txt
> 
> Thanks to everyone who contributed to this release!
> 


RE: Resolve by IP

2020-04-14 Thread Markus Jelsma
Hello Marcel,

You can use your /etc/hosts file for that purpose, assuming you are on Linux.

Regards,
Markus
 
-Original message-
> From:Marcel Haazen 
> Sent: Tuesday 14th April 2020 12:12
> To: user@nutch.apache.org
> Subject: Resolve by IP
> 
> Hi,
> I'm trying to crawl a specific domain but I cant use DNS for that ATM due to 
> some difficulties is there an property I can set per site to specify what IP 
> it should resolve to?
> I'm using Nutch 2.3.
> 
> 
> Best Regards,
> 
> Marcel Haazen
> trainee engineer
> 
> 


RE: Extracting XMP metadata from PDF for indexing Nutch 1.15

2019-12-31 Thread Markus Jelsma
Hello Joseph,

> Is there more documentation on having Nutch get what Tika sees into what Solr 
> will see?

No, but i believe you would want to checkout the parsechecker and indexchecker 
tools. These tools display what Tika sees and what will be sent to Solr.

Regards,
Markus
 
-Original message-
> From:Gilvary, Joseph 
> Sent: Tuesday 31st December 2019 14:19
> To: user@nutch.apache.org
> Subject: Extracting XMP metadata from PDF for indexing Nutch 1.15 
> 
> Happy New Year,
> 
> I've searched the archives and the web as best I can, tinkered with 
> nutch-site.xml and schema.xml, but I can't get XMP metadata that's in the 
> parse metadata into the Solr (7.6) index.
> 
> I want to index stuff like:
> 
> xmp:CreatorTool=PScript5.dll Version 5.2.2
> xmpTPg:NPages=23
> 
> I get the pdf:docinfo:created, pdf:docinfo:modified, etc. fine, but swapping 
> out ':' for '_' isn't working for the xmp stuff.
> 
> Is there more documentation on having Nutch get what Tika sees into what Solr 
> will see?
> 
> Any help appreciated.
> 
> Thanks,
> 
> Joe
> 


RE: Best and economical way of setting hadoop cluster for distributed crawling

2019-11-01 Thread Markus Jelsma
Hello Sachin,

You might want to check out the fetcher.* settings in your configuration. They 
control how many threads in total, how they are queued, what the delay between 
fetchers is, how many threads per queue etc.

Keep in mind, if you do not own the server or have no explicit permission, it 
is wise not to over do it (the default settings are recommended) you can easily 
bring down a website using Nutch in local mode.

Regards,
Markus
 
 
-Original message-
> From:Sachin Mittal 
> Sent: Friday 1st November 2019 6:53
> To: user@nutch.apache.org
> Subject: Re: Best and economical way of setting hadoop cluster for 
> distributed crawling
> 
> Hi,
> I understood the point.
> I would also like to run nutch on my local machine.
> 
> So far I am running in standalone mode with default crawl script where
> fetch time limit is 180 minutes.
> What I have observed is that it usually fetches, parses and indexes 1800
> web pages.
> I am basically fetching the entire page and fetch process is one that takes
> maximum time.
> 
> I have a i7 processor with 16GB of RAM.
> 
> How can I increase the throughput here?
> What I have understood here is that in local mode there is only one thread
> doing the fetch?
> 
> I guess I would need multiple threads running in parallel.
> Would running nutch in pseudo distributed mode and answer here?
> It will then run multiple fetchers and I can increase my throughput.
> 
> Please let me know.
> 
> Thanks
> Sachin
> 
> 
> 
> 
> 
> 
> On Thu, Oct 31, 2019 at 2:40 AM Markus Jelsma 
> wrote:
> 
> > Hello Sachin,
> >
> > Nutch can run on Amazon AWS without trouble, and probably on any Hadoop
> > based provider. This is the most expensive option you have.
> >
> > Cheaper would be to rent some servers and install Hadoop yourself, getting
> > it up and running by hand on some servers will take the better part of a
> > day.
> >
> > The cheapest and easiest, and in almost all cases the best option, is not
> > to run Nutch on Hadoop and stay local. A local Nutch can easily handle a
> > couple of million URLs. So unless you want to crawl many different domains
> > and expect 10M+ URLs, stay local.
> >
> > When we first started our business almost a decade ago we rented VPSs
> > first and then physical machines. This ran fine for some years but when we
> > had the option to make some good investments, we bought our own hardware
> > and have been scaling up the cluster ever since. And with the previous and
> > most recent AMD based servers processing power became increasingly cheaper.
> >
> > If you need to scale up for long term, getting your own hardware is indeed
> > the best option.
> >
> > Regards,
> > Markus
> >
> >
> > -Original message-
> > > From:Sachin Mittal 
> > > Sent: Tuesday 22nd October 2019 15:59
> > > To: user@nutch.apache.org
> > > Subject: Best and economical way of setting hadoop cluster for
> > distributed crawling
> > >
> > > Hi,
> > > I have been running nutch in local mode and so far I am able to have a
> > good
> > > understanding on how it all works.
> > >
> > > I wanted to start with distributed crawling using some public cloud
> > > provider.
> > >
> > > I just wanted to know if fellow users have any experience in setting up
> > > nutch for distributed crawling.
> > >
> > > From nutch wiki I have some idea on what hardware requirements should be.
> > >
> > > I just wanted to know which of the public cloud providers (IaaS or PaaS)
> > > are good to setup hadoop clusters on. Basically ones on which it is easy
> > to
> > > setup/manage the cluster and ones which are easy on budget.
> > >
> > > Please let me know if you folks have any insights based on your
> > experiences.
> > >
> > > Thanks and Regards
> > > Sachin
> > >
> >
> 


RE: Nutch not crawling all pages

2019-10-30 Thread Markus Jelsma
Hello,

The CrawlDB does not lie, but you are two pages short of being indexed. That 
can happen for various different reasons and is hard to debug. But Bruno's 
point is valid. If you inject 50k but end up with 39k in the DB, this means 
some are filtered or multiple URLs were normalized back to the same.

My experience with websites generating valid URLs only, is that this assumption 
is almost never true. In our case, out of the thousands of sites maybe only a 
few of those with just a dozen URLs are free from errors, e.g. not having 
ambiguous URLs, redirects or 404s or otherwise bogus entries.

Markus 
 
 
-Original message-
> From:Bruno Osiek 
> Sent: Wednesday 30th October 2019 23:51
> To: user@nutch.apache.org
> Subject: Re: Nutch not crawling all pages
> 
> What is the output of the inject command, ie, when you inject the 5
> seeds justo before generating the first segment?
> 
> On Wed, Oct 30, 2019 at 3:18 PM Dave Beckstrom 
> wrote:
> 
> > Hi Markus,
> >
> > Thank you so much for the reply and the help!  The seed URL list is
> > generated from a CMS.  I'm doubtful that many of the urls would be for
> > redirects or missing pages as the CMS only writes out the urls for valid
> > pages.  It's got me stumped!
> >
> > Here is the result of the readdb.  Not sure why the dates are wonky.  The
> > date on the server is correct.  SOLR shows 39148 pages.
> >
> > TOTAL urls: 39164
> > shortest fetch interval:30 days, 00:00:00
> > avg fetch interval: 30 days, 00:07:10
> > longest fetch interval: 45 days, 00:00:00
> > earliest fetch time:Mon Nov 25 07:08:00 EST 2019
> > avg of fetch times: Wed Nov 27 18:46:00 EST 2019
> > latest fetch time:  Sat Dec 14 08:18:00 EST 2019
> > retry 0:39164
> > score quantile 0.01:1.8460402498021722E-4
> > score quantile 0.05:1.8460402498021722E-4
> > score quantile 0.1: 1.8460402498021722E-4
> > score quantile 0.2: 1.8642803479451686E-4
> > score quantile 0.25:1.8642803479451686E-4
> > score quantile 0.3: 1.960784284165129E-4
> > score quantile 0.4: 1.9663813566079454E-4
> > score quantile 0.5: 2.0251113164704293E-4
> > score quantile 0.6: 2.037905069300905E-4
> > score quantile 0.7: 2.1473052038345486E-4
> > score quantile 0.75:2.1473052038345486E-4
> > score quantile 0.8: 2.172968233935535E-4
> > score quantile 0.9: 2.429802336152917E-4
> > score quantile 0.95:2.4354603374376893E-4
> > score quantile 0.99:2.542474209925616E-4
> > min score:  3.0443254217971116E-5
> > avg score:      7.001118352666182E-4
> > max score:  1.3120110034942627
> > status 2 (db_fetched):  39150
> > status 3 (db_gone): 13
> > status 4 (db_redir_temp):   1
> > CrawlDb statistics: done
> >
> >
> >
> > On Wed, Oct 30, 2019 at 4:01 PM Markus Jelsma 
> > wrote:
> >
> > > Hello Dave,
> > >
> > > First you should check the CrawlDB using readdb -stats. My bet is that
> > > your set contains some redirects and gone (404), or transient errors. The
> > > number for fetched and notModified added up should be about the same as
> > the
> > > number of documents indexed.
> > >
> > > Regards,
> > > Markus
> > >
> > >
> > >
> > > -Original message-
> > > > From:Dave Beckstrom 
> > > > Sent: Wednesday 30th October 2019 20:00
> > > > To: user@nutch.apache.org
> > > > Subject: Nutch not crawling all pages
> > > >
> > > > Hi Everyone,
> > > >
> > > > I googled and researched and I am not finding any solutions.  I'm
> > hoping
> > > > someone here can help.
> > > >
> > > > I have txt files with about 50,000 seed urls that are fed to Nutch for
> > > > crawling and then indexing in SOLR.  However, it will not index more
> > than
> > > > about 39,000 pages no matter what I do.   The robots.txt file gives
> > Nutch
> > > > access to the entire site.
> > > >
> > > > This is a snippet of the last Nutch run:
> > > >
> > > > nerator: starting at 2019-10-30 14:44:38
> > > > Generator: Selecting best-scoring urls due for fetch.
> > > > Generator: filtering: false
> > > > Generator: normalizing: true
> > > > Generator: topN: 8
> > > > Generator: 0 records selected for fetching, exiting ...
> > > > Generate returned 1 (no new segments created)
> > > > Esc

RE: Best and economical way of setting hadoop cluster for distributed crawling

2019-10-30 Thread Markus Jelsma
Hello Sachin,

Nutch can run on Amazon AWS without trouble, and probably on any Hadoop based 
provider. This is the most expensive option you have.

Cheaper would be to rent some servers and install Hadoop yourself, getting it 
up and running by hand on some servers will take the better part of a day.

The cheapest and easiest, and in almost all cases the best option, is not to 
run Nutch on Hadoop and stay local. A local Nutch can easily handle a couple of 
million URLs. So unless you want to crawl many different domains and expect 
10M+ URLs, stay local.

When we first started our business almost a decade ago we rented VPSs first and 
then physical machines. This ran fine for some years but when we had the option 
to make some good investments, we bought our own hardware and have been scaling 
up the cluster ever since. And with the previous and most recent AMD based 
servers processing power became increasingly cheaper.

If you need to scale up for long term, getting your own hardware is indeed the 
best option.

Regards,
Markus
 
 
-Original message-
> From:Sachin Mittal 
> Sent: Tuesday 22nd October 2019 15:59
> To: user@nutch.apache.org
> Subject: Best and economical way of setting hadoop cluster for distributed 
> crawling
> 
> Hi,
> I have been running nutch in local mode and so far I am able to have a good
> understanding on how it all works.
> 
> I wanted to start with distributed crawling using some public cloud
> provider.
> 
> I just wanted to know if fellow users have any experience in setting up
> nutch for distributed crawling.
> 
> From nutch wiki I have some idea on what hardware requirements should be.
> 
> I just wanted to know which of the public cloud providers (IaaS or PaaS)
> are good to setup hadoop clusters on. Basically ones on which it is easy to
> setup/manage the cluster and ones which are easy on budget.
> 
> Please let me know if you folks have any insights based on your experiences.
> 
> Thanks and Regards
> Sachin
> 


RE: Nutch not crawling all pages

2019-10-30 Thread Markus Jelsma
Hello Dave,

First you should check the CrawlDB using readdb -stats. My bet is that your set 
contains some redirects and gone (404), or transient errors. The number for 
fetched and notModified added up should be about the same as the number of 
documents indexed.

Regards,
Markus

 
 
-Original message-
> From:Dave Beckstrom 
> Sent: Wednesday 30th October 2019 20:00
> To: user@nutch.apache.org
> Subject: Nutch not crawling all pages
> 
> Hi Everyone,
> 
> I googled and researched and I am not finding any solutions.  I'm hoping
> someone here can help.
> 
> I have txt files with about 50,000 seed urls that are fed to Nutch for
> crawling and then indexing in SOLR.  However, it will not index more than
> about 39,000 pages no matter what I do.   The robots.txt file gives Nutch
> access to the entire site.
> 
> This is a snippet of the last Nutch run:
> 
> nerator: starting at 2019-10-30 14:44:38
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: false
> Generator: normalizing: true
> Generator: topN: 8
> Generator: 0 records selected for fetching, exiting ...
> Generate returned 1 (no new segments created)
> Escaping loop: no more URLs to fetch now
> 
> I ran that crawl about 5 or 6  times.  It seems to index about 6,000 pages
> per run.  I planned to keep running it until it hit the 50,000+ page mark
> which would indicate that all of the pages where indexed.  That last run it
> just ended without crawling anything more.
> 
> Below are some of the potentially relevent config settings.  I removed the
> "description" for brevity.
> 
> 
>   http.content.limit
>   -1
> 
> 
>  db.ignore.external.links
>  true
> 
> 
>  db.ignore.external.links.mode
>  byDomain
> 
> 
>   db.ignore.internal.links
>   false
> 
> 
>   db.update.additions.allowed
>   true
>  
>  
>  db.max.outlinks.per.page
>   -1
>  
>  
>   db.injector.overwrite
>   true
>  
> 
> Anyone have any suggestions?  Its odd that when you give nutch a specific
> list of urls to be crawled that it wouldn't crawl all of them.
> 
> I appreicate any help you can offer.   Thank you!
> 
> -- 
> *Fig Leaf Software is now Collective FLS, Inc.*
> *
> *
> *Collective FLS, Inc.* 
> 
> https://www.collectivefls.com/  
> 
> 
> 
> 


RE: Adding specfic query parameters to nutch url filters

2019-10-21 Thread Markus Jelsma
Hello Sachin,

Once a URL gets filtered, by any plugin, it is rejected entirely.

If you want specific queries to pass the regex-urlfilter, you must let is pass 
explicitly above this -[?*!@=] line, e.g. +passThisQuery=

Use bin/nutch filterchecker -stdIn for quick testing.

Regards,
Markus

-Original message-
> From:Sachin Mittal 
> Sent: Monday 21st October 2019 14:22
> To: user@nutch.apache.org
> Subject: Adding specfic query parameters to nutch url filters
> 
> Hi,
> I have checked the regex-urlfilter and by default I see this line:
> 
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
> 
> In my case for a particular url I want to crawl a specific query, so wanted
> to know what file would be the best to make changes to enable this.
> 
> Would it be regex-urlfilter or I also see a filters file suffix-urlfilter
> and fast-urlfilter.
> 
> Would adding filters in any of the later two files would help.
> Any idea why these filters are added, like what would be the potential
> usecase.
> 
> Also say if I add multiple filter plugins backed by these files, then how
> url filtering works? Only those urls which pass all the plugins are
> selected to be fetched or any of the plugin?
> 
> Thanks
> Sachin
> 


Unable to index on Hadoop 3.2.0 with 1.16

2019-10-14 Thread Markus Jelsma
Hello,

We're upgrading our stuff to 1.16 and got a peculiar problem when we started 
indexing:

2019-10-14 13:50:30,586 WARN [main] org.apache.hadoop.mapred.YarnChild: 
Exception running child : java.lang.IllegalStateException: text width is less 
than 1, was <-41>
at org.apache.commons.lang3.Validate.validState(Validate.java:829)
at 
de.vandermeer.skb.interfaces.transformers.textformat.Text_To_FormattedText.transform(Text_To_FormattedText.java:215)
at 
de.vandermeer.asciitable.AT_Renderer.renderAsCollection(AT_Renderer.java:250)
at de.vandermeer.asciitable.AT_Renderer.render(AT_Renderer.java:128)
at de.vandermeer.asciitable.AsciiTable.render(AsciiTable.java:191)
at org.apache.nutch.indexer.IndexWriters.describe(IndexWriters.java:326)
at 
org.apache.nutch.indexer.IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:45)
at 
org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.(ReduceTask.java:542)
at 
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:615)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)

The only IndexWriter we use is SolrIndexer, and locally everything is just 
fine. 

Any thoughts?

Thanks,
Markus


RE: [ANNOUNCE] Apache Nutch 1.16 Release

2019-10-14 Thread Markus Jelsma
Thanks Sebastian!
 
-Original message-
> From:Sebastian Nagel 
> Sent: Friday 11th October 2019 17:03
> To: user@nutch.apache.org
> Cc: d...@nutch.apache.org; annou...@apache.org
> Subject: [ANNOUNCE] Apache Nutch 1.16 Release
> 
> Hi folks!
> 
> The Apache Nutch [0] Project Management Committee are pleased to announce
> the immediate release of Apache Nutch v1.16. We advise all current users
> and developers to upgrade to this release.
> 
> Nutch is a well matured, production ready Web crawler. Nutch 1.x enables
> fine grained configuration, relying on Apache Hadoop™ [1] data structures,
> which are great for batch processing.
> 
> As usual in the 1.X series, release artifacts are made available as both
> source and binary and also available within Maven Central [2] as a Maven
> dependency. The release is available from our downloads page [3].
> 
> This release includes more than 100 bug fixes and improvements, the full
> list of changes can be seen in the release report [4]. Please also check
> the changelog [5] for breaking changes.
> 
> 
> Thanks to all Nutch contributors which made this release possible,
> Sebastian (on behalf of the Nutch PMC)
> 
> 
> [0] https://nutch.apache.org/
> [1] https://hadoop.apache.org/
> [2]
> https://search.maven.org/search?q=g:org.apache.nutch%20AND%20a:nutch%20AND%20v:1.16
> [3] https://nutch.apache.org/downloads.html
> [4] https://s.apache.org/l2j94
> [5] https://dist.apache.org/repos/dist/release/nutch/1.16/CHANGES.txt
> 


RE: Excluding individual pages?

2019-10-10 Thread Markus Jelsma
Hello Dave,

If you have just one specific page you do not want Nutch to index, or Solr to 
show, you can either create a custom IndexingFilter that returns null 
(rejecting it) for the specified URL, or add an additional filterQuery to Solr, 
fq=-id:, filtering the specific URL from the results.

If there are more than a few URLs you want to exclude from indexing, and they 
have a pattern, you can uses regular expressions in the IndexingFilter or Solr 
filterQuery.

This is manual intervention, and only possible if your set is small enough, and 
does not change frequently. If this is not the case, you need more rigorous 
tools to detect and reject - what we call - hub pages or overview pages.

Regards,
Markus
 
-Original message-
> From:Dave Beckstrom 
> Sent: Thursday 10th October 2019 22:34
> To: user@nutch.apache.org
> Subject: Excluding individual pages?
> 
> Hi Everyone,
> 
> I searched and didn't find an answer.
> 
> Nutch is indexing the content of the page that has the seed urls in it and
> then that page shows up in the SOLR search results.   We don't want that to
> happen.
> 
> Is there a way to have nutch crawl the seed url page but not push that page
> into SOLR?  If not, is there a way to have a particular page excluded from
> the SOLR search results?  Either way I'm trying to not have that page show
> in search results.
> 
> Thank you!
> 
> Dave
> 
> -- 
> *Fig Leaf Software is now Collective FLS, Inc.*
> *
> *
> *Collective FLS, Inc.* 
> 
> https://www.collectivefls.com/  
> 
> 
> 
> 


RE: [VOTE] Release Apache Nutch 1.16 RC#1

2019-10-03 Thread Markus Jelsma
Hello Sebastian,

All tests pass nicely and i can easily run a crawl.

+1

Thanks,
Markus

By the way, what does this mean:
2019-10-03 12:48:49,696 INFO  crawl.Generator - Generator: number of items 
rejected during selection:
2019-10-03 12:48:49,698 INFO  crawl.Generator - Generator:  1  
SCHEDULE_REJECTED

 
 
-Original message-
> From:Sebastian Nagel 
> Sent: Wednesday 2nd October 2019 19:55
> To: user@nutch.apache.org
> Cc: d...@nutch.apache.org
> Subject: [VOTE] Release Apache Nutch 1.16 RC#1
> 
> Hi Folks,
> 
> A first candidate for the Nutch 1.16 release is available at:
> 
>    https://dist.apache.org/repos/dist/dev/nutch/1.16/
> 
> The release candidate is a zip and tar.gz archive of the binary and sources 
> in:
>    https://github.com/apache/nutch/tree/release-1.16
> 
> In addition, a staged maven repository is available here:
>    https://repository.apache.org/content/repositories/orgapachenutch-1017/
> 
> We addressed 104 Issues:
>    https://s.apache.org/l2j94
> 
> Please vote on releasing this package as Apache Nutch 1.16.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Nutch PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Nutch 1.16.
> [ ] -1 Do not release this package because…
> 
> Cheers,
> Sebastian
> (On behalf of the Nutch PMC)
> 
> P.S. Here is my +1.


RE: Nutch NTLM to IIS 8.5 - issues!

2019-04-25 Thread Markus Jelsma
Hello,

It doesn't say much except failure, no reason. You might want to set debugging 
to TRACE, the authenticator logs on that level. You could also check if there 
are server side messages.

Regards,
Markus
 
 
-Original message-
> From:Larry.Santello 
> Sent: Thursday 25th April 2019 15:28
> To: user@nutch.apache.org
> Subject: Nutch NTLM to IIS 8.5 - issues!
> 
> All -
> 
> I've tried several 1.x versions of Nutch and a variety of configurations and
> simply can NOT get NTLM authentication working with Nutch. I need help
> desperately!
> 
> Here are the relevent configuration points:
> Note: "user", "password", and "ntdomain" are, of course, fillers for real
> values
> 
> httpclient-auth.xml:
> 
>
> 
> 
> nutch-site.xml:
> 
>   plugin.includes
>  
> protocol-(http|httpclient)|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)
>
> 
> 
> logged problem (note that, yes, this is from 1.5.1, but 1.15 produces
> similar results):
> 2019-04-25 07:38:47,641 INFO  parse.ParserChecker - fetching:
> http://url.com/crawltest.html
> 2019-04-25 07:38:47,650 INFO  plugin.PluginRepository - Plugins: looking in:
> C:\nutch\apache-nutch-1.5.1\plugins
> 2019-04-25 07:38:47,728 INFO  plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2019-04-25 07:38:47,729 INFO  plugin.PluginRepository - Registered Plugins:
> 2019-04-25 07:38:47,729 INFO  plugin.PluginRepository -   Html Parse 
> Plug-in
> (parse-html)
> 2019-04-25 07:38:47,729 INFO  plugin.PluginRepository -   HTTP Framework
> (lib-http)
> 2019-04-25 07:38:47,729 INFO  plugin.PluginRepository -   Http / Https
> Protocol Plug-in (protocol-httpclient)
> 2019-04-25 07:38:47,729 INFO  plugin.PluginRepository -   Regex URL Filter
> (urlfilter-regex)
> 2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -   the nutch core
> extension points (nutch-extensionpoints)
> 2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -   Basic Indexing
> Filter (index-basic)
> 2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -   Anchor Indexing
> Filter (index-anchor)
> 2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -   Tika Parser 
> Plug-in
> (parse-tika)
> 2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -   Basic URL
> Normalizer (urlnormalizer-basic)
> 2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -   Regex URL Filter
> Framework (lib-regex-filter)
> 2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -   Regex URL
> Normalizer (urlnormalizer-regex)
> 2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -   URL Validator
> (urlfilter-validator)
> 2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -   CyberNeko HTML
> Parser (lib-nekohtml)
> 2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -   Pass-through URL
> Normalizer (urlnormalizer-pass)
> 2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -   OPIC Scoring
> Plug-in (scoring-opic)
> 2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -   Http Protocol
> Plug-in (protocol-http)
> 2019-04-25 07:38:47,733 INFO  plugin.PluginRepository - Registered
> Extension-Points:
> 2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -   Nutch Content
> Parser (org.apache.nutch.parse.Parser)
> 2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -   Nutch URL Filter
> (org.apache.nutch.net.URLFilter)
> 2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -   HTML Parse 
> Filter
> (org.apache.nutch.parse.HtmlParseFilter)
> 2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -   Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter)
> 2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -   Nutch URL
> Normalizer (org.apache.nutch.net.URLNormalizer)
> 2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -   Nutch Protocol
> (org.apache.nutch.protocol.Protocol)
> 2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -   Nutch Segment 
> Merge
> Filter (org.apache.nutch.segment.SegmentMergeFilter)
> 2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -   Nutch Indexing
> Filter (org.apache.nutch.indexer.IndexingFilter)
> 2019-04-25 07:38:47,761 INFO  httpclient.Http - http.proxy.host = null
> 2019-04-25 07:38:47,762 INFO  httpclient.Http - http.proxy.port = 8080
> 2019-04-25 07:38:47,763 INFO  httpclient.Http - http.timeout = 1
> 2019-04-25 07:38:47,763 INFO  httpclient.Http - http.content.limit = -1
> 2019-04-25 07:38:47,763 INFO  httpclient.Http - http.agent = Ulinenet
> Spider/Nutch-1.5.1
> 2019-04-25 07:38:47,764 INFO  httpclient.Http - http.accept.language =
> en-us,en-gb,en;q=0.7,*;q=0.3
> 2019-04-25 07:38:47,764 INFO  httpclient.Http - http.accept =
> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> 2019-04-25 07:38:47,835 DEBUG auth.AuthChallengeProcessor - Supported
> authentication schemes in the order of preference: [ntlm, digest, basic]
> 2019-04-25 

RE: Boilerpipe algorithm is not working as expected

2019-03-20 Thread Markus Jelsma
Hello Hany,

For Boilerpipe you can only select which extractor it should use. By default it 
uses ArticleExtractor, which is the best choice in most cases. However, if 
content is more spread out into separate blocks, CanolaExtractor could be a 
better choice.

Regards,
Markus
 
-Original message-
> From:hany.n...@hsbc.com.INVALID 
> Sent: Tuesday 19th March 2019 18:06
> To: user@nutch.apache.org
> Subject: Boilerpipe algorithm is not working as expected
> 
> Hello,
> 
> I am using Boilerpipe algorithm in Nutch; however, I noticed the extracted 
> content is almost 5% of the page; main page content is removed.
> 
> How does Boilerpipe is working and based on which criteria is deciding to 
> remove a section or not?
> 
> Kind regards,
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT
> Corporate Functions | HSBC Operations, Services and Technology (HOST)
> ul. Kapelanka 42A, 30-347 Kraków, Poland
> __
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com
> __
> Protect our environment - please only print this if you have to!
> 
> 
> 
> -
> SAVE PAPER - THINK BEFORE YOU PRINT!
> 
> This E-mail is confidential.  
> 
> It may also be legally privileged. If you are not the addressee you may not 
> copy,
> forward, disclose or use any part of it. If you have received this message in 
> error,
> please delete it and all copies from your system and notify the sender 
> immediately by
> return E-mail.
> 
> Internet communications cannot be guaranteed to be timely secure, error or 
> virus-free.
> The sender does not accept liability for any errors or omissions.
> 


RE: Limiting Results From Single Domain

2019-03-20 Thread Markus Jelsma
Hello Alexis, see inline.

Regards,
Markus 
 
-Original message-
> From:IZaBEE_Keeper 
> Sent: Wednesday 20th March 2019 1:28
> To: user@nutch.apache.org
> Subject: RE: Limiting Results From Single Domain
> 
> Markus Jelsma-2 wrote
> > Hello Alexis,
> > 
> > This is definately a question for Solr. Regardless of that, you choice is
> > between Solr's Result Grouping component, or FieldCollapsing filter query
> > parser.
> > 
> > Regards,
> > Markus
> 
> Thank you..  
> 
> I kinda figured that I'd need to figure out how to use the FieldCollapsing
> query parser & figure out how to make it work on a per hostname basis from
> the hostname field.. I'm not too sure on how to write the function for it
> but I should be able to figure it out..

fq={!collapse field=host}

keep in mind, for this to work equal hosts must be indexed into equals shards.
 
> I'm hopeful though that nutch might solve some of this for me as it indexes
> another billion pages.. It seems to be less frequent with more pages added
> to the index from multiple domains..

Nutch, out-of-the-box, can't solve this for you, unless you crawl or index 
less. Or get rid of a decent amount of duplicates, which are usually around if 
you crawl a few billion pages.

> 
> Thanks again..  :)
> 
> 
> 
> 
> -
> Bee Keeper at IZaBEE.com
> --
> Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
> 


RE: OutOfMemoryError: GC overhead limit exceeded

2019-03-18 Thread Markus Jelsma
Hello Hany,

If you deal with large PDF files, and you get an OOM with this stack trace, it 
is highly unlikely due to Boilerpipe being active. Boilerpipe does not run 
before PDFBox is finished so you should really increase the heap.

Of course, to answer the question, Boilerpipe should not run for non-(X)HTML 
pages anyway, so you can open a ticket. But the resources saved by such a 
change would be minimal at best.

Regards,
Markus
 
-Original message-
> From:hany.n...@hsbc.com.INVALID 
> Sent: Monday 18th March 2019 11:49
> To: user@nutch.apache.org
> Subject: RE: OutOfMemoryError: GC overhead limit exceeded
> 
> Hi,
> 
> I found the root cause and it is not related to JVM Heap Size.
> 
> The problem of parsing these pdfs happen when I enable the tika extractor to 
> be boilerpipe.
> 
> Boilerpipe article extractor is working perfectly with other pdfs and pages; 
> when I disable it, Tika is able to parse and index these pdfs.
> 
> Any suggestion/help?
> 
> Kind regards, 
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT 
> Corporate Functions | HSBC Operations, Services and Technology (HOST)
> ul. Kapelanka 42A, 30-347 Kraków, Poland
> __ 
> 
> Tie line: 7148 7689 4698 
> External: +48 123 42 0698 
> Mobile: +48 723 680 278 
> E-mail: hany.n...@hsbc.com 
> __ 
> Protect our environment - please only print this if you have to!
> 
> 
> -Original Message-
> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com.INVALID] 
> Sent: 14 March 2019 13:06
> To: user@nutch.apache.org
> Subject: Re: OutOfMemoryError: GC overhead limit exceeded
> 
> Hi,
> 
> if running in local mode, it's better passed via ENV to bin/nutch, cf.
> 
> # Environment Variables
> #
> #   NUTCH_JAVA_HOME The java implementation to use.  Overrides JAVA_HOME.
> #
> #   NUTCH_HEAPSIZE  The maximum amount of heap to use, in MB.
> #   Default is 1000.
> #
> #   NUTCH_OPTS  Extra Java runtime options.
> #   Multiple options must be separated by white space.
> 
> In distributed mode, please read the Hadoop docs about mapper/reducer memory 
> and Java heap space.
> 
> Best,
> Sebastian
> 
> On 3/14/19 12:16 PM, hany.n...@hsbc.com.INVALID wrote:
> > I'm changing the mapred.child.java.opts=-Xmx1500m in crawl bash file.
> > 
> > Is it correct?, should I change anywhere else?
> > 
> > 
> > Kind regards,
> > Hany Shehata
> > Enterprise Engineer
> > Green Six Sigma Certified
> > Solutions Architect, Marketing and Communications IT Corporate 
> > Functions | HSBC Operations, Services and Technology (HOST) ul. 
> > Kapelanka 42A, 30-347 Kraków, Poland 
> > __
> > 
> > Tie line: 7148 7689 4698
> > External: +48 123 42 0698
> > Mobile: +48 723 680 278
> > E-mail: hany.n...@hsbc.com
> > __
> > Protect our environment - please only print this if you have to!
> > 
> > 
> > -Original Message-
> > From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> > Sent: 14 March 2019 10:59
> > To: user@nutch.apache.org
> > Subject: RE: OutOfMemoryError: GC overhead limit exceeded
> > 
> > Hello - 1500 MB is a lot indeed, but 3500 PDF pages is even more. You have 
> > no choice, either skip large files, or increase memory.
> > 
> > Regards,
> > Markus
> > 
> >  
> >  
> > -Original message-
> >> From:hany.n...@hsbc.com.INVALID 
> >> Sent: Thursday 14th March 2019 10:44
> >> To: user@nutch.apache.org
> >> Subject: OutOfMemoryError: GC overhead limit exceeded
> >>
> >> Hello,
> >>
> >> I'm facing OutOfMemoryError: GC overhead limit exceeded exception while 
> >> trying to parse pdfs that includes 3500 pages.
> >>
> >> I increased the JVM RAM to 1500MB; however, I'm still facing the same 
> >> problem
> >>
> >> Please advise
> >>
> >> 2019-03-08 05:31:55,269 WARN  parse.ParseUtil - Error parsing 
> >> http://domain/-/media/files/attachments/common/voting_disclosure_2014
> >> _ q2.pdf with org.apache.nutch.parse.tika.TikaParser
> >> java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC 
> >> overhead limit exceeded
> >> at 
> >> java.util.concurrent

RE: Increasing the number of reducer in UpdateHostDB

2019-03-18 Thread Markus Jelsma
Hello Suraj,

You can safely increase the number of reducers for UpdateHostDB to as high as 
you like. 

Regards,
Markus

-Original message-
> From:Suraj Singh 
> Sent: Monday 18th March 2019 11:41
> To: user@nutch.apache.org
> Subject: Increasing the number of reducer in UpdateHostDB
> 
> Hi All,
> 
> Can I increase the number of reducer in UpdateHostDB step? Currently it is 
> running with 1 reducer.
> Will it impact the crawling in any way?
> 
> Current command in crawl script:
> __bin_nutch updatehostdb -crawldb "$CRAWL_PATH"/crawldb -hostdb 
> "$CRAWL_PATH"/hostdb
> 
> Can I update it to:
> __bin_nutch updatehostdb -D mapreduce.job.reduces=32 -crawldb 
> "$CRAWL_PATH"/crawldb -hostdb "$CRAWL_PATH"/hostdb
> 
> Thanks it advance.
> 
> Regards,
> Suraj Singh
> 
> 


RE: OutOfMemoryError: GC overhead limit exceeded

2019-03-14 Thread Markus Jelsma
Hello - 1500 MB is a lot indeed, but 3500 PDF pages is even more. You have no 
choice, either skip large files, or increase memory.

Regards,
Markus

 
 
-Original message-
> From:hany.n...@hsbc.com.INVALID 
> Sent: Thursday 14th March 2019 10:44
> To: user@nutch.apache.org
> Subject: OutOfMemoryError: GC overhead limit exceeded
> 
> Hello,
> 
> I'm facing OutOfMemoryError: GC overhead limit exceeded exception while 
> trying to parse pdfs that includes 3500 pages.
> 
> I increased the JVM RAM to 1500MB; however, I'm still facing the same problem
> 
> Please advise
> 
> 2019-03-08 05:31:55,269 WARN  parse.ParseUtil - Error parsing 
> http://domain/-/media/files/attachments/common/voting_disclosure_2014_q2.pdf 
> with org.apache.nutch.parse.tika.TikaParser
> java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC 
> overhead limit exceeded
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:206)
> at 
> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92)
> at 
> org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:127)
> at 
> org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:78)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
> at 
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
> at 
> org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:564)
> at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
> at 
> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
> at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> at 
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
> at 
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:171)
> at 
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:138)
> at 
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:79)
> at 
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> at 
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> 
> Kind regards,
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT
> Corporate Functions | HSBC Operations, Services and Technology (HOST)
> ul. Kapelanka 42A, 30-347 Kraków, Poland
> __
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com
> __
> Protect our environment - please only print this if you have to!
> 
> 
> 
> -
> SAVE PAPER - THINK BEFORE YOU PRINT!
> 
> This E-mail is confidential.  
> 
> It may also be legally privileged. If you are not the addressee you may not 
> copy,
> forward, disclose or use any part of it. If you have received this message in 
> error,
> please delete it and all copies from your system and notify the sender 
> immediately by
> return E-mail.
> 
> Internet communications cannot be guaranteed to be timely secure, error or 
> virus-free.
> The sender does not accept liability for any errors or omissions.
> 


RE: Increasing the number of reducer in Deduplication

2019-02-20 Thread Markus Jelsma
Hello Suraj,

That should be no problem. Duplicates are grouped by their signature, this 
means you can have as many reducers as you would like.

Regards,
Markus
 
 
-Original message-
> From:Suraj Singh 
> Sent: Wednesday 20th February 2019 12:56
> To: user@nutch.apache.org
> Subject: Increasing the number of reducer in Deduplication
> 
> Hi All,
> 
> Can I increase the number of reducer in Deduplication on crawldb? Currently 
> it is running with 1 reducer.
> Will it impact the crawling in any way?
> 
> Current command in crawl script:
> __bin_nutch dedup "$CRAWL_PATH"/crawldb
> 
> Can I update it to:
> __bin_nutch dedup "$CRAWL_PATH"/crawldb mapreduce.job.reduces=32
> 
> Thanks it advance.
> 
> Regards,
> Suraj Singh
> 


RE: Difficulty getting data from Nutch parse data into Solr document

2019-02-13 Thread Markus Jelsma
Hello Tom,

To get parse metadata field indexed, you need the indexer-metadata plugin. Use 
the index.parse.md parameter to define the fields you want to have indexed. Use 
indexchecker to test.

Regards,
Markus

 
 
-Original message-
> From:Tom Potter 
> Sent: Wednesday 13th February 2019 11:51
> To: user@nutch.apache.org
> Subject: Difficulty getting data from Nutch parse data into Solr document
> 
> I'm not sure how to get some of the data from a crawled PDF document into
> my Solr index. When I run the parsechecker tool I can see the date I need
> as an attribute of the Content Metadata (date=2018-08-06T14:14:00Z), but
> I'm not sure how I configure the solrindex-mapping.xml to successfully map
> this to a Solr field.
> 
> I tried adding the below mapping, but it didn't work:
> 
> 
> 
> Below is an example of the result of the parsechecker data showing the date
> attribute in the Content Metadata:
> -
> ParseData
> -
> 
> Version: 5
> Status: success(1,0)
> Title: XXX
> Outlinks: 1
>   outlink: toUrl: https://xxx.zzz anchor:
> Content Metadata: Server=Microsoft-IIS/7.5 Connection=close
> Last-Modified=Mon, 06 Aug 2018 15:16:28 GMT Date=Wed, 13 Feb 2019 10:36:52
> GMT nutch.crawl.score=0.0 nutch.fetch.time=1550054216537
> Cache-Control=no-cache, no-store ETag="8727b79f5faf0086a80c86df4cbbac12"
> Content-Disposition=inline; filename=x.pdf" X-AspNet-Version=4.0.30319
> Content-Length=81903 Content-Type=application/pdf X-Powered-By=ASP.NET
> Parse Metadata: date=2018-08-06T14:14:00Z pdf:PDFVersion=1.5
> xmp:CreatorTool=Microsoft Office Word
> access_permission:modify_annotations=true
> access_permission:can_print_degraded=true dc:creator=X
> dcterms:created=2018-08-06T14:14:00Z Last-Modified=2018-08-06T14:14:00Z
> dcterms:modified=2018-08-06T14:14:00Z dc:format=application/pdf;
> version=1.5 Last-Save-Date=2018-08-06T14:14:00Z
> access_permission:fill_in_form=true meta:save-date=2018-08-06T14:14:00Z
> pdf:encrypted=false dc:title= modified=2018-08-06T14:14:00Z
> Content-Type=application/pdf creator=XX meta:author=X
> meta:creation-date=2018-08-06T14:14:00Z created=Mon Aug 06 15:14:00 BST
> 2018 access_permission:extract_for_accessibility=true
> access_permission:assemble_document=true xmpTPg:NPages=7
> Creation-Date=2018-08-06T14:14:00Z access_permission:extract_content=true
> access_permission:can_print=true Author=XX producer=Aspose.Words for
> .NET 16.2.0.0 access_permission:can_modify=true
> 
> 
> -- 
> 
> 
> *Tom Potter*
> Software Developer  T: 0191 241 3703
> E: tom.pot...@orangebus.co.uk  • W:
> www.orangebus.co.uk •
> [image: Orange Bus]  Orange Bus, Milburn
> House, Dean Street, Newcastle Upon Tyne, NE1 1LE
> 
> -- 
> 
> 
> This email and any attachment to it are confidential. Unless you are the 
> intended recipient, you may not use, copy or disclose either the message or 
> any information contained in the message. If you are not the intended 
> recipient, you should delete this email and notify the sender immediately. 
> Any views or opinions expressed in this email are those of the sender 
> unless otherwise stated. All copyright in any Orange Bus and/or Capita 
> material in this email is reserved. All emails may be recorded by Orange 
> Bus  and monitored for legitimate business purposes. Orange Bus and Capita 
> exclude all liability for any loss or damage arising or resulting from the 
> receipt, use or transmission of this email to the fullest extent permitted 
> by law.
> 
> 
> 
> 
> Orange Bus Limited is a company registered in England & Wales 
> under company registration number 974. Our registered company address 
> is 30 Berners Street, London, W1T 3LR, United Kingdom. Orange Bus Limited, 
> part of Capita Software, is a subsidiary of Capita Business Services Ltd 
> registered in England & Wales under company number 2299747. 
> 
> 
> 
> 
> *You are 
> receiving this message from Capita Software. Should you wish to see how we 
> may have collected or may use your information, or view ways to exercise 
> your individual rights, see our Privacy Notice 
> *
> 


RE: Multiple Reducers for Linkdb

2018-12-18 Thread Markus Jelsma
Hello Suraj,

You can safely run the LinkDB merger with as many reducers as you like.

Regards,
Markus
 
 
-Original message-
> From:Suraj Singh 
> Sent: Tuesday 18th December 2018 15:39
> To: user@nutch.apache.org
> Subject: Multiple Reducers for Linkdb
> 
> Hello,
> 
> Can we run Linkdb(invertlinks) with multiple reducers?
> 
> I am asking this because by default it runs with just one Reducer and it 
> takes good amount of time to complete and it keeps on increasing every 
> subsequent round since the number of Mappers keeps on increasing with every 
> round.
> 
> Of course we can set the number of reducers to more than one but I am afraid 
> it will break something in subsequent steps as I am not sure why it was built 
> to run with just one Reducer considering the overload with every subsequent 
> round.
> 
> Thanks in Advance.
> 
> Regards,
> Suraj Singh
> 
> 


RE: RE: unexpected Nutch crawl interruption

2018-11-19 Thread Markus Jelsma
Hello Yossi,

That should only be the case if the CrawlDB is updated by the generator, which 
is not a default.

Regards,
Markus

 
 
-Original message-
> From:Yossi Tamari 
> Sent: Monday 19th November 2018 14:04
> To: user@nutch.apache.org
> Subject: RE: RE: unexpected Nutch crawl interruption
> 
> I think in the case that you interrupt the fetcher, you'll have the problem 
> that URLs that where scheduled to be fetched on the interrupted cycle will 
> never be fetched (because of NUTCH-1842).
> 
>   Yossi.
> 
> > -Original Message-
> > From: Markus Jelsma 
> > Sent: 19 November 2018 14:52
> > To: user@nutch.apache.org
> > Subject: RE: RE: unexpected Nutch crawl interruption
> > 
> > Hello Hany,
> > 
> > That depends. If you interrupt the fetcher, the segment being fetched can be
> > thrown away. But if you interrupt updatedb, you can remove the temp 
> > directory
> > and must get rid of the lock file. The latter is also true if you interrupt 
> > the
> > generator.
> > 
> > Regards,
> > Markus
> > 
> > 
> > 
> > -Original message-
> > > From:hany.n...@hsbc.com 
> > > Sent: Monday 19th November 2018 13:30
> > > To: user@nutch.apache.org
> > > Subject: RE: RE: unexpected Nutch crawl interruption
> > >
> > > This means there is nothing called corrupted db by any mean?
> > >
> > >
> > > Kind regards,
> > > Hany Shehata
> > > Solutions Architect, Marketing and Communications IT Corporate
> > > Functions | HSBC Operations, Services and Technology (HOST) ul.
> > > Kapelanka 42A, 30-347 Kraków, Poland
> > >
> > _
> > _
> > >
> > > Tie line: 7148 7689 4698
> > > External: +48 123 42 0698
> > > Mobile: +48 723 680 278
> > > E-mail: hany.n...@hsbc.com
> > >
> > _
> > _
> > > Protect our environment - please only print this if you have to!
> > >
> > >
> > > -Original Message-
> > > From: Semyon Semyonov [mailto:semyon.semyo...@mail.com]
> > > Sent: Monday, November 19, 2018 12:59 PM
> > > To: user@nutch.apache.org
> > > Subject: Re: RE: unexpected Nutch crawl interruption
> > >
> > > From the most recent updated crawldb.
> > >
> > >
> > > Sent: Monday, November 19, 2018 at 12:35 PM
> > > From: hany.n...@hsbc.com
> > > To: "user@nutch.apache.org" 
> > > Subject: RE: unexpected Nutch crawl interruption Hello Semyon,
> > >
> > > Does it means that if I re-run crawl command it will continue from where 
> > > it has
> > been stopped from the previous run?
> > >
> > > Kind regards,
> > > Hany Shehata
> > > Solutions Architect, Marketing and Communications IT Corporate
> > > Functions | HSBC Operations, Services and Technology (HOST) ul.
> > > Kapelanka 42A, 30-347 Kraków, Poland
> > >
> > _
> > _
> > >
> > > Tie line: 7148 7689 4698
> > > External: +48 123 42 0698
> > > Mobile: +48 723 680 278
> > > E-mail: hany.n...@hsbc.com
> > >
> > _
> > _
> > > Protect our environment - please only print this if you have to!
> > >
> > >
> > > -Original Message-
> > > From: Semyon Semyonov [mailto:semyon.semyo...@mail.com]
> > > Sent: Monday, November 19, 2018 12:06 PM
> > > To: user@nutch.apache.org
> > > Subject: Re: unexpected Nutch crawl interruption
> > >
> > > Hi Hany,
> > >
> > > If you open the script code you will reach that line:
> > >
> > > # main loop : rounds of generate - fetch - parse - update for ((a=1; ; 
> > > a++)) with
> > number of break conditions.
> > >
> > > For each iteration it calls n-independent map jobs.
> > > If it breaks it stops.
> > > You should finish the loop either with manual nutch commands, or start 
> > > with
> > the new call of crawl script using the past iteration crawldb.
> > > Semyon.
> > >
> > >
> > >
> > > Sent: Monday, November 19, 2018 at 11:41 AM
> > > From: hany.n...@hsbc.com
> > > To: "user@nutch.apache.org" 
> > > Subject: unexpected Nut

RE: RE: unexpected Nutch crawl interruption

2018-11-19 Thread Markus Jelsma
Hello Hany,

That depends. If you interrupt the fetcher, the segment being fetched can be 
thrown away. But if you interrupt updatedb, you can remove the temp directory 
and must get rid of the lock file. The latter is also true if you interrupt the 
generator.

Regards,
Markus

 
 
-Original message-
> From:hany.n...@hsbc.com 
> Sent: Monday 19th November 2018 13:30
> To: user@nutch.apache.org
> Subject: RE: RE: unexpected Nutch crawl interruption
> 
> This means there is nothing called corrupted db by any mean?
> 
> 
> Kind regards, 
> Hany Shehata
> Solutions Architect, Marketing and Communications IT 
> Corporate Functions | HSBC Operations, Services and Technology (HOST)
> ul. Kapelanka 42A, 30-347 Kraków, Poland
> __ 
> 
> Tie line: 7148 7689 4698 
> External: +48 123 42 0698 
> Mobile: +48 723 680 278 
> E-mail: hany.n...@hsbc.com 
> __ 
> Protect our environment - please only print this if you have to!
> 
> 
> -Original Message-
> From: Semyon Semyonov [mailto:semyon.semyo...@mail.com] 
> Sent: Monday, November 19, 2018 12:59 PM
> To: user@nutch.apache.org
> Subject: Re: RE: unexpected Nutch crawl interruption
> 
> From the most recent updated crawldb.
>  
> 
> Sent: Monday, November 19, 2018 at 12:35 PM
> From: hany.n...@hsbc.com
> To: "user@nutch.apache.org" 
> Subject: RE: unexpected Nutch crawl interruption Hello Semyon,
> 
> Does it means that if I re-run crawl command it will continue from where it 
> has been stopped from the previous run?
> 
> Kind regards,
> Hany Shehata
> Solutions Architect, Marketing and Communications IT Corporate Functions | 
> HSBC Operations, Services and Technology (HOST) ul. Kapelanka 42A, 30-347 
> Kraków, Poland 
> __ 
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com
> __
> Protect our environment - please only print this if you have to!
> 
> 
> -Original Message-
> From: Semyon Semyonov [mailto:semyon.semyo...@mail.com]
> Sent: Monday, November 19, 2018 12:06 PM
> To: user@nutch.apache.org
> Subject: Re: unexpected Nutch crawl interruption
> 
> Hi Hany,  
>  
> If you open the script code you will reach that line:
>  
> # main loop : rounds of generate - fetch - parse - update for ((a=1; ; a++)) 
> with number of break conditions.
> 
> For each iteration it calls n-independent map jobs.
> If it breaks it stops.
> You should finish the loop either with manual nutch commands, or start with 
> the new call of crawl script using the past iteration crawldb.
> Semyon.
>  
>  
> 
> Sent: Monday, November 19, 2018 at 11:41 AM
> From: hany.n...@hsbc.com
> To: "user@nutch.apache.org" 
> Subject: unexpected Nutch crawl interruption Hello,
> 
> What will happen if bin/crawl command is forced to be stopped by any reason? 
> Server restart
> 
> Kind regards,
> Hany Shehata
> Solutions Architect, Marketing and Communications IT Corporate Functions | 
> HSBC Operations, Services and Technology (HOST) ul. Kapelanka 42A, 30-347 
> Kraków, Poland 
> __
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com
> __
> Protect our environment - please only print this if you have to!
> 
> 
> 
> -
> SAVE PAPER - THINK BEFORE YOU PRINT!
> 
> This E-mail is confidential.
> 
> It may also be legally privileged. If you are not the addressee you may not 
> copy, forward, disclose or use any part of it. If you have received this 
> message in error, please delete it and all copies from your system and notify 
> the sender immediately by return E-mail.
> 
> Internet communications cannot be guaranteed to be timely secure, error or 
> virus-free.
> The sender does not accept liability for any errors or omissions.
> 
> 
> ***
> This message originated from the Internet. Its originator may or may not be 
> who they claim to be and the information contained in the message and any 
> attachments may or may not be accurate.
> 
> 
> 
> 
> 
> -
> SAVE PAPER - THINK BEFORE YOU PRINT!
> 
> This E-mail is confidential.
> 
> It may also be legally privileged. If you are not the addressee you may not 
> copy, forward, disclose or use any part of it. If you have received this 
> message in error, please delete it and all copies from your system and notify 
> the sender immediately by return E-mail.
> 
> Internet communications cannot be guaranteed to be timely secure, error or 
> 

RE: Wordpress.com hosted sites fail org.apache.commons.httpclient.NoHttpResponseException

2018-11-14 Thread Markus Jelsma
Hello Nicholas,

Your IP might be blocked, or the firewall just drops the connection due to your 
User-Agent name. We have no problems fetching this host.

Regards,
Markus

 
 
-Original message-
> From:Nicholas Roberts 
> Sent: Wednesday 14th November 2018 7:58
> To: user@nutch.apache.org
> Subject: Wordpress.com hosted sites fail 
> org.apache.commons.httpclient.NoHttpResponseException
> 
> hi
> 
> I am setting up a new crawler with Nutch 1.15 and am having problems only
> with Wordpress.com hosted sites
> 
> I can crawl other https sites no problems
> 
> Wordpress sites can be crawled on other hosts, but I think there is a
> problem with the SSL certs at Wordpress.com
> 
> I get this error
> 
> FetcherThread 43 fetch of https://whatdavidread.ca/ failed with:
> org.apache.commons.httpclient.NoHttpResponseException: The server
> whatdavidread.ca failed to respond
> FetcherThread 43 has no more work available
> 
> there seems to be two layers of SSL certs
> 
> first there is a Letsencrypt cert, with many domains, including the one
> above, and the tls.auttomatic.com domain
> 
> then, underlying the Lets Encrypt cert, there is a *.wordpress.com cert
> from Comodo
> 
> Certificate chain
>  0 s:/OU=Domain Control Validated/OU=EssentialSSL Wildcard/CN=*.
> wordpress.com
>i:/C=GB/ST=Greater Manchester/L=Salford/O=COMODO CA Limited/CN=COMODO
> RSA Domain Validation Secure Server CA
> 
> I can crawl other https sites no problems
> 
> I have tried the NUTCH_OPTS=($NUTCH_OPTS -Dhadoop.log.dir="$NUTCH_LOG_DIR"
> -Djsse.enableSNIExtension=false) and no joy
> 
> my nutch-site.xml
> 
> 
>   plugin.includes
> 
> protocol-http|protocol-httpclient|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|urlfilter-domainblacklist
>   
>   
> 
> 
> 
> thanks for the consideration
> -- 
> Nicholas Roberts
> www.niccolox.org
> 


RE: Block certain parts of HTML code from being indexed

2018-11-14 Thread Markus Jelsma
Hello Hany,

Using parse-tika as your HTML parser, you can enable Boilerpipe (see 
nutch-default).

Regards,
Markus

 
 
-Original message-
> From:hany.n...@hsbc.com 
> Sent: Wednesday 14th November 2018 15:53
> To: user@nutch.apache.org
> Subject: Block certain parts of HTML code from being indexed
> 
> Hello All,
> 
> I am using Nutch 1.15, and wondering if there is a feature for blocking 
> certain parts of HTML code from being indexed (header & footer).
> 
> Kind regards,
> Hany Shehata
> Solutions Architect, Marketing and Communications IT
> Corporate Functions | HSBC Operations, Services and Technology (HOST)
> ul. Kapelanka 42A, 30-347 Kraków, Poland
> __
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com
> __
> Protect our environment - please only print this if you have to!
> 
> 
> 
> -
> SAVE PAPER - THINK BEFORE YOU PRINT!
> 
> This E-mail is confidential.  
> 
> It may also be legally privileged. If you are not the addressee you may not 
> copy,
> forward, disclose or use any part of it. If you have received this message in 
> error,
> please delete it and all copies from your system and notify the sender 
> immediately by
> return E-mail.
> 
> Internet communications cannot be guaranteed to be timely secure, error or 
> virus-free.
> The sender does not accept liability for any errors or omissions.
> 


RE: Getting Nutch To Crawl Sharepoint Online

2018-10-29 Thread Markus Jelsma
Hello Ashish,

You might want to check out Apache ManifoldCF.

Regards.
Markus

http://manifoldcf.apache.org/

 
 
-Original message-
> From:Ashish Saini 
> Sent: Monday 29th October 2018 18:56
> To: user@nutch.apache.org
> Subject: Getting Nutch To Crawl Sharepoint Online
> 
> We are looking at solutions for crawling and indexing documents in
> Sharepoint Online (Office 365) into Elasticsearch. We already use Nutch
> 1.14 for crawling websites and are looking to extend the solution to crawl
> Sharepoint as well.
> 
> Looking around on the Wiki, it seems adding a custom authentication scheme
> and implementing an AuthScheme interface is a path available for Nutch
> users.
> 
> I just wanted to see if anyone has recently crawled Sharepoint content and
> if there are any caveats or tips to keep in mind.
> 
> Thanks.
> 


RE: Apache Nutch commercial support

2018-10-12 Thread Markus Jelsma
Hello Hany,

There are a few, mine included, mentioned on the Nutch support wiki page [1].

Regards,
Markus

[1] https://wiki.apache.org/nutch/Support

 
 
-Original message-
> From:hany.n...@hsbc.com 
> Sent: Friday 12th October 2018 9:25
> To: user@nutch.apache.org
> Subject: Apache Nutch commercial support
> 
> Hello,
> 
> You know big companies,  always looking for commercial things :(
> 
> Do you know if there is any commercial support for Apache Nutch and Solr? - 
> or external providers?
> 
> Kind regards,
> Hany Shehata
> Solutions Architect, Marketing and Communications IT
> Corporate Functions | HSBC Operations, Services and Technology (HOST)
> ul. Kapelanka 42A, 30-347 Kraków, Poland
> __
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com
> __
> Protect our environment - please only print this if you have to!
> 
> 
> 
> -
> SAVE PAPER - THINK BEFORE YOU PRINT!
> 
> This E-mail is confidential.  
> 
> It may also be legally privileged. If you are not the addressee you may not 
> copy,
> forward, disclose or use any part of it. If you have received this message in 
> error,
> please delete it and all copies from your system and notify the sender 
> immediately by
> return E-mail.
> 
> Internet communications cannot be guaranteed to be timely secure, error or 
> virus-free.
> The sender does not accept liability for any errors or omissions.
> 


RE: Regex to block some patterns

2018-10-03 Thread Markus Jelsma
Hi Amarnatha,

-^.+(?:modal|exit).*\.html

Will work for all exampes given.

You can test regexes really well online [1]. If each input has true for 
lookingAt, Nutch' regexfilter will filter the URL's.

Regards,
Markus

[1] https://www.regexplanet.com/advanced/java/index.html
 
 
-Original message-
> From:Amarnatha Reddy 
> Sent: Wednesday 3rd October 2018 15:23
> To: user@nutch.apache.org
> Subject: Regex to block some patterns
> 
> Hi Team,
> 
> 
> 
> I need some assistance to block patterns in my current setup.
> 
> 
> 
> Always my seed url is *https://www.abc.com/ * and
> need to crawl all pages except below patterns in Nutch1.15
> 
> 
> Blocking pattern *modal(.*).html *and *exit.html? *and *exit.html/?*
> 
> Sample pages *modal.html, modal_1123Abc.html, modalaa_12.html* (these could
> be end of the domain)
> 
> 
> 
> Below are the few use case urls'
> 
> 
> https://www.abc.com/abc-editions/2018/test-ask/altitude/feature-pillar/abc/acb-1/modal.html
> 
> https://www.abc.com/2017/ask/exterior/feature_overlay/modalcontainer5.html
> 
> https://www.abc.com/2017/image/exterior/abc/feature_overlay/modalcontainer5_Ab_c.html
> 
> 
> 
> exit.html (here anything like this exit.html? exit.html/?)
> 
> 
> Ask here is after domain (https://www.abc.com/), starts with
> exit.html/exit.html?/exit.html/?  then need to block/exclude crawl.
> 
>  https://www.abc.com/exit.html?url=https://www.gear.abc.com/welcome.asp
> 
> https://www.abc.com/exit.html/?tname=abc_facebook=http://www.facebook.com/abc=true
> 
> 
> *Note: Yes we can directly put - ^(complete url) ,but dont know how many
> are there, so need generic regex rule to apply.*
> 
> 
> i tried below pattern,but it is not working
> 
> ## Blocking pattern ends with 
> 
> -^(?i)\*(modal*|exit*).html
> 
> 
> 
> Kindly help me to setup regex to block my use case.
> 
> 
> 
> Thanks,
> 
> Amarnath
> 
> 
> 
> 
> --
> 
> Thanks and Regards,
> 
> *Amarnath Polu*
> 


RE: Nutch 2.x HBase alternatives

2018-10-03 Thread Markus Jelsma
Hi Benjamin,

If you do not specifically require Nutch 2.x, i would strongly suggest to go to 
Nutch 1.x. It doesn't have the added hustle of a DB and DB layer, is much more 
mature and gets the most commits of the two.

Regards,
Markus

 
 
-Original message-
> From:Benjamin Vachon 
> Sent: Wednesday 3rd October 2018 20:21
> To: user@nutch.apache.org
> Subject: Nutch 2.x HBase alternatives
> 
> Hi,
> 
> I'm doing some research for a Nutch use-case where the HBase storage 
> layer isn't available and setting one up is not an option.
> Does anyone have any experience using 2.x with a Gora backend other than 
> the HBase store?
> If so, which Gora store did you use, why, and have you noticed 
> differences in stability/performance?
> 
> Thank you,
> 
> Ben V.
> 
> 


RE: Nutch Maven support for plugins

2018-08-29 Thread Markus Jelsma
Hello Rustam,

You can use urlnormalizer-slash for this task.

Regards,
Markus

 
 
-Original message-
> From:Rustam 
> Sent: Wednesday 29th August 2018 10:30
> To: user@nutch.apache.org
> Subject: Nutch Maven support for plugins
> 
> It seems Nutch is available in Maven, but without its plugins.
> Would it be possible to publish Nutch plugins in Maven as well?
> Without the plugins it's kind of useless.
> 


RE: [VOTE] Release Apache Nutch 1.15 RC#1

2018-08-01 Thread Markus Jelsma
Hello Sebastian,

It seems to happen only occasionally, and it only in the earlier phase stages 
of the fetcher. I've got a very long tail fetcher running now with two hosts, 
the problem doesn't show up. The only real examples i got is our own site 
multiple times, and sporadic wiki lemmas. Created an issue [1] for the logging.

Regards,
Markus

[1] https://issues.apache.org/jira/browse/NUTCH-2630

 
 
-Original message-
> From:Sebastian Nagel 
> Sent: Wednesday 1st August 2018 12:55
> To: user@nutch.apache.org
> Subject: Re: [VOTE] Release Apache Nutch 1.15 RC#1
> 
> Hi Markus
> 
> > 2018-08-01 11:42:10,660 INFO  fetcher.FetcherThread - FetcherThread 47 
> > fetching
> https://en.wikipedia.org/wiki/Special:RecentChanges (queue crawl delay=5000ms)
> 
> Ok, non-blocking because of:
> User-agent: *
> Disallow: /wiki/Special:
> 
> > 2018-08-01 11:42:10,660 INFO  fetcher.FetcherThread - FetcherThread 47 
> > fetching
> https://en.wikipedia.org/wiki/301_redirect (queue crawl delay=5000ms)
> ...
> > 2018-08-01 11:42:11,841 INFO  fetcher.FetcherThread - FetcherThread 45 
> > fetching
> http://en.wikipedia.org/wiki/Internet_media_type (queue crawl delay=5000ms)
> 
> That could be because of NUTCH-2623 (to be fixed in 1.16).
> 
> If you have more examples, let me know. Otherwise, let's re-test if NUTCH-2623
> is fixed and the logging is improved. Could you open an issue for an improved 
> logging?
> 
> Thanks,
> Sebastian
> 
> On 08/01/2018 12:45 PM, Markus Jelsma wrote:
> > Hello Sebastian,
> > 
> > That is unfortunately not the only example:
> > 
> > 2018-08-01 11:42:10,660 INFO  fetcher.FetcherThread - FetcherThread 47 
> > fetching https://en.wikipedia.org/wiki/Special:RecentChanges (queue crawl 
> > delay=5000ms)
> > 2018-08-01 11:42:10,660 INFO  fetcher.FetcherThread - FetcherThread 47 
> > fetching https://en.wikipedia.org/wiki/301_redirect (queue crawl 
> > delay=5000ms)
> > 2018-08-01 11:42:11,289 INFO  fetcher.FetcherThread - FetcherThread 52 
> > fetching https://about.twitter.com/about (queue crawl delay=5000ms)
> > 2018-08-01 11:42:11,313 INFO  fetcher.Fetcher - -activeThreads=10, 
> > spinWaiting=7, fetchQueues.totalSize=151, fetchQueues.getQueueCount=19
> > 2018-08-01 11:42:11,509 INFO  fetcher.FetcherThread - FetcherThread 50 
> > fetching http://www.apache.org/dyn/closer.cgi/nutch/ (queue crawl 
> > delay=4000ms)
> > 2018-08-01 11:42:11,723 INFO  fetcher.FetcherThread - FetcherThread 44 
> > fetching https://mobile.twitter.com/MrOrdnas (queue crawl delay=1000ms)
> > 2018-08-01 11:42:11,841 INFO  fetcher.FetcherThread - FetcherThread 45 
> > fetching http://en.wikipedia.org/wiki/Internet_media_type (queue crawl 
> > delay=5000ms)
> > 
> > I also saw it fetching multiple URLs of our own site within the same 
> > millisecond, on multiple occasions. Wasn't there some work done regarding 
> > crawl delay for 1.15 or is this actually an older problem? 
> > 
> > Regarding the logging, i agree. We already log failed fetches, no reason 
> > not to log skipped fetches too.
> > 
> > Regards,
> > Markus
> >  
> > -Original message-
> >> From:Sebastian Nagel 
> >> Sent: Wednesday 1st August 2018 12:31
> >> To: user@nutch.apache.org
> >> Subject: Re: [VOTE] Release Apache Nutch 1.15 RC#1
> >>
> >> Hi Markus,
> >>
> >> thanks for running a test crawl.
> >>
> >>> i noticed the crawl delay is not always respected
> >>
> >> Do you mean for the host t.co ?
> >>
> >> The host t.co disallows crawling in its robots.txt 
> >> (https://t.co/robots.txt).
> >> The first access fetches the robots.txt, all later fetches do not block 
> >> because the host is not
> >> accessed at all. That's by design.
> >>
> >> But it could be a useful improvement to log this (or in general the status 
> >> of a fetch).
> >> It would double the logged lines but would help to understand what the 
> >> fetcher is doing,
> >> esp. regarding robots denied and redirects.
> >>
> >> Best,
> >> Sebastian
> >>
> >>
> >> On 08/01/2018 11:59 AM, Markus Jelsma wrote:
> >>> However, the test crawl ran/runs fine, in the background, no errors. But 
> >>> just now, watching the fetcher, i noticed the crawl delay is not always 
> >>> respected. The only configuration change i have is the http.agent.* 
> >>> directives to run.
> >>>
> >>> 2018-08-01 11:47:41,256 INFO  fetcher.FetcherT

RE: [VOTE] Release Apache Nutch 1.15 RC#1

2018-08-01 Thread Markus Jelsma
Hello Sebastian,

That is unfortunately not the only example:

2018-08-01 11:42:10,660 INFO  fetcher.FetcherThread - FetcherThread 47 fetching 
https://en.wikipedia.org/wiki/Special:RecentChanges (queue crawl delay=5000ms)
2018-08-01 11:42:10,660 INFO  fetcher.FetcherThread - FetcherThread 47 fetching 
https://en.wikipedia.org/wiki/301_redirect (queue crawl delay=5000ms)
2018-08-01 11:42:11,289 INFO  fetcher.FetcherThread - FetcherThread 52 fetching 
https://about.twitter.com/about (queue crawl delay=5000ms)
2018-08-01 11:42:11,313 INFO  fetcher.Fetcher - -activeThreads=10, 
spinWaiting=7, fetchQueues.totalSize=151, fetchQueues.getQueueCount=19
2018-08-01 11:42:11,509 INFO  fetcher.FetcherThread - FetcherThread 50 fetching 
http://www.apache.org/dyn/closer.cgi/nutch/ (queue crawl delay=4000ms)
2018-08-01 11:42:11,723 INFO  fetcher.FetcherThread - FetcherThread 44 fetching 
https://mobile.twitter.com/MrOrdnas (queue crawl delay=1000ms)
2018-08-01 11:42:11,841 INFO  fetcher.FetcherThread - FetcherThread 45 fetching 
http://en.wikipedia.org/wiki/Internet_media_type (queue crawl delay=5000ms)

I also saw it fetching multiple URLs of our own site within the same 
millisecond, on multiple occasions. Wasn't there some work done regarding crawl 
delay for 1.15 or is this actually an older problem? 

Regarding the logging, i agree. We already log failed fetches, no reason not to 
log skipped fetches too.

Regards,
Markus
 
-Original message-
> From:Sebastian Nagel 
> Sent: Wednesday 1st August 2018 12:31
> To: user@nutch.apache.org
> Subject: Re: [VOTE] Release Apache Nutch 1.15 RC#1
> 
> Hi Markus,
> 
> thanks for running a test crawl.
> 
> > i noticed the crawl delay is not always respected
> 
> Do you mean for the host t.co ?
> 
> The host t.co disallows crawling in its robots.txt (https://t.co/robots.txt).
> The first access fetches the robots.txt, all later fetches do not block 
> because the host is not
> accessed at all. That's by design.
> 
> But it could be a useful improvement to log this (or in general the status of 
> a fetch).
> It would double the logged lines but would help to understand what the 
> fetcher is doing,
> esp. regarding robots denied and redirects.
> 
> Best,
> Sebastian
> 
> 
> On 08/01/2018 11:59 AM, Markus Jelsma wrote:
> > However, the test crawl ran/runs fine, in the background, no errors. But 
> > just now, watching the fetcher, i noticed the crawl delay is not always 
> > respected. The only configuration change i have is the http.agent.* 
> > directives to run.
> > 
> > 2018-08-01 11:47:41,256 INFO  fetcher.FetcherThread - FetcherThread 47 
> > fetching https://t.co/rqlNNVQgix (queue crawl delay=5000ms)in general 
> > 2018-08-01 11:47:41,319 INFO  fetcher.FetcherThread - FetcherThread 51 
> > fetching http://planet.apache.org/ (queue crawl delay=5000ms)
> > 2018-08-01 11:47:41,324 INFO  regex.RegexURLNormalizer - can't find rules 
> > for scope 'fetcher', using default
> > 2018-08-01 11:47:41,325 INFO  fetcher.FetcherThread - FetcherThread 48 
> > fetching http://schema.org/Event (queue crawl delay=5000ms)
> > 2018-08-01 11:47:41,515 INFO  fetcher.FetcherThread - FetcherThread 44 
> > fetching http://people.apache.org/~jianhe (queue crawl delay=5000ms)
> > 2018-08-01 11:47:41,532 INFO  regex.RegexURLNormalizer - can't find rules 
> > for scope 'fetcher', using default
> > 2018-08-01 11:47:41,533 INFO  fetcher.FetcherThread - FetcherThread 43 
> > fetching https://en.wikipedia.org/wiki/Internet_marketing (queue crawl 
> > delay=5000ms)
> > 2018-08-01 11:47:41,600 INFO  fetcher.FetcherThread - FetcherThread 44 
> > fetching https://apache.org/dist/nutch/2.3.1/apache-nutch-2.3.1-src.zip.asc 
> > (queue crawl delay=5000ms)
> > 2018-08-01 11:47:41,607863 INFO  regex.RegexURLNormalizer - can't find 
> > rules for scope 'fetcher', using default
> > 2018-08-01 11:47:41,608 INFO  fetcher.FetcherThread - FetcherThread 49 
> > fetching https://twitter.com/i/directory/profiles/5 (queue crawl 
> > delay=5000ms)
> > 2018-08-01 11:47:41,673 INFO  fetcher.FetcherThread - FetcherThread 48 
> > fetching https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Categories 
> > (queue crawl delay=5000ms)
> > 2018-08-01 11:47:41,688 INFO  fetcher.FetcherThread - FetcherThread 52 
> > fetching http://photomatt.net/ (queue crawl delay=5000ms)
> > 2018-08-01 11:47:41,696 INFO  fetcher.FetcherThread - FetcherThread 43 
> > fetching https://cy.wikipedia.org/wiki/Wicipedia:Cysylltwch_%C3%A2_ni 
> > (queue crawl delay=5000ms)
> > 2018-08-01 11:47:41,752 INFO  fetcher.FetcherThread - FetcherThread 48 
> > fetching https://mobile.twitter.com/david_kunz/followers (queue crawl 
> > de

RE: [VOTE] Release Apache Nutch 1.15 RC#1

2018-08-01 Thread Markus Jelsma
All tests pass, crawler run fine so far, +1 for 1.15!

Regards,
Markus

 
 
-Original message-
> From:Sebastian Nagel 
> Sent: Thursday 26th July 2018 17:05
> To: user@nutch.apache.org
> Cc: d...@nutch.apache.org
> Subject: [VOTE] Release Apache Nutch 1.15 RC#1
> 
> Hi Folks,
> 
> A first candidate for the Nutch 1.15 release is available at:
> 
>   https://dist.apache.org/repos/dist/dev/nutch/1.15/
> 
> The release candidate is a zip and tar.gz archive of the binary and sources 
> in:
>   https://github.com/apache/nutch/tree/release-1.15
> 
> The SHA1 checksum of the archive apache-nutch-1.15-bin.tar.gz is
>555d00ddc0371b05c5958bde7abb2a9db8c38ee2
> 
> In addition, a staged maven repository is available here:
>https://repository.apache.org/content/repositories/orgapachenutch-1015/
> 
> We addressed 119 Issues:
>https://s.apache.org/nczS
> 
> Please vote on releasing this package as Apache Nutch 1.15.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Nutch PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Nutch 1.15.
> [ ] -1 Do not release this package because…
> 
> Cheers,
> Sebastian
> (On behalf of the Nutch PMC)
> 
> P.S. Here is my +1.
> 


RE: Issues while crawling pagination

2018-07-28 Thread Markus Jelsma
Hello,

Yossi's suggestion is excellent if your case is crawl everything once, and 
never again. However, if you need to crawl future articles as well, and have to 
deal with mutations, then let the crawler run continuously without regard for 
depth.

The latter is the usual case, because after all, if you got this task a few 
months ago you wouldn't need to go to a depth of 497342 right?

Regards,
Markus


 
 
-Original message-
> From:Yossi Tamari 
> Sent: Saturday 28th July 2018 23:09
> To: user@nutch.apache.org; shivakarthik...@gmail.com; nu...@lucene.apache.org
> Subject: RE: Issues while crawling pagination
> 
> Hi Shiva,
> 
> My suggestion would be to programmatically generate a seeds file containing 
> these 497342 URLs (since you know them in advance), and then use a very low 
> max-depth (probably 1), and a high number of iterations, since only a small 
> number will be fetched in each iteration, unless you set a very low 
> crawl-delay.
> (Mathematically, If you fetch 1 URL per second from this domain, fetching 
> 497342 URLs will take 138 hours).
> 
>   Yossi.
> 
> > -Original Message-
> > From: ShivaKarthik S 
> > Sent: 28 July 2018 23:20
> > To: nu...@lucene.apache.org; user@nutch.apache.org
> > Subject: Reg: Issues while crawling pagination
> > 
> >  Hi
> > 
> > Can you help me in figuring out the issue while crawling a hub page having
> > pagination. Problem what i am facing is what depth to give and how to handle
> > pagination.
> > I have a hubpage which has a pagination of more than 4.95L.
> > e.g. https://www.jagran.com/latest-news-page497342.html  > the number of pages under the hubpage latest-news>
> > 
> > 
> > --
> > Thanks and Regards
> > Shiva
> 
> 


RE: Sitemap URL's concatenated, causing status 14 not found

2018-06-06 Thread Markus Jelsma
Sebastian, i do not want to be a pain in the arsch, aber ich habe nicht eine 
Github account. If you would do the honours of opening a ticket, please do so.

Entschuldiging,
Markus

 
 
-Original message-
> From:Sebastian Nagel 
> Sent: Tuesday 29th May 2018 11:33
> To: user@nutch.apache.org
> Subject: Re: Sitemap URL's concatenated, causing status 14 not found
> 
> > I agree that the this is not the ideal error behaviour, but I guess the 
> > code was written from the
> assumption that the document is valid and conformant.
> 
> Over time the crawler-commons sitemap parser has been extended to get as much 
> as possible from
> non-conforming sitemaps as well. Of course, it's hard to foresee and handle 
> all possible mistakes...
> The equivalent syntax error for sitemaps (missing closing/next  in 
>  is handled.
> 
> @Markus: Please open an issue for crawler-commons
>   https://github.com/crawler-commons/crawler-commons/issues/
> 
> Thanks,
> Sebastian
> 
> 
> On 05/26/2018 02:57 AM, Yossi Tamari wrote:
> > Hi Markus,
> > 
> > I don’t believe this is a valid sitemapindex. Each  should include 
> > exactly one .
> > See also https://www.sitemaps.org/protocol.html#index and 
> > https://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd.
> > I agree that the this is not the ideal error behaviour, but I guess the 
> > code was written from the assumption that the document is valid and 
> > conformant.
> > 
> > Yossi.
> > 
> >> -Original Message-
> >> From: Markus Jelsma 
> >> Sent: 25 May 2018 23:45
> >> To: User 
> >> Subject: Sitemap URL's concatenated, causing status 14 not found
> >>
> >> Hello,
> >>
> >> We have a sitemap.xml pointing to further sitemaps. The XML seems fine, but
> >> Nutch things those two sitemap URL's are actually one consisting of both
> >> concatenated.
> >>
> >> Here is https://www.saxion.nl/sitemap.xml
> >>
> >> 
> >>  >> xmlns:ns2="http://www.sitemaps.org/schemas/sitemap/0.9;>
> >> 
> >> https://www.saxion.nl/opleidingen-sitemap.xml
> >> https://www.saxion.nl/content-sitemap.xml
> >> 
> >> 
> >>
> >> This seems fine, but Nutch attempts, and obviously fails to load:
> >>
> >> 2018-05-25 16:27:50,515 ERROR [Thread-30]
> >> org.apache.nutch.util.SitemapProcessor: Error while fetching the sitemap.
> >> Status code: 14 for https://www.saxion.nl/opleidingen-
> >> sitemap.xmlhttps://www.saxion.nl/content-sitemap.xml
> >>
> >> What is going on here? Why does Nutch, or CC's sitemap util behave like 
> >> this?
> >>
> >> Thanks,
> >> Markus
> > 
> 
> 


RE: Sitemap URL's concatenated, causing status 14 not found

2018-05-29 Thread Markus Jelsma
Ah, of course, i missed that!

Thanks,
Markus
 
-Original message-
> From:Yossi Tamari 
> Sent: Saturday 26th May 2018 2:57
> To: user@nutch.apache.org
> Subject: RE: Sitemap URL's concatenated, causing status 14 not found
> 
> Hi Markus,
> 
> I don’t believe this is a valid sitemapindex. Each  should include 
> exactly one .
> See also https://www.sitemaps.org/protocol.html#index and 
> https://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd.
> I agree that the this is not the ideal error behaviour, but I guess the code 
> was written from the assumption that the document is valid and conformant.
> 
>   Yossi.
> 
> > -Original Message-
> > From: Markus Jelsma 
> > Sent: 25 May 2018 23:45
> > To: User 
> > Subject: Sitemap URL's concatenated, causing status 14 not found
> > 
> > Hello,
> > 
> > We have a sitemap.xml pointing to further sitemaps. The XML seems fine, but
> > Nutch things those two sitemap URL's are actually one consisting of both
> > concatenated.
> > 
> > Here is https://www.saxion.nl/sitemap.xml
> > 
> > 
> >  > xmlns:ns2="http://www.sitemaps.org/schemas/sitemap/0.9;>
> > 
> > https://www.saxion.nl/opleidingen-sitemap.xml
> > https://www.saxion.nl/content-sitemap.xml
> > 
> > 
> > 
> > This seems fine, but Nutch attempts, and obviously fails to load:
> > 
> > 2018-05-25 16:27:50,515 ERROR [Thread-30]
> > org.apache.nutch.util.SitemapProcessor: Error while fetching the sitemap.
> > Status code: 14 for https://www.saxion.nl/opleidingen-
> > sitemap.xmlhttps://www.saxion.nl/content-sitemap.xml
> > 
> > What is going on here? Why does Nutch, or CC's sitemap util behave like 
> > this?
> > 
> > Thanks,
> > Markus
> 
> 


Sitemap URL's concatenated, causing status 14 not found

2018-05-25 Thread Markus Jelsma
Hello,

We have a sitemap.xml pointing to further sitemaps. The XML seems fine, but 
Nutch things those two sitemap URL's are actually one consisting of both 
concatenated.

Here is https://www.saxion.nl/sitemap.xml


http://www.sitemaps.org/schemas/sitemap/0.9;>

https://www.saxion.nl/opleidingen-sitemap.xml
https://www.saxion.nl/content-sitemap.xml



This seems fine, but Nutch attempts, and obviously fails to load:

2018-05-25 16:27:50,515 ERROR [Thread-30] 
org.apache.nutch.util.SitemapProcessor: Error while fetching the sitemap. 
Status code: 14 for 
https://www.saxion.nl/opleidingen-sitemap.xmlhttps://www.saxion.nl/content-sitemap.xml

What is going on here? Why does Nutch, or CC's sitemap util behave like this?

Thanks,
Markus


RE: Having plugin as a separate project

2018-05-07 Thread Markus Jelsma
Hi,

Here are examples using Maven:
https://github.com/ATLANTBH/nutch-plugins/tree/master/nutch-plugins

Regards,
Markus
 
 
-Original message-
> From:Yash Thenuan Thenuan 
> Sent: Monday 7th May 2018 11:51
> To: user@nutch.apache.org
> Subject: Re: Having plugin as a separate project
> 
> Hey,
> Thanks for the answer, But my question was can't we write plugin by
> downloading the nutch jar and using it as a dependency, rather than adding
> the code in nutch source code?
> 
> On Fri, May 4, 2018 at 8:08 PM, Jorge Betancourt  > wrote:
> 
> > Usually we tend to develop everything inside the Nutch file structure,
> > specially useful if you need to deploy to a Hadoop cluster later on
> > (because you need to bundle everything in a job file).
> >
> > But, if you really want to develop the plugin in isolation you only need to
> > create a new project in your preferred IDE/maven/ant/gradle and add the
> > dependencies that you need from the lib/ directory (or the global
> > dependencies with the same version).
> >
> > Then just compile everything to a jar and place it in the proper plugin
> > structure in the Nutch installation. Although this should work is not
> > really a smooth development experience.
> > You need to be careful and not bundle all libs inside your jar, etc.
> >
> > The path suggested by Sebastian is much better, in the end while developing
> > you want to have everything, perhaps just compile/test your plugin and
> > later on you can copy the final jar of your plugin to the desired Nutch
> > installation.
> >
> > Best Regards,
> > Jorge
> >
> > On Fri, May 4, 2018 at 4:02 PM narendra singh arya 
> > wrote:
> >
> > > Can we have nutch plugin as a separate project?
> > >
> > > On Fri, 4 May 2018, 19:26 Sebastian Nagel, 
> > > wrote:
> > >
> > > > That's trivial. Just run ant in the plugin's source folder:
> > > >
> > > >   cd src/plugin/urlnormalizer-basic/
> > > >   ant
> > > >
> > > > or to run also the tests
> > > >
> > > >   cd src/plugin/urlnormalizer-basic/
> > > >   ant test
> > > >
> > > > Note: you have to compile the core test classes first by running
> > > >
> > > >   ant compile-core-test
> > > >
> > > > in the Nutch "root" folder.
> > > >
> > > > A little bit slower but guarantees that everything is compiled:
> > > >
> > > >   ant -Dplugin=urlnormalizer-basic test-plugin
> > > >
> > > > Or sometimes it's enough to skip some of the long running tests:
> > > >
> > > >   ant -Dtest.exclude='TestSegmentMerger*' clean runtime test
> > > >
> > > >
> > > > Best,
> > > > Sebastian
> > > >
> > > > On 05/04/2018 01:13 PM, Yash Thenuan Thenuan wrote:
> > > > > Hi all,
> > > > > I want to compile my plugins separately so that I need not compile
> > > > > the whole project again when I make a change in some plugin. How can
> > I
> > > > > achieve that?
> > > > > Thanks
> > > > >
> > > >
> > > >
> > >
> >
> 


RE: RE: random sampling of crawlDb urls

2018-05-01 Thread Markus Jelsma
Ah crap, i got it wrong, >0.1 should not get 10% but 90% of the records.

If you could add debugging lines that emit the direct output of Math.random() 
and the equation as well, we might learn more. Maybe Math.random() is evaluated 
just once, i have no idea how Jexl works under the hood.

Again, you might have more luck on the Jexl list, we just implemented it. And 
there could be a bug  somewhere.

Hope you find some answers. Sorry to be of so little help.
Markus

-Original message-
> From:Michael Coffey <mcof...@yahoo.com.INVALID>
> Sent: Tuesday 1st May 2018 23:18
> To: user@nutch.apache.org
> Subject: Re: RE: random sampling of crawlDb urls
> 
> Just to clarify: .99 does NOT work fine. It should have rejected most of the 
> records when I specified "((Math.random())>=.99)".
>  
> I have used expressions not involving Math.random. For example, I can extract 
> records above a specific score with "score>1.0". But the random thing doesn't 
> work even though I have tried various thresholds.
> 
> On Tuesday, May 1, 2018, 2:00:48 PM PDT, Markus Jelsma 
> <markus.jel...@openindex.io> wrote:  
>  
>  Hello Michael,
> 
> I would think this should work as well. But since you mention .99 works fine, 
> did you try .1 as well to get ~10% output? It seems the expressions itself do 
> work at some level, and since this is a Jexl specific thing, you might want 
> to try the Jexl list as well. I could not find an online Jexl parser to test 
> this question, it would be really helpful! 
> 
> Regards,
> Markus
> 
> -Original message-
> > From:Michael Coffey <mcof...@yahoo.com.INVALID>
> > Sent: Tuesday 1st May 2018 22:47
> > To: User <user@nutch.apache.org>
> > Subject: random sampling of crawlDb urls
> > 
> > I want to extract a random sample of URLS from my big crawldb. I think I 
> > should be able to do this using readdb -dump with a Jexl expression, but I 
> > haven't been able to get it to work.
> > 
> > I have tried several variations of the following command.
> > $NUTCH_HOME/runtime/deploy/bin/nutch readdb /crawls/pop2/data/crawldb -dump 
> > /crawls/pop2/data/crawldb/pruned/current -format crawldb -expr 
> > "((Math.random())>=0.1)"
> > 
> > 
> > Typically, it produces zero records. I know the expression is getting 
> > through to the CrawlDbReader (without quotes) because I get this message:
> > 18/05/01 13:22:48 INFO crawl.CrawlDbReader: CrawlDb db: expr: 
> > ((Math.random())>=0.1)
> > 
> > Even when I use the expression "((Math.random())>=0.0)" I get zero output 
> > records.
> > 
> > If I use the expression "((Math.random())>=.99)" it lets all records pass 
> > through to the output. I guess it has something to do with the lack of 
> > leading zero on the numeric constant.
> > 
> > Does anyone know a good way to extract a random sample of records from a 
> > crawlDb?
> >   


RE: random sampling of crawlDb urls

2018-05-01 Thread Markus Jelsma
Hello Michael,

I would think this should work as well. But since you mention .99 works fine, 
did you try .1 as well to get ~10% output? It seems the expressions itself do 
work at some level, and since this is a Jexl specific thing, you might want to 
try the Jexl list as well. I could not find an online Jexl parser to test this 
question, it would be really helpful! 

Regards,
Markus

-Original message-
> From:Michael Coffey 
> Sent: Tuesday 1st May 2018 22:47
> To: User 
> Subject: random sampling of crawlDb urls
> 
> I want to extract a random sample of URLS from my big crawldb. I think I 
> should be able to do this using readdb -dump with a Jexl expression, but I 
> haven't been able to get it to work.
> 
> I have tried several variations of the following command.
> $NUTCH_HOME/runtime/deploy/bin/nutch readdb /crawls/pop2/data/crawldb -dump 
> /crawls/pop2/data/crawldb/pruned/current -format crawldb -expr 
> "((Math.random())>=0.1)"
> 
> 
> Typically, it produces zero records. I know the expression is getting through 
> to the CrawlDbReader (without quotes) because I get this message:
> 18/05/01 13:22:48 INFO crawl.CrawlDbReader: CrawlDb db: expr: 
> ((Math.random())>=0.1)
> 
> Even when I use the expression "((Math.random())>=0.0)" I get zero output 
> records.
> 
> If I use the expression "((Math.random())>=.99)" it lets all records pass 
> through to the output. I guess it has something to do with the lack of 
> leading zero on the numeric constant.
> 
> Does anyone know a good way to extract a random sample of records from a 
> crawlDb?
> 


RE: Nutch fetching times out at 3 hours, not sure why.

2018-04-17 Thread Markus Jelsma
Hello Chip,

I have no clue where the three hour limit could come from. Please take a 
further look in the last few minutes of the logs.

The only thing i can think of is that a webserver would block you after some 
amount of requests/time window, that would be visible in the logs. It is clear 
Nutch itself terminates the fetcher (the dropping line). That is only possible 
with an imposed time limit, or a if you reached some number of exceptions (or 
one other variable i am forgetting).

Regards,
Markus
 
-Original message-
> From:Chip Calhoun 
> Sent: Tuesday 17th April 2018 21:27
> To: user@nutch.apache.org
> Subject: RE: Nutch fetching times out at 3 hours, not sure why.
> 
> I'm on 1.12, and mine also defaulted at -1. It does not fail at the same URL, 
> or even at the same point in a URL's fetcher loop; it really seems to be time 
> based. 
> 
> -Original Message-
> From: Sadiki Latty [mailto:sla...@uottawa.ca] 
> Sent: Tuesday, April 17, 2018 1:43 PM
> To: user@nutch.apache.org
> Subject: RE: Nutch fetching times out at 3 hours, not sure why.
> 
> Which version are you running? That value is defaulted to -1 in my current 
> version (1.14)  so shouldn't be something you should have needed to change. 
> My crawls, by default, go for as much as even 12 hours with little to no 
> tweaking necessary from the nutch-default. Something else is causing it. Is 
> it always the same URL that it fails at?
> 
> -Original Message-
> From: Chip Calhoun [mailto:ccalh...@aip.org] 
> Sent: April-17-18 10:45 AM
> To: user@nutch.apache.org
> Subject: Nutch fetching times out at 3 hours, not sure why.
> 
> I crawl a list of roughly 2600 URLs all on my local server, and I'm only 
> crawling around 1000 of them. The fetcher quits after exactly 3 hours (give 
> or take a few milliseconds) with this message in the log:
> 
> 2018-04-13 15:50:48,885 INFO  fetcher.FetchItemQueues - * queue: 
> https://history.aip.org >> dropping!
> 
> I've seen that 3 hours is the default in some Nutch installations, but I've 
> got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something 
> obvious. Any thoughts would be greatly appreciated. Thank you.
> 
> Chip Calhoun
> Digital Archivist
> Niels Bohr Library & Archives
> American Institute of Physics
> One Physics Ellipse
> College Park, MD  20740-3840  USA
> Tel: +1 301-209-3180
> Email: ccalh...@aip.org
> https://www.aip.org/history-programs/niels-bohr-library
> 
> 


RE: Issues related to Hung threads when crawling more than 15K articles

2018-04-04 Thread Markus Jelsma
That doesn't appear to be the case, fetcher's time bomb nicely logs when it 
reached its limit, it also usually runs for longer than two seconds which we 
see here.

What can you find in the logs? There must be some error beyond having hung 
threads. Usually something with a hanging parser or GC issues.

Markus

 
 
-Original message-
> From:Yossi Tamari 
> Sent: Wednesday 4th April 2018 11:37
> To: user@nutch.apache.org; shivakarthik...@gmail.com
> Cc: 'Sebastian Nagel' 
> Subject: RE: Issues related to Hung threads when crawling more than 15K 
> articles
> 
> I believe this is normal behaviour. The fetch timeout which you have defined 
> (fetcher.timelimit.mins) has passed, and the fetcher is exiting. In this case 
> one of the fetcher threads is still waiting for a response from a specific 
> URL. This is not a problem, and any URLs which were not fetched because of 
> the timeout will be "generated" again in a future segment.
> You do want to try to match the fetcher timeout and the generated segment 
> size, but you can never be 100% successful, and that's not a problem.
> 
>   Yossi.
> 
> > -Original Message-
> > From: ShivaKarthik S 
> > Sent: 04 April 2018 12:32
> > To: user@nutch.apache.org
> > Cc: Sebastian Nagel 
> > Subject: Reg: Issues related to Hung threads when crawling more than 15K
> > articles
> > 
> > Hi,
> > 
> >I am crawling 25K+ artilces at a time (in single depth), but after 
> > crawling (using
> > nutch-1.11) certain amount of articles am getting error related to Hung 
> > threads
> > and the process gets killed. Can some one suggest me a solution to resolve 
> > this?
> > 
> > *Error am getting is as follows*
> > 
> > Fetcher: throughput threshold: -1
> > Fetcher: throughput threshold retries: 5 -activeThreads=10, spinWaiting=9,
> > fetchQueues.totalSize=2,
> > fetchQueues.getQueueCount=1
> > Aborting with 10 hung threads.
> > Thread #0 hung while processing
> > https://24.kg/sport/29754_kyirgyizstantsyi_vyiigrali_dva_boya_na_litsenzionno
> > m_turnire_po_boksu_v_kitae/
> > Thread #1 hung while processing null
> > Thread #2 hung while processing null
> > Thread #3 hung while processing null
> > Thread #4 hung while processing null
> > Thread #5 hung while processing null
> > Thread #6 hung while processing null
> > Thread #7 hung while processing null
> > Thread #8 hung while processing null
> > Thread #9 hung while processing null
> > Fetcher: finished at 2018-04-04 14:23:45, elapsed: 00:00:02
> > 
> > --
> > Thanks and Regards
> > Shiva
> 
> 


RE: Is there any way to block the hubpages while crawling

2018-03-20 Thread Markus Jelsma
Hello Shiva,

Yes, that is possible, but it (ours) is not a fool proof solution.

We got our first hub classifier years ago in the form of a simple ParseFilter 
backed by an SVM. The model was built solely on the HTML of positive and 
negative examples, with very few features, so it was extremely unreliable for 
sites that weren't part of the training set.

Today we operate a hierarchic set of SVMs that get tons of features from 
pre-analyzed structures in HTML. It helped a great deal because first we try to 
figure out what kind of website it is, and only then whether it is a hub. It 
was easier to know a hub page, if you know if the site is a forum, a regular 
news/blog site, a wiki or a webshop.

I know this is not the answer you are looking for, but if you analyze HTML, get 
data structures out of it and use those as features for SVMs, you are on your 
way. It worked for us at least.

Regards,
Markus
 
-Original message-
> From:Sebastian Nagel 
> Sent: Tuesday 20th March 2018 13:21
> To: user@nutch.apache.org
> Subject: Re: Is there any way to block the hubpages while crawling
> 
> Hi,
> 
> > more control over what is being indexed?
> 
> It's possible to enable URL filters for the indexer:
>    bin/nutch index ... -filter
> With little extra effort you can use different URL filter rules
> during the index step, e.g. in local mode by pointing NUTCH_CONF_DIR
> to a different folder.
> 
> >> I can't generalize any rule
> 
> What about to classify hubs by number of outlinks?
> Then you could skip those pages using an indexing-filter, just return
> null if a document shall be skipped.
> For a larger crawl you'll definitely get lost with a URL filter.
> 
> Maybe you can also see this as a ranking problem: if hub pages are
> only penalized you could apply simple but noisy heuristics.
> 
> Best,
> Sebastian
> 
> On 03/18/2018 10:10 AM, BlackIce wrote:
> > Basically what you're saying is that you need more control over what is
> > being indexed?
> > 
> > That's an excellent question!
> > 
> > Greetz!
> > 
> > On Mar 17, 2018 11:46 AM, "ShivaKarthik S" 
> > wrote:
> > 
> >> Hi,
> >>
> >> Is there any way to block the hub pages & index only the articles from the
> >> websites. I wanted to index only the articles & not hubpage. Hub pages will
> >> be crawled & the outlines will be extracted, but while indexing, I needed
> >> only the articles to be indexed.
> >> E.g.
> >> www.abc.com/xyz & www.abc.com/abc are hub pages and www.abc.com/xyz/1.html
> >> & www.abc.com/ABC/1.html is an article.
> >>
> >> In this case I can block all the urls not ending with .html or .aspx or
> >> .JSP or any other extensions. But all the websites need not be following
> >> same format. Some follow . html for hub pages as well as articles & some
> >> follow no extension for both hub pages as well as articles. Considering
> >> these cases, I can't generalize any rule saying that whichever is ending
> >> without extension is hubpage & whichever is ending with extension is
> >> article. Is there any way in nutch 1.x this can be handled?
> >>
> >> Thanks & regards
> >> Shiva
> >>
> >>
> >> --
> >> Thanks and Regards
> >> Shiva
> >>
> > 
> 
> 


RE: Reg: URL Near Duplicate Issues with same content

2018-03-15 Thread Markus Jelsma
About URL Normalizers, you can use:

urlnormalizer-host to normalize between www- and non-www hosts, and
urlnormalizer-slash to normalize per host trailing or non-trailing slashes.

There are no committed tools that automate this, but if your set of sites is 
limited, it is easy to manage by hand.

Regards,
Markus

 
 
-Original message-
> From:Sebastian Nagel 
> Sent: Thursday 15th March 2018 10:34
> To: user@nutch.apache.org
> Cc: shivakarthik...@gmail.com
> Subject: Re: Reg: URL Near Duplicate Issues with same content
> 
> Hi Shiva,
> 
> 1. you can define URL normalizer rules to rewrite the URLs
>    but it only works for sites where you know which URL is
>    the canonical form.
> 
> 2. you can deduplicate (command "nutch dedup") based on the
>    content checksum: the duplicates are still crawled but deleted
>    afterwards
> 
> It's a frequent problem (plus http:// vs. https://) but there is
> no solution which works for all sites because each site or web server
> behaves different. Well-configured servers wouldn't present variant
> URLs and also could redirect the user or crawler to the canonical page.
> 
> Best,
> Sebastian
> 
> 
> 
> On 03/15/2018 10:12 AM, ShivaKarthik S wrote:
> > Hi,
> > 
> >       I am crawling many websites using Nutch-1.11 or Nutch-1.13 or 1.14. 
> > While crawling am getting
> > near duplicate URLs like the following where the content is exactly the 
> > same 
> > 
> > *_Case1: URLs with and Without WWW_*
> > http://www.samacharplus.com/~samachar/index.php/en/worlds/11-india/24151-nine-crpf-soldiers-martyred-in-naxal-attack-in-sukma
> > http://samacharplus.com/~samachar/index.php/en/worlds/11-india/24151-nine-crpf-soldiers-martyred-in-naxal-attack-in-sukma
> > 
> > *_Case2: URLs ending with and without Slash (/)_*
> >  http://eng.belta.by/news-headers 
> >  http://eng.belta.by/news-headers/  
> > http://eng.belta.by/products
> > http://eng.belta.by/products/
> > 
> > Nutch is not able to handle this and the it is sending as separate document 
> > in each case whereas it
> > is actually duplicate URLs. Can you give me a solution to handles these 
> > kind of pages and treat them
> > as a single one. 
> > 
> > -- 
> > Thanks and Regards
> > Shiva
> 
> 


RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Markus Jelsma
That is for the LinkDB.

 
 
-Original message-
> From:Yossi Tamari <yossi.tam...@pipl.com>
> Sent: Monday 12th March 2018 13:02
> To: user@nutch.apache.org
> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long 
> links
> 
> Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy paste error...
> 
> > -Original Message-
> > From: Markus Jelsma <markus.jel...@openindex.io>
> > Sent: 12 March 2018 14:01
> > To: user@nutch.apache.org
> > Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long 
> > links
> > 
> > scripts/apache-nutch-
> > 1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205:
> > maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page", 100);
> > scripts/apache-nutch-
> > 1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118:int
> > maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page", 100);
> > 
> > 
> > 
> > 
> > -Original message-
> > > From:Yossi Tamari <yossi.tam...@pipl.com>
> > > Sent: Monday 12th March 2018 12:56
> > > To: user@nutch.apache.org
> > > Subject: RE: UrlRegexFilter is getting destroyed for unrealistically
> > > long links
> > >
> > > Nutch.default contains a property db.max.outlinks.per.page, which I think 
> > > is
> > supposed to prevent these cases. However, I just searched the code and 
> > couldn't
> > find where it is used. Bug?
> > >
> > > > -Original Message-
> > > > From: Semyon Semyonov <semyon.semyo...@mail.com>
> > > > Sent: 12 March 2018 12:47
> > > > To: usernutch.apache.org <user@nutch.apache.org>
> > > > Subject: UrlRegexFilter is getting destroyed for unrealistically
> > > > long links
> > > >
> > > > Dear all,
> > > >
> > > > There is an issue with UrlRegexFilter and parsing. In average,
> > > > parsing takes about 1 millisecond, but sometimes the websites have
> > > > the crazy links that destroy the parsing(takes 3+ hours and destroy the 
> > > > next
> > steps of the crawling).
> > > > For example, below you can see shortened logged version of url with
> > > > encoded image, the real lenght of the link is 532572 characters.
> > > >
> > > > Any idea what should I do with such behavior?  Should I modify the
> > > > plugin to reject links with lenght > MAX or use more comlex
> > > > logic/check extra configuration?
> > > > 2018-03-10 23:39:52,082 INFO [main]
> > > > org.apache.nutch.parse.ParseOutputFormat:
> > > > ParseOutputFormat.Write.filterNormalize 4.3. Before filteing and
> > > > normalization
> > > > 2018-03-10 23:39:52,178 INFO [main]
> > > > org.apache.nutch.urlfilter.api.RegexURLFilterBase:
> > > > ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url filter
> > > > for url
> > > >
> > :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoNS
> > > >
> > UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2
> > > >
> > Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr7
> > > >
> > X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1
> > > >
> > efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu
> > > > dbnu50253lju... [532572 characters]
> > > > 2018-03-11 03:56:26,118 INFO [main]
> > > > org.apache.nutch.parse.ParseOutputFormat:
> > > > ParseOutputFormat.Write.filterNormalize 4.4. After filteing and
> > > > normalization
> > > >
> > > > Semyon.
> > >
> > >
> 
> 


RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Markus Jelsma
scripts/apache-nutch-1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205:
maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page", 100);
scripts/apache-nutch-1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118:
int maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page", 100);


 
 
-Original message-
> From:Yossi Tamari 
> Sent: Monday 12th March 2018 12:56
> To: user@nutch.apache.org
> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long 
> links
> 
> Nutch.default contains a property db.max.outlinks.per.page, which I think is 
> supposed to prevent these cases. However, I just searched the code and 
> couldn't find where it is used. Bug? 
> 
> > -Original Message-
> > From: Semyon Semyonov 
> > Sent: 12 March 2018 12:47
> > To: usernutch.apache.org 
> > Subject: UrlRegexFilter is getting destroyed for unrealistically long links
> > 
> > Dear all,
> > 
> > There is an issue with UrlRegexFilter and parsing. In average, parsing takes
> > about 1 millisecond, but sometimes the websites have the crazy links that
> > destroy the parsing(takes 3+ hours and destroy the next steps of the 
> > crawling).
> > For example, below you can see shortened logged version of url with encoded
> > image, the real lenght of the link is 532572 characters.
> > 
> > Any idea what should I do with such behavior?  Should I modify the plugin to
> > reject links with lenght > MAX or use more comlex logic/check extra
> > configuration?
> > 2018-03-10 23:39:52,082 INFO [main]
> > org.apache.nutch.parse.ParseOutputFormat:
> > ParseOutputFormat.Write.filterNormalize 4.3. Before filteing and 
> > normalization
> > 2018-03-10 23:39:52,178 INFO [main]
> > org.apache.nutch.urlfilter.api.RegexURLFilterBase:
> > ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url filter for url
> > :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoNS
> > UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2
> > Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr7
> > X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1
> > efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu
> > dbnu50253lju... [532572 characters]
> > 2018-03-11 03:56:26,118 INFO [main]
> > org.apache.nutch.parse.ParseOutputFormat:
> > ParseOutputFormat.Write.filterNormalize 4.4. After filteing and 
> > normalization
> > 
> > Semyon.
> 
> 


RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Markus Jelsma
Hello - see inline.

Regards,
Markus
 
-Original message-
> From:Semyon Semyonov 
> Sent: Monday 12th March 2018 11:47
> To: usernutch.apache.org 
> Subject: UrlRegexFilter is getting destroyed for unrealistically long links
> 
> Dear all,
> 
> There is an issue with UrlRegexFilter and parsing. In average, parsing takes 
> about 1 millisecond, but sometimes the websites have the crazy links that 
> destroy the parsing(takes 3+ hours and destroy the next steps of the 
> crawling). 

Regarding destroys the next steps, you mean other jobs then also take a long 
time? In that case you have filtering/normalizing enabled for other jobs, which 
you can safely disable. You filtered/normalized while parsing, no need to do it 
twice or more (except when you have different filters depending on job).

> For example, below you can see shortened logged version of url with encoded 
> image, the real lenght of the link is 532572 characters.
>  
> Any idea what should I do with such behavior?  Should I modify the plugin to 
> reject links with lenght > MAX or use more comlex logic/check extra 
> configuration?

We skip all URL's longer than 512 characters using -.{512,} as first rule in 
the regex file. We have not seen any problem with skipping those URL's, and not 
seen any customer URL's that still make sense but are longer than 512.

> 2018-03-10 23:39:52,082 INFO [main] org.apache.nutch.parse.ParseOutputFormat: 
> ParseOutputFormat.Write.filterNormalize 4.3. Before filteing and 
> normalization 
> 2018-03-10 23:39:52,178 INFO [main] 
> org.apache.nutch.urlfilter.api.RegexURLFilterBase: 
> ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url filter for url 
> :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoNSUhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr7X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNudbnu50253lju...
>  [532572 characters]
> 2018-03-11 03:56:26,118 INFO [main] org.apache.nutch.parse.ParseOutputFormat: 
> ParseOutputFormat.Write.filterNormalize 4.4. After filteing and normalization 
> 
> Semyon.
> 


RE: Need Tutorial on Nutch

2018-03-07 Thread Markus Jelsma
Hello,

Yes, we have used headless browsers with and without Nutch. But i am unsure 
which of the mentioned challenges a headless browser is going to help solving, 
except for dealing with sites that serve only AJAXed web pages.

Semyon is right, if you really want this, Nutch and Hadoop can be great tools 
for the job, but none of it is easy and you are going to need plenty of custom 
code. That is, of course, doable, but you also need to bring plenty of 
hardware, infrastructure and time to do the job.

Regards,
Markus
 
 
-Original message-
> From:Eric Valencia <ericlvalen...@gmail.com>
> Sent: Wednesday 7th March 2018 21:51
> To: user@nutch.apache.org
> Subject: Re: Need Tutorial on Nutch
> 
> How about using nutch with a headless browser like CasperJS?  Will this
> work? Have any of you tried this?
> 
> On Tue, Mar 6, 2018 at 1:00 PM Markus Jelsma <markus.jel...@openindex.io>
> wrote:
> 
> > Hi,
> >
> > Yes you are going to need code, and a lot more than just that, probably
> > including dropping the 'every two hour' requirement.
> >
> > For your case you need either site-specific price extraction, which is
> > easy but a lot of work for 500+ sites. Or you need a more complicated
> > generic algorithm, which is a lot of work too. Both can be implemented as
> > Nutch ParseFilter plugins and need Java code to run.
> >
> > Your next problem is daily volume, every product 12x per day for 500+
> > shops times many products. You can ignore bandwidth and processing, that is
> > easy. But you are going to be blocked within a few days by at least a good
> > amount of sites.
> >
> > We once built a price checker crawler too, but the client's requirement
> > for very high interval checks could not be met easily without the use of
> > costly proxies to avoid being blocked, hardware and network costs. They
> > dropped the requirement.
> >
> > Good luck
> > Markus
> >
> > -Original message-
> > > From:Eric Valencia <ericlvalen...@gmail.com>
> > > Sent: Tuesday 6th March 2018 21:17
> > > To: user@nutch.apache.org
> > > Subject: Re: Need Tutorial on Nutch
> > >
> > > Yash, well, I want to monitor the price for every item in the top 500
> > > retail websites every two hours, 24/7/365.  Java is needed?
> > >
> > > On Tue, Mar 6, 2018 at 12:15 PM, Yash Thenuan Thenuan <
> > > rit2014...@iiita.ac.in> wrote:
> > >
> > > > If you want simple crawlung then Not at all.
> > > > But having experience with java will help you to fulfil your personal
> > > > requirements.
> > > >
> > > > On 7 Mar 2018 01:42, "Eric Valencia" <ericlvalen...@gmail.com> wrote:
> > > >
> > > > > Does this require knowing Java proficiently?
> > > > >
> > > > > On Tue, Mar 6, 2018 at 10:51 AM Semyon Semyonov <
> > > > semyon.semyo...@mail.com>
> > > > > wrote:
> > > > >
> > > > > > Here is an unpleasant truth - there is no up to date tutorial for
> > > > Nutch.
> > > > > > To make it even more interesting, sometimes the tutorial can
> > contradict
> > > > > > real behavior of Nutch, because of lately introduced
> > features/bugs. If
> > > > > you
> > > > > > find such cases, please try to fix and contribute to the project.
> > > > > >
> > > > > > Welcome to the open source world.
> > > > > >
> > > > > > Though, my recommendations as a person who started with Nutch less
> > > > then a
> > > > > > year ago :
> > > > > > 1) If you just need a simple crawl, you are in luck. Simply run
> > crawl
> > > > > > script or several steps according to the Nutch crawl tutorial.
> > > > > > 2) If it is bit more comlex you start to face problems either with
> > > > > > configuration or with bugs. Therefore, first have a look at Nutch
> > List
> > > > > > Archive http://nutch.apache.org/mailing_lists.html , if it doesnt
> > work
> > > > > > try to figure out yourself, if that doesnt work ask here or at
> > > > developer
> > > > > > list.
> > > > > > 3) In most cases, you HAVE to open the code and fix/discover
> > something.
> > > > > > Nutch is really complicated system and to understand it properly
> > you
> > > > can
> > > > > >

index-metadata, lowercasing field names?

2018-03-07 Thread Markus Jelsma
Hi,

I've got metadata, containing a capital in the field name. But index-metadata 
lowercases its field names:
  parseFieldnames.put(metatag.toLowerCase(Locale.ROOT), metatag);

This means index-metadata is useless if your metadata fields contain uppercase 
characters. Was this done for a reason?

If not, i'll patch it up.

Thanks,
Markus


RE: Need Tutorial on Nutch

2018-03-06 Thread Markus Jelsma
Hi,

Yes you are going to need code, and a lot more than just that, probably 
including dropping the 'every two hour' requirement.

For your case you need either site-specific price extraction, which is easy but 
a lot of work for 500+ sites. Or you need a more complicated generic algorithm, 
which is a lot of work too. Both can be implemented as Nutch ParseFilter 
plugins and need Java code to run.

Your next problem is daily volume, every product 12x per day for 500+ shops 
times many products. You can ignore bandwidth and processing, that is easy. But 
you are going to be blocked within a few days by at least a good amount of 
sites.

We once built a price checker crawler too, but the client's requirement for 
very high interval checks could not be met easily without the use of costly 
proxies to avoid being blocked, hardware and network costs. They dropped the 
requirement.

Good luck
Markus
 
-Original message-
> From:Eric Valencia 
> Sent: Tuesday 6th March 2018 21:17
> To: user@nutch.apache.org
> Subject: Re: Need Tutorial on Nutch
> 
> Yash, well, I want to monitor the price for every item in the top 500
> retail websites every two hours, 24/7/365.  Java is needed?
> 
> On Tue, Mar 6, 2018 at 12:15 PM, Yash Thenuan Thenuan <
> rit2014...@iiita.ac.in> wrote:
> 
> > If you want simple crawlung then Not at all.
> > But having experience with java will help you to fulfil your personal
> > requirements.
> >
> > On 7 Mar 2018 01:42, "Eric Valencia"  wrote:
> >
> > > Does this require knowing Java proficiently?
> > >
> > > On Tue, Mar 6, 2018 at 10:51 AM Semyon Semyonov <
> > semyon.semyo...@mail.com>
> > > wrote:
> > >
> > > > Here is an unpleasant truth - there is no up to date tutorial for
> > Nutch.
> > > > To make it even more interesting, sometimes the tutorial can contradict
> > > > real behavior of Nutch, because of lately introduced features/bugs. If
> > > you
> > > > find such cases, please try to fix and contribute to the project.
> > > >
> > > > Welcome to the open source world.
> > > >
> > > > Though, my recommendations as a person who started with Nutch less
> > then a
> > > > year ago :
> > > > 1) If you just need a simple crawl, you are in luck. Simply run crawl
> > > > script or several steps according to the Nutch crawl tutorial.
> > > > 2) If it is bit more comlex you start to face problems either with
> > > > configuration or with bugs. Therefore, first have a look at Nutch List
> > > > Archive http://nutch.apache.org/mailing_lists.html , if it doesnt work
> > > > try to figure out yourself, if that doesnt work ask here or at
> > developer
> > > > list.
> > > > 3) In most cases, you HAVE to open the code and fix/discover something.
> > > > Nutch is really complicated system and to understand it properly you
> > can
> > > > easily spend 2-3 months trying to get the full basic understanding of
> > the
> > > > system. It gets even worse if you don't know Hadoop. If you dont I do
> > > > recomend to read "Hadoop. The definitive guide", because, well, Nutch
> > is
> > > > Hadoop.
> > > >
> > > > Here we are, no pain, no gain.
> > > >
> > > >
> > > >
> > > > Sent: Tuesday, March 06, 2018 at 7:42 PM
> > > > From: "Eric Valencia" 
> > > > To: user@nutch.apache.org
> > > > Subject: Re: Need Tutorial on Nutch
> > > > Thank you kindly Yash. Yes, I did try some of the tutorials actually
> > but
> > > > they seem to be missing the complete amount of steps required to
> > > > successfully scrape in nutch.
> > > >
> > > > On Tue, Mar 6, 2018 at 10:37 AM Yash Thenuan Thenuan <
> > > > rit2014...@iiita.ac.in>
> > > > wrote:
> > > >
> > > > > I would suggest to start with the documentation on nutch's website.
> > > > > You can get a Idea about how to start crawling and all.
> > > > > Apart from that there are no proper tutorials as such.
> > > > > Just start crawling if you got stuck somewhere try to find something
> > > > > related to that on Google and nutch mailing list archives.
> > > > > Ask questions if nothing helps.
> > > > >
> > > > > On 7 Mar 2018 00:01, "Eric Valencia" 
> > wrote:
> > > > >
> > > > > I'm a beginner in Nutch and need the best tutorials to get started.
> > Can
> > > > > you guys let me know how you would advise yourselves if starting
> > today
> > > > > (like me)?
> > > > >
> > > > > Eric
> > > > >
> > > >
> > >
> >
> 


RE: Why doesn't hostdb support byDomain mode?

2018-03-05 Thread Markus Jelsma
Ah, well, that is a good one! I took me a while to figure it out, but having 
the check there is an error. We had added the same check in an earlier 
different Nutch job where the database itself could remove itself just by the 
rules it emitted and host normalized enabled.

I simply reused the job setup code and forgot to remove that check. You can 
safely remove that check in HostDB.

Regards,
Markus


-Original message-
> From:Yossi Tamari <yossi.tam...@pipl.com>
> Sent: Monday 5th March 2018 11:30
> To: user@nutch.apache.org
> Subject: RE: Why doesn't hostdb support byDomain mode?
> 
> Thanks Markus, I will open a ticket and submit a patch.
> One follow up question: UpdateHostDb checks and throws an exception if 
> urlnormalizer-host (which can be used to mitigate the problem I mentioned) is 
> enabled. Is that also an internal decision of OpenIndex, and perhaps should 
> be removed now that the code is part of Nutch, or is there a reason this 
> normalizer must not be used with UpdateHostDb?
> 
>   Yossi.
> 
> > -Original Message-
> > From: Markus Jelsma <markus.jel...@openindex.io>
> > Sent: 05 March 2018 12:22
> > To: user@nutch.apache.org
> > Subject: RE: Why doesn't hostdb support byDomain mode?
> > 
> > Hi,
> > 
> > The reason is simple, we (company) needed this information based on
> > hostname, so we made a hostdb. I don't see any downside for supporting a
> > domain mode. Adding support for it through hostdb.url.mode seems like a good
> > idea.
> > 
> > Regards,
> > Markus
> > 
> > -Original message-
> > > From:Yossi Tamari <yossi.tam...@pipl.com>
> > > Sent: Sunday 4th March 2018 12:01
> > > To: user@nutch.apache.org
> > > Subject: Why doesn't hostdb support byDomain mode?
> > >
> > > Hi,
> > >
> > >
> > >
> > > Is there a reason that hostdb provides per-host data even when the
> > > generate/fetch are working by domain? This generates misleading
> > > statistics for servers that load-balance by redirecting to nodes (e.g.
> > photobucket).
> > >
> > > If this is just an oversight, I can contribute a patch, but I'm not
> > > sure if I should use partition.url.mode, generate.count.mode, one of
> > > the other similar properties, or create one more such property
> > hostdb.url.mode.
> > >
> > >
> > >
> > > Yossi.
> > >
> > >
> 
> 


RE: Why doesn't hostdb support byDomain mode?

2018-03-05 Thread Markus Jelsma
Hi,

The reason is simple, we (company) needed this information based on hostname, 
so we made a hostdb. I don't see any downside for supporting a domain mode. 
Adding support for it through hostdb.url.mode seems like a good idea.

Regards,
Markus

-Original message-
> From:Yossi Tamari 
> Sent: Sunday 4th March 2018 12:01
> To: user@nutch.apache.org
> Subject: Why doesn't hostdb support byDomain mode?
> 
> Hi,
> 
>  
> 
> Is there a reason that hostdb provides per-host data even when the
> generate/fetch are working by domain? This generates misleading statistics
> for servers that load-balance by redirecting to nodes (e.g. photobucket).
> 
> If this is just an oversight, I can contribute a patch, but I'm not sure if
> I should use partition.url.mode, generate.count.mode, one of the other
> similar properties, or create one more such property hostdb.url.mode.
> 
>  
> 
> Yossi.
> 
> 


RE: Nutch pointed to Cassandra, yet, asks for Hadoop

2018-02-23 Thread Markus Jelsma
Hi,

If you want to stay clear of all 2.x caveats, use Nutch 1.x. If you want the 
most stable and feature rich version, use 1.x. If you want to limit the number 
of wheels (Gora as DB abstraction, running and operate a separate DB server), 
use 1.x. If you do not intend to crawl tens of millions of records, you are 
fine running Nutch 1.x locally. 

Regards,
Markus
 
-Original message-
> From:Kaliyug Antagonist 
> Sent: Friday 23rd February 2018 22:48
> To: user@nutch.apache.org
> Subject: RE: Nutch pointed to Cassandra, yet, asks for Hadoop
> 
> So what's the whole point of supporting Cassandra or other databases(via
> Gora) if Hadoop(HDFS & MR)both are essential? What exactly Cassandra would
> be doing ?
> 
> On 23 Feb 2018 22:41, "Yossi Tamari"  wrote:
> 
> > 1 is not true.
> > 2 is true, if we ignore the second part 
> > Hadoop is made of two parts: distributed storage (HDFS) and a Map/Reduce
> > framework. Nutch is essentially a collection of Map/Reduce tasks. It relies
> > on Hadoop to distribute these tasks to all participating servers. So if you
> > run in local mode, you can only use one server. If you have a single-node
> > Hadoop, Nutch will be able to fully utilize the server, but it will still
> > be limited to crawling from one machine, which is only sufficient for
> > small/slow crawls.
> >
> > > -Original Message-
> > > From: Kaliyug Antagonist [mailto:kaliyugantagon...@gmail.com]
> > > Sent: 23 February 2018 23:16
> > > To: user@nutch.apache.org
> > > Subject: RE: Nutch pointed to Cassandra, yet, asks for Hadoop
> > >
> > > Ohh. I'm a bit confused. What of the following is true in the 'deploy'
> > mode:
> > > 1. Data cannot be stored in Cassandra, HBase is the only way.
> > > 2. Data will be stored in Cassandra but you need a (maybe, just a single
> > > node)Hadoop cluster anyway which won't be storing any data but is there
> > just to
> > > make Nutch happy.
> > >
> > > On 23 Feb 2018 22:08, "Yossi Tamari"  wrote:
> > >
> > > > Hi Kaliyug,
> > > >
> > > > Nutch 2 still requires Hadoop to run, it just allows you to store data
> > > > somewhere other than HDFS.
> > > > The only way to run Nutch without Hadoop is local mode, which is only
> > > > recommended for testing. To do that, run ./runtime/local/bin/crawl.
> > > >
> > > > Yossi.
> > > >
> > > > > -Original Message-
> > > > > From: Kaliyug Antagonist [mailto:kaliyugantagon...@gmail.com]
> > > > > Sent: 23 February 2018 20:26
> > > > > To: user@nutch.apache.org
> > > > > Subject: Nutch pointed to Cassandra, yet, asks for Hadoop
> > > > >
> > > > > Windows 10 Nutch 2.3.1 Cassandra 3.11.1
> > > > >
> > > > > I have extracted and built Nutch under the Cygwin's home directory.
> > > > >
> > > > > I believe that the Cassandra server is working:
> > > > >
> > > > > INFO  [main] 2018-02-23 16:20:41,077 StorageService.java:1442 -
> > > > > JOINING: Finish joining ring
> > > > > INFO  [main] 2018-02-23 16:20:41,820 SecondaryIndexManager.java:509
> > > > > - Executing pre-join tasks for: CFS(Keyspace='test',
> > > > > ColumnFamily='test')
> > > > > INFO  [main] 2018-02-23 16:20:42,161 StorageService.java:2268 - Node
> > > > > localhost/127.0.0.1 state jump to NORMAL INFO  [main] 2018-02-23
> > > > > 16:20:43,049 NativeTransportService.java:75 - Netty using Java NIO
> > > > > event
> > > > loop
> > > > > INFO  [main] 2018-02-23 16:20:43,358 Server.java:155 - Using Netty
> > > > > Version: [netty-buffer=netty-buffer-4.0.44.Final.452812a,
> > > > > netty-codec=netty-codec-4.0.44.Final.452812a,
> > > > > netty-codec-haproxy=netty-codec-haproxy-4.0.44.Final.452812a,
> > > > > netty-codec-http=netty-codec-http-4.0.44.Final.452812a,
> > > > > netty-codec-socks=netty-codec-socks-4.0.44.Final.452812a,
> > > > > netty-common=netty-common-4.0.44.Final.452812a,
> > > > > netty-handler=netty-handler-4.0.44.Final.452812a,
> > > > > netty-tcnative=netty-tcnative-1.1.33.Fork26.142ecbb,
> > > > > netty-transport=netty-transport-4.0.44.Final.452812a,
> > > > > netty-transport-native-epoll=netty-transport-native-epoll-
> > > > 4.0.44.Final.452812a,
> > > > > netty-transport-rxtx=netty-transport-rxtx-4.0.44.Final.452812a,
> > > > > netty-transport-sctp=netty-transport-sctp-4.0.44.Final.452812a,
> > > > > netty-transport-udt=netty-transport-udt-4.0.44.Final.452812a]
> > > > > INFO  [main] 2018-02-23 16:20:43,359 Server.java:156 - Starting
> > > > listening for
> > > > > CQL clients on localhost/127.0.0.1:9042 (unencrypted)...
> > > > > INFO  [main] 2018-02-23 16:20:43,941 CassandraDaemon.java:527 - Not
> > > > > starting RPC server as requested. Use JMX
> > > > > (StorageService->startRPCServer()) or nodetool (enablethrift) to
> > > > > start
> > > > it
> > > > >
> > > > > I did the following check:
> > > > >
> > > > > apache-cassandra-3.11.1\bin>nodetool status
> > > > > Datacenter: datacenter1
> > > > > 
> > > > > Status=Up/Down
> > > 

RE: Search with Accent and without accent Character

2018-02-13 Thread Markus Jelsma
Checked and confirmed, even Dutch digraph IJ is folded properly, as well as the 
upper case dotless Turkish i and the Spanish example you provided is folded 
properly.

Correction for German (before Nagel corrects me), ö and ü are not normalized by 
ICU folder according to German rules. Their accents are stripped instead of 
transforming them into oe and ue respectively. It makes the case of language 
specific folders, especially when dealing with Scandinavian or German. Dutch 
and Latin can be folded just by removing their accents.

Correct me when im wrong!
Markus
 
-Original message-
> From:Markus Jelsma 
> Sent: Tuesday 13th February 2018 22:21
> To: user@nutch.apache.org
> Subject: RE: Search with Accent and without accent Character
> 
> Hi,
> 
> My guess is you haven't reindexed after changing filter configuration, which 
> is required for index-time filters.
> 
> Regarding your fieldType, you can drop the lowercase and ASCII folding 
> filters and just keep the ICU folder, it will work for pretty much any 
> character set. It will normalize case, Scandinavian digraphs (AE), probably 
> Dutch digraphs (IJ) as well. But also deal with German oe ü, ringel s and all 
> regular Latin accents including Spanish tilde ~, circumflex etc.
> 
> If a there is a language specific normalizer/folder, use that instead of ICU 
> because there can be differences in how accents should be normalized across 
> languages.
> 
> And do not forget to reindex and use the same normalizers index- and 
> query-time.
> 
> Regards,
> Markus
> 
>  
>  
> -Original message-
> > From:Rushi 
> > Sent: Tuesday 13th February 2018 19:40
> > To: user@nutch.apache.org
> > Subject: Search with Accent and without accent Character
> > 
> > Hello All,
> > I integrated Nutch with solr ,everything seems to be fine till now, i am
> > having a issue while searching some spanish accent characters,the search
> > results are not same,with accent (Example :investigación) gives correct
> > result  but without accent(example :investigacion) gives zero results.
> > I tried using  various filters but still the issue is same.Here is my
> > configuration on nutch and solr.
> > 
> > 
> >   > positionIncrementGap="100">
> > 
> > 
> > 
> > 
> > 
> >  > maxGramSize="50" side="front"/>
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> >   
> > 
> > I would really appreciate if  anyone of you can  tell me what i am missing?
> > -- 
> > Regards
> > Rushikesh M
> > .Net Developer
> > 
> 


RE: Search with Accent and without accent Character

2018-02-13 Thread Markus Jelsma
Hi,

My guess is you haven't reindexed after changing filter configuration, which is 
required for index-time filters.

Regarding your fieldType, you can drop the lowercase and ASCII folding filters 
and just keep the ICU folder, it will work for pretty much any character set. 
It will normalize case, Scandinavian digraphs (AE), probably Dutch digraphs 
(IJ) as well. But also deal with German oe ü, ringel s and all regular Latin 
accents including Spanish tilde ~, circumflex etc.

If a there is a language specific normalizer/folder, use that instead of ICU 
because there can be differences in how accents should be normalized across 
languages.

And do not forget to reindex and use the same normalizers index- and query-time.

Regards,
Markus

 
 
-Original message-
> From:Rushi 
> Sent: Tuesday 13th February 2018 19:40
> To: user@nutch.apache.org
> Subject: Search with Accent and without accent Character
> 
> Hello All,
> I integrated Nutch with solr ,everything seems to be fine till now, i am
> having a issue while searching some spanish accent characters,the search
> results are not same,with accent (Example :investigación) gives correct
> result  but without accent(example :investigacion) gives zero results.
> I tried using  various filters but still the issue is same.Here is my
> configuration on nutch and solr.
> 
> 
>   positionIncrementGap="100">
> 
> 
> 
> 
> 
>  maxGramSize="50" side="front"/>
> 
> 
> 
> 
> 
> 
> 
> 
>   
> 
> I would really appreciate if  anyone of you can  tell me what i am missing?
> -- 
> Regards
> Rushikesh M
> .Net Developer
> 


RE: SitemapProcessor destroyed our CrawlDB

2018-01-17 Thread Markus Jelsma
Hello Lewis,

We do have some weird and complicated rules, but these should not time out for 
450 seconds, e.g. keep the JVM busy for that amount of time. We still haven't 
fully investigated yet so it is a possibility some sitemap entries are very 
long and complicated. But 450 seconds, very odd, but it seems reproducible as 
it happened twice in a row.

The disaster is not that big of a problem thanks to HDFS snapshots.

Thanks,
Markus
 
-Original message-
> From:lewis john mcgibbney <lewi...@apache.org>
> Sent: Wednesday 17th January 2018 17:47
> To: user@nutch.apache.org
> Subject: Re: SitemapProcessor destroyed our CrawlDB
> 
> Hi Markus,
> 
> What a disaster... do/did you have any crazy rules, replacements and/or
> substitutions present in the urlnormalizer-regex configuration?
> Lewis
> 
> On Wed, Jan 17, 2018 at 2:51 AM, <user-digest-h...@nutch.apache.org> wrote:
> 
> >
> > From: Markus Jelsma <markus.jel...@openindex.io>
> > To: User <user@nutch.apache.org>
> > Cc:
> > Bcc:
> > Date: Wed, 17 Jan 2018 10:51:49 +
> > Subject: SitemapProcessor destroyed our CrawlDB
> > Hello,
> >
> > We noticed some abnormalities in our crawl cycle caused by a sudden
> > reduction of our CrawlDB's size. The SitemapProcessor ran, failed (timed
> > out, see below) and left us with a decimated CrawlDB.
> >
> > This is odd because of:
> >
> > } catch (Exception e) {
> >   if (fs.exists(tempCrawlDb))
> > fs.delete(tempCrawlDb, true);
> >
> >   LockUtil.removeLockFile(fs, lock);
> >   throw e;
> > }
> >
> > Any ideas?
> >
> > Thanks,
> > Markus
> >
> > Full thread dump OpenJDK 64-Bit Server VM (25.151-b12 mixed mode):
> >
> > "Thread-52" #74 prio=5 os_prio=0 tid=0x7fe2adc85000 nid=0x6cf8
> > runnable [0x7fe28a86d000]
> >java.lang.Thread.State: RUNNABLE
> > at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3797)
> > at java.util.regex.Pattern$Start.match(Pattern.java:3461)
> > at java.util.regex.Matcher.search(Matcher.java:1248)
> > at java.util.regex.Matcher.find(Matcher.java:637)
> > at java.util.regex.Matcher.replaceAll(Matcher.java:951)
> > at org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.
> > regexNormalize(RegexURLNormalizer.java:193)
> > at org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.normalize(
> > RegexURLNormalizer.java:200)
> > at org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:319)
> > at org.apache.nutch.util.SitemapProcessor$SitemapMapper.filterNormalize(
> > SitemapProcessor.java:176)
> > at org.apache.nutch.util.SitemapProcessor$SitemapMapper.
> > generateSitemapUrlDatum(SitemapProcessor.java:225)
> > at org.apache.nutch.util.SitemapProcessor$SitemapMapper.
> > generateSitemapUrlDatum(SitemapProcessor.java:264)
> > at org.apache.nutch.util.SitemapProcessor$SitemapMapper.map(
> > SitemapProcessor.java:154)
> > at org.apache.nutch.util.SitemapProcessor$SitemapMapper.map(
> > SitemapProcessor.java:95)
> > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
> > at org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper$MapRunner.run(
> > MultithreadedMapper.java:273)
> >
> > "SpillThread" #34 daemon prio=5 os_prio=0 tid=0x7fe2ada12000
> > nid=0x6c2f waiting on condition [0x7fe28d2ad000]
> >java.lang.Thread.State: WAITING (parking)
> > at sun.misc.Unsafe.park(Native Method)
> > - parking to wait for  <0xede6dc80> (a java.util.concurrent.locks.
> > AbstractQueuedSynchronizer$ConditionObject)
> > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> > at java.util.concurrent.locks.AbstractQueuedSynchronizer$
> > ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> > at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$
> > SpillThread.run(MapTask.java:1530)
> >
> > "org.apache.hadoop.hdfs.PeerCache@1fc0053e" #33 daemon prio=5 os_prio=0
> > tid=0x7fe2ad7fe000 nid=0x6be7 waiting on condition [0x7fe28d3ae000]
> >java.lang.Thread.State: TIMED_WAITING (sleeping)
> > at java.lang.Thread.sleep(Native Method)
> > at org.apache.hadoop.hdfs.PeerCache.run(PeerCache.java:253)
> > at org.apache.hadoop.hdfs.PeerCache.access$000(PeerCache.java:46)
> > at org.apache.hadoop.hdfs.PeerCache$1.run(PeerCache.java:124)
> > at java.lang.Thread.run(Thread.java:748)
> >
> > "communication thread" #28 daemon prio=5 os_prio=0 tid=0x7fe2ad975

RE: SitemapProcessor destroyed our CrawlDB

2018-01-17 Thread Markus Jelsma
I'll fix NUTCH-2466 this afternoon. 
 
-Original message-
> From:Sebastian Nagel <wastl.na...@googlemail.com>
> Sent: Wednesday 17th January 2018 14:09
> To: user@nutch.apache.org
> Subject: Re: SitemapProcessor destroyed our CrawlDB
> 
> It was finally Omkar who brought NUTCH-2442 forward.
> Time to review the patch of NUTCH-2466!
> 
> On 01/17/2018 01:53 PM, Markus Jelsma wrote:
> > Ah thanks!
> > 
> > I knew you'd fixed some of these, now i know my patch of NUTCH-2466 
> > silently removes your commit!
> > 
> > My bad, thanks!
> > Markus 
> >  
> > -Original message-
> >> From:Sebastian Nagel <wastl.na...@googlemail.com>
> >> Sent: Wednesday 17th January 2018 13:32
> >> To: user@nutch.apache.org
> >> Subject: Re: SitemapProcessor destroyed our CrawlDB
> >>
> >> Hi Markus,
> >>
> >> the problem should be fixed with NUTCH-2442. It wasn't the case with the 
> >> first version of the
> >> sitemap processor. It's mandatory to check also the return value of 
> >> job.waitForCompletion(true),
> >> only checking for exceptions isn't enough!
> >>
> >> Sebastian
> >>
> >> On 01/17/2018 11:51 AM, Markus Jelsma wrote:
> >>> Hello,
> >>>
> >>> We noticed some abnormalities in our crawl cycle caused by a sudden 
> >>> reduction of our CrawlDB's size. The SitemapProcessor ran, failed (timed 
> >>> out, see below) and left us with a decimated CrawlDB.
> >>>
> >>> This is odd because of:
> >>>
> >>>     } catch (Exception e) {
> >>>   if (fs.exists(tempCrawlDb))
> >>>     fs.delete(tempCrawlDb, true);
> >>>
> >>>   LockUtil.removeLockFile(fs, lock);
> >>>   throw e;
> >>>     }
> >>>
> >>> Any ideas?
> >>>
> >>> Thanks,
> >>> Markus
> >>>
> >>> Full thread dump OpenJDK 64-Bit Server VM (25.151-b12 mixed mode):
> >>>
> >>> "Thread-52" #74 prio=5 os_prio=0 tid=0x7fe2adc85000 nid=0x6cf8 
> >>> runnable [0x7fe28a86d000]
> >>>    java.lang.Thread.State: RUNNABLE 
> >>> at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3797) 
> >>> at java.util.regex.Pattern$Start.match(Pattern.java:3461) 
> >>> at java.util.regex.Matcher.search(Matcher.java:1248) 
> >>> at java.util.regex.Matcher.find(Matcher.java:637) 
> >>> at java.util.regex.Matcher.replaceAll(Matcher.java:951) 
> >>> at 
> >>> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.regexNormalize(RegexURLNormalizer.java:193)
> >>>  
> >>> at 
> >>> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.normalize(RegexURLNormalizer.java:200)
> >>>  
> >>> at org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:319) 
> >>> at 
> >>> org.apache.nutch.util.SitemapProcessor$SitemapMapper.filterNormalize(SitemapProcessor.java:176)
> >>>  
> >>> at 
> >>> org.apache.nutch.util.SitemapProcessor$SitemapMapper.generateSitemapUrlDatum(SitemapProcessor.java:225)
> >>>  
> >>> at 
> >>> org.apache.nutch.util.SitemapProcessor$SitemapMapper.generateSitemapUrlDatum(SitemapProcessor.java:264)
> >>>  
> >>> at 
> >>> org.apache.nutch.util.SitemapProcessor$SitemapMapper.map(SitemapProcessor.java:154)
> >>>  
> >>> at 
> >>> org.apache.nutch.util.SitemapProcessor$SitemapMapper.map(SitemapProcessor.java:95)
> >>>  
> >>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) 
> >>> at 
> >>> org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper$MapRunner.run(MultithreadedMapper.java:273)
> >>>
> >>> "SpillThread" #34 daemon prio=5 os_prio=0 tid=0x7fe2ada12000 
> >>> nid=0x6c2f waiting on condition [0x7fe28d2ad000]
> >>>    java.lang.Thread.State: WAITING (parking) 
> >>> at sun.misc.Unsafe.park(Native Method) 
> >>> - parking to wait for  <0xede6dc80> (a 
> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) 
> >>> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) 
> >>> at 
> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> >>>  
&g

RE: SitemapProcessor destroyed our CrawlDB

2018-01-17 Thread Markus Jelsma
Ah thanks!

I knew you'd fixed some of these, now i know my patch of NUTCH-2466 silently 
removes your commit!

My bad, thanks!
Markus 
 
-Original message-
> From:Sebastian Nagel <wastl.na...@googlemail.com>
> Sent: Wednesday 17th January 2018 13:32
> To: user@nutch.apache.org
> Subject: Re: SitemapProcessor destroyed our CrawlDB
> 
> Hi Markus,
> 
> the problem should be fixed with NUTCH-2442. It wasn't the case with the 
> first version of the
> sitemap processor. It's mandatory to check also the return value of 
> job.waitForCompletion(true),
> only checking for exceptions isn't enough!
> 
> Sebastian
> 
> On 01/17/2018 11:51 AM, Markus Jelsma wrote:
> > Hello,
> > 
> > We noticed some abnormalities in our crawl cycle caused by a sudden 
> > reduction of our CrawlDB's size. The SitemapProcessor ran, failed (timed 
> > out, see below) and left us with a decimated CrawlDB.
> > 
> > This is odd because of:
> > 
> >     } catch (Exception e) {
> >   if (fs.exists(tempCrawlDb))
> >     fs.delete(tempCrawlDb, true);
> > 
> >   LockUtil.removeLockFile(fs, lock);
> >   throw e;
> >     }
> > 
> > Any ideas?
> > 
> > Thanks,
> > Markus
> > 
> > Full thread dump OpenJDK 64-Bit Server VM (25.151-b12 mixed mode):
> > 
> > "Thread-52" #74 prio=5 os_prio=0 tid=0x7fe2adc85000 nid=0x6cf8 runnable 
> > [0x7fe28a86d000]
> >    java.lang.Thread.State: RUNNABLE 
> > at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3797) 
> > at java.util.regex.Pattern$Start.match(Pattern.java:3461) 
> > at java.util.regex.Matcher.search(Matcher.java:1248) 
> > at java.util.regex.Matcher.find(Matcher.java:637) 
> > at java.util.regex.Matcher.replaceAll(Matcher.java:951) 
> > at 
> > org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.regexNormalize(RegexURLNormalizer.java:193)
> >  
> > at 
> > org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.normalize(RegexURLNormalizer.java:200)
> >  
> > at org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:319) 
> > at 
> > org.apache.nutch.util.SitemapProcessor$SitemapMapper.filterNormalize(SitemapProcessor.java:176)
> >  
> > at 
> > org.apache.nutch.util.SitemapProcessor$SitemapMapper.generateSitemapUrlDatum(SitemapProcessor.java:225)
> >  
> > at 
> > org.apache.nutch.util.SitemapProcessor$SitemapMapper.generateSitemapUrlDatum(SitemapProcessor.java:264)
> >  
> > at 
> > org.apache.nutch.util.SitemapProcessor$SitemapMapper.map(SitemapProcessor.java:154)
> >  
> > at 
> > org.apache.nutch.util.SitemapProcessor$SitemapMapper.map(SitemapProcessor.java:95)
> >  
> > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) 
> > at 
> > org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper$MapRunner.run(MultithreadedMapper.java:273)
> > 
> > "SpillThread" #34 daemon prio=5 os_prio=0 tid=0x7fe2ada12000 nid=0x6c2f 
> > waiting on condition [0x7fe28d2ad000]
> >    java.lang.Thread.State: WAITING (parking) 
> > at sun.misc.Unsafe.park(Native Method) 
> > - parking to wait for  <0xede6dc80> (a 
> > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) 
> > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) 
> > at 
> > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> >  
> > at 
> > org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1530)
> > 
> > "org.apache.hadoop.hdfs.PeerCache@1fc0053e" #33 daemon prio=5 os_prio=0 
> > tid=0x7fe2ad7fe000 nid=0x6be7 waiting on condition [0x7fe28d3ae000]
> >    java.lang.Thread.State: TIMED_WAITING (sleeping) 
> > at java.lang.Thread.sleep(Native Method) 
> > at org.apache.hadoop.hdfs.PeerCache.run(PeerCache.java:253) 
> > at org.apache.hadoop.hdfs.PeerCache.access$000(PeerCache.java:46) 
> > at org.apache.hadoop.hdfs.PeerCache$1.run(PeerCache.java:124) 
> > at java.lang.Thread.run(Thread.java:748)
> > 
> > "communication thread" #28 daemon prio=5 os_prio=0 tid=0x7fe2ad975800 
> > nid=0x6b9e in Object.wait() [0x7fe28d8b1000]
> >    java.lang.Thread.State: TIMED_WAITING (on object monitor) 
> > at java.lang.Object.wait(Native Method) 
> > at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:799) 
> > - locked <0xede69ae8> (a java.lang.Object) 
> > at java.lang.Thread.run(Thread.java:748)
> > 
> > 

SitemapProcessor destroyed our CrawlDB

2018-01-17 Thread Markus Jelsma
Hello,

We noticed some abnormalities in our crawl cycle caused by a sudden reduction 
of our CrawlDB's size. The SitemapProcessor ran, failed (timed out, see below) 
and left us with a decimated CrawlDB.

This is odd because of:

    } catch (Exception e) {
  if (fs.exists(tempCrawlDb))
    fs.delete(tempCrawlDb, true);

  LockUtil.removeLockFile(fs, lock);
  throw e;
    }

Any ideas?

Thanks,
Markus

Full thread dump OpenJDK 64-Bit Server VM (25.151-b12 mixed mode):

"Thread-52" #74 prio=5 os_prio=0 tid=0x7fe2adc85000 nid=0x6cf8 runnable 
[0x7fe28a86d000]
   java.lang.Thread.State: RUNNABLE 
at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3797) 
at java.util.regex.Pattern$Start.match(Pattern.java:3461) 
at java.util.regex.Matcher.search(Matcher.java:1248) 
at java.util.regex.Matcher.find(Matcher.java:637) 
at java.util.regex.Matcher.replaceAll(Matcher.java:951) 
at 
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.regexNormalize(RegexURLNormalizer.java:193)
 
at 
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.normalize(RegexURLNormalizer.java:200)
 
at org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:319) 
at 
org.apache.nutch.util.SitemapProcessor$SitemapMapper.filterNormalize(SitemapProcessor.java:176)
 
at 
org.apache.nutch.util.SitemapProcessor$SitemapMapper.generateSitemapUrlDatum(SitemapProcessor.java:225)
 
at 
org.apache.nutch.util.SitemapProcessor$SitemapMapper.generateSitemapUrlDatum(SitemapProcessor.java:264)
 
at 
org.apache.nutch.util.SitemapProcessor$SitemapMapper.map(SitemapProcessor.java:154)
 
at 
org.apache.nutch.util.SitemapProcessor$SitemapMapper.map(SitemapProcessor.java:95)
 
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) 
at 
org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper$MapRunner.run(MultithreadedMapper.java:273)

"SpillThread" #34 daemon prio=5 os_prio=0 tid=0x7fe2ada12000 nid=0x6c2f 
waiting on condition [0x7fe28d2ad000]
   java.lang.Thread.State: WAITING (parking) 
at sun.misc.Unsafe.park(Native Method) 
- parking to wait for  <0xede6dc80> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) 
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) 
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
 
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1530)

"org.apache.hadoop.hdfs.PeerCache@1fc0053e" #33 daemon prio=5 os_prio=0 
tid=0x7fe2ad7fe000 nid=0x6be7 waiting on condition [0x7fe28d3ae000]
   java.lang.Thread.State: TIMED_WAITING (sleeping) 
at java.lang.Thread.sleep(Native Method) 
at org.apache.hadoop.hdfs.PeerCache.run(PeerCache.java:253) 
at org.apache.hadoop.hdfs.PeerCache.access$000(PeerCache.java:46) 
at org.apache.hadoop.hdfs.PeerCache$1.run(PeerCache.java:124) 
at java.lang.Thread.run(Thread.java:748)

"communication thread" #28 daemon prio=5 os_prio=0 tid=0x7fe2ad975800 
nid=0x6b9e in Object.wait() [0x7fe28d8b1000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor) 
at java.lang.Object.wait(Native Method) 
at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:799) 
- locked <0xede69ae8> (a java.lang.Object) 
at java.lang.Thread.run(Thread.java:748)

"client DomainSocketWatcher" #27 daemon prio=5 os_prio=0 tid=0x7fe2ad952000 
nid=0x6b95 runnable [0x7fe28d9b2000]
   java.lang.Thread.State: RUNNABLE 
at org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(Native Method) 
at 
org.apache.hadoop.net.unix.DomainSocketWatcher.access$900(DomainSocketWatcher.java:52)
 
at 
org.apache.hadoop.net.unix.DomainSocketWatcher$2.run(DomainSocketWatcher.java:503)
 
at java.lang.Thread.run(Thread.java:748)

"Thread for syncLogs" #26 daemon prio=5 os_prio=0 tid=0x7fe2ad82 
nid=0x6b81 waiting on condition [0x7fe28deb3000]
   java.lang.Thread.State: TIMED_WAITING (parking) 
at sun.misc.Unsafe.park(Native Method) 
- parking to wait for  <0xe7118190> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) 
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) 
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
 
at 
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
 
at 
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
 
at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)

"org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner" #24 
daemon prio=5 os_prio=0 tid=0x7fe2ad746800 

RE: [ANNOUNCE] Apache Nutch 1.14 Release

2017-12-25 Thread Markus Jelsma
Thanks Sebastian!

 
 
-Original message-
> From:Sebastian Nagel 
> Sent: Monday 25th December 2017 18:38
> To: user@nutch.apache.org; annou...@apache.org
> Subject: [ANNOUNCE] Apache Nutch 1.14 Release
> 
> Dear Nutch users,
> 
> the Apache Nutch [0] Project Management Committee are pleased to announce
> the immediate release of Apache Nutch v1.14. We advise all current users
> and developers of the 1.X series to upgrade to this release.
> 
> Nutch is a well matured, production ready Web crawler. Nutch 1.x enables
> fine grained configuration, relying on Apache Hadoop™ [1] data structures,
> which are great for batch processing.
> 
> The Nutch DOAP can be found at [2]. An account of the CHANGES in this
> release can be seen in the release report [3].
> 
> As usual in the 1.X series, release artifacts are made available as both
> source and binary and also available within Maven Central [4] as a Maven
> dependency. The release is available from our downloads page [5].
> 
> 
> Thanks,
> Sebastian (on behalf of the Nutch PMC)
> 
> 
> [0] http://nutch.apache.org/
> [1] http://hadoop.apache.org/
> [2] https://svn.apache.org/repos/asf/nutch/cms_site/trunk/content/doap.rdf
> [3] 
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=10680=12340218
> [4] 
> http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22org.apache.nutch%22%20AND%20a%3A%22nutch%22
> [5] http://nutch.apache.org/downloads.html
> 


RE: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

2017-11-15 Thread Markus Jelsma
The DefaultExtractor gives as i remember the same as ArticleExtractor, which is 
fine for contiguous regions of text. CanolaExtractor must be used if you expect 
lots of non-contiguous regions of text. The latter is also more prone to get 
the boilerplate text you want to avoid in the first place.

By the way, if you intend to extract CJK websites you need to manually modify 
Boilerpipe to take into account the different character-to-information ratio, 
or try Canola.
 
-Original message-
> From:Michael Coffey <mcof...@yahoo.com.INVALID>
> Sent: Wednesday 15th November 2017 23:00
> To: user@nutch.apache.org
> Subject: Re: [MASSMAIL]RE: Removing header,Footer and left menus while 
> crawling
> 
> I found a lot of detail about the boilerpipe algortithm in 
> http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf
> 
> 
> Seems like very short paragraphs can be a problem, since one of the primary 
> features used for determining boilerplate is the length of a given text block.
> 
> I would also look into the tika.extractor.boilerpipe.algorithm setting. It 
> can be DefaultExtractor, ArticleExtractor or CanolaExtractor. I don't know 
> what the differences are, but I bet ArticleExtractor (the default algorithm ) 
> inserts the Title.
> 
> 
> 
> 
> From: Markus Jelsma <markus.jel...@openindex.io>
> To: "user@nutch.apache.org" <user@nutch.apache.org> 
> Sent: Wednesday, November 15, 2017 1:38 PM
> Subject: RE: [MASSMAIL]RE: Removing header,Footer and left menus while 
> crawling
> 
> 
> 
> Boilerpipe is a crude tool but cheap and effective enough for many sorts of 
> websites. It does has a problem with pages with little text, just as all 
> extractors have a degree of problems with little text.
> 
> 
> I believe Boilerpipe adds the title hardcoded, or it is TikaParser doing it. 
> I am not sure, but remember you can get rid of it by removing some lines of 
> code. See TikaParser.java, i think it is there.
> 
> 
> Regards,
> 
> Makrus
> 
> 
> > non-open source contribution, you could try our extractor if you want, 
> > there is a (low speed) test available at 
> > https://www.openindex.io/saas/data-extraction/ . It is not free or open 
> > source but available and actively developed, and does much more than just 
> > text extraction.
> 
> 
> 
> 
> -Original message-
> 
> > From:Rushikesh K <rushikeshmod...@gmail.com>
> 
> > Sent: Wednesday 15th November 2017 22:21
> 
> > To: user@nutch.apache.org; eru...@uci.cu
> 
> > Subject: Re: [MASSMAIL]RE: Removing header,Footer and left menus while 
> > crawling
> 
> > 
> 
> > Hello, 
> 
> > 
> 
> > 
> 
> > Eyeris - Thanks for your response, i was able to make working with tika 
> > boilerpipe but now i have a different problem ,some of the crawled pages 
> > doesnt have the expected data 
> 
> > For some pages it brings back only the Title and skips all the content i am 
> > not sure in what special cases does this do.But in my case i have two 
> > problems now  
> 
> > 1. when my page has a image and 1 or 2 lines of text it doesnt get those 
> > lines of data.(the data is in the  tag) 
> 
> > 2.why is it adding Title to the starting of the content is there a way not 
> > to include that. 
> 
> > 
> 
> > For example see the following image for the first URL it came back with out 
> > any date 
> 
> > 
> 
> > 
> 
> > 
> 
> > On Wed, Nov 15, 2017 at 8:57 AM, Eyeris Rodriguez Rueda <eru...@uci.cu 
> > <mailto:eru...@uci.cu>> wrote:
> 
> > Hello.
> 
> 
> > 
> 
> 
> > I am using tika boilerpipe with good results in aproximately 2000 websites.
> 
> 
> > Rushikesh if tika boilerpipe is not working for you maybe it is because you 
> > don´t are parsing documents with tika. please check this configuration
> 
> 
> > and tell us.
> 
> 
> > 
> 
> 
> > make sure that tika plugin is activated in plugin.included property then 
> > check:
> 
> 
> > 
> 
> 
> > ***
> 
> 
> > Use tika parser instead of parse-html.
> 
> 
> > 
> 
> 
> > parse-plugins.xml
> 
> 
> > 
> 
> 
> > 
> 
> 
> > 
> 
> 
> > 
> 
> 
> > 
> 

RE: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

2017-11-15 Thread Markus Jelsma
Boilerpipe is a crude tool but cheap and effective enough for many sorts of 
websites. It does has a problem with pages with little text, just as all 
extractors have a degree of problems with little text.

I believe Boilerpipe adds the title hardcoded, or it is TikaParser doing it. I 
am not sure, but remember you can get rid of it by removing some lines of code. 
See TikaParser.java, i think it is there.

Regards,
Makrus

> non-open source contribution, you could try our extractor if you want, there 
> is a (low speed) test available at 
> https://www.openindex.io/saas/data-extraction/ . It is not free or open 
> source but available and actively developed, and does much more than just 
> text extraction.


 
-Original message-
> From:Rushikesh K <rushikeshmod...@gmail.com>
> Sent: Wednesday 15th November 2017 22:21
> To: user@nutch.apache.org; eru...@uci.cu
> Subject: Re: [MASSMAIL]RE: Removing header,Footer and left menus while 
> crawling
> 
> Hello, 
> 
> 
> Eyeris - Thanks for your response, i was able to make working with tika 
> boilerpipe but now i have a different problem ,some of the crawled pages 
> doesnt have the expected data 
> For some pages it brings back only the Title and skips all the content i am 
> not sure in what special cases does this do.But in my case i have two 
> problems now  
> 1. when my page has a image and 1 or 2 lines of text it doesnt get those 
> lines of data.(the data is in the  tag) 
> 2.why is it adding Title to the starting of the content is there a way not to 
> include that. 
> 
> For example see the following image for the first URL it came back with out 
> any date 
> 
> 
> 
> On Wed, Nov 15, 2017 at 8:57 AM, Eyeris Rodriguez Rueda <eru...@uci.cu 
> <mailto:eru...@uci.cu>> wrote:
> Hello.
 
> 
 
> I am using tika boilerpipe with good results in aproximately 2000 websites.
 
> Rushikesh if tika boilerpipe is not working for you maybe it is because you 
> don´t are parsing documents with tika. please check this configuration
 
> and tell us.
 
> 
 
> make sure that tika plugin is activated in plugin.included property then 
> check:
 
> 
 
> ***
 
> Use tika parser instead of parse-html.
 
> 
 
> parse-plugins.xml
 
> 
 
> 
 
>                 
 
>         
 
> 
 
>         
 
>                 
 
>         
 
> ***
 
> 
 
> ***
 
> nutch-site.xml
 
> 
 
>   tika.extractor
 
>   boilerpipe
 
>   
 
>   Which text extraction algorithm to use. Valid values are: boilerpipe or 
> none.
 
>   
 
> 
 
> 
 
> 
 
>   tika.extractor.boilerpipe.algorithm
 
>   ArticleExtractor
 
>   
 
>   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, 
> ArticleExtractor
 
>   or CanolaExtractor.
 
>   
 
> 
 
> 
 
> 
 
> 
 
> 
 
> 
 
> 
 
> 
 
> 
 
> 
 
> 
 
> 
 
> 
 
> 
 
> - Mensaje original -
 
> De: "Markus Jelsma" <markus.jel...@openindex.io 
> <mailto:markus.jel...@openindex.io>>
 
> Para: user@nutch.apache.org <mailto:user@nutch.apache.org>
 
> Enviados: Martes, 14 de Noviembre 2017 17:40:08
 
> Asunto: [MASSMAIL]RE: Removing header,Footer and left menus while crawling
 
> 
 
> Hello Rushikesh - why is Boilerpipe not working for you. Are you having 
> trouble getting it configured - it is really just setting a boolean value. Or 
> does it work, but not to your satisfaction?
 
> 
 
> The Bayan solution should work, theoretically, but just with a lot of tedious 
> manual per-site configuration.
 
> 
 
> Regards,
 
> Markus
 
> 
 
> -Original message-
 
> > From:Rushikesh K <rushikeshmod...@gmail.com 
> > <mailto:rushikeshmod...@gmail.com>>
 
> > Sent: Tuesday 14th November 2017 23:30
 
> > To: user@nutch.apache.org <mailto:user@nutch.apache.org>
 
> > Cc: Sebastian Nagel <wastl.na...@googlemail.com 
> > <mailto:wastl.na...@googlemail.com>>; betancourt.jo...@gmail.com 
> > <mailto:betancourt.jo...@gmail.com>
 
> > Subject: Re: Removing header,Footer and left menus while crawling
 
> >
 
> > Hello,
 
> >
 
> > *Jorge*
 
> > Thanks for response,Sorry for confusion i am using Nutch 1.13 but also  i
 
> > tried configuring Tika boilerpipe with this version but this doesnt work
 
> > for me.As you suggested to use own parser ,i am not a java developer by
 
> > chance.
 
> > By chance if you or anyone in the community has a working file ,it would be
 
> > gre

RE: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

2017-11-15 Thread Markus Jelsma
You could do that, but you would need to fiddle around in TikaParser.java. 
Using TeeContentHandler you can add both the normal ContentHandler, and the 
Boilerpipe version.

 
 
-Original message-
> From:Michael Coffey 
> Sent: Wednesday 15th November 2017 20:34
> To: user@nutch.apache.org
> Subject: Re: [MASSMAIL]RE: Removing header,Footer and left menus while 
> crawling
> 
> I am curious, is it possible to send boilerpipe output to Solr as a separate 
> "plaintext" field, in addition to the usual "content" field (rather than 
> replacing it)? If so, would someone please give an overview of how to do it?
> 


RE: Removing header,Footer and left menus while crawling

2017-11-14 Thread Markus Jelsma
Hello Rushikesh - why is Boilerpipe not working for you. Are you having trouble 
getting it configured - it is really just setting a boolean value. Or does it 
work, but not to your satisfaction?

The Bayan solution should work, theoretically, but just with a lot of tedious 
manual per-site configuration.

Regards,
Markus

-Original message-
> From:Rushikesh K 
> Sent: Tuesday 14th November 2017 23:30
> To: user@nutch.apache.org
> Cc: Sebastian Nagel ; betancourt.jo...@gmail.com
> Subject: Re: Removing header,Footer and left menus while crawling
> 
> Hello,
> 
> *Jorge*
> Thanks for response,Sorry for confusion i am using Nutch 1.13 but also  i
> tried configuring Tika boilerpipe with this version but this doesn't work
> for me.As you suggested to use own parser ,i am not a java developer by
> chance.
> By chance if you or anyone in the community has a working file ,it would be
> great if you can share it because there are many people facing with this
> issue (i came to know this when i googled).
> 
> Mark Vega
> we also tried Bayan Group extractor plugin with Nutch 1.13 but this is also
> not working.we followed the same steps.I can share the changes if you want
> to take a look.
> 
> I appreciate for your quick suggestions!
> 
> Thanks
> Rushikesh
> 
> On Tue, Nov 14, 2017 at 8:34 AM, Jorge Betancourt <
> betancourt.jo...@gmail.com> wrote:
> 
> > Hello Rushikesh,
> >
> > Are you using Nutch 1.3 or Nutch 1.13? If you're using Nutch 1.13, then you
> > could use the Tika boilerpipe implementation, on the nutch-site.xml you
> > need to enable this feature with:
> >
> > 
> >   tika.extractor
> >   boilerpipe
> >   
> >   Which text extraction algorithm to use. Valid values are: boilerpipe or
> > none.
> >   
> > 
> >
> > And configure the proper extractor with
> > the tika.extractor.boilerpipe.algorithm setting.
> >
> > This is not a perfect solution, but I've used it successfully in the past,
> > of course, your results will depend on how is the structure (markup of the
> > website).
> >
> > Other option could be to implement your own parser if you need to have more
> > control over what to include/exclude from the HTML. You can take a look at
> > this issue https://issues.apache.org/jira/browse/NUTCH-585 which contains
> > some info and old patches.
> >
> > Best Regards,
> > Jorge
> >
> > On Mon, Nov 13, 2017 at 8:58 PM Rushikesh K 
> > wrote:
> >
> > > Hello Sebastian,
> > > we are most excited in using the  Nutch 1.3 (with solr 6.4)  for crawling
> > > our website and we are happy with the search results  but we had
> > > requirement to skip the header footer and left menus and some other parts
> > > of the page, can you please guide how can we exclude those parts.i was
> > > trying various ways on google but nothing works for me.
> > >
> > > Appreciate for your help in Advance!
> > > --
> > > Regards
> > > Rushikesh M
> > > .Net Developer
> > >
> >
> 
> 
> 
> -- 
> Regards
> Rushikesh M
> .Net Developer
> 


FW: Nutch(plugins) and R

2017-11-07 Thread Markus Jelsma
cc list

-Original message-
> From:Markus Jelsma 
> Sent: Wednesday 8th November 2017 0:15
> To: user@nutch.apache.org
> Subject: RE: Nutch(plugins) and R
> 
> Hello - there are no responses, and i don't know what R is, but you are 
> interested in HTML parsing, specifically topic detection, so here are my 
> thoughts.
> 
> We have done topic detection in our custom HTML parser, but in Nutch speak we 
> would do it in a ParseFilter implementation. Get the extracted text - a 
> problem on its own - and feed it into a model builder with annotated data. 
> Use the produced model in the ParseFilter to get the topic.
> 
> In our case we used Mallet, and it produced decent results, although we 
> needed lots of code to facilitate the whole thing and keep stable results 
> between model iterations.
> 
> If R has a Java interface, the ParseFilter is the place to be because there 
> you can feed the text into the model, and get the topic back.
> 
> If R is not Java, i would - and have done - build a simple HTTP daemon around 
> it, and call it over HTTP. It breaks a Hadoop principle of bringing code to 
> data but rules can be broken. On the other hand, topic models are usually 
> very large due to the amount of vocabulary. Not bringing the data with the 
> code each time has its benefits too.
> 
> Regards,
> M.
>  
> -Original message-
> > From:Semyon Semyonov 
> > Sent: Friday 3rd November 2017 16:59
> > To: user@nutch.apache.org
> > Subject: Nutch(plugins) and R
> > 
> > Hello,
> > 
> > I'm looking for a way to use R in Nutch, particularly HTML parser, but 
> > usage in the other parts can be intresting as well. For each parsed 
> > document I would like to run a script and provide the results back to the 
> > system e.g. topic detection of the document.
> >  
> > NB I'm not looking for a way of scaling R to Hadoop or HDFS like Microsoft 
> > R server. This way uses Hadoop as an execution engine after the crawling 
> > process. In other words, first the computationally intensive full crawling 
> > after that another computationally intensive R/Hadoop process.
> >  
> > Instead I'm looking for a way of calling R scripts directly from java code 
> > of map or reduce jobs. Any ideas how to make it? One way to do it is 
> > "Rserve - Binary R server", but I'm looking for alternatives, to compare 
> > efficiency.
> > 
> > Semyon.
> > 


RE: Nutch(plugins) and R

2017-11-07 Thread Markus Jelsma
Hello - there are no responses, and i don't know what R is, but you are 
interested in HTML parsing, specifically topic detection, so here are my 
thoughts.

We have done topic detection in our custom HTML parser, but in Nutch speak we 
would do it in a ParseFilter implementation. Get the extracted text - a problem 
on its own - and feed it into a model builder with annotated data. Use the 
produced model in the ParseFilter to get the topic.

In our case we used Mallet, and it produced decent results, although we needed 
lots of code to facilitate the whole thing and keep stable results between 
model iterations.

If R has a Java interface, the ParseFilter is the place to be because there you 
can feed the text into the model, and get the topic back.

If R is not Java, i would - and have done - build a simple HTTP daemon around 
it, and call it over HTTP. It breaks a Hadoop principle of bringing code to 
data but rules can be broken. On the other hand, topic models are usually very 
large due to the amount of vocabulary. Not bringing the data with the code each 
time has its benefits too.

Regards,
M.
 
-Original message-
> From:Semyon Semyonov 
> Sent: Friday 3rd November 2017 16:59
> To: user@nutch.apache.org
> Subject: Nutch(plugins) and R
> 
> Hello,
> 
> I'm looking for a way to use R in Nutch, particularly HTML parser, but usage 
> in the other parts can be intresting as well. For each parsed document I 
> would like to run a script and provide the results back to the system e.g. 
> topic detection of the document.
>  
> NB I'm not looking for a way of scaling R to Hadoop or HDFS like Microsoft R 
> server. This way uses Hadoop as an execution engine after the crawling 
> process. In other words, first the computationally intensive full crawling 
> after that another computationally intensive R/Hadoop process.
>  
> Instead I'm looking for a way of calling R scripts directly from java code of 
> map or reduce jobs. Any ideas how to make it? One way to do it is "Rserve - 
> Binary R server", but I'm looking for alternatives, to compare efficiency.
> 
> Semyon.
> 


RE: Incorrect encoding detected

2017-11-02 Thread Markus Jelsma
Hello Sebastian,

I just spotted tika.config.file in the TikaParser, so that's how we can 
instruct a specific config.

Meanwhile Timothy Allison committed a fix. I will try the nightly build 
tomorrow.

Thanks,
Markus 
 
-Original message-
> From:Sebastian Nagel <wastl.na...@googlemail.com>
> Sent: Thursday 2nd November 2017 13:32
> To: user@nutch.apache.org
> Subject: Re: Incorrect encoding detected
> 
> I hadn't the time to dig into the problem.
> Neither how to pass a tika-config file nor why
> actually parse-html is detecting the encoding
> although it's also only looking for the first 8192
> characters (see CHUNK_SIZE).
> 
> Just one point: for the MIME detection we also
> pass the Content-Type sent by the web server to Tika.
> Could this also be help to pass it as additional glue?
> In the concrete example the server sends
>   Content-Type: text/html; charset=utf-8
> 
> Sebastian
> 
> On 11/01/2017 07:06 PM, Markus Jelsma wrote:
> > Any ideas?
> > 
> > Thanks!
> > 
> >  
> >  
> > -Original message-
> >> From:Markus Jelsma <markus.jel...@openindex.io>
> >> Sent: Tuesday 31st October 2017 13:14
> >> To: User <user@nutch.apache.org>
> >> Subject: FW: Incorrect encoding detected
> >>
> >> I actually don't know, can we specify a tika-config file in Nutch?
> >>
> >> Thanks,
> >> Markus
> >>  
> >> -Original message-
> >>> From:Allison, Timothy B. <talli...@mitre.org>
> >>> Sent: Tuesday 31st October 2017 13:11
> >>> To: u...@tika.apache.org
> >>> Subject: RE: Incorrect encoding detected
> >>>
> >>> For 1.17, the simplest solution, I think, is to allow users to configure 
> >>> extending the detection limit via our @Field config methods, that is, via 
> >>> tika-config.xml.
> >>>
> >>> To confirm, Nutch will allow users to specify a tika-config file?  Will 
> >>> this work for you and Nutch?
> >>>
> >>> -Original Message-
> >>> From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
> >>> Sent: Tuesday, October 31, 2017 5:47 AM
> >>> To: u...@tika.apache.org
> >>> Subject: RE: Incorrect encoding detected
> >>>
> >>> Hello Timothy - what would be your preferred solution? Increase detection 
> >>> limit or skip inline styles and possibly other useless head information?
> >>>
> >>> Thanks,
> >>> Markus
> >>>
> >>>  
> >>>  
> >>> -Original message-
> >>>> From:Markus Jelsma <markus.jel...@openindex.io>
> >>>> Sent: Friday 27th October 2017 15:37
> >>>> To: u...@tika.apache.org
> >>>> Subject: RE: Incorrect encoding detected
> >>>>
> >>>> Hi Tim,
> >>>>
> >>>> I have opened TIKA-2485 to track the problem. 
> >>>>
> >>>> Thank you very very much!
> >>>> Markus
> >>>>
> >>>>  
> >>>>  
> >>>> -Original message-
> >>>>> From:Allison, Timothy B. <talli...@mitre.org>
> >>>>> Sent: Friday 27th October 2017 15:33
> >>>>> To: u...@tika.apache.org
> >>>>> Subject: RE: Incorrect encoding detected
> >>>>>
> >>>>> Unfortunately there is no way to do this now.  _I think_ we could make 
> >>>>> this configurable, though, fairly easily.  Please open a ticket.
> >>>>>
> >>>>> The next RC for PDFBox might be out next week, and we'll try to release 
> >>>>> Tika 1.17 shortly after that...so there should be time to get this in.
> >>>>>
> >>>>> -Original Message-
> >>>>> From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
> >>>>> Sent: Friday, October 27, 2017 9:12 AM
> >>>>> To: u...@tika.apache.org
> >>>>> Subject: RE: Incorrect encoding detected
> >>>>>
> >>>>> Hello Tim,
> >>>>>
> >>>>> Getting rid of script and style contents sounds plausible indeed. But 
> >>>>> to work around the problem for now, can i instruct HTMLEncodingDetector 
> >>>>> from within Nutch to look beyond the limit?
> >>>>>
> >>>>> Thanks!
> >>>>> Markus
> >>>&g

RE: sitemap and xml crawl

2017-11-02 Thread Markus Jelsma
Hi - Nutch has a parser for RSS and ATOM on-board:
https://nutch.apache.org/apidocs/apidocs-1.13/org/apache/nutch/parse/feed/FeedParser.html

You must configure it in your plugin.includes to use it.

Regards,
Markus

 
 
-Original message-
> From:Ankit Goel 
> Sent: Thursday 2nd November 2017 10:11
> To: user@nutch.apache.org
> Subject: Re: sitemap and xml crawl
> 
> Hi Yossi,
> I have 2 kinds of rss links which are domain.com/rss/feed.xml 
>  links. One is the standard rss feed that we 
> see, which becomes the starting point for crawling further as we can pull 
> links from it.
> 
> 
> 
> 
> 
> 
> 
> article url
> 
>  date 
> 
> 
> 
> 
> 
> 
> 
> 
> …
> 
> 
> The other one also includes the content within the xml itself, so it doesn’t 
> need further crawling.
> I have standalone xml parsers in java that I can use directly, but obviously, 
> crawling is an important part, because it documents all the links traversed 
> so far.
> 
> What would you advice?
> 
> Regards,
> Ankit Goel
> 
> > On 02-Nov-2017, at 2:04 PM, Yossi Tamari  wrote:
> > 
> > Hi Ankit,
> > 
> > If you are looking for a Sitemap parser, I would suggest moving to 1.14
> > (trunk). I've been using it, and it is probably in better shape than 1.13.
> > If you need to parse your own format, the answer depends on the details. Do
> > you need to crawl pages in this format where each page contains links in XML
> > that you need to crawl? Or is this more like Sitemap where the XML is just
> > the  initial starting point? 
> > In the second case, maybe just write something outside of Nutch that will
> > parse the XML and produce a seed file?
> > In the first case, the link you sent is not relevant. You need to implement
> > a
> > http://nutch.apache.org/apidocs/apidocs-1.13/org/apache/nutch/parse/Parser.h
> > tml. I haven't done that myself. My suggestion is that you take a look at
> > the built-in parser at
> > https://github.com/apache/nutch/blob/master/src/plugin/parse-html/src/java/o
> > rg/apache/nutch/parse/html/HtmlParser.java. Google found this article on
> > developing a custom parser, which might be a good starting point:
> > http://www.treselle.com/blog/apache-nutch-with-custom-parser/.
> > 
> > Yossi.
> > 
> > 
> >> -Original Message-
> >> From: Ankit Goel [mailto:ankitgoel2...@gmail.com]
> >> Sent: 02 November 2017 10:24
> >> To: user@nutch.apache.org
> >> Subject: Re: sitemap and xml crawl
> >> 
> >> Hi Yossi,
> >> So I need to make a custom parser. Where do I start? I found this link
> >> https://wiki.apache.org/nutch/HowToMakeCustomSearch
> >> . Is this the right
> >> place, or should I be looking at creating a plugin page. Any advice would
> > be
> >> helpful.
> >> 
> >> Thank you,
> >> Ankit Goel
> >> 
> >>> On 02-Nov-2017, at 1:14 PM, Yossi Tamari  wrote:
> >>> 
> >>> Hi Ankit,
> >>> 
> >>> According to this: https://issues.apache.org/jira/browse/NUTCH-1465,
> >>> sitemap is a 1.14 feature.
> >>> I just checked, and the command indeed exists in 1.14. I did not test
> >>> that it works.
> >>> 
> >>> In general, Nutch supports crawling anything, but you might need to
> >>> write your own parser for custom protocols.
> >>> 
> >>>   Yossi.
> >>> 
>  -Original Message-
>  From: Ankit Goel [mailto:ankitgoel2...@gmail.com]
>  Sent: 01 November 2017 18:55
>  To: user@nutch.apache.org
>  Subject: sitemap and xml crawl
>  
>  Hi,
>  I need to crawl a xml feed, which includes url, title and content of
>  the
> >>> articles on
>  site.
>  
>  The documentation on the site says that bin/nutch sitemap exists, but
>  on
> >>> my
>  nutch 1.13 sitemap is not a command in bin/nutch. So does nutch
>  support crawling sitemaps? Or xml links.
>  
>  Regards,
>  Ankit Goel
> >>> 
> >>> 
> > 
> > 
> 
> 


RE: Incorrect encoding detected

2017-11-01 Thread Markus Jelsma
Any ideas?

Thanks!

 
 
-Original message-
> From:Markus Jelsma <markus.jel...@openindex.io>
> Sent: Tuesday 31st October 2017 13:14
> To: User <user@nutch.apache.org>
> Subject: FW: Incorrect encoding detected
> 
> I actually don't know, can we specify a tika-config file in Nutch?
> 
> Thanks,
> Markus
>  
> -Original message-
> > From:Allison, Timothy B. <talli...@mitre.org>
> > Sent: Tuesday 31st October 2017 13:11
> > To: u...@tika.apache.org
> > Subject: RE: Incorrect encoding detected
> > 
> > For 1.17, the simplest solution, I think, is to allow users to configure 
> > extending the detection limit via our @Field config methods, that is, via 
> > tika-config.xml.
> > 
> > To confirm, Nutch will allow users to specify a tika-config file?  Will 
> > this work for you and Nutch?
> > 
> > -Original Message-
> > From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
> > Sent: Tuesday, October 31, 2017 5:47 AM
> > To: u...@tika.apache.org
> > Subject: RE: Incorrect encoding detected
> > 
> > Hello Timothy - what would be your preferred solution? Increase detection 
> > limit or skip inline styles and possibly other useless head information?
> > 
> > Thanks,
> > Markus
> > 
> >  
> >  
> > -Original message-
> > > From:Markus Jelsma <markus.jel...@openindex.io>
> > > Sent: Friday 27th October 2017 15:37
> > > To: u...@tika.apache.org
> > > Subject: RE: Incorrect encoding detected
> > > 
> > > Hi Tim,
> > > 
> > > I have opened TIKA-2485 to track the problem. 
> > > 
> > > Thank you very very much!
> > > Markus
> > > 
> > >  
> > >  
> > > -Original message-
> > > > From:Allison, Timothy B. <talli...@mitre.org>
> > > > Sent: Friday 27th October 2017 15:33
> > > > To: u...@tika.apache.org
> > > > Subject: RE: Incorrect encoding detected
> > > > 
> > > > Unfortunately there is no way to do this now.  _I think_ we could make 
> > > > this configurable, though, fairly easily.  Please open a ticket.
> > > > 
> > > > The next RC for PDFBox might be out next week, and we'll try to release 
> > > > Tika 1.17 shortly after that...so there should be time to get this in.
> > > > 
> > > > -Original Message-
> > > > From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
> > > > Sent: Friday, October 27, 2017 9:12 AM
> > > > To: u...@tika.apache.org
> > > > Subject: RE: Incorrect encoding detected
> > > > 
> > > > Hello Tim,
> > > > 
> > > > Getting rid of script and style contents sounds plausible indeed. But 
> > > > to work around the problem for now, can i instruct HTMLEncodingDetector 
> > > > from within Nutch to look beyond the limit?
> > > > 
> > > > Thanks!
> > > > Markus
> > > > 
> > > >  
> > > >  
> > > > -Original message-
> > > > > From:Allison, Timothy B. <talli...@mitre.org>
> > > > > Sent: Friday 27th October 2017 14:53
> > > > > To: u...@tika.apache.org
> > > > > Subject: RE: Incorrect encoding detected
> > > > > 
> > > > > Hi Markus,
> > > > >   
> > > > > My guess is that the ~32,000 characters of mostly ascii-ish  
> > > > > are what is actually being used for encoding detection.  The 
> > > > > HTMLEncodingDetector only looks in the first 8,192 characters, and 
> > > > > the other encoding detectors have similar (but longer?) restrictions.
> > > > >  
> > > > > At some point, I had a dev version of a stripper that removed 
> > > > > contents of  and  before trying to detect the 
> > > > > encoding[0]...perhaps it is time to resurrect that code and integrate 
> > > > > it?
> > > > > 
> > > > > Or, given that HTML has been, um, blossoming, perhaps, more simply, 
> > > > > we should expand how far we look into a stream for detection?
> > > > > 
> > > > > Cheers,
> > > > > 
> > > > >Tim
> > > > > 
> > > > > [0] https://issues.apache.org/jira/browse/TIKA-2038
> > > > >
> > > > > 
> > > > > -Original Message-
> > > > > From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
> > > > > Sent: Friday, October 27, 2017 8:39 AM
> > > > > To: u...@tika.apache.org
> > > > > Subject: Incorrect encoding detected
> > > > > 
> > > > > Hello,
> > > > > 
> > > > > We have a problem with Tika, encoding and pages on this website: 
> > > > > https://www.aarstiderne.com/frugt-groent-og-mere/mixkasser
> > > > > 
> > > > > Using Nutch and Tika 1.12, but also using Tika 1.16, we found out 
> > > > > that the regular HTML parser does a fine job, but our TikaParser has 
> > > > > a tough job dealing with this HTML. For some reason Tika thinks 
> > > > > Content-Encoding=windows-1252 is what this webpage says it is, 
> > > > > instead the page identifies itself properly as UTF-8.
> > > > > 
> > > > > Of all websites we index, this is so far the only one giving trouble 
> > > > > indexing accents, getting fÃ¥ instead of a regular få.
> > > > > 
> > > > > Any tips to spare? 
> > > > > 
> > > > > Many many thanks!
> > > > > Markus
> > > > > 
> > > > 
> > > 
> > 
> 


FW: Incorrect encoding detected

2017-10-31 Thread Markus Jelsma
I actually don't know, can we specify a tika-config file in Nutch?

Thanks,
Markus
 
-Original message-
> From:Allison, Timothy B. <talli...@mitre.org>
> Sent: Tuesday 31st October 2017 13:11
> To: u...@tika.apache.org
> Subject: RE: Incorrect encoding detected
> 
> For 1.17, the simplest solution, I think, is to allow users to configure 
> extending the detection limit via our @Field config methods, that is, via 
> tika-config.xml.
> 
> To confirm, Nutch will allow users to specify a tika-config file?  Will this 
> work for you and Nutch?
> 
> -----Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
> Sent: Tuesday, October 31, 2017 5:47 AM
> To: u...@tika.apache.org
> Subject: RE: Incorrect encoding detected
> 
> Hello Timothy - what would be your preferred solution? Increase detection 
> limit or skip inline styles and possibly other useless head information?
> 
> Thanks,
> Markus
> 
>  
>  
> -Original message-
> > From:Markus Jelsma <markus.jel...@openindex.io>
> > Sent: Friday 27th October 2017 15:37
> > To: u...@tika.apache.org
> > Subject: RE: Incorrect encoding detected
> > 
> > Hi Tim,
> > 
> > I have opened TIKA-2485 to track the problem. 
> > 
> > Thank you very very much!
> > Markus
> > 
> >  
> >  
> > -Original message-
> > > From:Allison, Timothy B. <talli...@mitre.org>
> > > Sent: Friday 27th October 2017 15:33
> > > To: u...@tika.apache.org
> > > Subject: RE: Incorrect encoding detected
> > > 
> > > Unfortunately there is no way to do this now.  _I think_ we could make 
> > > this configurable, though, fairly easily.  Please open a ticket.
> > > 
> > > The next RC for PDFBox might be out next week, and we'll try to release 
> > > Tika 1.17 shortly after that...so there should be time to get this in.
> > > 
> > > -Original Message-
> > > From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
> > > Sent: Friday, October 27, 2017 9:12 AM
> > > To: u...@tika.apache.org
> > > Subject: RE: Incorrect encoding detected
> > > 
> > > Hello Tim,
> > > 
> > > Getting rid of script and style contents sounds plausible indeed. But to 
> > > work around the problem for now, can i instruct HTMLEncodingDetector from 
> > > within Nutch to look beyond the limit?
> > > 
> > > Thanks!
> > > Markus
> > > 
> > >  
> > >  
> > > -Original message-
> > > > From:Allison, Timothy B. <talli...@mitre.org>
> > > > Sent: Friday 27th October 2017 14:53
> > > > To: u...@tika.apache.org
> > > > Subject: RE: Incorrect encoding detected
> > > > 
> > > > Hi Markus,
> > > >   
> > > > My guess is that the ~32,000 characters of mostly ascii-ish  
> > > > are what is actually being used for encoding detection.  The 
> > > > HTMLEncodingDetector only looks in the first 8,192 characters, and the 
> > > > other encoding detectors have similar (but longer?) restrictions.
> > > >  
> > > > At some point, I had a dev version of a stripper that removed contents 
> > > > of  and  before trying to detect the 
> > > > encoding[0]...perhaps it is time to resurrect that code and integrate 
> > > > it?
> > > > 
> > > > Or, given that HTML has been, um, blossoming, perhaps, more simply, we 
> > > > should expand how far we look into a stream for detection?
> > > > 
> > > > Cheers,
> > > > 
> > > >Tim
> > > > 
> > > > [0] https://issues.apache.org/jira/browse/TIKA-2038
> > > >
> > > > 
> > > > -Original Message-
> > > > From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
> > > > Sent: Friday, October 27, 2017 8:39 AM
> > > > To: u...@tika.apache.org
> > > > Subject: Incorrect encoding detected
> > > > 
> > > > Hello,
> > > > 
> > > > We have a problem with Tika, encoding and pages on this website: 
> > > > https://www.aarstiderne.com/frugt-groent-og-mere/mixkasser
> > > > 
> > > > Using Nutch and Tika 1.12, but also using Tika 1.16, we found out that 
> > > > the regular HTML parser does a fine job, but our TikaParser has a tough 
> > > > job dealing with this HTML. For some reason Tika thinks 
> > > > Content-Encoding=windows-1252 is what this webpage says it is, instead 
> > > > the page identifies itself properly as UTF-8.
> > > > 
> > > > Of all websites we index, this is so far the only one giving trouble 
> > > > indexing accents, getting fÃ¥ instead of a regular få.
> > > > 
> > > > Any tips to spare? 
> > > > 
> > > > Many many thanks!
> > > > Markus
> > > > 
> > > 
> > 
> 


  1   2   3   4   5   6   7   8   9   10   >