Speed of linkDB Merge

2017-04-02 Thread Michael Coffey
In my situation, I find that linkdb merge takes much more time than fetch and 
parse combined, even though fetch is fully polite.

What is the standard advice for making linkdb-merge go faster?

I call invertlinks like this:
__bin_nutch invertlinks "$CRAWL_PATH"/linkdb "$CRAWL_PATH"/segments/$SEGMENT

invertlinks  seems to call mergelinkdb automatically.

I currently have about 3-6 slaves for fetching, though that will increase soon. 
I am currently using small segment sizes (3000 urls) but I can increase that if 
it would help.

I have the following properties that may be relevant.


  db.max.outlinks.per.page
  1000



  db.ignore.external.links
  false



The following props are left as default in nutch-default.xml


  db.update.max.inlinks
  1



  db.ignore.internal.links
  false
  



  db.ignore.external.links
  false
  



[ANNOUNCE] Apache Nutch 1.13 Release

2017-04-02 Thread lewis john mcgibbney
Hello Folks,

The Apache Nutch [0] Project Management Committee are pleased to announce
the immediate release of Apache Nutch v1.13, we advise all current users
and developers of the 1.X series to upgrade to this release.

Nutch is a well matured, production ready Web crawler. Nutch 1.x enables
fine grained configuration, relying on Apache Hadoop™
 data structures, which are great for batch
processing.

The Nutch DOAP can be found at [1]. An account of the CHANGES in this
release can be seen in the release report .

As usual in the 1.X series, release artifacts are made available as both
source and binary and also available within Maven Central

as a Maven dependency. The release is available from our DOWNLAODS PAGE
.
Thank you
Lewis
(On behalf of the Nutch PMC)

[0] http://nutch.apache.org
[1] https://svn.apache.org/repos/asf/nutch/cms_site/trunk/content/doap.rdf
-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney


[RESULT] WAS Re: [VOTE] Release Apache Nutch 1.13 RC#1

2017-04-02 Thread lewis john mcgibbney
Hi Folks,
Thank you to everyone who was able to review the RC and VOTE, greatly
appreciated.
72 has come and gone, please see below for RESULT's.

[9] +1 Release this package as Apache Nutch 1.13.
Lewis John McGibbney *
Julien Nioche *
Kevin Ratnasekera
Chris A. Mattmann *
Furkan KAMACI *
Matei Miroslav
Markus Jelsma *
Jorge Luis Betancourt González *
Sebastian Nagel *

[0] -1 Do not release this package because…

* Nutch PMC

The VOTE passes. Thank you to everyone able to contribute towards the Nutch
1.13 release.
Lewis

On Tue, Mar 28, 2017 at 9:20 PM, lewis john mcgibbney 
wrote:

> Hi Folks,
>
> A first candidate for the Nutch 1.13 release is available at:
>
>   https://dist.apache.org/repos/dist/dev/nutch/1.13/
>
> The release candidate is a zip and tar.gz archive of the binary and
> sources in:
> https://github.com/apache/nutch/tree/release-1.13
>
> The SHA1 checksum of the archive is
> bd0da3569aa14105799ed39204d4f0a31c77b42c
>
> In addition, a staged maven repository is available here:
>
> https://repository.apache.org/content/repositories/orgapachenutch-1013
>
> We addressed 29 Issues - https://s.apache.org/wq3x
>
> Please vote on releasing this package as Apache Nutch 1.13.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Nutch PMC votes are cast.
>
> [ ] +1 Release this package as Apache Nutch 1.13.
> [ ] -1 Do not release this package because…
>
> Cheers,
> Lewis
> (On behalf of the Nutch PMC)
>
> P.S. Here is my +1.
>
> --
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney
>



-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney