Hi,
we have kind of a similar case and we perform the following:
1 put all URLs you want to recrawl in the regex-urlfilter.txt
2 perform a bin/nutch mergedb with -filter param to strip those URLs from
the crawl-db *
3 put the URLs from 1 into a seed file
4 remove the URLs from 1 from the
. Is this normal?
Regards
Thomas
Von: Hannes Carl Meyer [mailto:hannesc...@googlemail.com]
Gesendet: Dienstag, 30. August 2011 09:25
An: user@nutch.apache.org
Cc: Eggebrecht, Thomas (GfK Marktforschung)
Betreff: Re: Parameter tuning or how to accelerate fetching
Hi Thomas
Hi,
you could set generate.max.per.host to a reasonable size to prevent this!
On a default configuration this is set to -1 which means unlimited.
BR
Hannes
---
Hannes Carl Meyer
www.informera.de
On Fri, Jul 8, 2011 at 2:53 PM, Eggebrecht, Thomas (GfK Marktforschung)
thomas.eggebre...@gfk.com
first step!
Sorrybut, I don´t understand what you are talking about...In my seed
list I only have http://elcorreo.com and I have the filter to it.
Regards
Adelaida.
2011/6/13 Hannes Carl Meyer hannesc...@googlemail.com
Hi,
is this your only filter? You should have at least a filter
Hi,
is this your only filter? You should have at least a filter for the seed
page you are accessing in the very first step!
Regards
Hannes
On Mon, Jun 13, 2011 at 1:10 PM, Adelaida Lejarazu alejar...@gmail.comwrote:
Hello,
I´m new to Nutch and I´m doing some tests to see how it works. I
I would not recommend using the Crawl command for large crawls, because:
1. Tuning Hadoop ist not possible at all
2. Incremental Crawling is also pretty difficult because you can't control
the different processes/steps
On Sat, Feb 26, 2011 at 9:58 AM, firespin firespin...@gmail.com wrote:
I
Hi,
did you solve the problem yourself?
I'm running in the same Issue...
Maybe someone else could help here?
Regards
Hannes
On Wed, Oct 27, 2010 at 12:28 PM, Davide Cavalaglio
davide.cavalag...@desktopsrl.com wrote:
Hi,
i have problem with the option If-Modified-Since with Nutch.
I want
Hi,
for example: ./bin/nutch readseg -dump crawl/segments/XXX/ dump_folder
-nofetch -nogenerate -noparse -noparsedata -noparsetex
Regards
Hannes
On Fri, Dec 17, 2010 at 8:32 PM, Paul Lypaczewski
paullypaczew...@yahoo.cawrote:
Thanks, Markus. I will check it out.
--- On Fri, 12/17/10,
I'm going to give it a try and confgure a peudo-distributed env on our
testing machine (which also has 16 Cores and 24 GB RAM).
I'll get back here after testing it!
On Sat, Nov 20, 2010 at 10:53 PM, Ken Krugler
kkrugler_li...@transpac.comwrote:
[snip]
During fetching, multiple threads are
machine)
Regards,
Hannes
On Sat, Nov 20, 2010 at 7:06 PM, Ken Krugler kkrugler_li...@transpac.comwrote:
On Nov 20, 2010, at 7:51am, Hannes Carl Meyer wrote:
Thank you for sharing your experiences!
in my case the web servers are pretty stable and we are allowed to perform
intensive
Hi,
I'm using nutch 0.9 to crawl about 400 hosts with an average of 600 pages.
That makes a volume of 240.000 fetched pages - I want to get all of them.
Can one give me an advice on the right threads/delay/per-host configuration
in this environnement?
My current conf:
property
Hi,
check wether your Working directory (Run - Run Configurations - Tab
Arguments - Working Directory) points to the Nutch base directory (where
your conf/nucht-site.xml is located).
Regards
Hannes
On Mon, Oct 4, 2010 at 11:02 AM, Marseld Dedgjonaj
marseld.dedgjo...@ikubinfo.com wrote:
Hello,
Hi J,
you should check logs/hadoop.log for further error messages!
Bests
Hannes
On Mon, Aug 16, 2010 at 6:37 PM, Jay sa...@blastsms.com wrote:
After doing all the steps again, I am now getting this.
Nutch 1.2
Getting closer! (I think)
crawl started in: crawl
rootUrlDir = urls
threads
Hi,
I'm currently using Nutch 1.0 to perform intranet crawl and index html and
pdf contents.
Unfortunately we are using Java 1.5 in our production env, that means I have
to move to Nutch 0.9 since 1.1 and 1.0 requiring Java 6.
Are there big differences between those versions which maybe impact
,
Chris
On 7/16/10 9:10 AM, Hannes Carl Meyer hannesc...@googlemail.com wrote:
Hi,
I'm currently using Nutch 1.0 to perform intranet crawl and index html
and
pdf contents.
Unfortunately we are using Java 1.5 in our production env, that means I
have
to move to Nutch 0.9 since 1.1 and 1.0
June 2010 15:30, Hannes Carl Meyer hannesc...@googlemail.comwrote:
Jep, did not work, although it displays: URL normalizing: true in the
crawl process...
Also bin/nutch plugin ... does not work!
On Thu, Jun 24, 2010 at 3:06 PM, Julien Nioche
lists.digitalpeb...@gmail.com wrote:
tried ant
file
On 24 June 2010 16:11, Hannes Carl Meyer hannesc...@googlemail.comwrote:
Nope, that changes nothing. Just checked out my log file:
2010-06-24 17:13:40,410 INFO plugin.PluginRepository - Plugins: looking
in: /~/apache-nutch-1.1-bin/plugins
2010-06-24 17:13:41,439 INFO
/property
if i remove the newline before /value, it is ok.
regards
reinhard
Hannes Carl Meyer schrieb:
Just tried it in nutch-1.0 with the same kind of behavior:
hc.me...@server01:~/nutch-1.0 ./bin/nutch plugin urlnormalizer-regex
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
Hi,
is it possible to run nutch in a single virtual machine for intranet
crawling? Even inside a Java Application Server?
Normally I'm using custom Nutch crawl scripts and start from the OS command
line by cron. In a new project it is required to use a running Virtual
Machine for deloyment and
19 matches
Mail list logo