Re: recrawl a single page explicit

2012-04-02 Thread Hannes Carl Meyer
Hi, we have kind of a similar case and we perform the following: 1 put all URLs you want to recrawl in the regex-urlfilter.txt 2 perform a bin/nutch mergedb with -filter param to strip those URLs from the crawl-db * 3 put the URLs from 1 into a seed file 4 remove the URLs from 1 from the

Re: Parameter tuning or how to accelerate fetching

2011-08-30 Thread Hannes Carl Meyer
. Is this normal? Regards Thomas Von: Hannes Carl Meyer [mailto:hannesc...@googlemail.com] Gesendet: Dienstag, 30. August 2011 09:25 An: user@nutch.apache.org Cc: Eggebrecht, Thomas (GfK Marktforschung) Betreff: Re: Parameter tuning or how to accelerate fetching Hi Thomas

Re: Partitioning selected urls for politeness and scoring

2011-07-08 Thread Hannes Carl Meyer
Hi, you could set generate.max.per.host to a reasonable size to prevent this! On a default configuration this is set to -1 which means unlimited. BR Hannes --- Hannes Carl Meyer www.informera.de On Fri, Jul 8, 2011 at 2:53 PM, Eggebrecht, Thomas (GfK Marktforschung) thomas.eggebre...@gfk.com

Fwd: No Urls to fetch

2011-06-14 Thread Hannes Carl Meyer
first step! Sorrybut, I don´t understand what you are talking about...In my seed list I only have http://elcorreo.com and I have the filter to it. Regards Adelaida. 2011/6/13 Hannes Carl Meyer hannesc...@googlemail.com Hi, is this your only filter? You should have at least a filter

Re: No Urls to fetch

2011-06-13 Thread Hannes Carl Meyer
Hi, is this your only filter? You should have at least a filter for the seed page you are accessing in the very first step! Regards Hannes On Mon, Jun 13, 2011 at 1:10 PM, Adelaida Lejarazu alejar...@gmail.comwrote: Hello, I´m new to Nutch and I´m doing some tests to see how it works. I

Re: Can I use the Nutch crawl command for large crawls?

2011-02-26 Thread Hannes Carl Meyer
I would not recommend using the Crawl command for large crawls, because: 1. Tuning Hadoop ist not possible at all 2. Incremental Crawling is also pretty difficult because you can't control the different processes/steps On Sat, Feb 26, 2011 at 9:58 AM, firespin firespin...@gmail.com wrote: I

Re: If-Modified-Since header with Nutch

2011-01-06 Thread Hannes Carl Meyer
Hi, did you solve the problem yourself? I'm running in the same Issue... Maybe someone else could help here? Regards Hannes On Wed, Oct 27, 2010 at 12:28 PM, Davide Cavalaglio davide.cavalag...@desktopsrl.com wrote: Hi, i have problem with the option If-Modified-Since with Nutch. I want

Re: How to dump the crawled Html pages?

2010-12-17 Thread Hannes Carl Meyer
Hi, for example: ./bin/nutch readseg -dump crawl/segments/XXX/ dump_folder -nofetch -nogenerate -noparse -noparsedata -noparsetex Regards Hannes On Fri, Dec 17, 2010 at 8:32 PM, Paul Lypaczewski paullypaczew...@yahoo.cawrote: Thanks, Markus. I will check it out. --- On Fri, 12/17/10,

Re: Performance Configuration on Focused Web Crawl

2010-11-21 Thread Hannes Carl Meyer
I'm going to give it a try and confgure a peudo-distributed env on our testing machine (which also has 16 Cores and 24 GB RAM). I'll get back here after testing it! On Sat, Nov 20, 2010 at 10:53 PM, Ken Krugler kkrugler_li...@transpac.comwrote: [snip] During fetching, multiple threads are

Re: Performance Configuration on Focused Web Crawl

2010-11-20 Thread Hannes Carl Meyer
machine) Regards, Hannes On Sat, Nov 20, 2010 at 7:06 PM, Ken Krugler kkrugler_li...@transpac.comwrote: On Nov 20, 2010, at 7:51am, Hannes Carl Meyer wrote: Thank you for sharing your experiences! in my case the web servers are pretty stable and we are allowed to perform intensive

Performance Configuration on Focused Web Crawl

2010-11-18 Thread Hannes Carl Meyer
Hi, I'm using nutch 0.9 to crawl about 400 hosts with an average of 600 pages. That makes a volume of 240.000 fetched pages - I want to get all of them. Can one give me an advice on the right threads/delay/per-host configuration in this environnement? My current conf: property

Re: Run crawl from java code

2010-10-04 Thread Hannes Carl Meyer
Hi, check wether your Working directory (Run - Run Configurations - Tab Arguments - Working Directory) points to the Nutch base directory (where your conf/nucht-site.xml is located). Regards Hannes On Mon, Oct 4, 2010 at 11:02 AM, Marseld Dedgjonaj marseld.dedgjo...@ikubinfo.com wrote: Hello,

Re: Nutch w Eclipse

2010-08-16 Thread Hannes Carl Meyer
Hi J, you should check logs/hadoop.log for further error messages! Bests Hannes On Mon, Aug 16, 2010 at 6:37 PM, Jay sa...@blastsms.com wrote: After doing all the steps again, I am now getting this. Nutch 1.2 Getting closer! (I think) crawl started in: crawl rootUrlDir = urls threads

Differences between 0.9 / 1.0

2010-07-16 Thread Hannes Carl Meyer
Hi, I'm currently using Nutch 1.0 to perform intranet crawl and index html and pdf contents. Unfortunately we are using Java 1.5 in our production env, that means I have to move to Nutch 0.9 since 1.1 and 1.0 requiring Java 6. Are there big differences between those versions which maybe impact

Re: Differences between 0.9 / 1.0

2010-07-16 Thread Hannes Carl Meyer
, Chris On 7/16/10 9:10 AM, Hannes Carl Meyer hannesc...@googlemail.com wrote: Hi, I'm currently using Nutch 1.0 to perform intranet crawl and index html and pdf contents. Unfortunately we are using Java 1.5 in our production env, that means I have to move to Nutch 0.9 since 1.1 and 1.0

Re: Question on normalizing urls / RegexURLNormalizer

2010-06-24 Thread Hannes Carl Meyer
June 2010 15:30, Hannes Carl Meyer hannesc...@googlemail.comwrote: Jep, did not work, although it displays: URL normalizing: true in the crawl process... Also bin/nutch plugin ... does not work! On Thu, Jun 24, 2010 at 3:06 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: tried ant

Re: Question on normalizing urls / RegexURLNormalizer

2010-06-24 Thread Hannes Carl Meyer
file On 24 June 2010 16:11, Hannes Carl Meyer hannesc...@googlemail.comwrote: Nope, that changes nothing. Just checked out my log file: 2010-06-24 17:13:40,410 INFO plugin.PluginRepository - Plugins: looking in: /~/apache-nutch-1.1-bin/plugins 2010-06-24 17:13:41,439 INFO

Re: Question on normalizing urls / RegexURLNormalizer

2010-06-24 Thread Hannes Carl Meyer
/property if i remove the newline before /value, it is ok. regards reinhard Hannes Carl Meyer schrieb: Just tried it in nutch-1.0 with the same kind of behavior: hc.me...@server01:~/nutch-1.0 ./bin/nutch plugin urlnormalizer-regex org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer

Running Nutch in a single VM

2010-05-21 Thread Hannes Carl Meyer
Hi, is it possible to run nutch in a single virtual machine for intranet crawling? Even inside a Java Application Server? Normally I'm using custom Nutch crawl scripts and start from the OS command line by cron. In a new project it is required to use a running Virtual Machine for deloyment and