Re: Encoding the content got from Fetcher

2009-11-27 Thread Santiago Pérez
Yes, I tried in that configuration file setting with the latin encoding Windows-1250, but the value of this property does not affect to the encoding of the content (I also tried with unexistent encoding and the result is the same...) property nameparser.character.encoding.default/name

Re: Encoding the content got from Fetcher

2009-11-27 Thread Andrzej Bialecki
Santiago Pérez wrote: Yes, I tried in that configuration file setting with the latin encoding Windows-1250, but the value of this property does not affect to the encoding of the content (I also tried with unexistent encoding and the result is the same...) property

Re: Encoding the content got from Fetcher

2009-11-27 Thread Santiago Pérez
I had already tried with: property nameparser.character.encoding.default/name valueUTF-8/value descriptionThe character encoding to fall back to when no other information is available/description /property and System.out.println(content.toString()); is still the HTML code with the

Re: 100 fetches per second?

2009-11-27 Thread Andrzej Bialecki
MilleBii wrote: Interesting updates on the current run of 450K urls : + 30minutes @ 3Mbits/s + drop to 1Mbit/s (1/X shape) + gradual improvement to 1.5 Mbit/s and steady for 7 hours + sudden drop to 0.9 Mbits/s and steady for 4 hours + up to 1.7 Mbits for 1hour + staircasing down to 0.5 Mbit/s

Re: Nutch indexes less pages, then it fetches

2009-11-27 Thread J. Smith
Does anybody know how to solve this problem? -- View this message in context: http://old.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26542690.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch indexes less pages, then it fetches

2009-11-27 Thread J. Smith
Yes, please. I'll be very grateful. But also I'm curious why this heppaning... Maybe someone can explain? caezar wrote: I've solved this problem by modifying nutch code. If this solution acceptable for you I can provide the details J. Smith wrote: Does anybody know how to solve this

Re: 100 fetches per second?

2009-11-27 Thread MilleBii
You mean map/reduce tasks ??? Being in pseudo-distributed / single node I only have two maps during the fetch phase... so it would be back to the URLs distribution. 2009/11/27 Andrzej Bialecki a...@getopt.org MilleBii wrote: Interesting updates on the current run of 450K urls : + 30minutes @

Re: 100 fetches per second?

2009-11-27 Thread Andrzej Bialecki
MilleBii wrote: You mean map/reduce tasks ??? Yes. Being in pseudo-distributed / single node I only have two maps during the fetch phase... so it would be back to the URLs distribution. Well, yes, but my explanation is still valid. Which unfortunately doesn't change the situation. Next

Re: Nutch indexes less pages, then it fetches

2009-11-27 Thread J. Smith
The funny thing is that in my case I have not any redirects and somehow status is Status: 1 (db_unfetched) regarding that content is fetched and successfully parsed. Anyway thanks for your solution. caezar wrote: If you read the thread up you'll see that thing is about pages with

Efficient focused crawling

2009-11-27 Thread Eran Zinman
Hi all, I'm try to figure out ways to improve Nutch focused crawling efficiency. I'm looking for certain pages inside each domain which contains content I'm looking for. I'm unable to know that a certain URL contains what I'm looking for unless I parse it and do some analysis on it. Basically

Re: Efficient focused crawling

2009-11-27 Thread MilleBii
Well I have created for my own application is topical-scoring plugin : 1. first I needed to score the pages after parsing based on my regular expression 2. then I searched several options on to how boost score of that pages... I have only found a way to boost the score of the outlinks of these

Re: 100 fetches per second?

2009-11-27 Thread MilleBii
My fetch run is getting to the end now I have the following logs towards the end 2009-11-27 19:07:43,866 INFO fetcher.Fetcher - -activeThreads=100, spinWaiting=100, fetchQueues.totalSize=12 2009-11-27 19:07:44,866 INFO fetcher.Fetcher - -activeThreads=100, spinWaiting=100,

RE: recrawl.sh stopped at depth 7/10 without error

2009-11-27 Thread BELLINI ADAM
hi, this is the main loop of my recrawl.sh do echo --- Beginning crawl at depth `expr $i + 1` of $depth --- $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN \ -adddays $adddays if [ $? -ne 0 ] then echo runbot: Stopping at depth $depth. No more URLs to

Re: 100 fetches per second?

2009-11-27 Thread Julien Nioche
there is a jira + a discussion on the mailing list on this. This is a synchronisation problem which has already been reported, patched but not yet committed. See https://issues.apache.org/jira/browse/NUTCH-719 J. 2009/11/27 MilleBii mille...@gmail.com My fetch run is getting to the end now I

Re: 100 fetches per second?

2009-11-27 Thread MilleBii
Already applied that patch which is actually 721, I was part of that discussion at the time. The difference now is that I moved on a linux box, and working pseudo-distributed hadoop, also I took a later nutch snapshot. By the way I could not apply Time-Bomb 770 patch command gives me errors. I