Yes, I tried in that configuration file setting with the latin encoding
Windows-1250, but the value of this property does not affect to the encoding
of the content (I also tried with unexistent encoding and the result is the
same...)
property
nameparser.character.encoding.default/name
Santiago Pérez wrote:
Yes, I tried in that configuration file setting with the latin encoding
Windows-1250, but the value of this property does not affect to the encoding
of the content (I also tried with unexistent encoding and the result is the
same...)
property
I had already tried with:
property
nameparser.character.encoding.default/name
valueUTF-8/value
descriptionThe character encoding to fall back to when no other
information
is available/description
/property
and System.out.println(content.toString());
is still the HTML code with the
MilleBii wrote:
Interesting updates on the current run of 450K urls :
+ 30minutes @ 3Mbits/s
+ drop to 1Mbit/s (1/X shape)
+ gradual improvement to 1.5 Mbit/s and steady for 7 hours
+ sudden drop to 0.9 Mbits/s and steady for 4 hours
+ up to 1.7 Mbits for 1hour
+ staircasing down to 0.5 Mbit/s
Does anybody know how to solve this problem?
--
View this message in context:
http://old.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26542690.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Yes, please. I'll be very grateful.
But also I'm curious why this heppaning... Maybe someone can explain?
caezar wrote:
I've solved this problem by modifying nutch code. If this solution
acceptable for you I can provide the details
J. Smith wrote:
Does anybody know how to solve this
You mean map/reduce tasks ???
Being in pseudo-distributed / single node I only have two maps during the
fetch phase... so it would be back to the URLs distribution.
2009/11/27 Andrzej Bialecki a...@getopt.org
MilleBii wrote:
Interesting updates on the current run of 450K urls :
+ 30minutes @
MilleBii wrote:
You mean map/reduce tasks ???
Yes.
Being in pseudo-distributed / single node I only have two maps during the
fetch phase... so it would be back to the URLs distribution.
Well, yes, but my explanation is still valid. Which unfortunately
doesn't change the situation.
Next
The funny thing is that in my case I have not any redirects and somehow
status is Status: 1 (db_unfetched) regarding that content is fetched and
successfully parsed.
Anyway thanks for your solution.
caezar wrote:
If you read the thread up you'll see that thing is about pages with
Hi all,
I'm try to figure out ways to improve Nutch focused crawling efficiency.
I'm looking for certain pages inside each domain which contains content I'm
looking for.
I'm unable to know that a certain URL contains what I'm looking for unless I
parse it and do some analysis on it.
Basically
Well I have created for my own application is topical-scoring plugin :
1. first I needed to score the pages after parsing based on my regular
expression
2. then I searched several options on to how boost score of that pages... I
have only found a way to boost the score of the outlinks of these
My fetch run is getting to the end now I have the following logs towards the
end
2009-11-27 19:07:43,866 INFO fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=12
2009-11-27 19:07:44,866 INFO fetcher.Fetcher - -activeThreads=100,
spinWaiting=100,
hi,
this is the main loop of my recrawl.sh
do
echo --- Beginning crawl at depth `expr $i + 1` of $depth ---
$NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN \
-adddays $adddays
if [ $? -ne 0 ]
then
echo runbot: Stopping at depth $depth. No more URLs to
there is a jira + a discussion on the mailing list on this. This is a
synchronisation problem which has already been reported, patched but not yet
committed. See https://issues.apache.org/jira/browse/NUTCH-719
J.
2009/11/27 MilleBii mille...@gmail.com
My fetch run is getting to the end now I
Already applied that patch which is actually 721, I was part of that
discussion at the time. The difference now is that I moved on a linux box,
and working pseudo-distributed hadoop, also I took a later nutch snapshot.
By the way I could not apply Time-Bomb 770 patch command gives me errors.
I
15 matches
Mail list logo