Well, I would like to agree with Piotr here but current development i.e. 0.8
version and onwards single machine nutch install is not optimal there
are various
hadoop related issue example
http://issues.apache.org/jira/browse/HADOOP-206
are important for a single machine install. I don't think
Nutch Newbie wrote:
Well, I would like to agree with Piotr here but current development
i.e. 0.8
version and onwards single machine nutch install is not optimal there
are various
hadoop related issue example
http://issues.apache.org/jira/browse/HADOOP-206
Is it really still a valid issue?
Hi,
What would be the best way to perform crawling with two different
user-agents so as to compare the pages (requested with the two different
agents) returned by a server and accept/reject the url (for subseqent
parsing/indexing etc.)?
I believe the Google crawler used to do (still does?)
Hi,
Nutch, from version 0.8 is, really, very, very slow, using a single machine,
to process data, after the crawling. Compared with Nutch 0.7.2 I would say,
from my experience in indexing about 500,000 pages that it is roughly 4 to
5 times slower. In adition to that, the possibilities to
carmmello wrote:
So, I think, one of the possibilities for the user of a single machine
is that the Nutch developers could use some of their time do improve the
previous 0.7.2, adding to it some new features, with further releases of
this series. I don`t belive that there are many Nutch
Dear Sami Siren,
Thank you for your prompt answer, but my problem with 0.8.1 was not with the
fetching time itself (although your speed in doing so is a lot greater than
mine), that is on pair with 0.7.2. My problem is with the time for all the
post fetching processes, that is much longer
Sami Siren wrote:
carmmello wrote:
So, I think, one of the possibilities for the user of a single
machine is that the Nutch developers could use some of their time do
improve the previous 0.7.2, adding to it some new features, with
further releases of this series. I don`t belive that there
Here is some general comments:
The problem is in Hadoop i.e. map-reduce, i.e. processing. Hadoop-206
is not solved..Have a look.
http://www.mail-archive.com/hadoop-user%40lucene.apache.org/msg00521.html
Well, again its a wishful thinking to ask for many developers, patch
and bug reporting and
(Sorry for the long post, but I felt this issue needs to be made very
clear ...)
Nutch Newbie wrote:
Here is some general comments:
The problem is in Hadoop i.e. map-reduce, i.e. processing. Hadoop-206
is not solved..Have a look.
2006/11/13, carmmello [EMAIL PROTECTED]:
Hi,
Nutch, from version 0.8 is, really, very, very slow, using a single machine,
to process data, after the crawling. Compared with Nutch 0.7.2 I would say,
...
this series. I don`t believe that there are many Nutch users, in the real
world of
Actually we are saying the same thing. Sorry I was not really pointing
any fingers, apology if It came across that away. I was just stating
the fact why things didn't get solved because as you pointed out
active developers are on large install and not on small install.
However if the ambition of
Hi all,
First an intro. I am another Nutch newbie and am finding 0.7.2 to be
quite an effective single machine crawler.
I am not new to Java or data or the Internet. I run an email list called
'tagdb' for people interested in db problems in creating folksonomy
applications, also a blog called
This is one of the options that I have suggested for our organisation to
adopt.
Anthony May
Web Developer
NZQA
[EMAIL PROTECTED] 14/11/2006 2:05 p.m.
Hi all,
First an intro. I am another Nutch newbie and am finding 0.7.2 to be
quite an effective single machine crawler.
I am not new to Java
Uroš Gruber wrote:
How fast do you need to go?
I did a 1 million page crawl today with trunk version of nutch patched
with NUTCH-395 [1]. total time for fetching was little over 7 hrs.
How is that even possible.
I have 3.2GHz pentium with 2G ram. I was same speed problem, because of
that
14 matches
Mail list logo