Re: Strategic Direction of Nutch

2006-11-13 Thread Nutch Newbie
Well, I would like to agree with Piotr here but current development i.e. 0.8 version and onwards single machine nutch install is not optimal there are various hadoop related issue example http://issues.apache.org/jira/browse/HADOOP-206 are important for a single machine install. I don't think

Re: Strategic Direction of Nutch

2006-11-13 Thread Andrzej Bialecki
Nutch Newbie wrote: Well, I would like to agree with Piotr here but current development i.e. 0.8 version and onwards single machine nutch install is not optimal there are various hadoop related issue example http://issues.apache.org/jira/browse/HADOOP-206 Is it really still a valid issue?

Fetching with two different user agents

2006-11-13 Thread e w
Hi, What would be the best way to perform crawling with two different user-agents so as to compare the pages (requested with the two different agents) returned by a server and accept/reject the url (for subseqent parsing/indexing etc.)? I believe the Google crawler used to do (still does?)

Re: Strategic Direction of Nutch

2006-11-13 Thread carmmello
Hi, Nutch, from version 0.8 is, really, very, very slow, using a single machine, to process data, after the crawling. Compared with Nutch 0.7.2 I would say, from my experience in indexing about 500,000 pages that it is roughly 4 to 5 times slower. In adition to that, the possibilities to

Re: Strategic Direction of Nutch

2006-11-13 Thread Sami Siren
carmmello wrote: So, I think, one of the possibilities for the user of a single machine is that the Nutch developers could use some of their time do improve the previous 0.7.2, adding to it some new features, with further releases of this series. I don`t belive that there are many Nutch

Re: Strategic Direction of Nutch

2006-11-13 Thread carmmello
Dear Sami Siren, Thank you for your prompt answer, but my problem with 0.8.1 was not with the fetching time itself (although your speed in doing so is a lot greater than mine), that is on pair with 0.7.2. My problem is with the time for all the post fetching processes, that is much longer

Re: Strategic Direction of Nutch

2006-11-13 Thread Uroš Gruber
Sami Siren wrote: carmmello wrote: So, I think, one of the possibilities for the user of a single machine is that the Nutch developers could use some of their time do improve the previous 0.7.2, adding to it some new features, with further releases of this series. I don`t belive that there

Re: Strategic Direction of Nutch

2006-11-13 Thread Nutch Newbie
Here is some general comments: The problem is in Hadoop i.e. map-reduce, i.e. processing. Hadoop-206 is not solved..Have a look. http://www.mail-archive.com/hadoop-user%40lucene.apache.org/msg00521.html Well, again its a wishful thinking to ask for many developers, patch and bug reporting and

Re: Strategic Direction of Nutch

2006-11-13 Thread Andrzej Bialecki
(Sorry for the long post, but I felt this issue needs to be made very clear ...) Nutch Newbie wrote: Here is some general comments: The problem is in Hadoop i.e. map-reduce, i.e. processing. Hadoop-206 is not solved..Have a look.

Re: Strategic Direction of Nutch

2006-11-13 Thread Tomi NA
2006/11/13, carmmello [EMAIL PROTECTED]: Hi, Nutch, from version 0.8 is, really, very, very slow, using a single machine, to process data, after the crawling. Compared with Nutch 0.7.2 I would say, ... this series. I don`t believe that there are many Nutch users, in the real world of

Re: Strategic Direction of Nutch

2006-11-13 Thread Nutch Newbie
Actually we are saying the same thing. Sorry I was not really pointing any fingers, apology if It came across that away. I was just stating the fact why things didn't get solved because as you pointed out active developers are on large install and not on small install. However if the ambition of

Re: Strategic Direction of Nutch

2006-11-13 Thread Nitin Borwankar
Hi all, First an intro. I am another Nutch newbie and am finding 0.7.2 to be quite an effective single machine crawler. I am not new to Java or data or the Internet. I run an email list called 'tagdb' for people interested in db problems in creating folksonomy applications, also a blog called

Re: Strategic Direction of Nutch

2006-11-13 Thread Anthony May
This is one of the options that I have suggested for our organisation to adopt. Anthony May Web Developer NZQA [EMAIL PROTECTED] 14/11/2006 2:05 p.m. Hi all, First an intro. I am another Nutch newbie and am finding 0.7.2 to be quite an effective single machine crawler. I am not new to Java

Re: Strategic Direction of Nutch

2006-11-13 Thread Sami Siren
Uroš Gruber wrote: How fast do you need to go? I did a 1 million page crawl today with trunk version of nutch patched with NUTCH-395 [1]. total time for fetching was little over 7 hrs. How is that even possible. I have 3.2GHz pentium with 2G ram. I was same speed problem, because of that