Re: Reviving Nutch 0.7

2007-01-23 Thread J. Delgado

Nutch Newbie wrote:

Again not really proposing a new project but more easy to use
re-usable code. IMHO, Nutch will be an umbrella project for
ala-Google and Solr will be for ala-Enterpise  where Lucene
is the index lib, Hadoop is the Mapred/DFS lib ..what is missing is
Common Crawler lib, Common
indexing lib etc..


EXACTLY!

-- Joaquin


Re: Reviving Nutch 0.7

2007-01-22 Thread Piotr Kosiorowski

Otis,
Some time ago people on the list said that they are willing to at
least maintain Nutch 0.7 branch. As a committer (not very active
recently) I volunteered to commit patches when they appear - I do not
have enough time at the moment to do active coding. I have created a
7.3 release in JIRA so we can start looking at it. So - we are ready
and willing to move Nutch 0.7 forward but it looks like there is no
interest at the moment.
Regards
Piotr

On 1/22/07, Otis Gospodnetic [EMAIL PROTECTED] wrote:

Hi,

I've been meaning to write this message for a while, and Andrzej's 
StrategicGoals made me compose it, finally.

Nutch 0.8 and beyond is very cool, very powerful, and once Hadoop stabilizes, 
it will be even more valuable than it is today.  However, I think there is 
still a need for something much simpler, something like what Nutch 0.7 used to 
be.  Fairly regular nutch-user inquiries confirm this.  Nutch has too few 
developers to maintain and further develop both of these concepts, and the main 
Nutch developers need the more powerful version - 0.8 and beyond.  So, what is 
going to happen to 0.7?  Maintenance mode?

I feel that there is enough need for 0.7-style Nutch that it might be worth at 
least considering and discussing the possibility of somehow branching that 
version into a parallel project that's not just in a maintenance mode, but has 
its own group of developers (not me, no time :( ) that pushes it forward.

Thoughts?

Otis






Re: Reviving Nutch 0.7

2007-01-22 Thread Zaheed Haque

On 1/22/07, Otis Gospodnetic [EMAIL PROTECTED] wrote:

Hi,

I've been meaning to write this message for a while, and Andrzej's 
StrategicGoals made me compose it, finally.

Nutch 0.8 and beyond is very cool, very powerful, and once Hadoop stabilizes, 
it will be even more valuable than it is today.  However, I think there is 
still a need for something much simpler, something like what Nutch 0.7 used to 
be.  Fairly regular nutch-user inquiries confirm this.  Nutch has too few 
developers to maintain and further develop both of these concepts, and the main 
Nutch developers need the more powerful version - 0.8 and beyond.  So, what is 
going to happen to 0.7?  Maintenance mode?

I feel that there is enough need for 0.7-style Nutch that it might be worth at 
least considering and discussing the possibility of somehow branching that 
version into a parallel project that's not just in a maintenance mode, but has 
its own group of developers (not me, no time :( ) that pushes it forward.

Thoughts?


I agree with you that there is a need for 0.7-style Nutch. I wouldn't
say reviving but more Disecting and re-directing :-). here you go
--- my focus here is 0.7 style i.e. mid-size, enterprise need.

Solr could use a good crawler cos it has everything else .. (AFAIK)
probably this is not technically plug an pray :-) also I am not sure
Solr community wants a crawler but it could benefit from such Solr add
on/snap on crawler. Furthermore I am sure some of the 0.7 plugins
could be re-factored to fit into Solr.

I will forward the mail to Solr community to see if there any interest.

Cheers


RE: Reviving Nutch 0.7

2007-01-22 Thread Alan Tanaman
Hello,

I'm writing this on behalf of both Armel Nene and myself. 

We think that you and those who have responded have a point.  We've been
experiencing quite a number of problems with getting Nutch 0.8 adapted for
our needs, and making changes to support evolving business requirements as
they come up.

So much so, that we've considered replacing the spine of Nutch with our
own programs, which would still be compatible with the Nutch plugins (same
parameters etc.), but that would allow us more ease in making changes and
debug.  We've decided to lay out some of our challenges for you to consider.
 
Our major needs are the ability to deploy on large enterprise file systems
(1-10 Terabytes, large compared to average file systems, but small compared
to the WWW).  We also need to support http, but only specific web sites,
subscription web sites and so on.  We don't need to replicate a
generic-Google implementation.

The main features we are currently working on relate primarily to
near-real-time crawling, specifically:
- Incremental Crawling, where changes are monitored at the folder level,
which is much faster than fetching every URL and checking for a change.
Note that this is similar to adaptive crawling, but will be even more
efficient.
- Special handling for parsing of large files (possibly farming those out to
dedicated processors a-la Amazon).  Hadoop would be useful here, but we
would consider re-adding this at a later stage.
- Incremental Indexing, where documents are added to or removed from a live
index, instead of rebuilding a new index each time.

We would be happy to join a group of 0.7 developers, if that would enable us
to pursue this enterprise-based direction, which clearly has different
challenges than those facing WWW-crawling.

Best regards,
Alan
_
Alan Tanaman
iDNA Solutions
http://blog.idna-solutions.com

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: 22 January 2007 06:48
To: Nutch Developer List
Subject: Reviving Nutch 0.7

Hi,

I've been meaning to write this message for a while, and Andrzej's
StrategicGoals made me compose it, finally.

Nutch 0.8 and beyond is very cool, very powerful, and once Hadoop
stabilizes, it will be even more valuable than it is today.  However, I
think there is still a need for something much simpler, something like what
Nutch 0.7 used to be.  Fairly regular nutch-user inquiries confirm this.
Nutch has too few developers to maintain and further develop both of these
concepts, and the main Nutch developers need the more powerful version - 0.8
and beyond.  So, what is going to happen to 0.7?  Maintenance mode?

I feel that there is enough need for 0.7-style Nutch that it might be worth
at least considering and discussing the possibility of somehow branching
that version into a parallel project that's not just in a maintenance mode,
but has its own group of developers (not me, no time :( ) that pushes it
forward.

Thoughts?

Otis






Re: Reviving Nutch 0.7

2007-01-22 Thread Sami Siren

2007/1/22, Otis Gospodnetic [EMAIL PROTECTED]:


Hi,

I've been meaning to write this message for a while, and Andrzej's
StrategicGoals made me compose it, finally.

Nutch 0.8 and beyond is very cool, very powerful, and once Hadoop
stabilizes, it will be even more valuable than it is today.  However, I
think there is still a need for something much simpler, something like what
Nutch 0.7 used to be.  Fairly regular nutch-user inquiries confirm
this.  Nutch has too few developers to maintain and further develop both of
these concepts, and the main Nutch developers need the more powerful version
- 0.8 and beyond.  So, what is going to happen to 0.7?  Maintenance mode?

I feel that there is enough need for 0.7-style Nutch that it might be
worth at least considering and discussing the possibility of somehow
branching that version into a parallel project that's not just in a
maintenance mode, but has its own group of developers (not me, no time :( )
that pushes it forward.

Thoughts?



Before doubling (or after 0.9.0 tripling?) the maintenance/development  work
please consider the following:

One option would be re factoring the code in a way that the parts that are
usable to other projects like protocols?, parsers (this actually was
proposed by
Jukka Zitting some time last year) and stuff would be modified to be
independent
of nutch (and hadoop) code. Yeah, this is easy to say, but would require
significant amount of work.

The more focused,smaller chunks of nutch would probably also get bigger
audience (perhaps also outside nutch land) and that way perhaps more people
willing to work for them.

Don't know about others but at least I would be more willing to work towards
this goal than the one where there would be practically many separate
projects,
each sharing common functionality but different code base.

--
Sami Siren


Re: Reviving Nutch 0.7

2007-01-22 Thread Chris Mattmann
 
 Before doubling (or after 0.9.0 tripling?) the maintenance/development  work
 please consider the following:
 
 One option would be re factoring the code in a way that the parts that are
 usable to other projects like protocols?, parsers (this actually was
 proposed by
 Jukka Zitting some time last year) and stuff would be modified to be
 independent
 of nutch (and hadoop) code. Yeah, this is easy to say, but would require
 significant amount of work.
 
 The more focused,smaller chunks of nutch would probably also get bigger
 audience (perhaps also outside nutch land) and that way perhaps more people
 willing to work for them.
 
 Don't know about others but at least I would be more willing to work towards
 this goal than the one where there would be practically many separate
 projects,
 each sharing common functionality but different code base.

+1 ;)

This was actually the project proposed by Jerome Charron and myself, called
Tika. We went so far as to create a project proposal, and send it out to
the nutch-dev list, as well as the Lucene PMC for potential Lucene
sub-project goodness. I could probably dig up the proposal should the need
arise.

Good ol' Jukka then took that effort and created us a project within Google
code, that still lives in there in fact:

http://code.google.com/p/tika/

There hasn't be active development on it because:

1. None of us (I'm speaking for Jerome, and myself here) ended up having the
time to shepherd it going forward

2. There was little, if any response, from the proposal to the nutch-dev
list, and folks willing to contribute (besides people like Jukka)

3. I think, as you correctly note above, most people thought it to be too
much of a Herculean effort that wouldn't pay the necessary dividends in the
end to undertake it


In any case, I think that, if we are going to maintain separate branches of
the source, in fact, really parallel projects, then an undertaking such as
Tika is properly needed ...

Cheers,
   Chris




 
 --
  Sami Siren




Re: Reviving Nutch 0.7

2007-01-22 Thread Sami Siren
Chris Mattmann wrote:
 In any case, I think that, if we are going to maintain separate branches of
 the source, in fact, really parallel projects, then an undertaking such as
 Tika is properly needed ...

I still don't think we need separate project to start with, IMO right
mode of mind is enough to get going. If people thing this is right
direction and it goes beyond talk then perhaps after that we could start
talking about separate project.


--
 Sami Siren




Re: Reviving Nutch 0.7

2007-01-22 Thread Doug Cutting

[EMAIL PROTECTED] wrote:

Yes, certainly, anything that can be shared and decoupled from pieces that make 
each branch (not SVN/CVS branch) different, should be decoupled.  But I was 
really curious about whether people think this is a valid idea/direction, not 
necessarily immediately how things should be implemented.  In my mind, one 
branch is the branch that runs on top of Hadoop, with NameNode, DataNode, HDFS, 
etc.  That's the branch that's in the trunk.  The other branch is a simpler 
branch without all that Hadoop stuff, for folks who need to fetch, index, and 
search a few hundred thousand or a few million or even a few tens of millions 
of pages, and don't need replication, etc. that comes with Hadoop.  That branch 
could be based off of 0.7.  I also know that a lot of people are trying to use 
Nutch to build vertical search engines, so there is also a need for a focused 
fetcher.  Kelvin Tan brought this up a few times, too, I believe.


Branching doesn't sound like the right solution here.

First, one doesn't need to run any Hadoop daemons to use Nutch: 
everything should run fine in a single process by default.  If there are 
bugs in this they should be logged, folks who care should submit 
high-quality, back-compatible, generally useful patches, and committers 
should work to get these patches committed to the trunk.


Second, if there are to be two modes of operation, wouldn't they best be 
developed in a common source tree, so that they share as much as 
possible and diverge as little as possible?  It seems to me that a good 
architecture would be to agree on a common high-level API, then use two 
different runtimes underneath, one to support distributed operation, and 
one to support standalone operation.  Hey!  That's what Hadoop already 
does!  Maybe it's not perfect and someone can propose a better way to 
share maximal amounts of code, but the code split should probably be 
into different classes and packages in a single source tree maintained 
by a single community of developers, not by branching a single source 
tree in a revision control and splitting the developers.


Third, part of the problem seems like there are two few 
contributors--that the challenges are big and the resources limited. 
Splitting the project will only spread those resources more thinly.


What really is the issue here?  Are good patches languishing?  Are there 
patches that should be committed (meet coding standards, are 
back-compatible, generally useful, etc.) but are not?  A great patch is 
one that a committer can commit it with few worries: it includes new 
unit tests, it passes all existing unit tests, it fixes one thing only, 
etc.  Such patches should not have to wait long for commit.  And once 
someone submits a few such patches, then one should be invited to become 
a committer.


It sounds to me like the problem is that, off-the-shelf, Nutch does not 
yet solve all the problems folks would like it too: e.g., it has never 
done a good job with incremental indexing.  Folks see progress made on 
scalability, but really wish it were making more progress on 
incrementality or something else.  But it's not going to make progress 
on incrementality without someone doing the work.  A fork or a branch 
isn't going to do the work.  I don't see any reason that the work cannot 
be done right now.  It can be done incrementally: e.g., if the web db 
API seems inappropriate for incremental updates, then someone should 
submit a patch that provides an incremental web db API, updating the 
fetcher and indexer to use this.  A design for this on the wiki would be 
a good place to start.


Finally, web crawling, indexing and searching are data-intensive. 
Before long, users will want to index tens or hundreds of millions of 
pages.  Distributed operation is soon required at this scale, and 
batch-mode is an order-of-magnitude faster.  So be careful before you 
threw those features out: you might want them back soon.


Doug




Re: Reviving Nutch 0.7

2007-01-22 Thread AJ Chen

On 1/22/07, Doug Cutting [EMAIL PROTECTED] wrote:



Finally, web crawling, indexing and searching are data-intensive.
Before long, users will want to index tens or hundreds of millions of
pages.  Distributed operation is soon required at this scale, and
batch-mode is an order-of-magnitude faster.  So be careful before you
threw those features out: you might want them back soon.

Doug


As a developer building application on top of Nutch, my experience is that

I can't go back to version 0.7x because the features in version 0.8/0.9 are
so much needed even for non-distributed crawling/indexing. For example, I
can run crawling/indexing on a linux server and a windows laptop separately,
and merge newly crawled databases into the main crawldb. I remember
v0.7can't merge separate crawldb without lots of customization.

It may takes some time to switch from 0.7x to v0.8/0.9 especially if you
have lots of customization code. But, once you get over this one hurdle, you
will enjoy the new and better features in 0.8/0.9 version.  Also, this may
be the time to re-think the design of your application. For my own project,
I always try to separate my code from nutch core code as much as possible so
that I can easily upgrade the application to keep up with new nutch release.
Keeping away from the newest nutch version is somewhat backward to me.

AJ
--
AJ Chen, PhD
Palo Alto, CA
http://web2express.org