Re: Reviving Nutch 0.7
Nutch Newbie wrote: Again not really proposing a new project but more easy to use re-usable code. IMHO, Nutch will be an umbrella project for ala-Google and Solr will be for ala-Enterpise where Lucene is the index lib, Hadoop is the Mapred/DFS lib ..what is missing is Common Crawler lib, Common indexing lib etc.. EXACTLY! -- Joaquin
Re: Reviving Nutch 0.7
Otis, Some time ago people on the list said that they are willing to at least maintain Nutch 0.7 branch. As a committer (not very active recently) I volunteered to commit patches when they appear - I do not have enough time at the moment to do active coding. I have created a 7.3 release in JIRA so we can start looking at it. So - we are ready and willing to move Nutch 0.7 forward but it looks like there is no interest at the moment. Regards Piotr On 1/22/07, Otis Gospodnetic [EMAIL PROTECTED] wrote: Hi, I've been meaning to write this message for a while, and Andrzej's StrategicGoals made me compose it, finally. Nutch 0.8 and beyond is very cool, very powerful, and once Hadoop stabilizes, it will be even more valuable than it is today. However, I think there is still a need for something much simpler, something like what Nutch 0.7 used to be. Fairly regular nutch-user inquiries confirm this. Nutch has too few developers to maintain and further develop both of these concepts, and the main Nutch developers need the more powerful version - 0.8 and beyond. So, what is going to happen to 0.7? Maintenance mode? I feel that there is enough need for 0.7-style Nutch that it might be worth at least considering and discussing the possibility of somehow branching that version into a parallel project that's not just in a maintenance mode, but has its own group of developers (not me, no time :( ) that pushes it forward. Thoughts? Otis
Re: Reviving Nutch 0.7
On 1/22/07, Otis Gospodnetic [EMAIL PROTECTED] wrote: Hi, I've been meaning to write this message for a while, and Andrzej's StrategicGoals made me compose it, finally. Nutch 0.8 and beyond is very cool, very powerful, and once Hadoop stabilizes, it will be even more valuable than it is today. However, I think there is still a need for something much simpler, something like what Nutch 0.7 used to be. Fairly regular nutch-user inquiries confirm this. Nutch has too few developers to maintain and further develop both of these concepts, and the main Nutch developers need the more powerful version - 0.8 and beyond. So, what is going to happen to 0.7? Maintenance mode? I feel that there is enough need for 0.7-style Nutch that it might be worth at least considering and discussing the possibility of somehow branching that version into a parallel project that's not just in a maintenance mode, but has its own group of developers (not me, no time :( ) that pushes it forward. Thoughts? I agree with you that there is a need for 0.7-style Nutch. I wouldn't say reviving but more Disecting and re-directing :-). here you go --- my focus here is 0.7 style i.e. mid-size, enterprise need. Solr could use a good crawler cos it has everything else .. (AFAIK) probably this is not technically plug an pray :-) also I am not sure Solr community wants a crawler but it could benefit from such Solr add on/snap on crawler. Furthermore I am sure some of the 0.7 plugins could be re-factored to fit into Solr. I will forward the mail to Solr community to see if there any interest. Cheers
RE: Reviving Nutch 0.7
Hello, I'm writing this on behalf of both Armel Nene and myself. We think that you and those who have responded have a point. We've been experiencing quite a number of problems with getting Nutch 0.8 adapted for our needs, and making changes to support evolving business requirements as they come up. So much so, that we've considered replacing the spine of Nutch with our own programs, which would still be compatible with the Nutch plugins (same parameters etc.), but that would allow us more ease in making changes and debug. We've decided to lay out some of our challenges for you to consider. Our major needs are the ability to deploy on large enterprise file systems (1-10 Terabytes, large compared to average file systems, but small compared to the WWW). We also need to support http, but only specific web sites, subscription web sites and so on. We don't need to replicate a generic-Google implementation. The main features we are currently working on relate primarily to near-real-time crawling, specifically: - Incremental Crawling, where changes are monitored at the folder level, which is much faster than fetching every URL and checking for a change. Note that this is similar to adaptive crawling, but will be even more efficient. - Special handling for parsing of large files (possibly farming those out to dedicated processors a-la Amazon). Hadoop would be useful here, but we would consider re-adding this at a later stage. - Incremental Indexing, where documents are added to or removed from a live index, instead of rebuilding a new index each time. We would be happy to join a group of 0.7 developers, if that would enable us to pursue this enterprise-based direction, which clearly has different challenges than those facing WWW-crawling. Best regards, Alan _ Alan Tanaman iDNA Solutions http://blog.idna-solutions.com -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: 22 January 2007 06:48 To: Nutch Developer List Subject: Reviving Nutch 0.7 Hi, I've been meaning to write this message for a while, and Andrzej's StrategicGoals made me compose it, finally. Nutch 0.8 and beyond is very cool, very powerful, and once Hadoop stabilizes, it will be even more valuable than it is today. However, I think there is still a need for something much simpler, something like what Nutch 0.7 used to be. Fairly regular nutch-user inquiries confirm this. Nutch has too few developers to maintain and further develop both of these concepts, and the main Nutch developers need the more powerful version - 0.8 and beyond. So, what is going to happen to 0.7? Maintenance mode? I feel that there is enough need for 0.7-style Nutch that it might be worth at least considering and discussing the possibility of somehow branching that version into a parallel project that's not just in a maintenance mode, but has its own group of developers (not me, no time :( ) that pushes it forward. Thoughts? Otis
Re: Reviving Nutch 0.7
2007/1/22, Otis Gospodnetic [EMAIL PROTECTED]: Hi, I've been meaning to write this message for a while, and Andrzej's StrategicGoals made me compose it, finally. Nutch 0.8 and beyond is very cool, very powerful, and once Hadoop stabilizes, it will be even more valuable than it is today. However, I think there is still a need for something much simpler, something like what Nutch 0.7 used to be. Fairly regular nutch-user inquiries confirm this. Nutch has too few developers to maintain and further develop both of these concepts, and the main Nutch developers need the more powerful version - 0.8 and beyond. So, what is going to happen to 0.7? Maintenance mode? I feel that there is enough need for 0.7-style Nutch that it might be worth at least considering and discussing the possibility of somehow branching that version into a parallel project that's not just in a maintenance mode, but has its own group of developers (not me, no time :( ) that pushes it forward. Thoughts? Before doubling (or after 0.9.0 tripling?) the maintenance/development work please consider the following: One option would be re factoring the code in a way that the parts that are usable to other projects like protocols?, parsers (this actually was proposed by Jukka Zitting some time last year) and stuff would be modified to be independent of nutch (and hadoop) code. Yeah, this is easy to say, but would require significant amount of work. The more focused,smaller chunks of nutch would probably also get bigger audience (perhaps also outside nutch land) and that way perhaps more people willing to work for them. Don't know about others but at least I would be more willing to work towards this goal than the one where there would be practically many separate projects, each sharing common functionality but different code base. -- Sami Siren
Re: Reviving Nutch 0.7
Before doubling (or after 0.9.0 tripling?) the maintenance/development work please consider the following: One option would be re factoring the code in a way that the parts that are usable to other projects like protocols?, parsers (this actually was proposed by Jukka Zitting some time last year) and stuff would be modified to be independent of nutch (and hadoop) code. Yeah, this is easy to say, but would require significant amount of work. The more focused,smaller chunks of nutch would probably also get bigger audience (perhaps also outside nutch land) and that way perhaps more people willing to work for them. Don't know about others but at least I would be more willing to work towards this goal than the one where there would be practically many separate projects, each sharing common functionality but different code base. +1 ;) This was actually the project proposed by Jerome Charron and myself, called Tika. We went so far as to create a project proposal, and send it out to the nutch-dev list, as well as the Lucene PMC for potential Lucene sub-project goodness. I could probably dig up the proposal should the need arise. Good ol' Jukka then took that effort and created us a project within Google code, that still lives in there in fact: http://code.google.com/p/tika/ There hasn't be active development on it because: 1. None of us (I'm speaking for Jerome, and myself here) ended up having the time to shepherd it going forward 2. There was little, if any response, from the proposal to the nutch-dev list, and folks willing to contribute (besides people like Jukka) 3. I think, as you correctly note above, most people thought it to be too much of a Herculean effort that wouldn't pay the necessary dividends in the end to undertake it In any case, I think that, if we are going to maintain separate branches of the source, in fact, really parallel projects, then an undertaking such as Tika is properly needed ... Cheers, Chris -- Sami Siren
Re: Reviving Nutch 0.7
Chris Mattmann wrote: In any case, I think that, if we are going to maintain separate branches of the source, in fact, really parallel projects, then an undertaking such as Tika is properly needed ... I still don't think we need separate project to start with, IMO right mode of mind is enough to get going. If people thing this is right direction and it goes beyond talk then perhaps after that we could start talking about separate project. -- Sami Siren
Re: Reviving Nutch 0.7
[EMAIL PROTECTED] wrote: Yes, certainly, anything that can be shared and decoupled from pieces that make each branch (not SVN/CVS branch) different, should be decoupled. But I was really curious about whether people think this is a valid idea/direction, not necessarily immediately how things should be implemented. In my mind, one branch is the branch that runs on top of Hadoop, with NameNode, DataNode, HDFS, etc. That's the branch that's in the trunk. The other branch is a simpler branch without all that Hadoop stuff, for folks who need to fetch, index, and search a few hundred thousand or a few million or even a few tens of millions of pages, and don't need replication, etc. that comes with Hadoop. That branch could be based off of 0.7. I also know that a lot of people are trying to use Nutch to build vertical search engines, so there is also a need for a focused fetcher. Kelvin Tan brought this up a few times, too, I believe. Branching doesn't sound like the right solution here. First, one doesn't need to run any Hadoop daemons to use Nutch: everything should run fine in a single process by default. If there are bugs in this they should be logged, folks who care should submit high-quality, back-compatible, generally useful patches, and committers should work to get these patches committed to the trunk. Second, if there are to be two modes of operation, wouldn't they best be developed in a common source tree, so that they share as much as possible and diverge as little as possible? It seems to me that a good architecture would be to agree on a common high-level API, then use two different runtimes underneath, one to support distributed operation, and one to support standalone operation. Hey! That's what Hadoop already does! Maybe it's not perfect and someone can propose a better way to share maximal amounts of code, but the code split should probably be into different classes and packages in a single source tree maintained by a single community of developers, not by branching a single source tree in a revision control and splitting the developers. Third, part of the problem seems like there are two few contributors--that the challenges are big and the resources limited. Splitting the project will only spread those resources more thinly. What really is the issue here? Are good patches languishing? Are there patches that should be committed (meet coding standards, are back-compatible, generally useful, etc.) but are not? A great patch is one that a committer can commit it with few worries: it includes new unit tests, it passes all existing unit tests, it fixes one thing only, etc. Such patches should not have to wait long for commit. And once someone submits a few such patches, then one should be invited to become a committer. It sounds to me like the problem is that, off-the-shelf, Nutch does not yet solve all the problems folks would like it too: e.g., it has never done a good job with incremental indexing. Folks see progress made on scalability, but really wish it were making more progress on incrementality or something else. But it's not going to make progress on incrementality without someone doing the work. A fork or a branch isn't going to do the work. I don't see any reason that the work cannot be done right now. It can be done incrementally: e.g., if the web db API seems inappropriate for incremental updates, then someone should submit a patch that provides an incremental web db API, updating the fetcher and indexer to use this. A design for this on the wiki would be a good place to start. Finally, web crawling, indexing and searching are data-intensive. Before long, users will want to index tens or hundreds of millions of pages. Distributed operation is soon required at this scale, and batch-mode is an order-of-magnitude faster. So be careful before you threw those features out: you might want them back soon. Doug
Re: Reviving Nutch 0.7
On 1/22/07, Doug Cutting [EMAIL PROTECTED] wrote: Finally, web crawling, indexing and searching are data-intensive. Before long, users will want to index tens or hundreds of millions of pages. Distributed operation is soon required at this scale, and batch-mode is an order-of-magnitude faster. So be careful before you threw those features out: you might want them back soon. Doug As a developer building application on top of Nutch, my experience is that I can't go back to version 0.7x because the features in version 0.8/0.9 are so much needed even for non-distributed crawling/indexing. For example, I can run crawling/indexing on a linux server and a windows laptop separately, and merge newly crawled databases into the main crawldb. I remember v0.7can't merge separate crawldb without lots of customization. It may takes some time to switch from 0.7x to v0.8/0.9 especially if you have lots of customization code. But, once you get over this one hurdle, you will enjoy the new and better features in 0.8/0.9 version. Also, this may be the time to re-think the design of your application. For my own project, I always try to separate my code from nutch core code as much as possible so that I can easily upgrade the application to keep up with new nutch release. Keeping away from the newest nutch version is somewhat backward to me. AJ -- AJ Chen, PhD Palo Alto, CA http://web2express.org