Re: Nutch near future - strategic directions

2009-11-26 Thread Sami Siren
Andrzej Bialecki wrote: Sami Siren wrote: Lots of good thoughts and ideas, easy to agree with. Something for the ease of use category: -allow running on top of plain vanilla hadoop What does it mean plain vanilla here? Do you mean the current DB implementation? That's the idea, we should

Re: Nutch near future - strategic directions

2009-11-20 Thread Andrzej Bialecki
Sami Siren wrote: Lots of good thoughts and ideas, easy to agree with. Something for the ease of use category: -allow running on top of plain vanilla hadoop What does it mean plain vanilla here? Do you mean the current DB implementation? That's the idea, we should aim for an abstract layer

Re: Nutch near future - strategic directions

2009-11-18 Thread Sami Siren
Lots of good thoughts and ideas, easy to agree with. Something for the ease of use category: -allow running on top of plain vanilla hadoop -split into reusable components with nice and clean public api -publish mvn artifacts so developers can directly use mvn, ivy etc to pull required

Re: Nutch near future - strategic directions

2009-11-16 Thread Andrzej Bialecki
Subhojit Roy wrote: Hi, Would it be possible to include in Nutch, the ability to crawl download a page only if the page has been updated since the last crawl? I had read sometime back that there were plans to include such a feature. It would be a very useful feature to have IMO. This of course

Re: Nutch near future - strategic directions

2009-11-16 Thread David M. Cole
At 2:44 PM +0100 11/16/09, Andrzej Bialecki wrote: This is already implemented - see the Signature / MD5Signature / TextProfileSignature. OK, then could somebody explain how to implement this feature? Does the initial indexing require a special commmand-line? Then does the secondary indexing

Re: Nutch near future - strategic directions

2009-11-15 Thread Subhojit Roy
Hi, Would it be possible to include in Nutch, the ability to crawl download a page only if the page has been updated since the last crawl? I had read sometime back that there were plans to include such a feature. It would be a very useful feature to have IMO. This of course depends on the last

Nutch near future - strategic directions

2009-11-09 Thread Andrzej Bialecki
Hi all, The ApacheCon is over, our release 1.0 has been out already for some time, so I think it's a good moment to discuss what are the next steps in Nutch development. Let me share with you the topics I identified and presented in the ApacheCon slides, and some topics that are worth