Re: Nutch 2.0 roadmap

2010-04-08 Thread Doğacan Güney
On Wed, Apr 7, 2010 at 20:32, Andrzej Bialecki a...@getopt.org wrote:
 On 2010-04-07 18:54, Doğacan Güney wrote:
 Hey everyone,

 On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote:
 On 2010-04-06 15:43, Julien Nioche wrote:
 Hi guys,

 I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
 based on what is currently referred to as NutchBase. Shall we create a
 branch for 2.0 in the Nutch SVN repository and have a label accordingly for
 JIRA so that we can file issues / feature requests on 2.0? Do you think 
 that
 the current NutchBase could be used as a basis for the 2.0 branch?

 I'm not sure what is the status of the nutchbase - it's missed a lot of
 fixes and changes in trunk since it's been last touched ...


 I know... But I still intend to finish it, I just need to schedule
 some time for it.

 My vote would be to go with nutchbase.

 Hmm .. this puzzles me, do you think we should port changes from 1.1 to
 nutchbase? I thought we should do it the other way around, i.e. merge
 nutchbase bits to trunk.


Hmm, I am a bit out of touch with the latest changes but I know that
the differences
between trunk and nutchbase are unfortunately rather large right now.
If merging nutchbase
back into trunk would be easier then sure, let's do that.


 * support for HBase : via ORM or not (see
 NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808
 )

 This IMHO is promising, this could open the doors to small-to-medium
 installations that are currently too cumbersome to handle.


 Yeah, there is already a simple ORM within nutchbase that is
 avro-based and should
 be generic enough to also support MySQL, cassandra and berkeleydb. But
 any good ORM will
 be a very good addition.

 Again, the advantage of DataNucleus is that we don't have to handcraft
 all the mid- to low-level mappings, just the mid-level ones (JOQL or
 whatever) - the cost of maintenance is lower, and the number of backends
 that are supported out of the box is larger. Of course, this is just
 IMHO - we won't know for sure until we try to use both your custom ORM
 and DataNucleus...

I am obviously a bit biased here but I have no strong feelings really.
DataNucleus
is an excellent project. What I like about avro-based approach is the
essentially free
MapReduce support we get and the fact that supporting another language
is easy. So,
we can expose partial hbase data through a server and a python-client
can easily read/write to it, thanks
to avro. That being said, I am all for DataNucleus or something else.


 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com





-- 
Doğacan Güney


Re: Nutch 2.0 roadmap

2010-04-08 Thread Doğacan Güney
Hi,

On Wed, Apr 7, 2010 at 21:19, MilleBii mille...@gmail.com wrote:
 Just a question ?
 Will the new HBase implementation allow more sophisticated crawling
 strategies than the current score based.

 Give you a few  example of what I'd like to do :
 Define different crawling frequency for different set of URLs, say
 weekly for some url, monthly or more for others.

 Select URLs to re-crawl based on attributes previously extracted.Just
 one example: recrawl urls that contained a certain keyword (or set of)

 Select URLs that have not yet been crawled, at the frontier of the
 crawl therefore


At some point, it would be nice to change generator so that it is only a handful
of methods and a pig (or something else) script. So, we would provide
most of the functions
you may need during generation (accessing various data) but actual
generation would be a pig
process. This way, anyone can easily change generate any way they want
(even make it more jobs
than 2 if they want more complex schemes).




 2010/4/7, Doğacan Güney doga...@gmail.com:
 Hey everyone,

 On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote:
 On 2010-04-06 15:43, Julien Nioche wrote:
 Hi guys,

 I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
 based on what is currently referred to as NutchBase. Shall we create a
 branch for 2.0 in the Nutch SVN repository and have a label accordingly
 for
 JIRA so that we can file issues / feature requests on 2.0? Do you think
 that
 the current NutchBase could be used as a basis for the 2.0 branch?

 I'm not sure what is the status of the nutchbase - it's missed a lot of
 fixes and changes in trunk since it's been last touched ...


 I know... But I still intend to finish it, I just need to schedule
 some time for it.

 My vote would be to go with nutchbase.


 Talking about features, what else would we add apart from :

 * support for HBase : via ORM or not (see
 NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808
 )

 This IMHO is promising, this could open the doors to small-to-medium
 installations that are currently too cumbersome to handle.


 Yeah, there is already a simple ORM within nutchbase that is
 avro-based and should
 be generic enough to also support MySQL, cassandra and berkeleydb. But
 any good ORM will
 be a very good addition.

 * plugin cleanup : Tika only for parsing - get rid of everything else?

 Basically, yes - keep only stuff like HtmlParseFilters (probably with a
 different API) so that we can post-process the DOM created in Tika from
 whatever original format.

 Also, the goal of the crawler-commons project is to provide APIs and
 implementations of stuff that is needed for every open source crawler
 project, like: robots handling, url filtering and url normalization, URL
 state management, perhaps deduplication. We should coordinate our
 efforts, and share code freely so that other projects (bixo, heritrix,
 droids) may contribute to this shared pool of functionality, much like
 Tika does for the common need of parsing complex formats.

 * remove index / search and delegate to SOLR

 +1 - we may still keep a thin abstract layer to allow other
 indexing/search backends, but the current mess of indexing/query filters
 and competing indexing frameworks (lucene, fields, solr) should go away.
 We should go directly from DOM to a NutchDocument, and stop there.


 Agreed. I would like to add support for katta and other indexing
 backends at some point but
 NutchDocument should be our canonical representation. The rest should
 be up to indexing backends.

 Regarding search - currently the search API is too low-level, with the
 custom text and query analysis chains. This needlessly introduces the
 (in)famous Nutch Query classes and Nutch query syntax limitations, We
 should get rid of it and simply leave this part of the processing to the
 search backend. Probably we will use the SolrCloud branch that supports
 sharding and global IDF.

 * new functionalities e.g. sitemap support, canonical tag etc...

 Plus a better handling of redirects, detecting duplicated sites,
 detection of spam cliques, tools to manage the webgraph, etc.


 I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
 update?

 Definitely. :)

 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com





 --
 Doğacan Güney



 --
 -MilleBii-




-- 
Doğacan Güney


Re: Nutch 2.0 roadmap

2010-04-08 Thread MilleBii
Not sure what u mean by pig script, but I'd like to be able to make a
multi-criteria selection of Url for fetching...
 The scoring method forces into a kind of mono dimensional approach
which is not really easy to deal with.

The regex filters are good but it assumes you want select URLs on data
which is in the URL... Pretty limited in fact

I basically would like to do 'content' based crawling. Say for
example: that I'm interested in topic A.
I'd'like to label URLs that match Topic A (user supplied logic).
Later on I would want to crawl topic A urls at a certain frequency
and non labeled urls for exploring in a different way.

 This looks like hard to do right now

2010/4/8, Doğacan Güney doga...@gmail.com:
 Hi,

 On Wed, Apr 7, 2010 at 21:19, MilleBii mille...@gmail.com wrote:
 Just a question ?
 Will the new HBase implementation allow more sophisticated crawling
 strategies than the current score based.

 Give you a few  example of what I'd like to do :
 Define different crawling frequency for different set of URLs, say
 weekly for some url, monthly or more for others.

 Select URLs to re-crawl based on attributes previously extracted.Just
 one example: recrawl urls that contained a certain keyword (or set of)

 Select URLs that have not yet been crawled, at the frontier of the
 crawl therefore


 At some point, it would be nice to change generator so that it is only a
 handful
 of methods and a pig (or something else) script. So, we would provide
 most of the functions
 you may need during generation (accessing various data) but actual
 generation would be a pig
 process. This way, anyone can easily change generate any way they want
 (even make it more jobs
 than 2 if they want more complex schemes).




 2010/4/7, Doğacan Güney doga...@gmail.com:
 Hey everyone,

 On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote:
 On 2010-04-06 15:43, Julien Nioche wrote:
 Hi guys,

 I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will
 be
 based on what is currently referred to as NutchBase. Shall we create a
 branch for 2.0 in the Nutch SVN repository and have a label accordingly
 for
 JIRA so that we can file issues / feature requests on 2.0? Do you think
 that
 the current NutchBase could be used as a basis for the 2.0 branch?

 I'm not sure what is the status of the nutchbase - it's missed a lot of
 fixes and changes in trunk since it's been last touched ...


 I know... But I still intend to finish it, I just need to schedule
 some time for it.

 My vote would be to go with nutchbase.


 Talking about features, what else would we add apart from :

 * support for HBase : via ORM or not (see
 NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808
 )

 This IMHO is promising, this could open the doors to small-to-medium
 installations that are currently too cumbersome to handle.


 Yeah, there is already a simple ORM within nutchbase that is
 avro-based and should
 be generic enough to also support MySQL, cassandra and berkeleydb. But
 any good ORM will
 be a very good addition.

 * plugin cleanup : Tika only for parsing - get rid of everything else?

 Basically, yes - keep only stuff like HtmlParseFilters (probably with a
 different API) so that we can post-process the DOM created in Tika from
 whatever original format.

 Also, the goal of the crawler-commons project is to provide APIs and
 implementations of stuff that is needed for every open source crawler
 project, like: robots handling, url filtering and url normalization, URL
 state management, perhaps deduplication. We should coordinate our
 efforts, and share code freely so that other projects (bixo, heritrix,
 droids) may contribute to this shared pool of functionality, much like
 Tika does for the common need of parsing complex formats.

 * remove index / search and delegate to SOLR

 +1 - we may still keep a thin abstract layer to allow other
 indexing/search backends, but the current mess of indexing/query filters
 and competing indexing frameworks (lucene, fields, solr) should go away.
 We should go directly from DOM to a NutchDocument, and stop there.


 Agreed. I would like to add support for katta and other indexing
 backends at some point but
 NutchDocument should be our canonical representation. The rest should
 be up to indexing backends.

 Regarding search - currently the search API is too low-level, with the
 custom text and query analysis chains. This needlessly introduces the
 (in)famous Nutch Query classes and Nutch query syntax limitations, We
 should get rid of it and simply leave this part of the processing to the
 search backend. Probably we will use the SolrCloud branch that supports
 sharding and global IDF.

 * new functionalities e.g. sitemap support, canonical tag etc...

 Plus a better handling of redirects, detecting duplicated sites,
 detection of spam cliques, tools to manage the webgraph, etc.


 I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
 update?

 

Re: Nutch 2.0 roadmap

2010-04-08 Thread Doğacan Güney
On Thu, Apr 8, 2010 at 21:11, MilleBii mille...@gmail.com wrote:
 Not sure what u mean by pig script, but I'd like to be able to make a
 multi-criteria selection of Url for fetching...

I mean a query language like

http://hadoop.apache.org/pig/

if we expose data correctly, then you should be able to generate on any criteria
that you want.

  The scoring method forces into a kind of mono dimensional approach
 which is not really easy to deal with.

 The regex filters are good but it assumes you want select URLs on data
 which is in the URL... Pretty limited in fact

 I basically would like to do 'content' based crawling. Say for
 example: that I'm interested in topic A.
 I'd'like to label URLs that match Topic A (user supplied logic).
 Later on I would want to crawl topic A urls at a certain frequency
 and non labeled urls for exploring in a different way.

  This looks like hard to do right now

 2010/4/8, Doğacan Güney doga...@gmail.com:
 Hi,

 On Wed, Apr 7, 2010 at 21:19, MilleBii mille...@gmail.com wrote:
 Just a question ?
 Will the new HBase implementation allow more sophisticated crawling
 strategies than the current score based.

 Give you a few  example of what I'd like to do :
 Define different crawling frequency for different set of URLs, say
 weekly for some url, monthly or more for others.

 Select URLs to re-crawl based on attributes previously extracted.Just
 one example: recrawl urls that contained a certain keyword (or set of)

 Select URLs that have not yet been crawled, at the frontier of the
 crawl therefore


 At some point, it would be nice to change generator so that it is only a
 handful
 of methods and a pig (or something else) script. So, we would provide
 most of the functions
 you may need during generation (accessing various data) but actual
 generation would be a pig
 process. This way, anyone can easily change generate any way they want
 (even make it more jobs
 than 2 if they want more complex schemes).




 2010/4/7, Doğacan Güney doga...@gmail.com:
 Hey everyone,

 On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote:
 On 2010-04-06 15:43, Julien Nioche wrote:
 Hi guys,

 I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will
 be
 based on what is currently referred to as NutchBase. Shall we create a
 branch for 2.0 in the Nutch SVN repository and have a label accordingly
 for
 JIRA so that we can file issues / feature requests on 2.0? Do you think
 that
 the current NutchBase could be used as a basis for the 2.0 branch?

 I'm not sure what is the status of the nutchbase - it's missed a lot of
 fixes and changes in trunk since it's been last touched ...


 I know... But I still intend to finish it, I just need to schedule
 some time for it.

 My vote would be to go with nutchbase.


 Talking about features, what else would we add apart from :

 * support for HBase : via ORM or not (see
 NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808
 )

 This IMHO is promising, this could open the doors to small-to-medium
 installations that are currently too cumbersome to handle.


 Yeah, there is already a simple ORM within nutchbase that is
 avro-based and should
 be generic enough to also support MySQL, cassandra and berkeleydb. But
 any good ORM will
 be a very good addition.

 * plugin cleanup : Tika only for parsing - get rid of everything else?

 Basically, yes - keep only stuff like HtmlParseFilters (probably with a
 different API) so that we can post-process the DOM created in Tika from
 whatever original format.

 Also, the goal of the crawler-commons project is to provide APIs and
 implementations of stuff that is needed for every open source crawler
 project, like: robots handling, url filtering and url normalization, URL
 state management, perhaps deduplication. We should coordinate our
 efforts, and share code freely so that other projects (bixo, heritrix,
 droids) may contribute to this shared pool of functionality, much like
 Tika does for the common need of parsing complex formats.

 * remove index / search and delegate to SOLR

 +1 - we may still keep a thin abstract layer to allow other
 indexing/search backends, but the current mess of indexing/query filters
 and competing indexing frameworks (lucene, fields, solr) should go away.
 We should go directly from DOM to a NutchDocument, and stop there.


 Agreed. I would like to add support for katta and other indexing
 backends at some point but
 NutchDocument should be our canonical representation. The rest should
 be up to indexing backends.

 Regarding search - currently the search API is too low-level, with the
 custom text and query analysis chains. This needlessly introduces the
 (in)famous Nutch Query classes and Nutch query syntax limitations, We
 should get rid of it and simply leave this part of the processing to the
 search backend. Probably we will use the SolrCloud branch that supports
 sharding and global IDF.

 * new functionalities e.g. sitemap support, canonical 

Re: Nutch 2.0 roadmap

2010-04-07 Thread Julien Nioche
Hi,

I'm not sure what is the status of the nutchbase - it's missed a lot of
 fixes and changes in trunk since it's been last touched ...


yes, maybe we should start the 2.0 branch from 1.1 instead
Dogacan - what do you think?

BTW I see there is now a 2.0 label under JIRA, thanks to whoever added it


 Also, the goal of the crawler-commons project is to provide APIs and
 implementations of stuff that is needed for every open source crawler
 project, like: robots handling, url filtering and url normalization, URL
 state management, perhaps deduplication. We should coordinate our
 efforts, and share code freely so that other projects (bixo, heritrix,
 droids) may contribute to this shared pool of functionality, much like
 Tika does for the common need of parsing complex formats.


definitely

 +1 - we may still keep a thin abstract layer to allow other
 indexing/search backends, but the current mess of indexing/query filters
 and competing indexing frameworks (lucene, fields, solr) should go away.
 We should go directly from DOM to a NutchDocument, and stop there.



I think that separating the parsing filters from the indexing filters can
have its merits e.g. combining the metadata generated by 2 or more different
parsing filters into a single field in the NutchDocument, keeping only a
subset of the available information etc...


 
  I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
  update?


Have created a new page to serve as a support for discussion :
http://wiki.apache.org/nutch/Nutch2Roadmap

julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com


Re: Nutch 2.0 roadmap

2010-04-07 Thread Doğacan Güney
Hey everyone,

On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote:
 On 2010-04-06 15:43, Julien Nioche wrote:
 Hi guys,

 I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
 based on what is currently referred to as NutchBase. Shall we create a
 branch for 2.0 in the Nutch SVN repository and have a label accordingly for
 JIRA so that we can file issues / feature requests on 2.0? Do you think that
 the current NutchBase could be used as a basis for the 2.0 branch?

 I'm not sure what is the status of the nutchbase - it's missed a lot of
 fixes and changes in trunk since it's been last touched ...


I know... But I still intend to finish it, I just need to schedule
some time for it.

My vote would be to go with nutchbase.


 Talking about features, what else would we add apart from :

 * support for HBase : via ORM or not (see
 NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808
 )

 This IMHO is promising, this could open the doors to small-to-medium
 installations that are currently too cumbersome to handle.


Yeah, there is already a simple ORM within nutchbase that is
avro-based and should
be generic enough to also support MySQL, cassandra and berkeleydb. But
any good ORM will
be a very good addition.

 * plugin cleanup : Tika only for parsing - get rid of everything else?

 Basically, yes - keep only stuff like HtmlParseFilters (probably with a
 different API) so that we can post-process the DOM created in Tika from
 whatever original format.

 Also, the goal of the crawler-commons project is to provide APIs and
 implementations of stuff that is needed for every open source crawler
 project, like: robots handling, url filtering and url normalization, URL
 state management, perhaps deduplication. We should coordinate our
 efforts, and share code freely so that other projects (bixo, heritrix,
 droids) may contribute to this shared pool of functionality, much like
 Tika does for the common need of parsing complex formats.

 * remove index / search and delegate to SOLR

 +1 - we may still keep a thin abstract layer to allow other
 indexing/search backends, but the current mess of indexing/query filters
 and competing indexing frameworks (lucene, fields, solr) should go away.
 We should go directly from DOM to a NutchDocument, and stop there.


Agreed. I would like to add support for katta and other indexing
backends at some point but
NutchDocument should be our canonical representation. The rest should
be up to indexing backends.

 Regarding search - currently the search API is too low-level, with the
 custom text and query analysis chains. This needlessly introduces the
 (in)famous Nutch Query classes and Nutch query syntax limitations, We
 should get rid of it and simply leave this part of the processing to the
 search backend. Probably we will use the SolrCloud branch that supports
 sharding and global IDF.

 * new functionalities e.g. sitemap support, canonical tag etc...

 Plus a better handling of redirects, detecting duplicated sites,
 detection of spam cliques, tools to manage the webgraph, etc.


 I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
 update?

 Definitely. :)

 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com





-- 
Doğacan Güney


Re: Nutch 2.0 roadmap

2010-04-07 Thread Enis Söztutar

Hi,

On 04/07/2010 07:54 PM, Doğacan Güney wrote:

Hey everyone,

On Tue, Apr 6, 2010 at 20:23, Andrzej Bialeckia...@getopt.org  wrote:
   

On 2010-04-06 15:43, Julien Nioche wrote:
 

Hi guys,

I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
based on what is currently referred to as NutchBase. Shall we create a
branch for 2.0 in the Nutch SVN repository and have a label accordingly for
JIRA so that we can file issues / feature requests on 2.0? Do you think that
the current NutchBase could be used as a basis for the 2.0 branch?
   

I'm not sure what is the status of the nutchbase - it's missed a lot of
fixes and changes in trunk since it's been last touched ...

 

I know... But I still intend to finish it, I just need to schedule
some time for it.

My vote would be to go with nutchbase.
   


A suggestion would be to continue with trunk until nutch-base is stable. 
Once it is, then we can merge the nutchbase branch to trunk (after 1.1 
split), at which point trunk becomes the nutchbase+other issues merged. 
Then when the time comes, we can fork branch-2.0 and release when 
blockers are done. I strongly suggest against having a trunk and a 2.0 
branch for development.


   

Talking about features, what else would we add apart from :

* support for HBase : via ORM or not (see
NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808
)
   

This IMHO is promising, this could open the doors to small-to-medium
installations that are currently too cumbersome to handle.

 

Yeah, there is already a simple ORM within nutchbase that is
avro-based and should
be generic enough to also support MySQL, cassandra and berkeleydb. But
any good ORM will
be a very good addition.
   
Current ORM code is merged with nutchbase code, but I think the sooner 
we split it the better, since development will be much more clear and 
simple this way. A have opened Nutch-808 to explore the alternatives, 
but we might as well continue with current implementation. I intent to 
share my findings in a couple of days.


   

* plugin cleanup : Tika only for parsing - get rid of everything else?
   

Basically, yes - keep only stuff like HtmlParseFilters (probably with a
different API) so that we can post-process the DOM created in Tika from
whatever original format.

Also, the goal of the crawler-commons project is to provide APIs and
implementations of stuff that is needed for every open source crawler
project, like: robots handling, url filtering and url normalization, URL
state management, perhaps deduplication. We should coordinate our
efforts, and share code freely so that other projects (bixo, heritrix,
droids) may contribute to this shared pool of functionality, much like
Tika does for the common need of parsing complex formats.

 


So, it seems that at some point, we need to bite the bullet, and 
refactor plugins, dropping backwards compatibility.



* remove index / search and delegate to SOLR
   

+1 - we may still keep a thin abstract layer to allow other
indexing/search backends, but the current mess of indexing/query filters
and competing indexing frameworks (lucene, fields, solr) should go away.
We should go directly from DOM to a NutchDocument, and stop there.

 

Agreed. I would like to add support for katta and other indexing
backends at some point but
NutchDocument should be our canonical representation. The rest should
be up to indexing backends.

   

Regarding search - currently the search API is too low-level, with the
custom text and query analysis chains. This needlessly introduces the
(in)famous Nutch Query classes and Nutch query syntax limitations, We
should get rid of it and simply leave this part of the processing to the
search backend. Probably we will use the SolrCloud branch that supports
sharding and global IDF.

 

* new functionalities e.g. sitemap support, canonical tag etc...
   

Plus a better handling of redirects, detecting duplicated sites,
detection of spam cliques, tools to manage the webgraph, etc.

 

I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
update?
   

Definitely. :)

--
Best regards,
Andrzej Bialecki
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


 



   




Re: Nutch 2.0 roadmap

2010-04-07 Thread Enis Söztutar
Forgot to say that, at Hadoop, it is the convention that big issues, 
like the ones under discussion come with a design document. So that a 
solid design is agreed upon for the work. We can apply the same pattern 
at Nutch.


On 04/07/2010 07:54 PM, Doğacan Güney wrote:

Hey everyone,

On Tue, Apr 6, 2010 at 20:23, Andrzej Bialeckia...@getopt.org  wrote:
   

On 2010-04-06 15:43, Julien Nioche wrote:
 

Hi guys,

I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
based on what is currently referred to as NutchBase. Shall we create a
branch for 2.0 in the Nutch SVN repository and have a label accordingly for
JIRA so that we can file issues / feature requests on 2.0? Do you think that
the current NutchBase could be used as a basis for the 2.0 branch?
   

I'm not sure what is the status of the nutchbase - it's missed a lot of
fixes and changes in trunk since it's been last touched ...

 

I know... But I still intend to finish it, I just need to schedule
some time for it.

My vote would be to go with nutchbase.

   

Talking about features, what else would we add apart from :

* support for HBase : via ORM or not (see
NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808
)
   

This IMHO is promising, this could open the doors to small-to-medium
installations that are currently too cumbersome to handle.

 

Yeah, there is already a simple ORM within nutchbase that is
avro-based and should
be generic enough to also support MySQL, cassandra and berkeleydb. But
any good ORM will
be a very good addition.

   

* plugin cleanup : Tika only for parsing - get rid of everything else?
   

Basically, yes - keep only stuff like HtmlParseFilters (probably with a
different API) so that we can post-process the DOM created in Tika from
whatever original format.

Also, the goal of the crawler-commons project is to provide APIs and
implementations of stuff that is needed for every open source crawler
project, like: robots handling, url filtering and url normalization, URL
state management, perhaps deduplication. We should coordinate our
efforts, and share code freely so that other projects (bixo, heritrix,
droids) may contribute to this shared pool of functionality, much like
Tika does for the common need of parsing complex formats.

 

* remove index / search and delegate to SOLR
   

+1 - we may still keep a thin abstract layer to allow other
indexing/search backends, but the current mess of indexing/query filters
and competing indexing frameworks (lucene, fields, solr) should go away.
We should go directly from DOM to a NutchDocument, and stop there.

 

Agreed. I would like to add support for katta and other indexing
backends at some point but
NutchDocument should be our canonical representation. The rest should
be up to indexing backends.

   

Regarding search - currently the search API is too low-level, with the
custom text and query analysis chains. This needlessly introduces the
(in)famous Nutch Query classes and Nutch query syntax limitations, We
should get rid of it and simply leave this part of the processing to the
search backend. Probably we will use the SolrCloud branch that supports
sharding and global IDF.

 

* new functionalities e.g. sitemap support, canonical tag etc...
   

Plus a better handling of redirects, detecting duplicated sites,
detection of spam cliques, tools to manage the webgraph, etc.

 

I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
update?
   

Definitely. :)

--
Best regards,
Andrzej Bialecki
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


 



   




Re: Nutch 2.0 roadmap

2010-04-07 Thread Andrzej Bialecki
On 2010-04-07 18:54, Doğacan Güney wrote:
 Hey everyone,
 
 On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote:
 On 2010-04-06 15:43, Julien Nioche wrote:
 Hi guys,

 I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
 based on what is currently referred to as NutchBase. Shall we create a
 branch for 2.0 in the Nutch SVN repository and have a label accordingly for
 JIRA so that we can file issues / feature requests on 2.0? Do you think that
 the current NutchBase could be used as a basis for the 2.0 branch?

 I'm not sure what is the status of the nutchbase - it's missed a lot of
 fixes and changes in trunk since it's been last touched ...

 
 I know... But I still intend to finish it, I just need to schedule
 some time for it.
 
 My vote would be to go with nutchbase.

Hmm .. this puzzles me, do you think we should port changes from 1.1 to
nutchbase? I thought we should do it the other way around, i.e. merge
nutchbase bits to trunk.


 * support for HBase : via ORM or not (see
 NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808
 )

 This IMHO is promising, this could open the doors to small-to-medium
 installations that are currently too cumbersome to handle.

 
 Yeah, there is already a simple ORM within nutchbase that is
 avro-based and should
 be generic enough to also support MySQL, cassandra and berkeleydb. But
 any good ORM will
 be a very good addition.

Again, the advantage of DataNucleus is that we don't have to handcraft
all the mid- to low-level mappings, just the mid-level ones (JOQL or
whatever) - the cost of maintenance is lower, and the number of backends
that are supported out of the box is larger. Of course, this is just
IMHO - we won't know for sure until we try to use both your custom ORM
and DataNucleus...

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch 2.0 roadmap

2010-04-07 Thread Andrzej Bialecki
On 2010-04-07 19:24, Enis Söztutar wrote:

 Also, the goal of the crawler-commons project is to provide APIs and
 implementations of stuff that is needed for every open source crawler
 project, like: robots handling, url filtering and url normalization, URL
 state management, perhaps deduplication. We should coordinate our
 efforts, and share code freely so that other projects (bixo, heritrix,
 droids) may contribute to this shared pool of functionality, much like
 Tika does for the common need of parsing complex formats.

  
 
 So, it seems that at some point, we need to bite the bullet, and
 refactor plugins, dropping backwards compatibility.

Right, that was my point - now is the time to break it, with the
cut-over to 2.0, and leaving 1.1 branch in a good shape, to serve well
enough in the interim period.


-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch 2.0 roadmap

2010-04-07 Thread MilleBii
Just a question ?
Will the new HBase implementation allow more sophisticated crawling
strategies than the current score based.

Give you a few  example of what I'd like to do :
Define different crawling frequency for different set of URLs, say
weekly for some url, monthly or more for others.

Select URLs to re-crawl based on attributes previously extracted.Just
one example: recrawl urls that contained a certain keyword (or set of)

Select URLs that have not yet been crawled, at the frontier of the
crawl therefore




2010/4/7, Doğacan Güney doga...@gmail.com:
 Hey everyone,

 On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote:
 On 2010-04-06 15:43, Julien Nioche wrote:
 Hi guys,

 I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
 based on what is currently referred to as NutchBase. Shall we create a
 branch for 2.0 in the Nutch SVN repository and have a label accordingly
 for
 JIRA so that we can file issues / feature requests on 2.0? Do you think
 that
 the current NutchBase could be used as a basis for the 2.0 branch?

 I'm not sure what is the status of the nutchbase - it's missed a lot of
 fixes and changes in trunk since it's been last touched ...


 I know... But I still intend to finish it, I just need to schedule
 some time for it.

 My vote would be to go with nutchbase.


 Talking about features, what else would we add apart from :

 * support for HBase : via ORM or not (see
 NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808
 )

 This IMHO is promising, this could open the doors to small-to-medium
 installations that are currently too cumbersome to handle.


 Yeah, there is already a simple ORM within nutchbase that is
 avro-based and should
 be generic enough to also support MySQL, cassandra and berkeleydb. But
 any good ORM will
 be a very good addition.

 * plugin cleanup : Tika only for parsing - get rid of everything else?

 Basically, yes - keep only stuff like HtmlParseFilters (probably with a
 different API) so that we can post-process the DOM created in Tika from
 whatever original format.

 Also, the goal of the crawler-commons project is to provide APIs and
 implementations of stuff that is needed for every open source crawler
 project, like: robots handling, url filtering and url normalization, URL
 state management, perhaps deduplication. We should coordinate our
 efforts, and share code freely so that other projects (bixo, heritrix,
 droids) may contribute to this shared pool of functionality, much like
 Tika does for the common need of parsing complex formats.

 * remove index / search and delegate to SOLR

 +1 - we may still keep a thin abstract layer to allow other
 indexing/search backends, but the current mess of indexing/query filters
 and competing indexing frameworks (lucene, fields, solr) should go away.
 We should go directly from DOM to a NutchDocument, and stop there.


 Agreed. I would like to add support for katta and other indexing
 backends at some point but
 NutchDocument should be our canonical representation. The rest should
 be up to indexing backends.

 Regarding search - currently the search API is too low-level, with the
 custom text and query analysis chains. This needlessly introduces the
 (in)famous Nutch Query classes and Nutch query syntax limitations, We
 should get rid of it and simply leave this part of the processing to the
 search backend. Probably we will use the SolrCloud branch that supports
 sharding and global IDF.

 * new functionalities e.g. sitemap support, canonical tag etc...

 Plus a better handling of redirects, detecting duplicated sites,
 detection of spam cliques, tools to manage the webgraph, etc.


 I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
 update?

 Definitely. :)

 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com





 --
 Doğacan Güney



-- 
-MilleBii-


Nutch 2.0 roadmap

2010-04-06 Thread Julien Nioche
Hi guys,

I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
based on what is currently referred to as NutchBase. Shall we create a
branch for 2.0 in the Nutch SVN repository and have a label accordingly for
JIRA so that we can file issues / feature requests on 2.0? Do you think that
the current NutchBase could be used as a basis for the 2.0 branch?

Talking about features, what else would we add apart from :

* support for HBase : via ORM or not (see
NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808
)
* plugin cleanup : Tika only for parsing - get rid of everything else?
* remove index / search and delegate to SOLR
* new functionalities e.g. sitemap support, canonical tag etc...

I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
update?

I look forward to hearing your thoughts on this

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com


Re: Nutch 2.0 roadmap

2010-04-06 Thread Andrzej Bialecki
On 2010-04-06 15:43, Julien Nioche wrote:
 Hi guys,
 
 I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
 based on what is currently referred to as NutchBase. Shall we create a
 branch for 2.0 in the Nutch SVN repository and have a label accordingly for
 JIRA so that we can file issues / feature requests on 2.0? Do you think that
 the current NutchBase could be used as a basis for the 2.0 branch?

I'm not sure what is the status of the nutchbase - it's missed a lot of
fixes and changes in trunk since it's been last touched ...

 
 Talking about features, what else would we add apart from :
 
 * support for HBase : via ORM or not (see
 NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808
 )

This IMHO is promising, this could open the doors to small-to-medium
installations that are currently too cumbersome to handle.

 * plugin cleanup : Tika only for parsing - get rid of everything else?

Basically, yes - keep only stuff like HtmlParseFilters (probably with a
different API) so that we can post-process the DOM created in Tika from
whatever original format.

Also, the goal of the crawler-commons project is to provide APIs and
implementations of stuff that is needed for every open source crawler
project, like: robots handling, url filtering and url normalization, URL
state management, perhaps deduplication. We should coordinate our
efforts, and share code freely so that other projects (bixo, heritrix,
droids) may contribute to this shared pool of functionality, much like
Tika does for the common need of parsing complex formats.

 * remove index / search and delegate to SOLR

+1 - we may still keep a thin abstract layer to allow other
indexing/search backends, but the current mess of indexing/query filters
and competing indexing frameworks (lucene, fields, solr) should go away.
We should go directly from DOM to a NutchDocument, and stop there.

Regarding search - currently the search API is too low-level, with the
custom text and query analysis chains. This needlessly introduces the
(in)famous Nutch Query classes and Nutch query syntax limitations, We
should get rid of it and simply leave this part of the processing to the
search backend. Probably we will use the SolrCloud branch that supports
sharding and global IDF.

 * new functionalities e.g. sitemap support, canonical tag etc...

Plus a better handling of redirects, detecting duplicated sites,
detection of spam cliques, tools to manage the webgraph, etc.

 
 I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
 update?

Definitely. :)

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com