[ 
https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13227859#comment-13227859
 ] 

Mathijs Homminga edited comment on NUTCH-882 at 3/12/12 8:29 PM:
-----------------------------------------------------------------

Hi guys,

I have second thoughts on implementing the NutchContext concept at this stage.

All Nutch processes are centered around the concept of a WebPage. And I agree, 
many of these processes and their plugins might benefit from additional input 
which is related to, but not directly part of a WebPage. Like host statistics, 
metadata or domain information.

The proposed NutchContext solution is elegant in the way that it makes this 
additional information available to plugins, in an extensible way. 
However, it indeed requires a big API break for plugins (since we don't use 
abstract base classes for all the plugins, we can't fix it there to keep them 
compatible).

I'm afraid that a patch that tries to implement the Host table and the 
NutchContext at the same time, will have a hard time to make it to the 
repository ;)

I propose to move the NutchContext approach to a new issue.
Plugins and other components can still use Host information by using the HostDB 
class directly to perform efficient host lookups when needed. We can then 
decide later to make this part of the NutchContext.

Agree?





                
      was (Author: mathijs.homminga):
    Hi guys,

I have second thoughts on implementing the NutchContext concept at this stage.

All Nutch processes are centered around the concept of a WebPage. And I agree, 
many of these processes and their plugins might benefit from additional input 
which is related to, but not directly part of a WebPage. Like host statistics, 
metadata or domain information.

The proposed NutchContext solution is elegant in the way that it makes this 
additional information available to plugins, in an extensible way. 
However, it indeed requires a big API break for plugins (since we don't use 
abstract base classes for all the plugins, we can't fix it there to keep them 
compatible).

I'm afraid that a patch that tries to implement the Host table and the 
NutchContext at the same time, will have a hard time to make it to the 
repository ;)

I propose to move the NutchContext approach to a new issue.
Plugins and other components can still use Host information by using the HostDB 
class directly to perform efficient host lookups when needed. We can then 
decide later to make this part of the NutchContext.

Agreed?





                  
> Design a Host table in GORA
> ---------------------------
>
>                 Key: NUTCH-882
>                 URL: https://issues.apache.org/jira/browse/NUTCH-882
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: nutchgora
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: nutchgora
>
>         Attachments: NUTCH-882-v1.patch, hostdb.patch
>
>
> Having a separate GORA table for storing information about hosts (and 
> domains?) would be very useful for : 
> * customising the behaviour of the fetching on a host basis e.g. number of 
> threads, min time between threads etc...
> * storing stats
> * keeping metadata and possibly propagate them to the webpages 
> * keeping a copy of the robots.txt and possibly use that later to filter the 
> webtable
> * store sitemaps files and update the webtable accordingly
> I'll try to come up with a GORA schema for such a host table but any comments 
> are of course already welcome 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to