Nutch is internally caching the robots rules (it uses a hash map) in every
round. It will fetch robots file for a particular host just once in a given
round. This model works out well. If you are creating a separate db for it,
then you have to ensure that it is updated frequently to take into account
the changes that are done by the server.

On Tue, Mar 5, 2013 at 7:15 AM, Raja Kulasekaran <[email protected]> wrote:

> Hi,
>
> I meant to move the entire crawl process in the client environment , create
>  "robots.db"  and fetch only robots.db as a indexed data .
>
> Raja
>
> On Tue, Mar 5, 2013 at 8:27 PM, Tejas Patil <[email protected]
> >wrote:
>
> > robots.txt is a global standard accepted by everyone. Even google, bing
> use
> > that. I dont think that there is any db file format maintained by web
> > servers for the robots information.
> >
> >
> > On Tue, Mar 5, 2013 at 1:29 AM, Raja Kulasekaran <[email protected]>
> > wrote:
> >
> > > Hi
> > >
> > > Instead of parsing robots.txt file, why don't ask the web hoster or web
> > > administrator to create the complete parsed text in the db file format
> at
> > > the robots.txt location itself ?
> > >
> > > Is there are any standard protocol ?  It would be a better idea to stop
> > > transferring data through crawlers .
> > >
> > > Please let me know your thoughts on the same .
> > >
> > > Raja
> > >
> >
>

Reply via email to