Re: support for robot rules that include a wild card

2009-11-19 Thread Ken Krugler

Hi Jason,

I've been spending some time on an improved robots.txt parser, as part  
of my Bixo project.


One aspect is support for Google wildcard extensions.

I think this will be part of the proposed "crawler-commons" project  
where we'll put components that can/should be shared between Nutch,  
Bixo, Heritrix and Droids.


One thing that would be useful is to collect examples of "advanced"  
robots.txt files, in addition to broken ones.


It would be great if you could open a Jira issue and attach specific  
examples of the above that you know about.


Thanks!

-- Ken


On Nov 19, 2009, at 11:31am, J.G.Konrad wrote:


I'm using nutch-1.0 and have noticed after running some tests that the
robot rules parser does not support wildcard (a.k.a globbing) in
rules. This means the rule will not work like it was expected to by
the person who wrote the robots.txt file.  For example

User-Agent: *
Disallow: /somepath/*/someotherpath

Even yahoo has one rule ( http://m.www.yahoo.com/robots.txt )
User-agent: *
Disallow: /p/
Disallow: /r/
Disallow: /*?

With the popularity of the wildcard (*) in robots.txt files these days
what are the plans/thoughts on adding support for it in Nutch?

Thanks,
 Jason



Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






support for robot rules that include a wild card

2009-11-19 Thread J.G.Konrad
I'm using nutch-1.0 and have noticed after running some tests that the
robot rules parser does not support wildcard (a.k.a globbing) in
rules. This means the rule will not work like it was expected to by
the person who wrote the robots.txt file.  For example

User-Agent: *
Disallow: /somepath/*/someotherpath

Even yahoo has one rule ( http://m.www.yahoo.com/robots.txt )
User-agent: *
Disallow: /p/
Disallow: /r/
Disallow: /*?

With the popularity of the wildcard (*) in robots.txt files these days
what are the plans/thoughts on adding support for it in Nutch?

Thanks,
  Jason