On Jul 11, 2011, at 4:22 PM, Markus Jelsma wrote: > Can't you use a robots.txt instead? >
I have no control over the robots.txt This is a client of ours with poor URL schemas. They would have to put 6 million entries into their robots.txt in order to exclude our crawler. That leaves <meta> directives as the next logical place to exclude specific pages (CMS templates make this easy). I would love to not have this problem. But alas, since the client's URL schema is very poor we cannot effectively use robots.txt or any of the regular expression filters. I am comfortable patching parse-html for our needs. Was just wondering if someone else has done this and would it be a worthwhile contribution back to the Nutch community at large. Blessings, TwP >> Currently Nutch supports the > directive in the head of individual pages. I > would like to extend this > feature to allow the "http.agent.name" as a valid > name in addition to the > "robots" directive. For example, in your nutch- > site.xml file if you have > the property > > > http.agent.name > examplebot > > > >> then any pages with the meta directive in the head > > > > then this > directive would cause Nutch to not index the page and to not > follow links > on > the page. > > So, the place to put this code is in the parse-html plugin and > the > parse-tika plugin. Both of these have an HTMLMetaProcessor class that > > would need to be updated. > > 1) Has someone already written this patch? > > > 2) Should the property "http.robots.agents" also be included as is done in > > the lib-http RobotRulesParser class? > > > Blessings, > TwP > > P.S. There is > precedence for such a change as the googlebot already parses > these type of > meta directives > &from=61050&rd=1>

