RE: Crawler submits forms?
Thanks for these various responses. I agree that I should be checking input more carefully and will do so. In my experience most developers find it useful to allow both GET and POST input so would prefer not to deny GET requests. But I do agree with Doug's fix to stop the crawler following POST links as the recommendation is that POST requests are used where side-effects are likely (see http://www.w3.org/2001/tag/doc/whenToUseGet.html#checklist). I assume this fix will make it into 0.7.2 some time, if I don't want to build from CVS. I'm not quite sure Jack's response about Stanford's HiWE search engine was a direct answer to my question, but it does raise the question of whether some applications will always think there are valid reasons to submit form POSTs in an effort to discover "the hidden web". This seems very reminiscent of the Google Web Accelerator saga earlier this year (e.g. see http://www.sitepoint.com/newsletter/viewissue.php?id=3&issue=113&format=html ), although that caused problems even with hrefs with side-effects (bad idea!) but usually only when users are logged in. Andy Read www.azurite.co.uk
Re: Crawler submits forms?
And please note the mail from Doug on Nov 23. - Title: [Fwd: Spider Causing Contact Form Submissions] Body: It looks as though Nutch is inadvertantly submitting forms. At DOMContentUtils.java:58 we specify that the "action" parameter of an HTML form should be extracted as a link. Yet we ignore the "method" parameter of the form. I think we should only follow these when the method is "get", not when it is "post". Do others agree? Doug --- I think the source code in svn ignore the POST url now . /Jack On 12/14/05, Jack Tang <[EMAIL PROTECTED]> wrote: > Hi > > You can read the article about Stanford's HiWE search engine on www10.org. > And it is easy to extend Nutch if you are using http-client protocol. > > http://www10.org/cdrom/posters/p1049/ > > Good luck:) > > /Jack > > On 12/14/05, Andy Read <[EMAIL PROTECTED]> wrote: > > Hi, > > > > I'm using nutch to create a site search facility for a couple of site. > > > > I upgraded from 0.6 to 0.7.1 a few days ago and have just noticed that blank > > users are being registered on my site at the exact times the cron job runs > > the crawl tool to re-index the site. This means that the crawler is now > > submitting a post request from the registration form! Is this a new > > 'feature' of 0.7 or 0.7.1? I can't find any mention in changes.txt and I > > can't find any config option referring to it. Surely the crawler should > > never submit form input? > > > > Any help appreciated. > > > > Thanks, > > > > Andy Read > > > > www.azurite.co.uk > > > > > > > > > > > -- > Keep Discovering ... ... > http://www.jroller.com/page/jmars > -- Keep Discovering ... ... http://www.jroller.com/page/jmars
Re: Crawler submits forms?
On Tue, 2005-12-13 at 16:57 +, Andy Read wrote: > Hi, > > I'm using nutch to create a site search facility for a couple of site. > > I upgraded from 0.6 to 0.7.1 a few days ago and have just noticed that blank > users are being registered on my site at the exact times the cron job runs > the crawl tool to re-index the site. This means that the crawler is now > submitting a post request from the registration form! Is this a new > 'feature' of 0.7 or 0.7.1? I can't find any mention in changes.txt and I > can't find any config option referring to it. Surely the crawler should > never submit form input? Nutch follows links. You can argue that it should not extract links from POST style forms (this change has been made) but in the end it doesn't make much of a difference since if you link to that script in any way (a href, etc.) it will be followed and give you the same results. Your registration form script is broken for accepting invalid input (or GET requests at all) and robots.txt should be used to protect dynamic areas from inadvertent uses. -- Rod Taylor <[EMAIL PROTECTED]>
Re: Crawler submits forms?
Hi You can read the article about Stanford's HiWE search engine on www10.org. And it is easy to extend Nutch if you are using http-client protocol. http://www10.org/cdrom/posters/p1049/ Good luck:) /Jack On 12/14/05, Andy Read <[EMAIL PROTECTED]> wrote: > Hi, > > I'm using nutch to create a site search facility for a couple of site. > > I upgraded from 0.6 to 0.7.1 a few days ago and have just noticed that blank > users are being registered on my site at the exact times the cron job runs > the crawl tool to re-index the site. This means that the crawler is now > submitting a post request from the registration form! Is this a new > 'feature' of 0.7 or 0.7.1? I can't find any mention in changes.txt and I > can't find any config option referring to it. Surely the crawler should > never submit form input? > > Any help appreciated. > > Thanks, > > Andy Read > > www.azurite.co.uk > > > > -- Keep Discovering ... ... http://www.jroller.com/page/jmars