Re: Following form action tags
Chris Schneider wrote: Gang, I had a webmaster complain that our crawler was following his form action links. Although he admits that his use of the GET method is a bit unorthodox, he feels strongly that form submissions with input fields shouldn't be followed by crawlers. Would it make sense to modify the HTML parser so that it checked to see whether such input fields exist before following form action links? I read through your email exchange, and setting aside all emotional content I think this is a valid request - indeed, as far as I can tell other major crawlers don't follow these links. We could either remove this, or make it optional (default not to use them). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Created: (NUTCH-271) Meta-data per URL/site/section
Meta-data per URL/site/section -- Key: NUTCH-271 URL: http://issues.apache.org/jira/browse/NUTCH-271 Project: Nutch Type: New Feature Versions: 0.7.2 Reporter: Stefan Neufeind We have the need to index sites and attach additional meta-data-tags to them. Afaik this is not yet possible, or is there a workaround I don't see? What I think of is using meta-tags per start-url, only indexing content below that URL, and have the ability to limit searches upon those meta-tags. E.g. http://www.example1.com/something1/ - meta-tag companybranch1 http://www.example2.com/something2/ - meta-tag companybranch2 http://www.example3.com/something3/ - meta-tag companybranch1 http://www.example4.com/something4/ - meta-tag companybranch3 search for everything in companybranch1 or across 1 and 3 or similar -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-271) Meta-data per URL/site/section
[ http://issues.apache.org/jira/browse/NUTCH-271?page=comments#action_12412435 ] Gal Nitzan commented on NUTCH-271: -- This functionality is already available in Nutch-0.8 Meta-data per URL/site/section -- Key: NUTCH-271 URL: http://issues.apache.org/jira/browse/NUTCH-271 Project: Nutch Type: New Feature Versions: 0.7.2 Reporter: Stefan Neufeind We have the need to index sites and attach additional meta-data-tags to them. Afaik this is not yet possible, or is there a workaround I don't see? What I think of is using meta-tags per start-url, only indexing content below that URL, and have the ability to limit searches upon those meta-tags. E.g. http://www.example1.com/something1/ - meta-tag companybranch1 http://www.example2.com/something2/ - meta-tag companybranch2 http://www.example3.com/something3/ - meta-tag companybranch1 http://www.example4.com/something4/ - meta-tag companybranch3 search for everything in companybranch1 or across 1 and 3 or similar -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-271) Meta-data per URL/site/section
[ http://issues.apache.org/jira/browse/NUTCH-271?page=comments#action_12412436 ] Gal Nitzan commented on NUTCH-271: -- Sorry for the short comment. Actually the meta tags functionality is already available in the 0.8 version along with a CrawlDatum object. You can build the required functionality just by developing plugins for parsing indexing and querying HTH. Meta-data per URL/site/section -- Key: NUTCH-271 URL: http://issues.apache.org/jira/browse/NUTCH-271 Project: Nutch Type: New Feature Versions: 0.7.2 Reporter: Stefan Neufeind We have the need to index sites and attach additional meta-data-tags to them. Afaik this is not yet possible, or is there a workaround I don't see? What I think of is using meta-tags per start-url, only indexing content below that URL, and have the ability to limit searches upon those meta-tags. E.g. http://www.example1.com/something1/ - meta-tag companybranch1 http://www.example2.com/something2/ - meta-tag companybranch2 http://www.example3.com/something3/ - meta-tag companybranch1 http://www.example4.com/something4/ - meta-tag companybranch3 search for everything in companybranch1 or across 1 and 3 or similar -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira