Re: Following form action tags

2006-05-18 Thread Andrzej Bialecki

Chris Schneider wrote:

Gang,

I had a webmaster complain that our crawler was following his form action links. 
Although he admits that his use of the GET method is a bit unorthodox, he feels strongly 
that form submissions with input fields shouldn't be followed by crawlers. Would it make 
sense to modify the HTML parser so that it checked to see whether such input fields exist 
before following form action links?

  


I read through your email exchange, and setting aside all emotional 
content I think this is a valid request - indeed, as far as I can tell 
other major crawlers don't follow these links. We could either remove 
this, or make it optional (default not to use them).


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




[jira] Created: (NUTCH-271) Meta-data per URL/site/section

2006-05-18 Thread Stefan Neufeind (JIRA)
Meta-data per URL/site/section
--

 Key: NUTCH-271
 URL: http://issues.apache.org/jira/browse/NUTCH-271
 Project: Nutch
Type: New Feature

Versions: 0.7.2
Reporter: Stefan Neufeind


We have the need to index sites and attach additional meta-data-tags to them. 
Afaik this is not yet possible, or is there a workaround I don't see? What I 
think of is using meta-tags per start-url, only indexing content below that 
URL, and have the ability to limit searches upon those meta-tags. E.g.

http://www.example1.com/something1/   - meta-tag companybranch1
http://www.example2.com/something2/   - meta-tag companybranch2
http://www.example3.com/something3/   - meta-tag companybranch1
http://www.example4.com/something4/   - meta-tag companybranch3

search for everything in companybranch1 or across 1 and 3 or similar

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-271) Meta-data per URL/site/section

2006-05-18 Thread Gal Nitzan (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-271?page=comments#action_12412435 ] 

Gal Nitzan commented on NUTCH-271:
--

This functionality is already available in Nutch-0.8

 Meta-data per URL/site/section
 --

  Key: NUTCH-271
  URL: http://issues.apache.org/jira/browse/NUTCH-271
  Project: Nutch
 Type: New Feature

 Versions: 0.7.2
 Reporter: Stefan Neufeind


 We have the need to index sites and attach additional meta-data-tags to them. 
 Afaik this is not yet possible, or is there a workaround I don't see? What 
 I think of is using meta-tags per start-url, only indexing content below that 
 URL, and have the ability to limit searches upon those meta-tags. E.g.
 http://www.example1.com/something1/   - meta-tag companybranch1
 http://www.example2.com/something2/   - meta-tag companybranch2
 http://www.example3.com/something3/   - meta-tag companybranch1
 http://www.example4.com/something4/   - meta-tag companybranch3
 search for everything in companybranch1 or across 1 and 3 or similar

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-271) Meta-data per URL/site/section

2006-05-18 Thread Gal Nitzan (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-271?page=comments#action_12412436 ] 

Gal Nitzan commented on NUTCH-271:
--

Sorry for the short comment.

Actually the meta tags functionality is already available in the 0.8 version 
along with a CrawlDatum object.

You can build the required functionality just by developing plugins for parsing 
indexing and querying

HTH.

 Meta-data per URL/site/section
 --

  Key: NUTCH-271
  URL: http://issues.apache.org/jira/browse/NUTCH-271
  Project: Nutch
 Type: New Feature

 Versions: 0.7.2
 Reporter: Stefan Neufeind


 We have the need to index sites and attach additional meta-data-tags to them. 
 Afaik this is not yet possible, or is there a workaround I don't see? What 
 I think of is using meta-tags per start-url, only indexing content below that 
 URL, and have the ability to limit searches upon those meta-tags. E.g.
 http://www.example1.com/something1/   - meta-tag companybranch1
 http://www.example2.com/something2/   - meta-tag companybranch2
 http://www.example3.com/something3/   - meta-tag companybranch1
 http://www.example4.com/something4/   - meta-tag companybranch3
 search for everything in companybranch1 or across 1 and 3 or similar

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira