[jira] Commented: (NUTCH-800) Generator builds a URL list that is not encoded

2010-03-29 Thread Jesse Campbell (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851089#action_12851089
 ] 

Jesse Campbell commented on NUTCH-800:
--

Well as it is right now, badly encoded urls will cause the crawler to break 
(with exceptions)
This tells me that it is not parsing the url string properly, which makes me 
question the possibility that there *could* be code injection...
Where I work, we try to be defensive... anything that comes from an outside 
source (in this case URLs either entered by the user in a text file or scraped 
from a website) should be encoded so that code injection isn't possible, or is 
at least harder.
I realize we're running java and not JS, so it would not be quite as simple as 
dropping in an Alert() command...

I also want it fixed because I don't really like the idea of using a regex 
normalizer to fix URLs with spaces in them... regex also is known to have 
multiple vulnerabilities in all languages.

 Generator builds a URL list that is not encoded
 ---

 Key: NUTCH-800
 URL: https://issues.apache.org/jira/browse/NUTCH-800
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 0.6, 0.7, 0.7.1, 0.7.2, 0.8, 0.8.1, 0.8.2, 0.7.3, 0.9.0, 
 1.0.0, 1.1
Reporter: Jesse Campbell

 The URL string that is grabbed by the generator when creating the fetch list 
 does not get encoded, could potentially allow unsafe excecution, and breaks 
 reading improperly encoded URLs from the scraped pages.
 Since we a) cannot guarantee that any site we scrape is not malitious, and b) 
 likely do not have control over all content providers, we are currently 
 forced to use a regex normalizer to perform the same function as a built-in 
 java class (it would be unsafe to leave alone)
 A quick solution would be to update Generator.java to utilize the 
 java.net.URLEncoder static class:
 line 187: 
 old: String urlString = url.toString();
 new: String urlString = URLEncoder.encode(url.toString(),UTF-8);
 line 192:
 old: u = new URL(url.toString());
 new: u = new URL(urlString);
 The use of URLEncoder.encode could also be at the updatedb stage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-800) Generator builds a URL list that is not encoded

2010-03-18 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847071#action_12847071
 ] 

Andrzej Bialecki  commented on NUTCH-800:
-

I'm puzzled by your problem description. Is Nutch affected by a potentially 
malicious URL data? URL form encoding is just a transport encoding, it doesn't 
make URL inherently safe (or unsafe).

 Generator builds a URL list that is not encoded
 ---

 Key: NUTCH-800
 URL: https://issues.apache.org/jira/browse/NUTCH-800
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 0.6, 0.7, 0.7.1, 0.7.2, 0.8, 0.8.1, 0.8.2, 0.7.3, 0.9.0, 
 1.0.0, 1.1
Reporter: Jesse Campbell

 The URL string that is grabbed by the generator when creating the fetch list 
 does not get encoded, could potentially allow unsafe excecution, and breaks 
 reading improperly encoded URLs from the scraped pages.
 Since we a) cannot guarantee that any site we scrape is not malitious, and b) 
 likely do not have control over all content providers, we are currently 
 forced to use a regex normalizer to perform the same function as a built-in 
 java class (it would be unsafe to leave alone)
 A quick solution would be to update Generator.java to utilize the 
 java.net.URLEncoder static class:
 line 187: 
 old: String urlString = url.toString();
 new: String urlString = URLEncoder.encode(url.toString(),UTF-8);
 line 192:
 old: u = new URL(url.toString());
 new: u = new URL(urlString);
 The use of URLEncoder.encode could also be at the updatedb stage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.