[
https://issues.apache.org/jira/browse/NUTCH-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851089#action_12851089
]
Jesse Campbell commented on NUTCH-800:
--
Well as it is right now, badly encoded urls will cause the crawler to break
(with exceptions)
This tells me that it is not parsing the url string properly, which makes me
question the possibility that there *could* be code injection...
Where I work, we try to be defensive... anything that comes from an outside
source (in this case URLs either entered by the user in a text file or scraped
from a website) should be encoded so that code injection isn't possible, or is
at least harder.
I realize we're running java and not JS, so it would not be quite as simple as
dropping in an Alert() command...
I also want it fixed because I don't really like the idea of using a regex
normalizer to fix URLs with spaces in them... regex also is known to have
multiple vulnerabilities in all languages.
Generator builds a URL list that is not encoded
---
Key: NUTCH-800
URL: https://issues.apache.org/jira/browse/NUTCH-800
Project: Nutch
Issue Type: Bug
Components: generator
Affects Versions: 0.6, 0.7, 0.7.1, 0.7.2, 0.8, 0.8.1, 0.8.2, 0.7.3, 0.9.0,
1.0.0, 1.1
Reporter: Jesse Campbell
The URL string that is grabbed by the generator when creating the fetch list
does not get encoded, could potentially allow unsafe excecution, and breaks
reading improperly encoded URLs from the scraped pages.
Since we a) cannot guarantee that any site we scrape is not malitious, and b)
likely do not have control over all content providers, we are currently
forced to use a regex normalizer to perform the same function as a built-in
java class (it would be unsafe to leave alone)
A quick solution would be to update Generator.java to utilize the
java.net.URLEncoder static class:
line 187:
old: String urlString = url.toString();
new: String urlString = URLEncoder.encode(url.toString(),UTF-8);
line 192:
old: u = new URL(url.toString());
new: u = new URL(urlString);
The use of URLEncoder.encode could also be at the updatedb stage.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.