[ https://issues.apache.org/jira/browse/NUTCH-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lewis John McGibbney updated NUTCH-1485: ---------------------------------------- Fix Version/s: 2.2 > TableUtil reverseURL to keep userinfo part > ------------------------------------------ > > Key: NUTCH-1485 > URL: https://issues.apache.org/jira/browse/NUTCH-1485 > Project: Nutch > Issue Type: Improvement > Affects Versions: 2.1 > Reporter: Sebastian Nagel > Priority: Minor > Fix For: 2.2 > > > The reversed URL key does not contain the userinfo part of an URL (user name > and password: {{ftp://user:passw...@ftp.xyz/file.txt}}, cf. [RFC > 3986|http://tools.ietf.org/html/rfc3986] and > [http://en.wikipedia.org/wiki/URI_scheme]. Keeping the userinfo would make it > easy to crawl a fixed list of protected content. However, URLs with userinfo > can be tricky, eg > [http://cnn.com&story=breaking_news@199.239.136.200/mostpopular], so it's ok > when the default is to remove the userinfo. But this should be done in > default URL normalizers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira