[jira] [Commented] (NUTCH-2554) parserchecker can't fetch some URLs
[ https://issues.apache.org/jira/browse/NUTCH-2554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433822#comment-16433822 ] Sebastian Nagel commented on NUTCH-2554: Thanks, for the review! PR is merged, see NUTCH-2012. > parserchecker can't fetch some URLs > --- > > Key: NUTCH-2554 > URL: https://issues.apache.org/jira/browse/NUTCH-2554 > Project: Nutch > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > Fix For: 1.15 > > > The parserchecker (org.apache.nutch.parse.ParserChecker) calls > _URLUtil.toASCII_ on the url it is given, reencoding already percent-encoded > URLs. > For instance, let's say we want to query > [http://example.com|http://example.com_/], passing a GET parameter with name > 'q' and value '/'. '/' is a special character, and thus has to be encoded > before being sent. > If we pass '[_http://example.com/?q=/_'|http://example.com/?q=/%27] to the > parserchecker, then it doesn't encode the '/', and tries to fetch the URL as > is, which is invalid. > If we try to encode the parameter beforehand, and call the parsechecker with > 'http://example.com/?q=%2F', then it encodes the '%' sign to '%25', and thus > fetches '[http://example.com/?q=%252F'.|http://example.com/?q=%252F%27.] > This actually makes it impossible to fetch the correct URL > ([http://example.com/?q=%2F]) from the parserchecker. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2554) parserchecker can't fetch some URLs
[ https://issues.apache.org/jira/browse/NUTCH-2554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430542#comment-16430542 ] Gerard Bouchar commented on NUTCH-2554: --- Yes, it seems that the PR fixed the issue, thank you! > parserchecker can't fetch some URLs > --- > > Key: NUTCH-2554 > URL: https://issues.apache.org/jira/browse/NUTCH-2554 > Project: Nutch > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > > The parserchecker (org.apache.nutch.parse.ParserChecker) calls > _URLUtil.toASCII_ on the url it is given, reencoding already percent-encoded > URLs. > For instance, let's say we want to query > [http://example.com|http://example.com_/], passing a GET parameter with name > 'q' and value '/'. '/' is a special character, and thus has to be encoded > before being sent. > If we pass '[_http://example.com/?q=/_'|http://example.com/?q=/%27] to the > parserchecker, then it doesn't encode the '/', and tries to fetch the URL as > is, which is invalid. > If we try to encode the parameter beforehand, and call the parsechecker with > 'http://example.com/?q=%2F', then it encodes the '%' sign to '%25', and thus > fetches '[http://example.com/?q=%252F'.|http://example.com/?q=%252F%27.] > This actually makes it impossible to fetch the correct URL > ([http://example.com/?q=%2F]) from the parserchecker. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2554) parserchecker can't fetch some URLs
[ https://issues.apache.org/jira/browse/NUTCH-2554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430436#comment-16430436 ] Sebastian Nagel commented on NUTCH-2554: Please have a look at [PR #310|https://github.com/apache/nutch/pull/310] which includes a fix for the problem. Note that URLs are now read from stdin: {noformat} % echo 'http://example.com/?q=%2F' \ | nutch parsechecker -Dplugin.includes='protocol-http|parse-html|urlnormalizer-basic' -normalize -followRedirects fetching: http://example.com/?q=%2F ... {noformat} > parserchecker can't fetch some URLs > --- > > Key: NUTCH-2554 > URL: https://issues.apache.org/jira/browse/NUTCH-2554 > Project: Nutch > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > > The parserchecker (org.apache.nutch.parse.ParserChecker) calls > _URLUtil.toASCII_ on the url it is given, reencoding already percent-encoded > URLs. > For instance, let's say we want to query > [http://example.com|http://example.com_/], passing a GET parameter with name > 'q' and value '/'. '/' is a special character, and thus has to be encoded > before being sent. > If we pass '[_http://example.com/?q=/_'|http://example.com/?q=/%27] to the > parserchecker, then it doesn't encode the '/', and tries to fetch the URL as > is, which is invalid. > If we try to encode the parameter beforehand, and call the parsechecker with > 'http://example.com/?q=%2F', then it encodes the '%' sign to '%25', and thus > fetches '[http://example.com/?q=%252F'.|http://example.com/?q=%252F%27.] > This actually makes it impossible to fetch the correct URL > ([http://example.com/?q=%2F]) from the parserchecker. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2554) parserchecker can't fetch some URLs
[ https://issues.apache.org/jira/browse/NUTCH-2554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430315#comment-16430315 ] Sebastian Nagel commented on NUTCH-2554: Thanks, [~gbouchar], that's annoying. It's already reported in NUTCH-2145 but should be fixed by NUTCH-2012: the checker classes share a lot of code and ParserChecker is the last one which should be inherited from AbstractChecker. I'll try to prepare a pull-request to fix this. > parserchecker can't fetch some URLs > --- > > Key: NUTCH-2554 > URL: https://issues.apache.org/jira/browse/NUTCH-2554 > Project: Nutch > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > > The parserchecker (org.apache.nutch.parse.ParserChecker) calls > _URLUtil.toASCII_ on the url it is given, reencoding already percent-encoded > URLs. > For instance, let's say we want to query > [http://example.com|http://example.com_/], passing a GET parameter with name > 'q' and value '/'. '/' is a special character, and thus has to be encoded > before being sent. > If we pass '[_http://example.com/?q=/_'|http://example.com/?q=/%27] to the > parserchecker, then it doesn't encode the '/', and tries to fetch the URL as > is, which is invalid. > If we try to encode the parameter beforehand, and call the parsechecker with > 'http://example.com/?q=%2F', then it encodes the '%' sign to '%25', and thus > fetches '[http://example.com/?q=%252F'.|http://example.com/?q=%252F%27.] > This actually makes it impossible to fetch the correct URL > ([http://example.com/?q=%2F]) from the parserchecker. -- This message was sent by Atlassian JIRA (v7.6.3#76005)