[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a "?"
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847300#action_12847300 ] Andrzej Bialecki commented on NUTCH-797: - If there are no futher comments I'm going to commit the current patch with a TODO to revisit this code if/when it's refactored to an external dependency. > parse-tika is not properly constructing URLs when the target begins with a "?" > -- > > Key: NUTCH-797 > URL: https://issues.apache.org/jira/browse/NUTCH-797 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.1 > Environment: Win 7, Java(TM) SE Runtime Environment (build > 1.6.0_16-b01) > Also repro's on RHEL and java 1.4.2 >Reporter: Robert Hohman >Priority: Minor > Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch > > > This is my first bug and patch on nutch, so apologies if I have not provided > enough detail. > In crawling the page at > http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 there are > links in the page that look like this: > 2 href="?co=0&sk=0&p=3&pi=1">3 > in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as > getOutlinks looks for links, it comes across this link, and constucts a new > url with a base URL class built from > "http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0";, and a > target of "?co=0&sk=0&p=2&pi=1" > The URL class, per RFC 3986 at > http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines > how to merge these two, and per the RFC, the URL class merges these to: > http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&p=2&pi=1 > because the RFC explicitly states that the rightmost url segment (the > Search.aspx in this case) should be ripped off before combining. > While this is compliant with the RFC, it means the URLs which are created for > the next round of fetching are incorrect. Modern browsers seem to handle > this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure > exception or handling of what is a poorly formed url on accenture's part. > I have fixed this by modifying DOMContentUtils to look for the case where a ? > begins the target, and then pulling the rightmost component out of the base > and inserting it into the target before the ?, so the target in this example > becomes: > Search.aspx?co=0&sk=0&p=2&pi=1 > The URL class then properly constructs the new url as: > http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0&p=2&pi=1 > If it is agreed that this solution works, I believe the other html parsers in > nutch would need to be modified in a similar way. > Can I get feedback on this proposed solution? Specifically I'm worried about > unforeseen side effects. > Much thanks > Here is the patch info: > Index: > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java > === > --- > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java >(revision 916362) > +++ > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java >(working copy) > @@ -299,6 +299,50 @@ > return false; >} > > + private URL fixURL(URL base, String target) throws MalformedURLException > + { > + // handle params that are embedded into the base url - move them to > target > + // so URL class constructs the new url class properly > + if (base.toString().indexOf(';') > 0) > + return fixEmbeddedParams(base, target); > + > + // handle the case that there is a target that is a pure query. > + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on > how to assemble > + // URLs but I've seen this in numerous places, for example at > + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 > + // It has urls in the page of the form href="?co=0&sk=0&pg=1", and by > default > + // URL constructs the base+target combo as > + // http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&pg=1, > incorrectly > + // dropping the Search.aspx target > + // > + // Browsers handle these just fine, they must have an exception > similar to this > + if (target.startsWith("?")) > + { > + return fixPureQueryTargets(base, target); > + } > + > + return new URL(base, target); > + } > + > + private URL fixPureQueryTargets(URL base, String target) throws > MalformedURLException > + { > + if (!target.startsWith("?")) > + return new URL(base, target); > + > + String basePath = base.getPath(); > + String baseRightMost=""; > + int
[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a "?"
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846923#action_12846923 ] Andrzej Bialecki commented on NUTCH-797: - That's one option, at least until the crawler-commons produces any artifacts ... Eventually I think that this code and other related code (e.g. deciding which URL is canonical in presence of redirects, url normalization and filtering) should end up in the crawler-commons. > parse-tika is not properly constructing URLs when the target begins with a "?" > -- > > Key: NUTCH-797 > URL: https://issues.apache.org/jira/browse/NUTCH-797 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.1 > Environment: Win 7, Java(TM) SE Runtime Environment (build > 1.6.0_16-b01) > Also repro's on RHEL and java 1.4.2 >Reporter: Robert Hohman >Priority: Minor > Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch > > > This is my first bug and patch on nutch, so apologies if I have not provided > enough detail. > In crawling the page at > http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 there are > links in the page that look like this: > 2 href="?co=0&sk=0&p=3&pi=1">3 > in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as > getOutlinks looks for links, it comes across this link, and constucts a new > url with a base URL class built from > "http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0";, and a > target of "?co=0&sk=0&p=2&pi=1" > The URL class, per RFC 3986 at > http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines > how to merge these two, and per the RFC, the URL class merges these to: > http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&p=2&pi=1 > because the RFC explicitly states that the rightmost url segment (the > Search.aspx in this case) should be ripped off before combining. > While this is compliant with the RFC, it means the URLs which are created for > the next round of fetching are incorrect. Modern browsers seem to handle > this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure > exception or handling of what is a poorly formed url on accenture's part. > I have fixed this by modifying DOMContentUtils to look for the case where a ? > begins the target, and then pulling the rightmost component out of the base > and inserting it into the target before the ?, so the target in this example > becomes: > Search.aspx?co=0&sk=0&p=2&pi=1 > The URL class then properly constructs the new url as: > http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0&p=2&pi=1 > If it is agreed that this solution works, I believe the other html parsers in > nutch would need to be modified in a similar way. > Can I get feedback on this proposed solution? Specifically I'm worried about > unforeseen side effects. > Much thanks > Here is the patch info: > Index: > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java > === > --- > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java >(revision 916362) > +++ > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java >(working copy) > @@ -299,6 +299,50 @@ > return false; >} > > + private URL fixURL(URL base, String target) throws MalformedURLException > + { > + // handle params that are embedded into the base url - move them to > target > + // so URL class constructs the new url class properly > + if (base.toString().indexOf(';') > 0) > + return fixEmbeddedParams(base, target); > + > + // handle the case that there is a target that is a pure query. > + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on > how to assemble > + // URLs but I've seen this in numerous places, for example at > + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 > + // It has urls in the page of the form href="?co=0&sk=0&pg=1", and by > default > + // URL constructs the base+target combo as > + // http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&pg=1, > incorrectly > + // dropping the Search.aspx target > + // > + // Browsers handle these just fine, they must have an exception > similar to this > + if (target.startsWith("?")) > + { > + return fixPureQueryTargets(base, target); > + } > + > + return new URL(base, target); > + } > + > + private URL fixPureQueryTargets(URL base, String target) throws > MalformedURLException > + { > + if (!target.startsWith("?")) > +
[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a "?"
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846865#action_12846865 ] Jukka Zitting commented on NUTCH-797: - I guess we need to apply the same logic also to other Tika parsers that may deal with relative URLs. Since we in any case need this functionality in Tika, would it be useful for Nutch if it was made available as a public utility class or method in tika-core? It would be great if we could avoid duplicating the code in different projects. > parse-tika is not properly constructing URLs when the target begins with a "?" > -- > > Key: NUTCH-797 > URL: https://issues.apache.org/jira/browse/NUTCH-797 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.1 > Environment: Win 7, Java(TM) SE Runtime Environment (build > 1.6.0_16-b01) > Also repro's on RHEL and java 1.4.2 >Reporter: Robert Hohman >Priority: Minor > Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch > > > This is my first bug and patch on nutch, so apologies if I have not provided > enough detail. > In crawling the page at > http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 there are > links in the page that look like this: > 2 href="?co=0&sk=0&p=3&pi=1">3 > in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as > getOutlinks looks for links, it comes across this link, and constucts a new > url with a base URL class built from > "http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0";, and a > target of "?co=0&sk=0&p=2&pi=1" > The URL class, per RFC 3986 at > http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines > how to merge these two, and per the RFC, the URL class merges these to: > http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&p=2&pi=1 > because the RFC explicitly states that the rightmost url segment (the > Search.aspx in this case) should be ripped off before combining. > While this is compliant with the RFC, it means the URLs which are created for > the next round of fetching are incorrect. Modern browsers seem to handle > this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure > exception or handling of what is a poorly formed url on accenture's part. > I have fixed this by modifying DOMContentUtils to look for the case where a ? > begins the target, and then pulling the rightmost component out of the base > and inserting it into the target before the ?, so the target in this example > becomes: > Search.aspx?co=0&sk=0&p=2&pi=1 > The URL class then properly constructs the new url as: > http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0&p=2&pi=1 > If it is agreed that this solution works, I believe the other html parsers in > nutch would need to be modified in a similar way. > Can I get feedback on this proposed solution? Specifically I'm worried about > unforeseen side effects. > Much thanks > Here is the patch info: > Index: > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java > === > --- > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java >(revision 916362) > +++ > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java >(working copy) > @@ -299,6 +299,50 @@ > return false; >} > > + private URL fixURL(URL base, String target) throws MalformedURLException > + { > + // handle params that are embedded into the base url - move them to > target > + // so URL class constructs the new url class properly > + if (base.toString().indexOf(';') > 0) > + return fixEmbeddedParams(base, target); > + > + // handle the case that there is a target that is a pure query. > + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on > how to assemble > + // URLs but I've seen this in numerous places, for example at > + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 > + // It has urls in the page of the form href="?co=0&sk=0&pg=1", and by > default > + // URL constructs the base+target combo as > + // http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&pg=1, > incorrectly > + // dropping the Search.aspx target > + // > + // Browsers handle these just fine, they must have an exception > similar to this > + if (target.startsWith("?")) > + { > + return fixPureQueryTargets(base, target); > + } > + > + return new URL(base, target); > + } > + > + private URL fixPureQueryTargets(URL base, String target) throws > MalformedURLException
[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a "?"
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846527#action_12846527 ] Andrzej Bialecki commented on NUTCH-797: - A few issues with this: * does this mean that the fixes would be applied to links found in other content types as well, not just html (the fixup code in TIKA-287 is located in HtmlParser)? * we need this also in other places, e.g. in the redirection handling code (both meta-refresh, javascript location.href and protocol-level redirect) * for a while we still need this in the parse-html plugin that does not use Tika. > parse-tika is not properly constructing URLs when the target begins with a "?" > -- > > Key: NUTCH-797 > URL: https://issues.apache.org/jira/browse/NUTCH-797 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.1 > Environment: Win 7, Java(TM) SE Runtime Environment (build > 1.6.0_16-b01) > Also repro's on RHEL and java 1.4.2 >Reporter: Robert Hohman >Priority: Minor > Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch > > > This is my first bug and patch on nutch, so apologies if I have not provided > enough detail. > In crawling the page at > http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 there are > links in the page that look like this: > 2 href="?co=0&sk=0&p=3&pi=1">3 > in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as > getOutlinks looks for links, it comes across this link, and constucts a new > url with a base URL class built from > "http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0";, and a > target of "?co=0&sk=0&p=2&pi=1" > The URL class, per RFC 3986 at > http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines > how to merge these two, and per the RFC, the URL class merges these to: > http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&p=2&pi=1 > because the RFC explicitly states that the rightmost url segment (the > Search.aspx in this case) should be ripped off before combining. > While this is compliant with the RFC, it means the URLs which are created for > the next round of fetching are incorrect. Modern browsers seem to handle > this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure > exception or handling of what is a poorly formed url on accenture's part. > I have fixed this by modifying DOMContentUtils to look for the case where a ? > begins the target, and then pulling the rightmost component out of the base > and inserting it into the target before the ?, so the target in this example > becomes: > Search.aspx?co=0&sk=0&p=2&pi=1 > The URL class then properly constructs the new url as: > http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0&p=2&pi=1 > If it is agreed that this solution works, I believe the other html parsers in > nutch would need to be modified in a similar way. > Can I get feedback on this proposed solution? Specifically I'm worried about > unforeseen side effects. > Much thanks > Here is the patch info: > Index: > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java > === > --- > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java >(revision 916362) > +++ > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java >(working copy) > @@ -299,6 +299,50 @@ > return false; >} > > + private URL fixURL(URL base, String target) throws MalformedURLException > + { > + // handle params that are embedded into the base url - move them to > target > + // so URL class constructs the new url class properly > + if (base.toString().indexOf(';') > 0) > + return fixEmbeddedParams(base, target); > + > + // handle the case that there is a target that is a pure query. > + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on > how to assemble > + // URLs but I've seen this in numerous places, for example at > + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 > + // It has urls in the page of the form href="?co=0&sk=0&pg=1", and by > default > + // URL constructs the base+target combo as > + // http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&pg=1, > incorrectly > + // dropping the Search.aspx target > + // > + // Browsers handle these just fine, they must have an exception > similar to this > + if (target.startsWith("?")) > + { > + return fixPureQueryTargets(base, target); > + } > + > + return new URL(base, target); > + } > + > +
[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a "?"
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846521#action_12846521 ] Jukka Zitting commented on NUTCH-797: - Wouldn't it be easier for Nutch to pass the base URL as the CONTENT_LOCATION metadata to the Tika parser? Then Tika would automatically apply these fixes, as discussed in TIKA-287. > parse-tika is not properly constructing URLs when the target begins with a "?" > -- > > Key: NUTCH-797 > URL: https://issues.apache.org/jira/browse/NUTCH-797 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.1 > Environment: Win 7, Java(TM) SE Runtime Environment (build > 1.6.0_16-b01) > Also repro's on RHEL and java 1.4.2 >Reporter: Robert Hohman >Priority: Minor > Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch > > > This is my first bug and patch on nutch, so apologies if I have not provided > enough detail. > In crawling the page at > http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 there are > links in the page that look like this: > 2 href="?co=0&sk=0&p=3&pi=1">3 > in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as > getOutlinks looks for links, it comes across this link, and constucts a new > url with a base URL class built from > "http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0";, and a > target of "?co=0&sk=0&p=2&pi=1" > The URL class, per RFC 3986 at > http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines > how to merge these two, and per the RFC, the URL class merges these to: > http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&p=2&pi=1 > because the RFC explicitly states that the rightmost url segment (the > Search.aspx in this case) should be ripped off before combining. > While this is compliant with the RFC, it means the URLs which are created for > the next round of fetching are incorrect. Modern browsers seem to handle > this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure > exception or handling of what is a poorly formed url on accenture's part. > I have fixed this by modifying DOMContentUtils to look for the case where a ? > begins the target, and then pulling the rightmost component out of the base > and inserting it into the target before the ?, so the target in this example > becomes: > Search.aspx?co=0&sk=0&p=2&pi=1 > The URL class then properly constructs the new url as: > http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0&p=2&pi=1 > If it is agreed that this solution works, I believe the other html parsers in > nutch would need to be modified in a similar way. > Can I get feedback on this proposed solution? Specifically I'm worried about > unforeseen side effects. > Much thanks > Here is the patch info: > Index: > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java > === > --- > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java >(revision 916362) > +++ > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java >(working copy) > @@ -299,6 +299,50 @@ > return false; >} > > + private URL fixURL(URL base, String target) throws MalformedURLException > + { > + // handle params that are embedded into the base url - move them to > target > + // so URL class constructs the new url class properly > + if (base.toString().indexOf(';') > 0) > + return fixEmbeddedParams(base, target); > + > + // handle the case that there is a target that is a pure query. > + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on > how to assemble > + // URLs but I've seen this in numerous places, for example at > + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 > + // It has urls in the page of the form href="?co=0&sk=0&pg=1", and by > default > + // URL constructs the base+target combo as > + // http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&pg=1, > incorrectly > + // dropping the Search.aspx target > + // > + // Browsers handle these just fine, they must have an exception > similar to this > + if (target.startsWith("?")) > + { > + return fixPureQueryTargets(base, target); > + } > + > + return new URL(base, target); > + } > + > + private URL fixPureQueryTargets(URL base, String target) throws > MalformedURLException > + { > + if (!target.startsWith("?")) > + return new URL(base, target); > + > + String basePath = base.getPath(); > + String baseRightM
[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a "?"
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846481#action_12846481 ] Robert Hohman commented on NUTCH-797: - Makes sense, thanks for looking at this guys > parse-tika is not properly constructing URLs when the target begins with a "?" > -- > > Key: NUTCH-797 > URL: https://issues.apache.org/jira/browse/NUTCH-797 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.1 > Environment: Win 7, Java(TM) SE Runtime Environment (build > 1.6.0_16-b01) > Also repro's on RHEL and java 1.4.2 >Reporter: Robert Hohman >Priority: Minor > Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch > > > This is my first bug and patch on nutch, so apologies if I have not provided > enough detail. > In crawling the page at > http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 there are > links in the page that look like this: > 2 href="?co=0&sk=0&p=3&pi=1">3 > in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as > getOutlinks looks for links, it comes across this link, and constucts a new > url with a base URL class built from > "http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0";, and a > target of "?co=0&sk=0&p=2&pi=1" > The URL class, per RFC 3986 at > http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines > how to merge these two, and per the RFC, the URL class merges these to: > http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&p=2&pi=1 > because the RFC explicitly states that the rightmost url segment (the > Search.aspx in this case) should be ripped off before combining. > While this is compliant with the RFC, it means the URLs which are created for > the next round of fetching are incorrect. Modern browsers seem to handle > this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure > exception or handling of what is a poorly formed url on accenture's part. > I have fixed this by modifying DOMContentUtils to look for the case where a ? > begins the target, and then pulling the rightmost component out of the base > and inserting it into the target before the ?, so the target in this example > becomes: > Search.aspx?co=0&sk=0&p=2&pi=1 > The URL class then properly constructs the new url as: > http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0&p=2&pi=1 > If it is agreed that this solution works, I believe the other html parsers in > nutch would need to be modified in a similar way. > Can I get feedback on this proposed solution? Specifically I'm worried about > unforeseen side effects. > Much thanks > Here is the patch info: > Index: > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java > === > --- > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java >(revision 916362) > +++ > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java >(working copy) > @@ -299,6 +299,50 @@ > return false; >} > > + private URL fixURL(URL base, String target) throws MalformedURLException > + { > + // handle params that are embedded into the base url - move them to > target > + // so URL class constructs the new url class properly > + if (base.toString().indexOf(';') > 0) > + return fixEmbeddedParams(base, target); > + > + // handle the case that there is a target that is a pure query. > + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on > how to assemble > + // URLs but I've seen this in numerous places, for example at > + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 > + // It has urls in the page of the form href="?co=0&sk=0&pg=1", and by > default > + // URL constructs the base+target combo as > + // http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&pg=1, > incorrectly > + // dropping the Search.aspx target > + // > + // Browsers handle these just fine, they must have an exception > similar to this > + if (target.startsWith("?")) > + { > + return fixPureQueryTargets(base, target); > + } > + > + return new URL(base, target); > + } > + > + private URL fixPureQueryTargets(URL base, String target) throws > MalformedURLException > + { > + if (!target.startsWith("?")) > + return new URL(base, target); > + > + String basePath = base.getPath(); > + String baseRightMost=""; > + int baseRightMostIdx = basePath.lastIndexOf("/"); > + if (baseRightMostIdx != -1) > + { > + baseRightM
[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a "?"
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846459#action_12846459 ] Ken Krugler commented on NUTCH-797: --- Agreed re crawler-commons...feels like there's a beefy chunk of URL handling code that should go there. > parse-tika is not properly constructing URLs when the target begins with a "?" > -- > > Key: NUTCH-797 > URL: https://issues.apache.org/jira/browse/NUTCH-797 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.1 > Environment: Win 7, Java(TM) SE Runtime Environment (build > 1.6.0_16-b01) > Also repro's on RHEL and java 1.4.2 >Reporter: Robert Hohman >Priority: Minor > Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch > > > This is my first bug and patch on nutch, so apologies if I have not provided > enough detail. > In crawling the page at > http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 there are > links in the page that look like this: > 2 href="?co=0&sk=0&p=3&pi=1">3 > in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as > getOutlinks looks for links, it comes across this link, and constucts a new > url with a base URL class built from > "http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0";, and a > target of "?co=0&sk=0&p=2&pi=1" > The URL class, per RFC 3986 at > http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines > how to merge these two, and per the RFC, the URL class merges these to: > http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&p=2&pi=1 > because the RFC explicitly states that the rightmost url segment (the > Search.aspx in this case) should be ripped off before combining. > While this is compliant with the RFC, it means the URLs which are created for > the next round of fetching are incorrect. Modern browsers seem to handle > this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure > exception or handling of what is a poorly formed url on accenture's part. > I have fixed this by modifying DOMContentUtils to look for the case where a ? > begins the target, and then pulling the rightmost component out of the base > and inserting it into the target before the ?, so the target in this example > becomes: > Search.aspx?co=0&sk=0&p=2&pi=1 > The URL class then properly constructs the new url as: > http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0&p=2&pi=1 > If it is agreed that this solution works, I believe the other html parsers in > nutch would need to be modified in a similar way. > Can I get feedback on this proposed solution? Specifically I'm worried about > unforeseen side effects. > Much thanks > Here is the patch info: > Index: > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java > === > --- > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java >(revision 916362) > +++ > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java >(working copy) > @@ -299,6 +299,50 @@ > return false; >} > > + private URL fixURL(URL base, String target) throws MalformedURLException > + { > + // handle params that are embedded into the base url - move them to > target > + // so URL class constructs the new url class properly > + if (base.toString().indexOf(';') > 0) > + return fixEmbeddedParams(base, target); > + > + // handle the case that there is a target that is a pure query. > + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on > how to assemble > + // URLs but I've seen this in numerous places, for example at > + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 > + // It has urls in the page of the form href="?co=0&sk=0&pg=1", and by > default > + // URL constructs the base+target combo as > + // http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&pg=1, > incorrectly > + // dropping the Search.aspx target > + // > + // Browsers handle these just fine, they must have an exception > similar to this > + if (target.startsWith("?")) > + { > + return fixPureQueryTargets(base, target); > + } > + > + return new URL(base, target); > + } > + > + private URL fixPureQueryTargets(URL base, String target) throws > MalformedURLException > + { > + if (!target.startsWith("?")) > + return new URL(base, target); > + > + String basePath = base.getPath(); > + String baseRightMost=""; > + int baseRightMostIdx = basePath.lastIndexOf("/"); > + if (base
[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a "?"
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846437#action_12846437 ] Andrzej Bialecki commented on NUTCH-797: - Unfortunately the way your fix was applied there is not reusable (private method in HtmlParser... ugh :( ). So for the time being I think we'll go with our utility class ... which we should really move to the crawler-commons anyway! > parse-tika is not properly constructing URLs when the target begins with a "?" > -- > > Key: NUTCH-797 > URL: https://issues.apache.org/jira/browse/NUTCH-797 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.1 > Environment: Win 7, Java(TM) SE Runtime Environment (build > 1.6.0_16-b01) > Also repro's on RHEL and java 1.4.2 >Reporter: Robert Hohman >Priority: Minor > Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch > > > This is my first bug and patch on nutch, so apologies if I have not provided > enough detail. > In crawling the page at > http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 there are > links in the page that look like this: > 2 href="?co=0&sk=0&p=3&pi=1">3 > in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as > getOutlinks looks for links, it comes across this link, and constucts a new > url with a base URL class built from > "http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0";, and a > target of "?co=0&sk=0&p=2&pi=1" > The URL class, per RFC 3986 at > http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines > how to merge these two, and per the RFC, the URL class merges these to: > http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&p=2&pi=1 > because the RFC explicitly states that the rightmost url segment (the > Search.aspx in this case) should be ripped off before combining. > While this is compliant with the RFC, it means the URLs which are created for > the next round of fetching are incorrect. Modern browsers seem to handle > this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure > exception or handling of what is a poorly formed url on accenture's part. > I have fixed this by modifying DOMContentUtils to look for the case where a ? > begins the target, and then pulling the rightmost component out of the base > and inserting it into the target before the ?, so the target in this example > becomes: > Search.aspx?co=0&sk=0&p=2&pi=1 > The URL class then properly constructs the new url as: > http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0&p=2&pi=1 > If it is agreed that this solution works, I believe the other html parsers in > nutch would need to be modified in a similar way. > Can I get feedback on this proposed solution? Specifically I'm worried about > unforeseen side effects. > Much thanks > Here is the patch info: > Index: > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java > === > --- > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java >(revision 916362) > +++ > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java >(working copy) > @@ -299,6 +299,50 @@ > return false; >} > > + private URL fixURL(URL base, String target) throws MalformedURLException > + { > + // handle params that are embedded into the base url - move them to > target > + // so URL class constructs the new url class properly > + if (base.toString().indexOf(';') > 0) > + return fixEmbeddedParams(base, target); > + > + // handle the case that there is a target that is a pure query. > + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on > how to assemble > + // URLs but I've seen this in numerous places, for example at > + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 > + // It has urls in the page of the form href="?co=0&sk=0&pg=1", and by > default > + // URL constructs the base+target combo as > + // http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&pg=1, > incorrectly > + // dropping the Search.aspx target > + // > + // Browsers handle these just fine, they must have an exception > similar to this > + if (target.startsWith("?")) > + { > + return fixPureQueryTargets(base, target); > + } > + > + return new URL(base, target); > + } > + > + private URL fixPureQueryTargets(URL base, String target) throws > MalformedURLException > + { > + if (!target.startsWith("?")) > + return new URL(base, target); > + > +
[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a "?"
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846424#action_12846424 ] Ken Krugler commented on NUTCH-797: --- I thought this same issue (relative URL with leading '?') had been fixed in Tika. Or at least I reported it, and I thought Jukka rolled in code that would handle it. See [TIKA-287], and the comment about "Note that special care must be taken to work around a known bug in the Java URL() class, when the relative URL is a query string and the base URL doesn't end with a '/'." Or is this the case of Nutch needing to implement similar link extraction support? > parse-tika is not properly constructing URLs when the target begins with a "?" > -- > > Key: NUTCH-797 > URL: https://issues.apache.org/jira/browse/NUTCH-797 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.1 > Environment: Win 7, Java(TM) SE Runtime Environment (build > 1.6.0_16-b01) > Also repro's on RHEL and java 1.4.2 >Reporter: Robert Hohman >Priority: Minor > Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch > > > This is my first bug and patch on nutch, so apologies if I have not provided > enough detail. > In crawling the page at > http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 there are > links in the page that look like this: > 2 href="?co=0&sk=0&p=3&pi=1">3 > in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as > getOutlinks looks for links, it comes across this link, and constucts a new > url with a base URL class built from > "http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0";, and a > target of "?co=0&sk=0&p=2&pi=1" > The URL class, per RFC 3986 at > http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines > how to merge these two, and per the RFC, the URL class merges these to: > http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&p=2&pi=1 > because the RFC explicitly states that the rightmost url segment (the > Search.aspx in this case) should be ripped off before combining. > While this is compliant with the RFC, it means the URLs which are created for > the next round of fetching are incorrect. Modern browsers seem to handle > this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure > exception or handling of what is a poorly formed url on accenture's part. > I have fixed this by modifying DOMContentUtils to look for the case where a ? > begins the target, and then pulling the rightmost component out of the base > and inserting it into the target before the ?, so the target in this example > becomes: > Search.aspx?co=0&sk=0&p=2&pi=1 > The URL class then properly constructs the new url as: > http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0&p=2&pi=1 > If it is agreed that this solution works, I believe the other html parsers in > nutch would need to be modified in a similar way. > Can I get feedback on this proposed solution? Specifically I'm worried about > unforeseen side effects. > Much thanks > Here is the patch info: > Index: > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java > === > --- > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java >(revision 916362) > +++ > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java >(working copy) > @@ -299,6 +299,50 @@ > return false; >} > > + private URL fixURL(URL base, String target) throws MalformedURLException > + { > + // handle params that are embedded into the base url - move them to > target > + // so URL class constructs the new url class properly > + if (base.toString().indexOf(';') > 0) > + return fixEmbeddedParams(base, target); > + > + // handle the case that there is a target that is a pure query. > + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on > how to assemble > + // URLs but I've seen this in numerous places, for example at > + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 > + // It has urls in the page of the form href="?co=0&sk=0&pg=1", and by > default > + // URL constructs the base+target combo as > + // http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&pg=1, > incorrectly > + // dropping the Search.aspx target > + // > + // Browsers handle these just fine, they must have an exception > similar to this > + if (target.startsWith("?")) > + { > + return fixPureQueryTargets(base, target); > + } > + > + return new URL(base,
[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a "?"
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846418#action_12846418 ] Andrzej Bialecki commented on NUTCH-797: - Hm, actually the picture is more complicated than I thought - if we apply both methods (fixEmbeddedParams and fixPureQueryTargets) then some of the test cases from RFC fail. However, all tests succeed if we only apply the fixPureQueryTargets ! Looking at the origin of the fixEmbeddedParams method (NUTCH-436) something must been fixed in java.net.URL, because the test case mentioned in that issue now passes if we apply only fixPureQueryTargets. The same case with test cases in a near-duplicate issue NUTCH-566. Consequently I'm going to remove fixEmbeddedParams. I added all tests from RFC3986 section 5.4.1, and they all pass now. I'll attach an updated patch shortly. > parse-tika is not properly constructing URLs when the target begins with a "?" > -- > > Key: NUTCH-797 > URL: https://issues.apache.org/jira/browse/NUTCH-797 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.1 > Environment: Win 7, Java(TM) SE Runtime Environment (build > 1.6.0_16-b01) > Also repro's on RHEL and java 1.4.2 >Reporter: Robert Hohman >Priority: Minor > Attachments: pureQueryUrl.patch > > > This is my first bug and patch on nutch, so apologies if I have not provided > enough detail. > In crawling the page at > http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 there are > links in the page that look like this: > 2 href="?co=0&sk=0&p=3&pi=1">3 > in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as > getOutlinks looks for links, it comes across this link, and constucts a new > url with a base URL class built from > "http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0";, and a > target of "?co=0&sk=0&p=2&pi=1" > The URL class, per RFC 3986 at > http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines > how to merge these two, and per the RFC, the URL class merges these to: > http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&p=2&pi=1 > because the RFC explicitly states that the rightmost url segment (the > Search.aspx in this case) should be ripped off before combining. > While this is compliant with the RFC, it means the URLs which are created for > the next round of fetching are incorrect. Modern browsers seem to handle > this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure > exception or handling of what is a poorly formed url on accenture's part. > I have fixed this by modifying DOMContentUtils to look for the case where a ? > begins the target, and then pulling the rightmost component out of the base > and inserting it into the target before the ?, so the target in this example > becomes: > Search.aspx?co=0&sk=0&p=2&pi=1 > The URL class then properly constructs the new url as: > http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0&p=2&pi=1 > If it is agreed that this solution works, I believe the other html parsers in > nutch would need to be modified in a similar way. > Can I get feedback on this proposed solution? Specifically I'm worried about > unforeseen side effects. > Much thanks > Here is the patch info: > Index: > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java > === > --- > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java >(revision 916362) > +++ > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java >(working copy) > @@ -299,6 +299,50 @@ > return false; >} > > + private URL fixURL(URL base, String target) throws MalformedURLException > + { > + // handle params that are embedded into the base url - move them to > target > + // so URL class constructs the new url class properly > + if (base.toString().indexOf(';') > 0) > + return fixEmbeddedParams(base, target); > + > + // handle the case that there is a target that is a pure query. > + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on > how to assemble > + // URLs but I've seen this in numerous places, for example at > + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 > + // It has urls in the page of the form href="?co=0&sk=0&pg=1", and by > default > + // URL constructs the base+target combo as > + // http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&pg=1, > incorrectly > + // dropping the Search.aspx target > + // > + // Browsers handle these just fine, t
[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a "?"
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846402#action_12846402 ] Andrzej Bialecki commented on NUTCH-797: - Thanks for reporting this, and providing a patch. An updated revision of the standard, RFC3986 section 5.4.1 example 7 follows the same reasoning. I'll fix this shortly. > parse-tika is not properly constructing URLs when the target begins with a "?" > -- > > Key: NUTCH-797 > URL: https://issues.apache.org/jira/browse/NUTCH-797 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.1 > Environment: Win 7, Java(TM) SE Runtime Environment (build > 1.6.0_16-b01) > Also repro's on RHEL and java 1.4.2 >Reporter: Robert Hohman >Priority: Minor > Attachments: pureQueryUrl.patch > > > This is my first bug and patch on nutch, so apologies if I have not provided > enough detail. > In crawling the page at > http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 there are > links in the page that look like this: > 2 href="?co=0&sk=0&p=3&pi=1">3 > in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as > getOutlinks looks for links, it comes across this link, and constucts a new > url with a base URL class built from > "http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0";, and a > target of "?co=0&sk=0&p=2&pi=1" > The URL class, per RFC 3986 at > http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines > how to merge these two, and per the RFC, the URL class merges these to: > http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&p=2&pi=1 > because the RFC explicitly states that the rightmost url segment (the > Search.aspx in this case) should be ripped off before combining. > While this is compliant with the RFC, it means the URLs which are created for > the next round of fetching are incorrect. Modern browsers seem to handle > this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure > exception or handling of what is a poorly formed url on accenture's part. > I have fixed this by modifying DOMContentUtils to look for the case where a ? > begins the target, and then pulling the rightmost component out of the base > and inserting it into the target before the ?, so the target in this example > becomes: > Search.aspx?co=0&sk=0&p=2&pi=1 > The URL class then properly constructs the new url as: > http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0&p=2&pi=1 > If it is agreed that this solution works, I believe the other html parsers in > nutch would need to be modified in a similar way. > Can I get feedback on this proposed solution? Specifically I'm worried about > unforeseen side effects. > Much thanks > Here is the patch info: > Index: > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java > === > --- > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java >(revision 916362) > +++ > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java >(working copy) > @@ -299,6 +299,50 @@ > return false; >} > > + private URL fixURL(URL base, String target) throws MalformedURLException > + { > + // handle params that are embedded into the base url - move them to > target > + // so URL class constructs the new url class properly > + if (base.toString().indexOf(';') > 0) > + return fixEmbeddedParams(base, target); > + > + // handle the case that there is a target that is a pure query. > + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on > how to assemble > + // URLs but I've seen this in numerous places, for example at > + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 > + // It has urls in the page of the form href="?co=0&sk=0&pg=1", and by > default > + // URL constructs the base+target combo as > + // http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&pg=1, > incorrectly > + // dropping the Search.aspx target > + // > + // Browsers handle these just fine, they must have an exception > similar to this > + if (target.startsWith("?")) > + { > + return fixPureQueryTargets(base, target); > + } > + > + return new URL(base, target); > + } > + > + private URL fixPureQueryTargets(URL base, String target) throws > MalformedURLException > + { > + if (!target.startsWith("?")) > + return new URL(base, target); > + > + String basePath = base.getPath(); > + String baseRightMost=""; > + int baseR