That's not good. Thank you for sharing this with us:

https://issues.apache.org/jira/browse/TIKA-4178

On Fri, Dec 22, 2023 at 11:18 AM João Domingues <[email protected]>
wrote:

> Dear Tika  Team,
>
> I am writing to report an issue encountered while using Apache Tika's
> HttpFetcher functionality, specifically when handling URLs containing
> spaces or other characters that require percent-encoding as per RFC 2396.
>
> Issue Description:
> The HttpFetcher encounters a java.net.URISyntaxException when processing
> URLs that contain unencoded characters, such as spaces. This issue occurs* 
> even
> when the URLs are correctly formatted and encoded as per standard URI
> encoding practices.*
>
> Error Log:
> Here is a snippet of the error log indicating the issue:
>
> ´´Caused by: java.net.URISyntaxException: Illegal character in path at
> index 81: [URL]
>     at java.net.URI$Parser.fail(URI.java:2976)
>     at java.net.URI$Parser.checkChars(URI.java:3147)
>     at java.net.URI$Parser.parseHierarchical(URI.java:3229)
>     at java.net.URI$Parser.parse(URI.java:3177)
>     at java.net.URI.<init>(URI.java:623)
>     at java.net.URI.create(URI.java:904)
>     ...´´
> The error points to an illegal character in the path at index 81 of the
> URL. In this instance, the character is a space, which,* although
> percent-encoded in the original URL*, seems to be causing issues during
> processing by HttpFetcher.
>
> Possible Cause:
> It appears that HttpFetcher or an underlying component may not be handling
> percent-encoded URLs correctly, leading to URISyntaxException when it
> encounters encoded spaces (%20) or other encoded characters.
>
> This happens on Tika Server when sending the URL both in the Headers and
> as Query Params the result is the same. Although the URL is sent encoded
> the error message always shows a decoded version of the URL.
>
> Thank you for your attention to this matter. Please let me know if you
> require any further information or clarification.
>
>
> João Domingues
>

Reply via email to