Dear Tika Team,
I am writing to report an issue encountered while using Apache Tika's
HttpFetcher functionality, specifically when handling URLs containing spaces or
other characters that require percent-encoding as per RFC 2396.
Issue Description:
The HttpFetcher encounters a java.net.URISyntaxException when processing URLs
that contain unencoded characters, such as spaces. This issue occurs even when
the URLs are correctly formatted and encoded as per standard URI encoding
practices.
Error Log:
Here is a snippet of the error log indicating the issue:
´´Caused by: java.net.URISyntaxException: Illegal character in path at index
81: [URL]
at java.net.URI$Parser.fail(URI.java:2976)
at java.net.URI$Parser.checkChars(URI.java:3147)
at java.net.URI$Parser.parseHierarchical(URI.java:3229)
at java.net.URI$Parser.parse(URI.java:3177)
at java.net.URI.<init>(URI.java:623)
at java.net.URI.create(URI.java:904)
...´´
The error points to an illegal character in the path at index 81 of the URL. In
this instance, the character is a space, which, although percent-encoded in the
original URL, seems to be causing issues during processing by HttpFetcher.
Possible Cause:
It appears that HttpFetcher or an underlying component may not be handling
percent-encoded URLs correctly, leading to URISyntaxException when it
encounters encoded spaces (%20) or other encoded characters.
This happens on Tika Server when sending the URL both in the Headers and as
Query Params the result is the same. Although the URL is sent encoded the error
message always shows a decoded version of the URL.
Thank you for your attention to this matter. Please let me know if you require
any further information or clarification.
João Domingues