That's not good. Thank you for sharing this with us: https://issues.apache.org/jira/browse/TIKA-4178
On Fri, Dec 22, 2023 at 11:18 AM João Domingues <[email protected]> wrote: > Dear Tika Team, > > I am writing to report an issue encountered while using Apache Tika's > HttpFetcher functionality, specifically when handling URLs containing > spaces or other characters that require percent-encoding as per RFC 2396. > > Issue Description: > The HttpFetcher encounters a java.net.URISyntaxException when processing > URLs that contain unencoded characters, such as spaces. This issue occurs* > even > when the URLs are correctly formatted and encoded as per standard URI > encoding practices.* > > Error Log: > Here is a snippet of the error log indicating the issue: > > ´´Caused by: java.net.URISyntaxException: Illegal character in path at > index 81: [URL] > at java.net.URI$Parser.fail(URI.java:2976) > at java.net.URI$Parser.checkChars(URI.java:3147) > at java.net.URI$Parser.parseHierarchical(URI.java:3229) > at java.net.URI$Parser.parse(URI.java:3177) > at java.net.URI.<init>(URI.java:623) > at java.net.URI.create(URI.java:904) > ...´´ > The error points to an illegal character in the path at index 81 of the > URL. In this instance, the character is a space, which,* although > percent-encoded in the original URL*, seems to be causing issues during > processing by HttpFetcher. > > Possible Cause: > It appears that HttpFetcher or an underlying component may not be handling > percent-encoded URLs correctly, leading to URISyntaxException when it > encounters encoded spaces (%20) or other encoded characters. > > This happens on Tika Server when sending the URL both in the Headers and > as Query Params the result is the same. Although the URL is sent encoded > the error message always shows a decoded version of the URL. > > Thank you for your attention to this matter. Please let me know if you > require any further information or clarification. > > > João Domingues >
