Hi, Apologies, I managed to send the mail too soon. Please see below for reply
On Sun, Oct 20, 2013 at 4:17 PM, <[email protected]> wrote: > Re: URL Encoding Issues in Apache Any23 > 110 by: S.L > > That is correct , that is the only discrepancy that I have noticed so far > , > OK, so at least we are on the same page regarding the actual problem. > I think whats happening here is that any23 is encoding an already encoded > URL , I have not found a way to avoid that in Java i.e avoid encoding an > already encoded URL. > Possibly yes this seems to be what is happening. My hunch is that the question we need to be asking (and addressing) is whether this problem is via the TikaEncodingDetector [0] and hence attributable directly to Tika or whether it is something within Any23. > Is there a way to do so ? Does any23 consider the possibility of the URL > being already encoded ? > > Again, it looks to me like this may be a Tika question. You can try debugging your code as it executes. I would suggest that you look around Line 561 of the SingleDocumentExtractor [1] as this seems to be where the magic is happening. I would focus on this class for now until you can pin-point exactly where the URL encoding is happening. hth [0] *http://s.apache.org/ILT* [1] *http://s.apache.org/DhK*
