Re: X-TIKA:content question

Tim Allison Thu, 19 Jan 2023 11:50:59 -0800

The leading \n are newlines injected by the XHTMLContentHandler while
writing headers.  Obv, the headers don't make it into the downstream
ToTextContentHandler, but the \n do.


https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/sax/XHTMLContentHandler.java

Search on that page for newline().

If you use a BodyContentHandler, that limits the output to only the
portions of the document after the body element is called, so you'll
avoid at least the leading \n from the headers.

I've been annoyed by the leading \n for a long, long time.  Not sure
if there's an easy fix, but we can look.

On Thu, Jan 19, 2023 at 2:19 PM Josh Burchard <[email protected]> wrote:
>
> Is it on purpose that many newline characters are prepended to, and at least 
> one appended to all content?
>
> This file is a one-liner and contains no newlines.
>
>
>
> And yet....
>
> [{"X-TIKA:Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.csv.TextAndCSVParser"],"X-TIKA:Parsed-By-Full-Set":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.csv.TextAndCSVParser"],"X-TIKA:content_handler":"ToTextContentHandler","Content-Encoding":"ISO-8859-1","X-TIKA:parse_time_millis":"0","X-TIKA:embedded_depth":"0","X-TIKA:content":"\n\n\n\n\n\n\n\n\n\nThe
>  quick brown fox jumps over the lazy dog. 
> PlainAvocado.\n","resourceName":"whatever.txt","Content-Length":"58","Content-Type":"text/plain;
>  charset=ISO-8859-1"}]
>
> A seemingly random number of '\n' chars get put before the content, and one 
> gets stuck on the end.   I've noticed this with all the file types that I've 
> tested.  It's a bit of bloat for files that contain many embeddeds and, 
> therefore, many X-TIKA:content values.
>
> Is this on purpose?  Is there any way to know if there were actually  '\n' 
> characters at the beginning and/or end in the original content (and how many 
> were original)?
>

Re: X-TIKA:content question

Reply via email to