Re: X-TIKA:content question

Josh Burchard Thu, 19 Jan 2023 12:36:39 -0800

Thanks Tim.  I'll check out the source that you linked to.  I'm messing 
around in our code that translates the returned Java strings to another 
character set, which means allocating another hefty buffer and examining 
every character.  Hence, I noticed all those extra '\n's when dealing with 
a file that had many embedded files. ;-)

From:   "Tim Allison" <[email protected]>
To:     [email protected]
Date:   01/19/2023 02:51 PM
Subject:        Re: X-TIKA:content question

The leading \n are newlines injected by the XHTMLContentHandler while
writing headers.  Obv, the headers don't make it into the downstream
ToTextContentHandler, but the \n do.

https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/sax/XHTMLContentHandler.java

Search on that page for newline().

If you use a BodyContentHandler, that limits the output to only the
portions of the document after the body element is called, so you'll
avoid at least the leading \n from the headers.

I've been annoyed by the leading \n for a long, long time.  Not sure
if there's an easy fix, but we can look.

On Thu, Jan 19, 2023 at 2:19 PM Josh Burchard <[email protected]> 
wrote:
>
> Is it on purpose that many newline characters are prepended to, and at 
least one appended to all content?
>
> This file is a one-liner and contains no newlines.
>
>
>
> And yet....
>
> 
[{"X-TIKA:Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.csv.TextAndCSVParser"],"X-TIKA:Parsed-By-Full-Set":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.csv.TextAndCSVParser"],"X-TIKA:content_handler":"ToTextContentHandler","Content-Encoding":"ISO-8859-1","X-TIKA:parse_time_millis":"0","X-TIKA:embedded_depth":"0","X-TIKA:content":"\n\n\n\n\n\n\n\n\n\nThe

quick brown fox jumps over the lazy dog. 
PlainAvocado.\n","resourceName":"whatever.txt","Content-Length":"58","Content-Type":"text/plain;

charset=ISO-8859-1"}]
>
> A seemingly random number of '\n' chars get put before the content, and 
one gets stuck on the end.   I've noticed this with all the file types 
that I've tested.  It's a bit of bloat for files that contain many 
embeddeds and, therefore, many X-TIKA:content values.
>
> Is this on purpose?  Is there any way to know if there were actually 
'\n' characters at the beginning and/or end in the original content (and 
how many were original)?
>

smime.p7s
Description: S/MIME Cryptographic Signature

Re: X-TIKA:content question

Reply via email to