Thanks Tim. I'll check out the source that you linked to. I'm messing around in our code that translates the returned Java strings to another character set, which means allocating another hefty buffer and examining every character. Hence, I noticed all those extra '\n's when dealing with a file that had many embedded files. ;-)
From: "Tim Allison" <[email protected]> To: [email protected] Date: 01/19/2023 02:51 PM Subject: Re: X-TIKA:content question The leading \n are newlines injected by the XHTMLContentHandler while writing headers. Obv, the headers don't make it into the downstream ToTextContentHandler, but the \n do. https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/sax/XHTMLContentHandler.java Search on that page for newline(). If you use a BodyContentHandler, that limits the output to only the portions of the document after the body element is called, so you'll avoid at least the leading \n from the headers. I've been annoyed by the leading \n for a long, long time. Not sure if there's an easy fix, but we can look. On Thu, Jan 19, 2023 at 2:19 PM Josh Burchard <[email protected]> wrote: > > Is it on purpose that many newline characters are prepended to, and at least one appended to all content? > > This file is a one-liner and contains no newlines. > > > > And yet.... > > [{"X-TIKA:Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.csv.TextAndCSVParser"],"X-TIKA:Parsed-By-Full-Set":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.csv.TextAndCSVParser"],"X-TIKA:content_handler":"ToTextContentHandler","Content-Encoding":"ISO-8859-1","X-TIKA:parse_time_millis":"0","X-TIKA:embedded_depth":"0","X-TIKA:content":"\n\n\n\n\n\n\n\n\n\nThe quick brown fox jumps over the lazy dog. PlainAvocado.\n","resourceName":"whatever.txt","Content-Length":"58","Content-Type":"text/plain; charset=ISO-8859-1"}] > > A seemingly random number of '\n' chars get put before the content, and one gets stuck on the end. I've noticed this with all the file types that I've tested. It's a bit of bloat for files that contain many embeddeds and, therefore, many X-TIKA:content values. > > Is this on purpose? Is there any way to know if there were actually '\n' characters at the beginning and/or end in the original content (and how many were original)? >
smime.p7s
Description: S/MIME Cryptographic Signature
