Depending on your metadata needs, you can cut down fairly dramatically by using a MetadataFilter or a MetadataWriteFilter to select only the metadata elements that you want. If you're just grabbing text, this obv doesn't apply.
On Thu, Jan 19, 2023 at 3:36 PM Josh Burchard <[email protected]> wrote: > > Thanks Tim. I'll check out the source that you linked to. I'm messing > around in our code that translates the returned Java strings to another > character set, which means allocating another hefty buffer and examining > every character. Hence, I noticed all those extra '\n's when dealing with a > file that had many embedded files. ;-) > > > > > From: "Tim Allison" <[email protected]> > To: [email protected] > Date: 01/19/2023 02:51 PM > Subject: Re: X-TIKA:content question > ________________________________ > > > > The leading \n are newlines injected by the XHTMLContentHandler while > writing headers. Obv, the headers don't make it into the downstream > ToTextContentHandler, but the \n do. > > https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/sax/XHTMLContentHandler.java > > Search on that page for newline(). > > If you use a BodyContentHandler, that limits the output to only the > portions of the document after the body element is called, so you'll > avoid at least the leading \n from the headers. > > I've been annoyed by the leading \n for a long, long time. Not sure > if there's an easy fix, but we can look. > > On Thu, Jan 19, 2023 at 2:19 PM Josh Burchard <[email protected]> wrote: > > > > Is it on purpose that many newline characters are prepended to, and at > > least one appended to all content? > > > > This file is a one-liner and contains no newlines. > > > > > > > > And yet.... > > > > [{"X-TIKA:Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.csv.TextAndCSVParser"],"X-TIKA:Parsed-By-Full-Set":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.csv.TextAndCSVParser"],"X-TIKA:content_handler":"ToTextContentHandler","Content-Encoding":"ISO-8859-1","X-TIKA:parse_time_millis":"0","X-TIKA:embedded_depth":"0","X-TIKA:content":"\n\n\n\n\n\n\n\n\n\nThe > > quick brown fox jumps over the lazy dog. > > PlainAvocado.\n","resourceName":"whatever.txt","Content-Length":"58","Content-Type":"text/plain; > > charset=ISO-8859-1"}] > > > > A seemingly random number of '\n' chars get put before the content, and one > > gets stuck on the end. I've noticed this with all the file types that > > I've tested. It's a bit of bloat for files that contain many embeddeds > > and, therefore, many X-TIKA:content values. > > > > Is this on purpose? Is there any way to know if there were actually '\n' > > characters at the beginning and/or end in the original content (and how > > many were original)? > > > >
