Well this is excellent! I'm definitely bookmarking that page. Question: For the MetadataWriteFilters there's no mention about whether the maxFieldSize XML tag, or the writeLimit header takes precedence. Could it be that if both of those are present you just use the smaller value? I tried to take a brief look in the Tika git but it wasn't readily apparent to me how this is handled.
From: "Tim Allison" <[email protected]> To: [email protected] Date: 01/19/2023 03:49 PM Subject: Re: X-TIKA:content question Depending on your metadata needs, you can cut down fairly dramatically by using a MetadataFilter or a MetadataWriteFilter to select only the metadata elements that you want. If you're just grabbing text, this obv doesn't apply. On Thu, Jan 19, 2023 at 3:36 PM Josh Burchard <[email protected]> wrote: > > Thanks Tim. I'll check out the source that you linked to. I'm messing around in our code that translates the returned Java strings to another character set, which means allocating another hefty buffer and examining every character. Hence, I noticed all those extra '\n's when dealing with a file that had many embedded files. ;-) > > > > > From: "Tim Allison" <[email protected]> > To: [email protected] > Date: 01/19/2023 02:51 PM > Subject: Re: X-TIKA:content question > ________________________________ > > > > The leading \n are newlines injected by the XHTMLContentHandler while > writing headers. Obv, the headers don't make it into the downstream > ToTextContentHandler, but the \n do. > > https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/sax/XHTMLContentHandler.java > > Search on that page for newline(). > > If you use a BodyContentHandler, that limits the output to only the > portions of the document after the body element is called, so you'll > avoid at least the leading \n from the headers. > > I've been annoyed by the leading \n for a long, long time. Not sure > if there's an easy fix, but we can look. > > On Thu, Jan 19, 2023 at 2:19 PM Josh Burchard <[email protected]> wrote: > > > > Is it on purpose that many newline characters are prepended to, and at least one appended to all content? > > > > This file is a one-liner and contains no newlines. > > > > > > > > And yet.... > > > > [{"X-TIKA:Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.csv.TextAndCSVParser"],"X-TIKA:Parsed-By-Full-Set":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.csv.TextAndCSVParser"],"X-TIKA:content_handler":"ToTextContentHandler","Content-Encoding":"ISO-8859-1","X-TIKA:parse_time_millis":"0","X-TIKA:embedded_depth":"0","X-TIKA:content":"\n\n\n\n\n\n\n\n\n\nThe quick brown fox jumps over the lazy dog. PlainAvocado.\n","resourceName":"whatever.txt","Content-Length":"58","Content-Type":"text/plain; charset=ISO-8859-1"}] > > > > A seemingly random number of '\n' chars get put before the content, and one gets stuck on the end. I've noticed this with all the file types that I've tested. It's a bit of bloat for files that contain many embeddeds and, therefore, many X-TIKA:content values. > > > > Is this on purpose? Is there any way to know if there were actually '\n' characters at the beginning and/or end in the original content (and how many were original)? > > > >
smime.p7s
Description: S/MIME Cryptographic Signature
