Re: X-TIKA:content question

Tim Allison Thu, 19 Jan 2023 12:49:33 -0800

Depending on your metadata needs, you can cut down fairly dramatically
by using a MetadataFilter or a MetadataWriteFilter to select only the
metadata elements that you want.  If you're just grabbing text, this
obv doesn't apply.


On Thu, Jan 19, 2023 at 3:36 PM Josh Burchard <[email protected]> wrote:
>
> Thanks Tim.  I'll check out the source that you linked to.  I'm messing 
> around in our code that translates the returned Java strings to another 
> character set, which means allocating another hefty buffer and examining 
> every character.  Hence, I noticed all those extra '\n's when dealing with a 
> file that had many embedded files. ;-)
>
>
>
>
> From:        "Tim Allison" <[email protected]>
> To:        [email protected]
> Date:        01/19/2023 02:51 PM
> Subject:        Re: X-TIKA:content question
> ________________________________
>
>
>
> The leading \n are newlines injected by the XHTMLContentHandler while
> writing headers.  Obv, the headers don't make it into the downstream
> ToTextContentHandler, but the \n do.
>
> https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/sax/XHTMLContentHandler.java
>
> Search on that page for newline().
>
> If you use a BodyContentHandler, that limits the output to only the
> portions of the document after the body element is called, so you'll
> avoid at least the leading \n from the headers.
>
> I've been annoyed by the leading \n for a long, long time.  Not sure
> if there's an easy fix, but we can look.
>
> On Thu, Jan 19, 2023 at 2:19 PM Josh Burchard <[email protected]> wrote:
> >
> > Is it on purpose that many newline characters are prepended to, and at 
> > least one appended to all content?
> >
> > This file is a one-liner and contains no newlines.
> >
> >
> >
> > And yet....
> >
> > [{"X-TIKA:Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.csv.TextAndCSVParser"],"X-TIKA:Parsed-By-Full-Set":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.csv.TextAndCSVParser"],"X-TIKA:content_handler":"ToTextContentHandler","Content-Encoding":"ISO-8859-1","X-TIKA:parse_time_millis":"0","X-TIKA:embedded_depth":"0","X-TIKA:content":"\n\n\n\n\n\n\n\n\n\nThe
> >  quick brown fox jumps over the lazy dog. 
> > PlainAvocado.\n","resourceName":"whatever.txt","Content-Length":"58","Content-Type":"text/plain;
> >  charset=ISO-8859-1"}]
> >
> > A seemingly random number of '\n' chars get put before the content, and one 
> > gets stuck on the end.   I've noticed this with all the file types that 
> > I've tested.  It's a bit of bloat for files that contain many embeddeds 
> > and, therefore, many X-TIKA:content values.
> >
> > Is this on purpose?  Is there any way to know if there were actually  '\n' 
> > characters at the beginning and/or end in the original content (and how 
> > many were original)?
> >
>
>

Re: X-TIKA:content question

Reply via email to