Re: X-TIKA:content question

Josh Burchard Fri, 20 Jan 2023 10:27:34 -0800

Well this is excellent!  I'm definitely bookmarking that page. 

Question: For the MetadataWriteFilters there's no mention about whether 
the maxFieldSize XML tag, or the writeLimit header takes precedence. Could 
it be that if both of those are present you just use the smaller value? I 
tried to take a brief look in the Tika git but it wasn't readily apparent 
to me how this is handled.





From:   "Tim Allison" <[email protected]>
To:     [email protected]
Date:   01/19/2023 03:49 PM
Subject:        Re: X-TIKA:content question



Depending on your metadata needs, you can cut down fairly dramatically
by using a MetadataFilter or a MetadataWriteFilter to select only the
metadata elements that you want.  If you're just grabbing text, this
obv doesn't apply.

On Thu, Jan 19, 2023 at 3:36 PM Josh Burchard <[email protected]> 
wrote:
>
> Thanks Tim.  I'll check out the source that you linked to.  I'm messing 
around in our code that translates the returned Java strings to another 
character set, which means allocating another hefty buffer and examining 
every character.  Hence, I noticed all those extra '\n's when dealing with 
a file that had many embedded files. ;-)
>
>
>
>
> From:        "Tim Allison" <[email protected]>
> To:        [email protected]
> Date:        01/19/2023 02:51 PM
> Subject:        Re: X-TIKA:content question
> ________________________________
>
>
>
> The leading \n are newlines injected by the XHTMLContentHandler while
> writing headers.  Obv, the headers don't make it into the downstream
> ToTextContentHandler, but the \n do.
>
> 
https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/sax/XHTMLContentHandler.java

>
> Search on that page for newline().
>
> If you use a BodyContentHandler, that limits the output to only the
> portions of the document after the body element is called, so you'll
> avoid at least the leading \n from the headers.
>
> I've been annoyed by the leading \n for a long, long time.  Not sure
> if there's an easy fix, but we can look.
>
> On Thu, Jan 19, 2023 at 2:19 PM Josh Burchard <[email protected]> 
wrote:
> >
> > Is it on purpose that many newline characters are prepended to, and at 
least one appended to all content?
> >
> > This file is a one-liner and contains no newlines.
> >
> >
> >
> > And yet....
> >
> > 
[{"X-TIKA:Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.csv.TextAndCSVParser"],"X-TIKA:Parsed-By-Full-Set":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.csv.TextAndCSVParser"],"X-TIKA:content_handler":"ToTextContentHandler","Content-Encoding":"ISO-8859-1","X-TIKA:parse_time_millis":"0","X-TIKA:embedded_depth":"0","X-TIKA:content":"\n\n\n\n\n\n\n\n\n\nThe
 
quick brown fox jumps over the lazy dog. 
PlainAvocado.\n","resourceName":"whatever.txt","Content-Length":"58","Content-Type":"text/plain;
 
charset=ISO-8859-1"}]
> >
> > A seemingly random number of '\n' chars get put before the content, 
and one gets stuck on the end.   I've noticed this with all the file types 
that I've tested.  It's a bit of bloat for files that contain many 
embeddeds and, therefore, many X-TIKA:content values.
> >
> > Is this on purpose?  Is there any way to know if there were actually 
'\n' characters at the beginning and/or end in the original content (and 
how many were original)?
> >
>
>

smime.p7s
Description: S/MIME Cryptographic Signature

Re: X-TIKA:content question

Reply via email to