X-TIKA:content question

Josh Burchard Thu, 19 Jan 2023 11:19:45 -0800

Is it on purpose that many newline characters are prepended to, and at 
least one appended to all content?


This file is a one-liner and contains no newlines.



And yet....

[{"X-TIKA:Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.csv.TextAndCSVParser"],"X-TIKA:Parsed-By-Full-Set":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.csv.TextAndCSVParser"],"X-TIKA:content_handler":"ToTextContentHandler","Content-Encoding":"ISO-8859-1","X-TIKA:parse_time_millis":"0","X-TIKA:embedded_depth":"0","
X-TIKA:content":"\n\n\n\n\n\n\n\n\n\nThe quick brown fox jumps over the 
lazy dog. PlainAvocado.\n
","resourceName":"whatever.txt","Content-Length":"58","Content-Type":"text/plain;
 
charset=ISO-8859-1"}]

A seemingly random number of '\n' chars get put before the content, and 
one gets stuck on the end.   I've noticed this with all the file types 
that I've tested.  It's a bit of bloat for files that contain many 
embeddeds and, therefore, many X-TIKA:content values.

Is this on purpose?  Is there any way to know if there were actually  '\n' 
characters at the beginning and/or end in the original content (and how 
many were original)?

The quick brown fox jumps over the lazy dog. PlainAvocado.

smime.p7s
Description: S/MIME Cryptographic Signature

X-TIKA:content question

Reply via email to