Re: Parser removes file content and treats it as Metadata

Tim Allison Thu, 25 Jan 2024 12:02:51 -0800

I'm sorry for not looking into this and responding sooner. That's the way
the RFC822 parser works. It attempts to read the headers and put those into
the metadata fields appropriately, and it tries to put the content in the
body. The reason that SUBJECT: XYZ is slipping through into the body is
because of the newlines.

So, you can process the metadata for the from/to, etc. Or, along the lines
of what Ken pointed out, you can turn off the RFC822Parser. and the
TextAndCSVParser should give you all of the text. As you point out, though,
the TextAndCSVParser will not parse embedded files.

The only option I see is to open an issue on our JIRA and hope that a dev
has time and the inclination to add a "writeMetadataToHandler" parameter or
similar on the RFC822Parser. Or, potentially, fork our RFC822Parser and add
the capability?

I'm sorry I can't be of more use.

Best,

          Tim

This is what I get with tika-app 2.9.0 with the '-J -t" options:

[{"X-TIKA:Parsed-By-Full-Set":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.mail.RFC822Parser"],"X-TIKA:content_handler":"ToTextContentHandler","dc:creator":"XYZ
EMPL.
OPPORUNITY","resourceName":"SampleFile_M_001(1).txt","Message-To":"DFG. OF
ABC","X-TIKA:Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.mail.RFC822Parser"],"Message:Raw-Header:LOCATION":"A.B.C
Dist","Message:From-Name":"XYZ EMPL.
OPPORUNITY","X-TIKA:parse_time_millis":"149","X-TIKA:embedded_depth":"0","X-TIKA:content":"\n\n\n\n\n\n\n\n\n\n\n\n\n\nSUBJECT:
XYZ EMPL. OPPORUNITY\n\nLorem ipsum dolor sit amet, consectetur adipiscing
elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut
enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut
aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in
voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint
occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit
anim id est laborum.\n\n\n","Content-Length":"553","Message-From":"XYZ
EMPL. OPPORUNITY","Content-Type":"message/rfc822"}]

On Thu, Jan 25, 2024 at 2:11 PM Gerardo Hernandez <[email protected]>
wrote:

> Hi Ken,
>
> Unfortunately enforcing Tika to use TXTParser does not solve our problem
> at all, I mean it would work for very simple emails, but we also want to be
> able to parse emails with embedded resources like images which is only
> done by MIME parser i.e. RFC822Parser.
>
> Thanks anyways,
> Gerardo
> ------------------------------
> *From:* Ken Krugler <[email protected]>
> *Sent:* Wednesday, January 24, 2024 02:40 PM
> *To:* [email protected] <[email protected]>
> *Cc:* Tim Allison <[email protected]>
> *Subject:* Re: Parser removes file content and treats it as Metadata
>
> You don't often get email from [email protected]. Learn why
> this is important <https://aka.ms/LearnAboutSenderIdentification>
> Hi Gerardo,
>
> What happens if you set the filename in the metadata, before calling
> parse()?
>
> E.g.
>
>    metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
>
> I don’t recall whether the Resource Name detector will be called first,
> before the Mime Magic detector (Tim?). If it is, then having a xxx.txt
> filename _should_ trigger Tika to use the generic text parser, versus the
> email parser.
>
> — Ken
>
> On Jan 23, 2024, at 11:26 AM, Gerardo Hernandez <[email protected]>
> wrote:
>
> Surely, I attached a simplified version of the code we use, please let me
> know if there are any way to configure the behavior of the parser so that
> the initial lines
> are also included in the handler contents.
>
> Regards,
> Gerardo
> ------------------------------
> *From:* Ken Krugler <[email protected]>
> *Sent:* Saturday, January 20, 2024 11:54 AM
> *To:* [email protected] <[email protected]>
> *Cc:* Mikhail Gushinets <[email protected]>
> *Subject:* Re: Parser removes file content and treats it as Metadata
>
> You don't often get email from [email protected]. Learn why
> this is important <https://aka.ms/LearnAboutSenderIdentification>
> I assume you are getting the initial lines as metadata because Tika is
> identifying the file as email.
>
> If you include details on your code (how you are calling the parser) and
> version, I’m confident someone can suggest reasonable work-arounds.
>
> Regards,
>
> — Ken
>
>
> On Jan 18, 2024, at 8:44 PM, Gerardo Hernandez <[email protected]>
> wrote:
>
> This is the input file; I think it was not uploaded correctly.
>
> Best regards,
> Gerardo
> ------------------------------
>
> *From:* Gerardo Hernandez
> *Sent:* Thursday, January 18, 2024 10:39 PM
> *To:* [email protected] <[email protected]>
> *Cc:* Mikhail Gushinets <[email protected]>
> *Subject:* Parser removes file content and treats it as Metadata
>
> Hi,
>
> We are using Tika parser to obtain files' contents and then we do some
> post processing on them, unfortunately we recently got some unexpected
> results from the AutoDectectParser using the attached text file
> SampleFile_M_001.txt
> <https://aparavi-my.sharepoint.com/:t:/p/g_hernandez/EUPjfGMN1k1Pii3e4h6tzNoBuVrxR7pAsRugZf-Y59Cmjg>.
> Basically, what we expect as result is the whole text in the file, but we
> only get (received by Handler):
>
> SUBJECT: XYZ EMPL. OPPORUNITY
>
> Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
> tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim
> veniam... (Till the end of the file).
>
> and the initial text of the file (FROM, TO, DATE, LOCATION) is not
> included but registered as metadata:
>
> <image.png>
>
> I would like to know if there is any way to prevent this from happening
> using AutoDectectParser so that all the text is included in the data sent
> to the Handler.
> <SampleFile_M_001.txt>
>
>
> --------------------------
> Ken Krugler
> http://www.scaleunlimited.com
> Custom big data solutions
> Flink & Pinot
>
>
>
>
>
>
> --------------------------
> Ken Krugler
> http://www.scaleunlimited.com
> Custom big data solutions
> Flink & Pinot
>
>
>
>

Re: Parser removes file content and treats it as Metadata

Reply via email to