Hi Ken,

Unfortunately enforcing Tika to use TXTParser does not solve our problem at 
all, I mean it would work for very simple emails, but we also want to be able 
to parse emails with embedded resources like images which is only done by MIME 
parser i.e. RFC822Parser.

Thanks anyways,
Gerardo
________________________________
From: Ken Krugler <[email protected]>
Sent: Wednesday, January 24, 2024 02:40 PM
To: [email protected] <[email protected]>
Cc: Tim Allison <[email protected]>
Subject: Re: Parser removes file content and treats it as Metadata

You don't often get email from [email protected]. Learn why this is 
important<https://aka.ms/LearnAboutSenderIdentification>
Hi Gerardo,

What happens if you set the filename in the metadata, before calling parse()?

E.g.

   metadata.set(Metadata.RESOURCE_NAME_KEY, filename);

I don’t recall whether the Resource Name detector will be called first, before 
the Mime Magic detector (Tim?). If it is, then having a xxx.txt filename 
_should_ trigger Tika to use the generic text parser, versus the email parser.

— Ken

On Jan 23, 2024, at 11:26 AM, Gerardo Hernandez <[email protected]> wrote:

Surely, I attached a simplified version of the code we use, please let me know 
if there are any way to configure the behavior of the parser so that the 
initial lines
are also included in the handler contents.

Regards,
Gerardo
________________________________
From: Ken Krugler 
<[email protected]<mailto:[email protected]>>
Sent: Saturday, January 20, 2024 11:54 AM
To: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Cc: Mikhail Gushinets 
<[email protected]<mailto:[email protected]>>
Subject: Re: Parser removes file content and treats it as Metadata

You don't often get email from 
[email protected]<mailto:[email protected]>. Learn why this 
is important<https://aka.ms/LearnAboutSenderIdentification>
I assume you are getting the initial lines as metadata because Tika is 
identifying the file as email.

If you include details on your code (how you are calling the parser) and 
version, I’m confident someone can suggest reasonable work-arounds.

Regards,

— Ken


On Jan 18, 2024, at 8:44 PM, Gerardo Hernandez 
<[email protected]<mailto:[email protected]>> wrote:

This is the input file; I think it was not uploaded correctly.

Best regards,
Gerardo
________________________________

From: Gerardo Hernandez
Sent: Thursday, January 18, 2024 10:39 PM
To: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Cc: Mikhail Gushinets 
<[email protected]<mailto:[email protected]>>
Subject: Parser removes file content and treats it as Metadata

Hi,

We are using Tika parser to obtain files' contents and then we do some post 
processing on them, unfortunately we recently got some unexpected results from 
the AutoDectectParser using the attached text file 
[https://res.cdn.office.net/assets/mail/file-icon/png/txt_16x16.png] 
SampleFile_M_001.txt<https://aparavi-my.sharepoint.com/:t:/p/g_hernandez/EUPjfGMN1k1Pii3e4h6tzNoBuVrxR7pAsRugZf-Y59Cmjg>.
 Basically, what we expect as result is the whole text in the file, but we only 
get (received by Handler):

​SUBJECT: XYZ EMPL. OPPORUNITY

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor 
incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam... (Till 
the end of the file).

and the initial text of the file (FROM, TO, DATE, LOCATION) is not included but 
registered as metadata:

<image.png>

I would like to know if there is any way to prevent this from happening using 
AutoDectectParser so that all the text is included in the data sent to the 
Handler.
<SampleFile_M_001.txt>

--------------------------
Ken Krugler
http://www.scaleunlimited.com<http://www.scaleunlimited.com/>
Custom big data solutions
Flink & Pinot






--------------------------
Ken Krugler
http://www.scaleunlimited.com
Custom big data solutions
Flink & Pinot



Reply via email to