Re: [EXTERNAL] Re: Tika parser not parsing email content

Kashif Khan Mon, 06 Nov 2023 07:30:40 -0800

Could you please help me with the web link to download tika 2.9.1 ?

Thanks,
Kashif


On Mon, Nov 6, 2023, 7:02 PM Tim Allison <[email protected]> wrote:

> The fix should be in the 2.9.1 release. Please let us know if that isn't
> working for you.
>
> On Sat, Nov 4, 2023 at 2:19 PM Kashif Khan <[email protected]>
> wrote:
>
>> Hi Tim,
>>
>> It will be very helpful if you can let me the path forward after the
>> closure of the ticket  https://issues.apache.org/jira/browse/TIKA-4153.
>> How do you think I can proceed with the parsing of the document, is there
>> a latest version I can download? Where exactly I can find this version to
>> download?
>>
>> Thanks,
>> Kashif
>>
>>
>> On Wed, Oct 11, 2023 at 1:46 AM Tim Allison <[email protected]> wrote:
>>
>>> I opened: https://issues.apache.org/jira/browse/TIKA-4153
>>>
>>> RFC822 detection has been a game of whack-a-mole especially with
>>> malformed files.  We should continue to refine the detection/fix this issue.
>>>
>>>
>>> On Tue, Oct 10, 2023 at 2:07 PM Josh Burchard <[email protected]>
>>> wrote:
>>>
>>>> Reading this surprised me.  It's too bad the default behavior isn't to
>>>> treat any non-.eml files as plain text and require a configuration setting
>>>> to turn on the detection magic. I personally wouldn't have expected the
>>>> noted behavior and it's likely our company's customers are encountering
>>>> this loss of fidelity when we index their file attachments. Is there a Jira
>>>> item where I can read about the reason behind its current implementation?
>>>>  -Josh/HCL
>>>>
>>>>
>>>>
>>>>
>>>> From:        "Tim Allison" <[email protected]>
>>>> To:        [email protected]
>>>> Date:        10/10/2023 12:47 PM
>>>> Subject:        [EXTERNAL] Re: Tika parser not parsing email content
>>>> ------------------------------
>>>>
>>>>
>>>>
>>>> I can confirm this is still happening in our main/3.x branch. As you
>>>> probably guessed, the issue is that the file is identified as an email and
>>>> then parsed as if it were one.  If you know that all you have are plain
>>>> text files, you might consider using the TextAndCSVParser or just the
>>>> TXTParser.
>>>>
>>>> One fix for this (and this is for the devs on the list), would be to
>>>> modify our minShouldMatch so that we have at least one of the field
>>>> patterns at offset 0 and then one of the other field patterns at 0:1024. We
>>>> currently require only two of the fields anywhere within the first 1024
>>>> characters.
>>>>
>>>> On Tue, Oct 10, 2023 at 7:12 AM Kashif Khan <*[email protected]*
>>>> <[email protected]>> wrote:
>>>>
>>>> Hi team,
>>>> I have been working on the Tika parser to parse a few text files and it
>>>> has been working fine until I have come to an issue where it is not able to
>>>> parse the text file if it contains 'email/message contents'.
>>>> This means if the text file contains any of the terms like 'From: ',
>>>> 'To: ', or 'Sent: ', it will fail to parse the text correctly.
>>>> In my case, the parser is deleting the lines of text files and only a
>>>> single line remains out of 40 lines.
>>>>
>>>> I am sharing a snippet of the text file for an example:
>>>>
>>>>
>>>>
>>>>
>>>> *Some text here 1. Some text here 2. Some text here 3. Original
>>>> Message----- From: **[email protected]* <[email protected]>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> * Sent: Thursday, October 31, 2019 9:52 AM To: Some person, (The XYZ
>>>> group) Subject: RE: Mr. Random person phone call: MESSAGE Hi, I am
>>>> available now to receive the call. Some text here 4. Some text here 5. Some
>>>> text here 6.*
>>>>
>>>> The Tika parser is reducing the above text to only one line as below:
>>>> *Subject: RE: Mr. Random person phone call: MESSAGE*
>>>>
>>>> Note that this is happening in the version later than Tika 1.19, with
>>>> 1.19 is parsing the contents perfectly fine.
>>>>
>>>> Could you please help me to understand the issue or please suggest some
>>>> path forward to this?
>>>> This will be very helpful.
>>>>
>>>> Thanks in advance.
>>>> -Kashif
>>>>
>>>>
>>>>

Re: [EXTERNAL] Re: Tika parser not parsing email content

Reply via email to