Could you please help me with the web link to download tika 2.9.1 ? Thanks, Kashif
On Mon, Nov 6, 2023, 7:02 PM Tim Allison <[email protected]> wrote: > The fix should be in the 2.9.1 release. Please let us know if that isn't > working for you. > > On Sat, Nov 4, 2023 at 2:19 PM Kashif Khan <[email protected]> > wrote: > >> Hi Tim, >> >> It will be very helpful if you can let me the path forward after the >> closure of the ticket https://issues.apache.org/jira/browse/TIKA-4153. >> How do you think I can proceed with the parsing of the document, is there >> a latest version I can download? Where exactly I can find this version to >> download? >> >> Thanks, >> Kashif >> >> >> On Wed, Oct 11, 2023 at 1:46 AM Tim Allison <[email protected]> wrote: >> >>> I opened: https://issues.apache.org/jira/browse/TIKA-4153 >>> >>> RFC822 detection has been a game of whack-a-mole especially with >>> malformed files. We should continue to refine the detection/fix this issue. >>> >>> >>> On Tue, Oct 10, 2023 at 2:07 PM Josh Burchard <[email protected]> >>> wrote: >>> >>>> Reading this surprised me. It's too bad the default behavior isn't to >>>> treat any non-.eml files as plain text and require a configuration setting >>>> to turn on the detection magic. I personally wouldn't have expected the >>>> noted behavior and it's likely our company's customers are encountering >>>> this loss of fidelity when we index their file attachments. Is there a Jira >>>> item where I can read about the reason behind its current implementation? >>>> -Josh/HCL >>>> >>>> >>>> >>>> >>>> From: "Tim Allison" <[email protected]> >>>> To: [email protected] >>>> Date: 10/10/2023 12:47 PM >>>> Subject: [EXTERNAL] Re: Tika parser not parsing email content >>>> ------------------------------ >>>> >>>> >>>> >>>> I can confirm this is still happening in our main/3.x branch. As you >>>> probably guessed, the issue is that the file is identified as an email and >>>> then parsed as if it were one. If you know that all you have are plain >>>> text files, you might consider using the TextAndCSVParser or just the >>>> TXTParser. >>>> >>>> One fix for this (and this is for the devs on the list), would be to >>>> modify our minShouldMatch so that we have at least one of the field >>>> patterns at offset 0 and then one of the other field patterns at 0:1024. We >>>> currently require only two of the fields anywhere within the first 1024 >>>> characters. >>>> >>>> On Tue, Oct 10, 2023 at 7:12 AM Kashif Khan <*[email protected]* >>>> <[email protected]>> wrote: >>>> >>>> Hi team, >>>> I have been working on the Tika parser to parse a few text files and it >>>> has been working fine until I have come to an issue where it is not able to >>>> parse the text file if it contains 'email/message contents'. >>>> This means if the text file contains any of the terms like 'From: ', >>>> 'To: ', or 'Sent: ', it will fail to parse the text correctly. >>>> In my case, the parser is deleting the lines of text files and only a >>>> single line remains out of 40 lines. >>>> >>>> I am sharing a snippet of the text file for an example: >>>> >>>> >>>> >>>> >>>> *Some text here 1. Some text here 2. Some text here 3. Original >>>> Message----- From: **[email protected]* <[email protected]> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> * Sent: Thursday, October 31, 2019 9:52 AM To: Some person, (The XYZ >>>> group) Subject: RE: Mr. Random person phone call: MESSAGE Hi, I am >>>> available now to receive the call. Some text here 4. Some text here 5. Some >>>> text here 6.* >>>> >>>> The Tika parser is reducing the above text to only one line as below: >>>> *Subject: RE: Mr. Random person phone call: MESSAGE* >>>> >>>> Note that this is happening in the version later than Tika 1.19, with >>>> 1.19 is parsing the contents perfectly fine. >>>> >>>> Could you please help me to understand the issue or please suggest some >>>> path forward to this? >>>> This will be very helpful. >>>> >>>> Thanks in advance. >>>> -Kashif >>>> >>>> >>>>
