Thanx a bunch for a suggested workaround. Also, I have checked and bug exists in latest 1.4 nightly build
-Vjeran On Tue, Jul 26, 2016 at 2:22 AM, Luís Filipe Nassif <[email protected]> wrote: > Hi, > > Based on https://en.wikipedia.org/wiki/Mbox, you can add the following entry > in org/apache/tika/mime/custom-mimetypes.xml: > > <mime-type type="application/mbox"> > <magic priority="70"> > <match value="From " type="string" offset="0"/> > </magic> > <glob pattern="*.mbox"/> > </mime-type> > > The priority must be greater than message/rfc822. It sometimes returns false > positives, but detects mbox files without extension, which are very very > commom. > > Luis > > 2016-07-25 16:36 GMT-03:00 Allison, Timothy B. <[email protected]>: >> >> <repositories> >> <repository> >> <id>apache.snapshots</id> >> <name>Apache Development Snapshot Repository</name> >> >> <url>https://repository.apache.org/content/repositories/snapshots/</url> >> <releases> >> <enabled>false</enabled> >> </releases> >> <snapshots> >> <enabled>true</enabled> >> </snapshots> >> </repository> >> </repositories> >> >> -----Original Message----- >> From: Vjeran Marcinko [mailto:[email protected]] >> Sent: Monday, July 25, 2016 3:25 PM >> To: [email protected] >> Subject: Re: Problem with detection of .mbox file >> >> Thanx guys, I can do it in some clumsy way, but before I try it, is there >> some maven repo for such nightly builds that I can include and specify these >> 1.4-SNAPSHOT deps ? >> >> On Mon, Jul 25, 2016 at 9:14 PM, Allison, Timothy B. <[email protected]> >> wrote: >> >> Can you try with a recent Tika nightly build? >> > e.g. >> > https://builds.apache.org/job/Tika-trunk/lastBuild/org.apache.tika$tik >> > a-app/ >> > >> > -----Original Message----- >> > From: Nick Burch [mailto:[email protected]] >> > Sent: Monday, July 25, 2016 3:03 PM >> > To: [email protected] >> > Subject: Re: Problem with detection of .mbox file >> > >> > On Mon, 25 Jul 2016, Vjeran Marcinko wrote: >> >> I fist noticed that my .mbox file doesn't get parsed by MBoxParser, >> >> and later, after debugging Tika source code, I found what the problem >> >> is - default detector doesn't even recognize it as "applciation/mbox" >> >> MIME type, and although file extension is .mbox, it ignores this hint >> >> because its "magic" way of detecting file type based on some amount >> >> of initial bytes detects it is "text/html" >> > >> > Can you try with a recent Tika nightly build? Only there have been >> > some tweaks done around that sort of thing recently >> > >> > If a nightly build / build from Git still shows the issue, please open a >> > bug in Jira and attach a problematic file, then we can take a look! >> > >> > Nick > >
