Hi,

Based on https://en.wikipedia.org/wiki/Mbox, you can add the following
entry in org/apache/tika/mime/custom-mimetypes.xml:

<mime-type type="application/mbox">
        <magic priority="70">
            <match value="From " type="string" offset="0"/>
        </magic>
        <glob pattern="*.mbox"/>
    </mime-type>

The priority must be greater than message/rfc822. It sometimes returns
false positives, but detects mbox files without extension, which are very
very commom.

Luis

2016-07-25 16:36 GMT-03:00 Allison, Timothy B. <[email protected]>:

>     <repositories>
>         <repository>
>             <id>apache.snapshots</id>
>             <name>Apache Development Snapshot Repository</name>
>             <url>
> https://repository.apache.org/content/repositories/snapshots/</url>
>             <releases>
>                 <enabled>false</enabled>
>             </releases>
>             <snapshots>
>                 <enabled>true</enabled>
>             </snapshots>
>         </repository>
>     </repositories>
>
> -----Original Message-----
> From: Vjeran Marcinko [mailto:[email protected]]
> Sent: Monday, July 25, 2016 3:25 PM
> To: [email protected]
> Subject: Re: Problem with detection of .mbox file
>
> Thanx guys, I can do it in some clumsy way, but before I try it, is there
> some maven repo for such nightly builds that I can include and specify
> these 1.4-SNAPSHOT deps ?
>
> On Mon, Jul 25, 2016 at 9:14 PM, Allison, Timothy B. <[email protected]>
> wrote:
> >> Can you try with a recent Tika nightly build?
> > e.g.
> > https://builds.apache.org/job/Tika-trunk/lastBuild/org.apache.tika$tik
> > a-app/
> >
> > -----Original Message-----
> > From: Nick Burch [mailto:[email protected]]
> > Sent: Monday, July 25, 2016 3:03 PM
> > To: [email protected]
> > Subject: Re: Problem with detection of .mbox file
> >
> > On Mon, 25 Jul 2016, Vjeran Marcinko wrote:
> >> I fist noticed that my .mbox file doesn't get parsed by MBoxParser,
> >> and later, after debugging Tika source code, I found what the problem
> >> is - default detector doesn't even recognize it as "applciation/mbox"
> >> MIME type, and although file extension is .mbox, it ignores this hint
> >> because its "magic" way of detecting file type based on some amount
> >> of initial bytes detects it is "text/html"
> >
> > Can you try with a recent Tika nightly build? Only there have been
> > some tweaks done around that sort of thing recently
> >
> > If a nightly build / build from Git still shows the issue, please open a
> bug in Jira and attach a problematic file, then we can take a look!
> >
> > Nick
>

Reply via email to