Re: File Content Type Detection

Jukka Zitting Thu, 26 Jan 2012 03:11:12 -0800

Hi,

On Thu, Jan 26, 2012 at 1:31 AM, Public Network Services
<[email protected]> wrote:
> is = new BufferedInputStream(new FileInputStream(new File(bame)));
>
> Metadata metadata = new Metadata();
> MediaType type = new DefaultDetector().detect(is, metadata);


I'd recommend using new Tika().detect(new File(name)) if all you're
interested in is the detected type.

If you need or want to access the Detector instance directly, it's
better if you use TikaInputStream.get(new File(name)) instead of
wrapping a FileInputStream to a BufferedInputStream.

Tika's advanced type detection for container formats like MS Office
depend on being able to access the actual underlying file or at least
a temporary file copy of an incoming stream on the local file system.
This is because most container formats rely on random-access and thus
can't efficiently be processed in a stream format.

The TikaInputStream class was designed to make the underlying file (or
a temporary copy) available to such code when available. If you do not
pass in a TikaInputStream to the detector, the detection code assumes
that the actual file is not available on the local file system and
thus for performance reasons the container detection mechanism is
skipped.

Using new Tika().detect(new File(name)) takes care of all these
details for you, which is why it's the recommended way to do type
detection unless you explicitly need direct access to the lower-level
functionality in Tika.

BR,

Jukka Zitting

Re: File Content Type Detection

Reply via email to