I cannot use the file reference directly, because it is handled outside my
detection code, which only has access to an InputtStream or a reader.

Luckily, TikaInputStream.get() seems to work with the external InputStream.

More specifically, for the 3 basic MS-Office formats, I am now getting:

   - For *docx*:
   application/vnd.openxmlformats-officedocument.wordprocessingml.document
   - For *xlsx*
   : application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
   - For *pptx*
   : vnd.openxmlformats-officedocument.presentationml.presentation
   - For *doc*: application/msword
   - For *xls*: application/vnd.ms-excel
   - For *ppt*: application/vnd.ms-powerpoint

Is that the only answer possible, or could there be another type returned
for, say, Word?

Where do I find all the Tika type declarations and names?

Many thanks!


On Thu, Jan 26, 2012 at 1:10 PM, Jukka Zitting <[email protected]>wrote:

> Hi,
>
> On Thu, Jan 26, 2012 at 1:31 AM, Public Network Services
> <[email protected]> wrote:
> > is = new BufferedInputStream(new FileInputStream(new File(bame)));
> >
> > Metadata metadata = new Metadata();
> > MediaType type = new DefaultDetector().detect(is, metadata);
>
> I'd recommend using new Tika().detect(new File(name)) if all you're
> interested in is the detected type.
>
> If you need or want to access the Detector instance directly, it's
> better if you use TikaInputStream.get(new File(name)) instead of
> wrapping a FileInputStream to a BufferedInputStream.
>
> Tika's advanced type detection for container formats like MS Office
> depend on being able to access the actual underlying file or at least
> a temporary file copy of an incoming stream on the local file system.
> This is because most container formats rely on random-access and thus
> can't efficiently be processed in a stream format.
>
> The TikaInputStream class was designed to make the underlying file (or
> a temporary copy) available to such code when available. If you do not
> pass in a TikaInputStream to the detector, the detection code assumes
> that the actual file is not available on the local file system and
> thus for performance reasons the container detection mechanism is
> skipped.
>
> Using new Tika().detect(new File(name)) takes care of all these
> details for you, which is why it's the recommended way to do type
> detection unless you explicitly need direct access to the lower-level
> functionality in Tika.
>
> BR,
>
> Jukka Zitting
>

Reply via email to