I cannot use the file reference directly, because it is handled outside my detection code, which only has access to an InputtStream or a reader.
Luckily, TikaInputStream.get() seems to work with the external InputStream. More specifically, for the 3 basic MS-Office formats, I am now getting: - For *docx*: application/vnd.openxmlformats-officedocument.wordprocessingml.document - For *xlsx* : application/vnd.openxmlformats-officedocument.spreadsheetml.sheet - For *pptx* : vnd.openxmlformats-officedocument.presentationml.presentation - For *doc*: application/msword - For *xls*: application/vnd.ms-excel - For *ppt*: application/vnd.ms-powerpoint Is that the only answer possible, or could there be another type returned for, say, Word? Where do I find all the Tika type declarations and names? Many thanks! On Thu, Jan 26, 2012 at 1:10 PM, Jukka Zitting <[email protected]>wrote: > Hi, > > On Thu, Jan 26, 2012 at 1:31 AM, Public Network Services > <[email protected]> wrote: > > is = new BufferedInputStream(new FileInputStream(new File(bame))); > > > > Metadata metadata = new Metadata(); > > MediaType type = new DefaultDetector().detect(is, metadata); > > I'd recommend using new Tika().detect(new File(name)) if all you're > interested in is the detected type. > > If you need or want to access the Detector instance directly, it's > better if you use TikaInputStream.get(new File(name)) instead of > wrapping a FileInputStream to a BufferedInputStream. > > Tika's advanced type detection for container formats like MS Office > depend on being able to access the actual underlying file or at least > a temporary file copy of an incoming stream on the local file system. > This is because most container formats rely on random-access and thus > can't efficiently be processed in a stream format. > > The TikaInputStream class was designed to make the underlying file (or > a temporary copy) available to such code when available. If you do not > pass in a TikaInputStream to the detector, the detection code assumes > that the actual file is not available on the local file system and > thus for performance reasons the container detection mechanism is > skipped. > > Using new Tika().detect(new File(name)) takes care of all these > details for you, which is why it's the recommended way to do type > detection unless you explicitly need direct access to the lower-level > functionality in Tika. > > BR, > > Jukka Zitting >
