Thanks, Nick. It also supports MHT. Used Curl for Windows to feed some files into the default Solr installation (3.6.1) and it handled them with aplomb.
For those who want to use Curl, there's an exe version here: http://curl.haxx.se/dlwiz/?type=bin&os=Win64&flav=- Add where you put that EXE to your PATH environment variable. Then run something like this from the command prompt where your file to index is: curl "http://localhost:8983/solr/update/extract?literal.id=doc3&commit=true" -F "[email protected]" Sincerely, Alex -----Original Message----- From: Nick Burch [mailto:[email protected]] Sent: 2 August 2012 5:29 PM To: [email protected] Subject: Re: File types supported On Thu, 2 Aug 2012, Alexander Cougarman wrote: > Hi. Does the latest version of Tika index text in these file types? > - Office 2007/2010 file types of DOCX, XLSX, PPTX Yes (thought the few tiny bits of new functionality introduced in 2010 will be skipped over) > - MHT file (MHTML Document) Not sure, how close is this to a regular html file? > This page helped on many of the file formats, but wanted to clarify: > http://tika.apache.org/1.2/formats.html Often the best way to check is to grab the tika-app jar, and try a few sample files with it Nick
