RE: Solr fields for Microsoft files, image files, PDF, text files

2017-09-25 Thread Allison, Timothy B.
bq: How do I get a list of all valid field names based on the file type bq: You don't. At least I've never found any. Plus various document formats will allow custom meta-data fields so there's no definitive list. It would be trivial to add field counts per mime to tika-eval. If you're

Re: Solr fields for Microsoft files, image files, PDF, text files

2017-09-25 Thread Erik Hatcher
Phillip - You may be interested to start with the example/files that ships with Solr. It is specifically designed as a configuration (and UI!) that deals with indexing rich files with a bit more than other examples - it pulls out acronyms, e-mail addresses, and URLs from text, as well as what

Re: Solr fields for Microsoft files, image files, PDF, text files

2017-09-25 Thread Erick Erickson
bq: How do I get a list of all valid field names based on the file type You don't. At least I've never found any. Plus various document formats will allow custom meta-data fields so there's no definitive list. bq: Also how do I search the "free form" text for a word/pattern in the Solr search

Solr fields for Microsoft files, image files, PDF, text files

2017-09-24 Thread Phillip Wu
Hi, I'm starting out with Solr on a Windows box. I want to index the following documents: doc;docx xls;xlsx ppt vsd pdf txt gif;jpeg;tiff I undersand that solr uses Apache Tika to read these file types and return an xml stream back to Solr. For Tika image processing, I've loaded Tesseract.