Hello. Try using EmptyParser for those types of document which you want to avoid indexing(document content) in your tika-config file.
Thanks and regards, Nilay Parmar -----Original Message----- From: Patrick Welfringer [mailto:[email protected]] Sent: Wednesday, December 18, 2013 3:21 PM To: [email protected] Subject: Can Lucene be configured to avoid downloading file contents? Hi, *Can anyone familiar with Lucene please share their insight?* The question is this: *is there any way to configure Lucene to index only certain whitelisted metadata*, or exclude blacklisted metadata? Indeed, we believe that excluding the “file” metadata could dramatically reduce the time it takes Lucene to download and process the large number of PDF files in our particular setup. We don’t need file contents to be indexed, only other metadata like “creation date”, “keywords” etc. The “Luke” tool tells us that none of the file contents are indexed. Yet during the hour long indexing, we see all of the metadata being downloaded and written to disk, including document contents. If you can help us find a way to prevent Lucene to index the entire Jackrabbit repository, you’ll cheer up many mailing list subscribers that have similar issues! Cheers, Patrick "Legal Disclaimer: This electronic message and all contents contain information from Cybage Software Private Limited which may be privileged, confidential, or otherwise protected from disclosure. The information is intended to be for the addressee(s) only. If you are not an addressee, any disclosure, copy, distribution, or use of the contents of this message is strictly prohibited. If you have received this electronic message in error please notify the sender by reply e-mail to and destroy the original message and all copies. Cybage has taken every reasonable precaution to minimize the risk of malicious content in the mail, but is not liable for any damage you may sustain as a result of any malicious content in this e-mail. You should carry out your own malicious content checks before opening the e-mail or attachment." www.cybage.com
