Sorry, you’ll get 0 byte files for an error that caused Tika batch to do a 
restart (hang/oom); and depending on cause, you may get an error logged in 
batch-process-error.xml.  If your OS kills the process or something truly 
catastrophic happens, the only trace you have is the 0 byte file.


  For regular caught exceptions, you can look in the .json file (key: 
TikaCoreProperties.TIKA_META_EXCEPTION_PREFIX+"runtime")
for the stack trace, or you can look in the logs as described below.

From: Allison, Timothy B. [mailto:[email protected]]
Sent: Friday, July 15, 2016 8:11 AM
To: [email protected]
Subject: RE: detect corrupt file and build a list of them before indexing in 
solr

Checking for 0 byte files is one option.  The other option is to configure the 
logs to capture exceptions.  I’ve attached the config files and the shell 
script that I use when running our large scale regression testing here: 
https://wiki.apache.org/tika/TikaBatchUsage?action=AttachFile&do=view&target=tika-batch-sh.zip

To run those, unzip the folder, put the tika-app.jar in the bin/ directory, 
update the shell script for your <input_dir> and your <output_dir> and you 
should be good to go.  You may need to create a “logs” directory.  Exceptions 
will be recorded in the batch-process-warn.log, and original file names are 
included along with stack traces.

From: kostali hassan [mailto:[email protected]]
Sent: Friday, July 15, 2016 5:17 AM
To: [email protected]<mailto:[email protected]>
Subject: detect corrupt file and build a list of them before indexing in solr

I'am looking to index ms word and pdf using uploading data with solr cell using 
apache tika;
 I just hope use tika to detect corrupt files before indexing and get a list of 
corrupted file. if its possible.
I try runing java -jar tika-app.jar <input_dir> <output_dir> I get in the 
output_dir all the files of <input_dir> in format xml and all the corrupt file 
with size 0ko (empty)

Reply via email to