I USE TIKA_app1.12 2016-07-15 18:20 GMT+01:00 Allison, Timothy B. <[email protected]>:
> Can you share the shell script/bat file you’re using? > > > > *From:* kostali hassan [mailto:[email protected]] > *Sent:* Friday, July 15, 2016 1:13 PM > > *To:* [email protected] > *Subject:* Re: detect corrupt file and build a list of them before > indexing in solr > > > > when I add to inputDIR d:\test the log tell me:java.lang.RuntimeException: > Crawler couldn't find this directory:D:\tika_batch_config\test > > the same if I add to inputDIR d:\Cvs the log is:java.lang.RuntimeException: > Crawler couldn't find this directory: D:\tika_batch_config\Cvs > > > > 2016-07-15 17:54 GMT+01:00 kostali hassan <[email protected]>: > > I added this directorry ANd still not working > > > > 2016-07-15 17:42 GMT+01:00 Allison, Timothy B. <[email protected]>: > > Y, the log tells you that the input directory wasn’t specified correctly: > > > > 1375 2016-07-15 17:33:17,354 [Thread-2] INFO > org.apache.tika.batch.BatchProcessDriverCLI - BatchProcess: > java.lang.RuntimeException: Crawler couldn't find this > directory:D:\tika_batch_config\test > > > > *From:* kostali hassan [mailto:[email protected]] > *Sent:* Friday, July 15, 2016 12:40 PM > > > *To:* [email protected] > *Subject:* Re: detect corrupt file and build a list of them before > indexing in solr > > > > only JXmx1g work AND the inputDIR is empty AND I get this files empty in > logs : > > batch-driver-warn.log > > batch-process-warn.log > > tika-batch-pdfbox.log > > > > AND this attached files > > > > 2016-07-15 16:36 GMT+01:00 Allison, Timothy B. <[email protected]>: > > Try changing the max heap to something that will work on your computer: > > > > -JXmx5g > > > > To (say): > > > > -JXmx1g > > *From:* kostali hassan [mailto:[email protected]] > *Sent:* Friday, July 15, 2016 11:27 AM > *To:* [email protected] > *Subject:* Re: detect corrupt file and build a list of them before > indexing in solr > > > > I get this files in the logs ; AND when I run the script he dont finich he > restart all the time > > > > 2016-07-15 13:19 GMT+01:00 Allison, Timothy B. <[email protected]>: > > Sorry, you’ll get 0 byte files for an error that caused Tika batch to do a > restart (hang/oom); and depending on cause, you may get an error logged in > batch-process-error.xml. If your OS kills the process or something truly > catastrophic happens, the only trace you have is the 0 byte file. > > > > For regular caught exceptions, you can look in the .json file (key: > TikaCoreProperties.*TIKA_META_EXCEPTION_PREFIX*+*"runtime"*) > > for the stack trace, or you can look in the logs as described below. > > > > *From:* Allison, Timothy B. [mailto:[email protected]] > *Sent:* Friday, July 15, 2016 8:11 AM > *To:* [email protected] > *Subject:* RE: detect corrupt file and build a list of them before > indexing in solr > > > > Checking for 0 byte files is one option. The other option is to configure > the logs to capture exceptions. I’ve attached the config files and the > shell script that I use when running our large scale regression testing > here: > https://wiki.apache.org/tika/TikaBatchUsage?action=AttachFile&do=view&target=tika-batch-sh.zip > > > > To run those, unzip the folder, put the tika-app.jar in the bin/ > directory, update the shell script for your <input_dir> and your > <output_dir> and you should be good to go. You may need to create a “logs” > directory. Exceptions will be recorded in the batch-process-warn.log, and > original file names are included along with stack traces. > > > > *From:* kostali hassan [mailto:[email protected] > <[email protected]>] > *Sent:* Friday, July 15, 2016 5:17 AM > *To:* [email protected] > *Subject:* detect corrupt file and build a list of them before indexing > in solr > > > > I'am looking to index ms word and pdf using uploading data with solr cell > using apache tika; > > I just hope use tika to detect corrupt files before indexing and get a > list of corrupted file. if its possible. > > I try runing java -jar tika-app.jar <input_dir> <output_dir> I get in the > output_dir all the files of <input_dir> in format xml and all the corrupt > file with size 0ko (empty) > > > > > > > > >
appBatchExecutor.sh
Description: Bourne shell script
