We have been using Tika to process a large variety of files, one at a time,
running it in server mode as follows on an Ubuntu 10.10 machine, with Java
1.7.0_b21 :
java -jar ~/software/tika/tika-app-1.3.jar -t -s -p 9100
This seems to process all PDFs we throw at it, occasionally bomb out on
PNGs (that's a seperate thread) and otherwise process JPGs (albeit as
"blank text") and other document types without concern.
However, when we threw a larger set of documents at it yesterday, we
noticed our process hang intermittently, and not always at the same
document after each restart and retry.
The file causing this was a 15MB plain text log file (from our rails
application) - regrettably this means I can't share it, but if I find
another good example, I will. This file seemed to spin through several
"chunks" of the file (we are downloading them from AWS) and then pause.
We tried taking AWS out of the question, by downloading the file locally,
and running in Ruby (1.8.7):
require 'socket'; s = TCPSocket.new('localhost', 9100);
File.open("/tmp/big.log", "r") { |f| s.write(f.read); s.close_write; puts
s.read }; s.close
This file still hung, failing to process. This was also the case trying to
scan the file running Tika in "GUI mode".
We have also tried using netcat (both nc and ncat, with are different tools
on Ubuntu) although this doesn't seem to work for ANY file on Ubuntu 10.10
- it does seem to work on Ubuntu 12.04, but the Ruby sample above doesn't,
so that's both a clue, and a bit confusing. I've sidelined this as "an
oddity of netcat on Ubuntu 10" but it might be important
Could there be an underlying OS library / package / behaviour causing tika
to fail to parse this plain text file ? It happily reports back the
metadata when run with the -m switch.
That's the extent of our investigation. Are there any other things we might
look into, or anything else we might be able to provide to assist with
diagnosing the issue ?
Regards,
Ben