I've created another text file (1.2MB) that fails to scan, as per my previous post - a copy of it is available here:
https://www.dropbox.com/s/96iw12mrufovmql/gibberish.txt Regards, Ben On 2 May 2013 16:54, Ben Turner <[email protected]> wrote: > We have been using Tika to process a large variety of files, one at a > time, running it in server mode as follows on an Ubuntu 10.10 machine, with > Java 1.7.0_b21 : > > java -jar ~/software/tika/tika-app-1.3.jar -t -s -p 9100 > > This seems to process all PDFs we throw at it, occasionally bomb out on > PNGs (that's a seperate thread) and otherwise process JPGs (albeit as > "blank text") and other document types without concern. > > However, when we threw a larger set of documents at it yesterday, we > noticed our process hang intermittently, and not always at the same > document after each restart and retry. > > The file causing this was a 15MB plain text log file (from our rails > application) - regrettably this means I can't share it, but if I find > another good example, I will. This file seemed to spin through several > "chunks" of the file (we are downloading them from AWS) and then pause. > > We tried taking AWS out of the question, by downloading the file locally, > and running in Ruby (1.8.7): > > require 'socket'; s = TCPSocket.new('localhost', 9100); > File.open("/tmp/big.log", "r") { |f| s.write(f.read); s.close_write; puts > s.read }; s.close > > This file still hung, failing to process. This was also the case trying to > scan the file running Tika in "GUI mode". > > We have also tried using netcat (both nc and ncat, with are different > tools on Ubuntu) although this doesn't seem to work for ANY file on Ubuntu > 10.10 - it does seem to work on Ubuntu 12.04, but the Ruby sample above > doesn't, so that's both a clue, and a bit confusing. I've sidelined this as > "an oddity of netcat on Ubuntu 10" but it might be important > > Could there be an underlying OS library / package / behaviour causing tika > to fail to parse this plain text file ? It happily reports back the > metadata when run with the -m switch. > > That's the extent of our investigation. Are there any other things we > might look into, or anything else we might be able to provide to assist > with diagnosing the issue ? > > Regards, > Ben >
