I've created another text file (1.2MB) that fails to scan, as per my
previous post - a copy of it is available here:

https://www.dropbox.com/s/96iw12mrufovmql/gibberish.txt

Regards,
Ben


On 2 May 2013 16:54, Ben Turner <[email protected]> wrote:

> We have been using Tika to process a large variety of files, one at a
> time, running it in server mode as follows on an Ubuntu 10.10 machine, with
> Java 1.7.0_b21 :
>
> java -jar ~/software/tika/tika-app-1.3.jar -t -s -p 9100
>
> This seems to process all PDFs we throw at it, occasionally bomb out on
> PNGs (that's a seperate thread) and otherwise process JPGs (albeit as
> "blank text") and other document types without concern.
>
> However, when we threw a larger set of documents at it yesterday, we
> noticed our process hang intermittently, and not always at the same
> document after each restart and retry.
>
> The file causing this was a 15MB plain text log file (from our rails
> application) - regrettably this means I can't share it, but if I find
> another good example, I will. This file seemed to spin through several
> "chunks" of the file (we are downloading them from AWS) and then pause.
>
> We tried taking AWS out of the question, by downloading the file locally,
> and running in Ruby (1.8.7):
>
> require 'socket'; s = TCPSocket.new('localhost', 9100);
> File.open("/tmp/big.log", "r") { |f| s.write(f.read); s.close_write; puts
> s.read }; s.close
>
> This file still hung, failing to process. This was also the case trying to
> scan the file running Tika in "GUI mode".
>
> We have also tried using netcat (both nc and ncat, with are different
> tools on Ubuntu) although this doesn't seem to work for ANY file on Ubuntu
> 10.10 - it does seem to work on Ubuntu 12.04, but the Ruby sample above
> doesn't, so that's both a clue, and a bit confusing. I've sidelined this as
> "an oddity of netcat on Ubuntu 10" but it might be important
>
> Could there be an underlying OS library / package / behaviour causing tika
> to fail to parse this plain text file ? It happily reports back the
> metadata when run with the -m switch.
>
> That's the extent of our investigation. Are there any other things we
> might look into, or anything else we might be able to provide to assist
> with diagnosing the issue ?
>
> Regards,
> Ben
>

Reply via email to