We are trying to use Tika in a Ruby environment, aiming to read data from
Amazon S3 and stream it into ElasticSearch.
* Currently this means running Tika in server mode:
java -jar tika-app-1.3.jar -t --server --port 12345
* We then talk to it via ruby sockets (for non-rubyists, this streams a
document from the file system into our local tika server over a simple
socket) :
#!/usr/bin/env ruby
require 'socket'
TCPSocket.open('127.0.0.1', 12345) do |socket|
File.open('/tmp/test.png', 'r') do |chunk|
socket.write(chunk)
end
socket.close_write
puts socket.read
end
* This works for PDF and JPEG files - outputting the text content for PDFs
and nothing for JPEGs. However, whenever I stream a PNG to tika, the ruby
code bombs out with a 'broken pipe' error during one of the writes.
I have added logging and seen a number of chunks do get written, but
somewhere in the file it fails. There is no output from the tika server
when this happens. I have also looked at the packets with Wireshark and
cannot see any obvious "null character" being written to cause the problem,
but it is binary data, so may not be so obvious to my limited knowledge at
this level.
In GUI mode, tika has no problem opening the PNG files in question.
So whilst I accept it could be something ruby-side, it seems fairly
consistent that PNGs fail to transmit over in server mode, so I wondered if
anyone might know why, and if it was a known issue around tika ?
Regards,
Ben