That looks like a bug in TikaUtils.
For whatever reason, when is.available() returns 0, we are then assuming that
fileUrl is not null. We need to check to make sure that fileUrl is not null
before trying to open the file.
if(is.available() == 0 && !"".equals(fileUrl)){
...
return TikaInputStream.get(new URL(fileUrl), metadata);
Would you mind opening a ticket on jira?
All,
Is there a reason why an inputstream would return 0 for available() but still
be readable?
Best,
Tim
From: Malarout, Namrata (398M-Affiliate) [mailto:[email protected]]
Sent: Tuesday, July 14, 2015 1:35 PM
To: [email protected]
Subject: Inconsistent (buggy) behavior when using tika-server
Hi Folks,
I am using Tika trunk (1.10-SNAPSHOT) and posting documents there. An example
would be the following:
curl -T MOD09GA.A2014010.h30v12.005.2014012183944.vegetation_fraction.tif
http://localhost:9998/meta --header "Accept: application/json"
...
curl -T MOD09GA.A2014010.h30v12.005.2014012183944.vegetation_fraction.tif
http://localhost:9998/meta --header "Accept: application/rdf+xml"
...
curl -T MOD09GA.A2014010.h30v12.005.2014012183944.vegetation_fraction.tif
http://localhost:9998/meta --header "Accept: text/csv"
I am using a python script to iterate through all the files in a folder. It
works for about 50% to 80% of the files. For the rest it gives an error 500.
When I post a file individually for which it previously failed (using the
python script) it sometimes works. When done in an ad hoc manner, it works most
of the time but fails sometimes. At times it is successful for
application/rdf+xml format but fails for application/json format. The behavior
is inconsistent.
Here is an example trace of when it does not work as expected [0]
A sample of the data being used can be found here [1]
Any help would be appreciated.
[0] https://paste.apache.org/lbAm
[1]
https://drive.google.com/file/d/0B6wmo4_-H0P2eWJjdTdtYS1HRGs/view?usp=sharing
Thanks,
Namrata Malarout