Tika and Python

Philipp Steinkrüger Sun, 22 May 2016 12:13:48 -0700

Dear list,

I am not sure this the right place to ask, but since I don’t know a better 
place and some of you might use the tika-python package, I might as well give 
it a shot. If you know a better place, please let me know.


First issue/question:
I want four things from each file I send to Tika:

1) The mime-type
2) The language
3) The metadata
4) The content

Now, at the moment, this requires three separate connections to Tika, which is 
quite a bit of overhead. As per the documentation, I am running

for (1) detector.from_file('/path/to/file’)
for (2) language.from_file('/path/to/file’)
for (3) and (4) parsed = parser.from_file('/path/to/file’) 

I can see that tika-python makes a request to localhost:9998/rmeta to get (3) 
and (4) and if I feed the file to localhost:9998/rmeta with curl, I can see 
that the result data does not include the detected language. But if I make the 
request to localhost:9998/meta, the data does include the detected language. I 
don’t quite see why language should be included in the /meta, but not in the 
/rmeta endpoint.

So my question is, is there anyway to get all of (1)-(4) with a single request 
and preferable with tika-python?

Second issue/question:
I am also wondering how to deal with email attachments. If I feed an email to 
tika with tika-python, the result content does not include the attachment nor 
metadata about the attachment (at least as far I can see). If I feed the file 
with curl to ‘rmeta’-endpoint, the result json contains, for my current email 
test file with an attachment, 6 sections:

Section 1 contains the metadata.
Section 2 and 3 contain the body of the text; section 2 as windows-1252 
encoding, parsed with the TXTParser, section 3 as windows-1252 encoding, parsed 
with the HtmlParser.
Section 4 contains metadata about the attachment, a PDF file.
Section 5 and 6 contain the rest of the email body.

I inspected the json result in order to find any distinction between the 
several sections that would allow me to determine what the ‘real’ attachment is 
(my email programme is able to distinguish between them and only treats section 
4 as an attachment), but I couldn’t find any. If I could find a distinction, I 
might then be able to send the file (again) to tika and this time to the 
unpack-endpoint, so that I can process the attachment. 

So my question here is: how can I distinguish between ‘real’ attachments and 
txt vs. html embeddings? And what would be the way with the minimum number of 
connections to Tika to determine the ‘real’ attachment and unpack it?

I appreciate any suggestions you could give me.

Thanks and all best,
Philipp

Tika and Python

Reply via email to