Dear list,
I am not sure this the right place to ask, but since I don’t know a better
place and some of you might use the tika-python package, I might as well give
it a shot. If you know a better place, please let me know.
First issue/question:
I want four things from each file I send to Tika:
1) The mime-type
2) The language
3) The metadata
4) The content
Now, at the moment, this requires three separate connections to Tika, which is
quite a bit of overhead. As per the documentation, I am running
for (1) detector.from_file('/path/to/file’)
for (2) language.from_file('/path/to/file’)
for (3) and (4) parsed = parser.from_file('/path/to/file’)
I can see that tika-python makes a request to localhost:9998/rmeta to get (3)
and (4) and if I feed the file to localhost:9998/rmeta with curl, I can see
that the result data does not include the detected language. But if I make the
request to localhost:9998/meta, the data does include the detected language. I
don’t quite see why language should be included in the /meta, but not in the
/rmeta endpoint.
So my question is, is there anyway to get all of (1)-(4) with a single request
and preferable with tika-python?
Second issue/question:
I am also wondering how to deal with email attachments. If I feed an email to
tika with tika-python, the result content does not include the attachment nor
metadata about the attachment (at least as far I can see). If I feed the file
with curl to ‘rmeta’-endpoint, the result json contains, for my current email
test file with an attachment, 6 sections:
Section 1 contains the metadata.
Section 2 and 3 contain the body of the text; section 2 as windows-1252
encoding, parsed with the TXTParser, section 3 as windows-1252 encoding, parsed
with the HtmlParser.
Section 4 contains metadata about the attachment, a PDF file.
Section 5 and 6 contain the rest of the email body.
I inspected the json result in order to find any distinction between the
several sections that would allow me to determine what the ‘real’ attachment is
(my email programme is able to distinguish between them and only treats section
4 as an attachment), but I couldn’t find any. If I could find a distinction, I
might then be able to send the file (again) to tika and this time to the
unpack-endpoint, so that I can process the attachment.
So my question here is: how can I distinguish between ‘real’ attachments and
txt vs. html embeddings? And what would be the way with the minimum number of
connections to Tika to determine the ‘real’ attachment and unpack it?
I appreciate any suggestions you could give me.
Thanks and all best,
Philipp