Re: Tika and Python

Philipp Steinkrüger Mon, 23 May 2016 02:20:29 -0700

Hi Chris,

thanks for answering so quickly. Some follow-up below:


> On 22 May 2016, at 21:57 , Chris Mattmann <[email protected]> wrote:
> 
> Philipp absolutely the right place to ask. I’ll try
> and answer the below:
> 
> 
> 
> 
> On 5/22/16, 12:12 PM, "Philipp Steinkrüger" 
> <[email protected] 
> <mailto:[email protected]>> wrote:
> 
>> Dear list,
>> 
>> I am not sure this the right place to ask, but since I don’t know a better 
>> place and some of you might use the tika-python package, I might as well 
>> give it a shot. If you know a better place, please let me know. 
>> 
>> First issue/question:
>> I want four things from each file I send to Tika:
>> 
>> 1) The mime-type
>> 2) The language
>> 3) The metadata
>> 4) The content
>> 
>> Now, at the moment, this requires three separate connections to Tika, which 
>> is quite a bit of overhead. As per the documentation, I am running
>> 
>> for (1) detector.from_file('/path/to/file’)
>> for (2) language.from_file('/path/to/file’)
>> for (3) and (4) parsed = parser.from_file('/path/to/file’)
> 
> Define “overhead”. There are 4 network connections that return in miliseconds.
> Sorry are you in an HPC environment, or an environment that will make millions
> of these per minute, second?

Well, admittedly I only started thinking about this because I currently run the 
script and the server in a virtual machine on an already heavy breathing 2008 
iMac. One connection takes around a second. Of course, this is nowhere near a 
proper production environment. But still, sometimes a file can be several 
megabytes large, and if it needs to be send to the server a number of times, 
that produces considerable traffic. It just doesn’t seem to be very 
parsimonious to me to send the whole file just to get, say, the language and 
then send it again to get the content. 

I do like the idea of a server, though. Just wish there was an option to get 
everything in one go.

> 
>> 
>> 
>> I can see that tika-python makes a request to localhost:9998/rmeta to get 
>> (3) and (4) and if I feed the file to localhost:9998/rmeta with curl, I can 
>> see that the result data does not include the detected language. But if I 
>> make the request to localhost:9998/meta, the data does include the detected 
>> language. I don’t quite see why language should be included in the /meta, 
>> but not in the /rmeta endpoint.
> 
> We can update /rmeta to output the language by default, no reason not to. 
> Please
> file a ticket at http://issues.apache.org/jira/browse/TIKA 
> <http://issues.apache.org/jira/browse/TIKA> for the Tika REST server
> and one in GH at http://github.com/chrismattmann/tika-python 
> <http://github.com/chrismattmann/tika-python> and I’ll take a look.

Great, I’ll file the tickets.

> 
>> 
>> So my question is, is there anyway to get all of (1)-(4) with a single 
>> request and preferable with tika-python?
> 
> Not at the moment, no.
> 
>> 
>> Second issue/question:
>> I am also wondering how to deal with email attachments. If I feed an email 
>> to tika with tika-python, the result content does not include the attachment 
>> nor metadata about the attachment (at least as far I can see). If I feed the 
>> file with curl to ‘rmeta’-endpoint, the result json contains, for my current 
>> email test file with an attachment, 6 sections:
>> 
>> Section 1 contains the metadata.
>> Section 2 and 3 contain the body of the text; section 2 as windows-1252 
>> encoding, parsed with the TXTParser, section 3 as windows-1252 encoding, 
>> parsed with the HtmlParser.
>> Section 4 contains metadata about the attachment, a PDF file.
>> Section 5 and 6 contain the rest of the email body.
>> 
>> I inspected the json result in order to find any distinction between the 
>> several sections that would allow me to determine what the ‘real’ attachment 
>> is (my email programme is able to distinguish between them and only treats 
>> section 4 as an attachment), but I couldn’t find any. If I could find a 
>> distinction, I might then be able to send the file (again) to tika and this 
>> time to the unpack-endpoint, so that I can process the attachment.
> 
> Check parser.py in the tika folder for tika-python. You’ll see it handles
> this use case, but it aggregates up the metadata storing, for each key, 
> the value for the particular object. So, to get the PDF parser’s metadata
> “Author” field you may need to do, e.g., 
> 
> parsed = parser.from_file(‘blah.msg’)
> parsed[“metadata”][“Author”][3]
> 
> Does that make sense?

Yes, at least partly, I think. So for the test file described above, I get, for 
instance:

>>> for x in parsed["metadata"]["X-Parsed-By"]:
...     print x
... 
org.apache.tika.parser.DefaultParser
org.apache.tika.parser.mail.RFC822Parser
[u'org.apache.tika.parser.DefaultParser', 
u'org.apache.tika.parser.txt.TXTParser']
[u'org.apache.tika.parser.DefaultParser', 
u'org.apache.tika.parser.html.HtmlParser']
[u'org.apache.tika.parser.DefaultParser', 
u'org.apache.tika.parser.pdf.PDFParser']
[u'org.apache.tika.parser.DefaultParser', 
u'org.apache.tika.parser.html.HtmlParser']
[u'org.apache.tika.parser.DefaultParser', 
u'org.apache.tika.parser.txt.TXTParser’]

So now I know how I can take a multipart email apart.

>> So my question here is: how can I distinguish between ‘real’ attachments and 
>> txt vs. html embeddings? And what would be the way with the minimum number 
>> of connections to Tika to determine the ‘real’ attachment and unpack it?

This part is still unclear to me, though. I see now how tika-python aggregates 
the reply from tika, but I still don’t see how I can distinguish between ‘real’ 
attachments and embeddings. How, for instance, would I distinguish between an 
email that has a ‘real’ txt-attachment and an email that really has no 
attachment, when tika presents the content as various embeddings of txt and 
html parts? But tika certainly understands this distinction, because if asked 
for the content of the email, it strips away the PDF. What am I overlooking 
here?

> Thanks!
> 
> Cheers,
> Chris


Thanks again, Chris, and all best!
Philipp



> 
>> 
>> I appreciate any suggestions you could give me.
>> 
>> Thanks and all best,
>> Philipp

Re: Tika and Python

Reply via email to