Philipp absolutely the right place to ask. I’ll try and answer the below:
On 5/22/16, 12:12 PM, "Philipp Steinkrüger" <[email protected]> wrote: >Dear list, > >I am not sure this the right place to ask, but since I don’t know a better >place and some of you might use the tika-python package, I might as well give >it a shot. If you know a better place, please let me know. > >First issue/question: >I want four things from each file I send to Tika: > >1) The mime-type >2) The language >3) The metadata >4) The content > >Now, at the moment, this requires three separate connections to Tika, which is >quite a bit of overhead. As per the documentation, I am running > >for (1) detector.from_file('/path/to/file’) >for (2) language.from_file('/path/to/file’) >for (3) and (4) parsed = parser.from_file('/path/to/file’) Define “overhead”. There are 4 network connections that return in miliseconds. Sorry are you in an HPC environment, or an environment that will make millions of these per minute, second? > > >I can see that tika-python makes a request to localhost:9998/rmeta to get (3) >and (4) and if I feed the file to localhost:9998/rmeta with curl, I can see >that the result data does not include the detected language. But if I make the >request to localhost:9998/meta, the data does include the detected language. I >don’t quite see why language should be included in the /meta, but not in the >/rmeta endpoint. We can update /rmeta to output the language by default, no reason not to. Please file a ticket at http://issues.apache.org/jira/browse/TIKA for the Tika REST server and one in GH at http://github.com/chrismattmann/tika-python and I’ll take a look. > >So my question is, is there anyway to get all of (1)-(4) with a single request >and preferable with tika-python? Not at the moment, no. > >Second issue/question: >I am also wondering how to deal with email attachments. If I feed an email to >tika with tika-python, the result content does not include the attachment nor >metadata about the attachment (at least as far I can see). If I feed the file >with curl to ‘rmeta’-endpoint, the result json contains, for my current email >test file with an attachment, 6 sections: > >Section 1 contains the metadata. >Section 2 and 3 contain the body of the text; section 2 as windows-1252 >encoding, parsed with the TXTParser, section 3 as windows-1252 encoding, >parsed with the HtmlParser. >Section 4 contains metadata about the attachment, a PDF file. >Section 5 and 6 contain the rest of the email body. > >I inspected the json result in order to find any distinction between the >several sections that would allow me to determine what the ‘real’ attachment >is (my email programme is able to distinguish between them and only treats >section 4 as an attachment), but I couldn’t find any. If I could find a >distinction, I might then be able to send the file (again) to tika and this >time to the unpack-endpoint, so that I can process the attachment. Check parser.py in the tika folder for tika-python. You’ll see it handles this use case, but it aggregates up the metadata storing, for each key, the value for the particular object. So, to get the PDF parser’s metadata “Author” field you may need to do, e.g., parsed = parser.from_file(‘blah.msg’) parsed[“metadata”][“Author”][3] Does that make sense? > > >So my question here is: how can I distinguish between ‘real’ attachments and >txt vs. html embeddings? And what would be the way with the minimum number of >connections to Tika to determine the ‘real’ attachment and unpack it? See above. Thanks! Cheers, Chris > >I appreciate any suggestions you could give me. > >Thanks and all best, >Philipp > >
