Dear list, I started to explore the possibilities of TIKA and I have a couple of questions that I will send to the list in separate emails, to keep things tidy.
To begin with, I noticed the following behaviour which might or might not be a bug. I asked this question on stackexchange (https://stackoverflow.com/questions/37226842/tika-metadata-from-email-misses-date <https://stackoverflow.com/questions/37226842/tika-metadata-from-email-misses-date>) but perhaps this is a better place. I have two email testfiles: • A file that has been created by using "save as" in Mac Mail (this creates a .txt file) • A file that has been created by dragging an email from Mac Mail to the Desktop (this creates an .eml file) If I feed the files with curl -T filename http://localhost:9998/detect/stream I get the response "message/rfc822" for both files. If I run curl -T filename http://localhost:9998/meta I get the metadata, but in the case of (1) I do not get the DATE extracted, while in case (2) I do. I understand, of course, that the .eml file includes the full raw header, while the .txt file only includes a very abbreviated header. However, even the abbreviated header does include a "Date" field, and so I think Tika should extract it. Is this a bug or intentional? In the latter case, is there anything I could do to get the Tika to extract the date in case (1)? I am running Tika-server 1.14. Any suggestions much appreciated! All best, Philipp
