Hey Vjeran, I think this was a bug in an earlier version of Tika. I just tried your example with Tika 1.10 (the latest) and the main file got correctly detected as message/rfc822 with its subparts getting detected as text/plain, text/html, and image/jpeg -- the correct behavior!
Sergey On Mon, Oct 12, 2015 at 12:53 AM, Vjeran Marcinko <[email protected]> wrote: > Hi, > > I took 2 .eml files from my Thunderbird, one with text/plain content, and > one with text/html, both with few picture attachment, and first one got > parsed via RFC822Parser whereas second one didnt get detected as > message/rfc822 and got parsed by HTMLParser. > > Any suggestion how to correct this so HTML mail gets detected as > message/rfc822? > > Here is a message source, and I can see there is multipart/alternative in > play here sicne Thunderbird includes both type of content (probably for > cases where mail client cannot display HTML): > > From - Sun Oct 11 17:23:18 2015 > X-Mozilla-Status: 0001 > X-Mozilla-Status2: 00000000 > X-Mozilla-Keys: > FCC: mailbox://[email protected]/Sent > X-Identity-Key: id1 > X-Account-Key: account1 > To: [email protected] > From: Vjeran Marcinko <[email protected]> > Subject: I am serious! > Message-ID: <[email protected]> > Date: Sun, 11 Oct 2015 17:23:17 +0200 > X-Mozilla-Draft-Info: internal/draft; vcard=0; receipt=0; DSN=0; uuencode=0; > attachmentreminder=0 > User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 > Thunderbird/38.3.0 > MIME-Version: 1.0 > Content-Type: multipart/mixed; > boundary="------------040507030005030300070403" > > This is a multi-part message in MIME format. > --------------040507030005030300070403 > Content-Type: multipart/alternative; > boundary="------------040402060408060209060509" > > > --------------040402060408060209060509 > Content-Type: text/plain; charset=utf-8; format=flowed > Content-Transfer-Encoding: 7bit > > This is *VERY IMPORTANT* text. > > I attached the pic of birthday cake, so you tell me if you like it. > > Also, the pic of my town is there also. > > Bye, > Steve > > > --------------040402060408060209060509 > Content-Type: text/html; charset=utf-8 > Content-Transfer-Encoding: 7bit > > <html> > <head> > > <meta http-equiv="content-type" content="text/html; charset=utf-8"> > <title>I am serious!</title> > </head> > <body text="#000000" bgcolor="#FFFFFF"> > This is <b>VERY IMPORTANT</b> text.<br> > <br> > I attached the pic of birthday cake, so you tell me if you like it.<br> > <br> > Also, the pic of my town is there also.<br> > <br> > Bye,<br> > Steve<br> > <br> > </body> > </html> > > --------------040402060408060209060509-- > > --------------040507030005030300070403 > Content-Type: image/jpeg; > name="1240378_759900384071242_8612750244479085543_n.jpg" > Content-Transfer-Encoding: base64 > Content-Disposition: attachment; > filename="1240378_759900384071242_8612750244479085543_n.jpg" > > /9j/4AAQSkZJRgABAgAAAQABAAD/7QA2UGhvdG9zaG9wIDMuMAA4QklNBAQAAAAAABkcAmcA > FHlxX1NUTHFZdkZfSEl5X2JQSDlGAP/iAhxJQ0NfUFJPRklMRQABAQAAAgxsY21zAhAAAG1u > dHJSR0IgWFlaIAfcAAEAGQADACkAOWFjc3BBUFBMAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA > >
