Hello, I fist noticed that my .mbox file doesn't get parsed by MBoxParser, and later, after debugging Tika source code, I found what the problem is - default detector doesn't even recognize it as "applciation/mbox" MIME type, and although file extension is .mbox, it ignores this hint because its "magic" way of detecting file type based on some amount of initial bytes detects it is "text/html" so it ignores the hint, and returns "text/html"...And by consequence, the parsing never goes to the correct parser.
Is there some way I could override this magic detection and enforce that detection in this case is based solely on file extension for these .mbox files? -Vjeran ################################################################################# Anyway, here is the beginning of my MBOX file which I got from Google exporting my GMAil emails: >From 1540828415824941917@xxx Mon Jul 25 12:08:06 +0000 2016 X-GM-THRID: 1540828415824941917 X-Gmail-Labels: Inbox,Important,clojure Delivered-To: [email protected] Received: by 10.31.56.17 with SMTP id f17csp1614203vka; Mon, 25 Jul 2016 05:08:06 -0700 (PDT) X-Received: by 10.202.95.133 with SMTP id t127mr8226795oib.80.1469448485990; Mon, 25 Jul 2016 05:08:05 -0700 (PDT) Return-Path: <[email protected]> Received: from o1678940x148.outbound-mail.sendgrid.net (o1678940x148.outbound-mail.sendgrid.net. [167.89.40.148]) by mx.google.com with ESMTPS id k58si11358370otb.279.2016.07.25.05.08.05 for <[email protected]> (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 25 Jul 2016 05:08:05 -0700 (PDT) Received-SPF: pass (google.com: domain of [email protected] designates 167.89.40.148 as permitted sender) client-ip=167.89.40.148; Authentication-Results: mx.google.com; dkim=pass [email protected]; dkim=pass [email protected]; spf=pass (google.com: domain of [email protected] designates 167.89.40.148 as permitted sender) smtp.mailfrom=bounces+2693180-18a0-vmarcinko=gmail....@m.dripemail2.com DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=dripemail2.com; h=content-type:from:mime-version:subject:to; s=s1; bh=wbY8sP/TelOpmU6q09dgY8v3muI=; b=Vo/m0Lx7f8jNAHU2m0vLO6StuGms/ XeJeiLBV4CHyhwMNr4UuuBIJmDVGIuv6YGSJPN9REUYVuCqFyaPOAZiBtlie8Awq 7uB7KxZKnFPDh/7XQRz1Z1kKx0dGiENBOoymZFglCebm9my2i+trZ6EzN4YFOB/+ ZNpksoRirEVhws= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sendgrid.info; h=content-type:from:mime-version:subject:to:x-feedback-id; s=smtpapi; bh=wbY8sP/TelOpmU6q09dgY8v3muI=; b=vnSfe24bbcPSeungct GphBd1h4S4i96PxeapkjmxCLyzeItTItNETiCtkLFbGnzFTVYVvzDOmcI47BYFHu yOM0kILRdMzFt1d7HNVE1EJCB0DHVS83Yk7vaH/jc+IU34jJgZBlG0yR292QYtYk 7WA4ETOIQnQ+3K3pJ+wUYNGKs= Received: by filter0448p1mdw1.sendgrid.net with SMTP id filter0448p1mdw1.23984.5796012246 2016-07-25 12:08:02.669274519 +0000 UTC Received: from MjY5MzE4MA (ec2-54-210-139-199.compute-1.amazonaws.com [54.210.139.199]) by ismtpd0002p1iad1.sendgrid.net (SG) with HTTP id zyxIxF_lRFKgFZxIoq9BKA for <[email protected]>; Mon, 25 Jul 2016 12:08:02.739 +0000 (UTC) Content-Type: multipart/alternative; boundary=0082ce9e57fb837e9dfa9ca77bc69f450567ae3138b24a5db1e7237fc121 Date: Mon, 25 Jul 2016 12:08:02 +0000 From: "Eric at PurelyFunctional.tv" <[email protected]> Mime-Version: 1.0 Subject: Twitter Bot, Atom Editor, and Scraping HTML To: [email protected] Message-ID: <[email protected]> X-SG-EID: pywWA7gL46oOK7j8609IHsuM8bBS72IBx+uWB+d8D/N9t0rE4+TMmdgXQpvC7JIN3ekubbU2qCgHqS 7W8GJ+aKX8qAKYokC5jzRvyv4CX3KHlasoMaqSUGqYEuHYx1e9vMNhqBIB4+nZN4uZmnKvRrvnYMZy NtpRNDKB0S28xjv5CxGmqbRggtf8RLQ7d2s5RIuQwIMIZQ3nLl3OrnmbjtZAP91VtQFkbhRATrKx7i o= X-SG-ID: 6l1ICXxVk1U2NQBE+KPgx+uy7/oBj9jrT6lO2L7BaL4cap+kBh3uUy+RmDmEF7s+mSBwxVfvlgfHyu osKIvS9Q== X-Feedback-ID: 2693180:l1fkQA9YLlZ4PTqywTL3Zu+zLq2XYmkeuiZ1WV+xvFE=:l1fkQA9YLlZ4PTqywTL3Zu+zLq2XYmkeuiZ1WV+xvFE=:SG --0082ce9e57fb837e9dfa9ca77bc69f450567ae3138b24a5db1e7237fc121 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=UTF-8 Mime-Version: 1.0 Dear Clojurist, Thanks again for being there. I am so lucky to have you here on my PurelyFunctional.tv email list. A lot of people ask me what it takes to be hirable in Clojure. Of course, the answer is complicated, but the short version is "not very much". I wrote about it. Read What do I have to learn to be hirable in Clojure? ( http://t.dripemail= 2.com/c/eyJhY2NvdW50X2lkIjoiMzY1MTcxNyIsImRlbGl2ZXJ5X2lkIjoiMjE3NTQ4MzEyIiw= idXJsIjoiaHR0cDovL3d3dy5saXNwY2FzdC5jb20vaGlyYWJsZS1pbi1jbG9qdXJlP19fcz15bj= R6dm8xcnY5cGhkazR4cG11diJ9 )
