Tabula PDF is something I have been looking at for this as well as doing like Deep Neural Nets…
From: Sergey Beryozkin <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Thursday, July 11, 2019 at 10:25 AM To: "[email protected]" <[email protected]> Subject: [EXTERNAL] How to parse PDF more effectively Hi I've used Tika to parse this invoice PDF: https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf (AutoDetectParser, ToTextContentHandler), see below what is returned. The numbers like (1), (2) are added by myself, this is the preferred order (approximately). Is it possible to hint somehow to Tika how to report the content ? Thanks Sergey PDF Invoice Example Invoice (5)Payment is due within 30 days from date of invoice. Late payment is subject to fees of 5% per month. Thanks for choosing DEMO - Sliced Invoices | [email protected] Page 1/1 (2)From: DEMO - Sliced Invoices Suite 5A-1204 123 Somewhere Street Your City AZ 12345 [email protected] (1)Invoice Number INV-3337 Order Number 12345 Invoice Date January 25, 2016 Due Date January 31, 2016 Total Due $93.50 (3)To: Test Business 123 Somewhere St Melbourne, VIC 3000 [email protected] (4) Hrs/Qty Service Rate/Price Adjust Sub Total 1.00 Web Design This is a sample description... $85.00 0.00% $85.00 Sub Total $85.00 Tax $8.50 Total $93.50 (5) ANZ Bank ACC # 1234 1234 BSB # 4321 432 Pa id
