Tabula PDF is something I have been looking at for this as well as doing
like Deep Neural Nets…

 

 

 

From: Sergey Beryozkin <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Thursday, July 11, 2019 at 10:25 AM
To: "[email protected]" <[email protected]>
Subject: [EXTERNAL] How to parse PDF more effectively

 

Hi

 

I've used Tika to parse this invoice PDF:

 

https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf

 

(AutoDetectParser, ToTextContentHandler), see below what is returned.

The numbers like (1), (2) are added by myself, this is the preferred order 
(approximately).

 

Is it possible to hint somehow to Tika how to report the content ?

 

Thanks Sergey

 

PDF Invoice Example
Invoice

(5)Payment is due within 30 days from date of invoice. Late payment is subject 
to fees of 5% per month.

Thanks for choosing DEMO - Sliced Invoices | [email protected]

Page 1/1

(2)From:

DEMO - Sliced Invoices

Suite 5A-1204

123 Somewhere Street

Your City AZ 12345

[email protected]

(1)Invoice Number INV-3337

Order Number 12345

Invoice Date January 25, 2016

Due Date January 31, 2016

Total Due $93.50

(3)To:

Test Business

123 Somewhere St

Melbourne, VIC 3000

[email protected]

(4) Hrs/Qty Service Rate/Price Adjust Sub Total

1.00
Web Design
This is a sample description...

$85.00 0.00% $85.00

Sub Total $85.00

Tax $8.50

Total $93.50

(5) ANZ Bank

ACC # 1234 1234

BSB # 4321 432 Pa
id

Reply via email to