Hi Chris

Interesting, I was wondering if it would make sense to add
TikaTabularPDFParser wrapper, example, it would accumulate a given table
headers, report them as a single ContentHandler line, etc...

Sergey

On Thu, Jul 11, 2019 at 6:26 PM Chris Mattmann <[email protected]> wrote:

> Tabula PDF is something I have been looking at for this as well as doing
> like Deep Neural Nets…
>
>
>
>
>
>
>
> *From: *Sergey Beryozkin <[email protected]>
> *Reply-To: *"[email protected]" <[email protected]>
> *Date: *Thursday, July 11, 2019 at 10:25 AM
> *To: *"[email protected]" <[email protected]>
> *Subject: *[EXTERNAL] How to parse PDF more effectively
>
>
>
> Hi
>
>
>
> I've used Tika to parse this invoice PDF:
>
>
>
> https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf
>
>
>
> (AutoDetectParser, ToTextContentHandler), see below what is returned.
>
> The numbers like (1), (2) are added by myself, this is the preferred order
> (approximately).
>
>
>
> Is it possible to hint somehow to Tika how to report the content ?
>
>
>
> Thanks Sergey
>
>
>
> PDF Invoice Example
> Invoice
>
> (5)Payment is due within 30 days from date of invoice. Late payment is
> subject to fees of 5% per month.
>
> Thanks for choosing DEMO - Sliced Invoices | [email protected]
>
> Page 1/1
>
> (2)From:
>
> DEMO - Sliced Invoices
>
> Suite 5A-1204
>
> 123 Somewhere Street
>
> Your City AZ 12345
>
> [email protected]
>
> (1)Invoice Number INV-3337
>
> Order Number 12345
>
> Invoice Date January 25, 2016
>
> Due Date January 31, 2016
>
> Total Due $93.50
>
> (3)To:
>
> Test Business
>
> 123 Somewhere St
>
> Melbourne, VIC 3000
>
> [email protected]
>
> (4) Hrs/Qty Service Rate/Price Adjust Sub Total
>
> 1.00
> Web Design
> This is a sample description...
>
> $85.00 0.00% $85.00
>
> Sub Total $85.00
>
> Tax $8.50
>
> Total $93.50
>
> (5) ANZ Bank
>
> ACC # 1234 1234
>
> BSB # 4321 432 Pa
> id
>

Reply via email to