Hi Chris Interesting, I was wondering if it would make sense to add TikaTabularPDFParser wrapper, example, it would accumulate a given table headers, report them as a single ContentHandler line, etc...
Sergey On Thu, Jul 11, 2019 at 6:26 PM Chris Mattmann <[email protected]> wrote: > Tabula PDF is something I have been looking at for this as well as doing > like Deep Neural Nets… > > > > > > > > *From: *Sergey Beryozkin <[email protected]> > *Reply-To: *"[email protected]" <[email protected]> > *Date: *Thursday, July 11, 2019 at 10:25 AM > *To: *"[email protected]" <[email protected]> > *Subject: *[EXTERNAL] How to parse PDF more effectively > > > > Hi > > > > I've used Tika to parse this invoice PDF: > > > > https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf > > > > (AutoDetectParser, ToTextContentHandler), see below what is returned. > > The numbers like (1), (2) are added by myself, this is the preferred order > (approximately). > > > > Is it possible to hint somehow to Tika how to report the content ? > > > > Thanks Sergey > > > > PDF Invoice Example > Invoice > > (5)Payment is due within 30 days from date of invoice. Late payment is > subject to fees of 5% per month. > > Thanks for choosing DEMO - Sliced Invoices | [email protected] > > Page 1/1 > > (2)From: > > DEMO - Sliced Invoices > > Suite 5A-1204 > > 123 Somewhere Street > > Your City AZ 12345 > > [email protected] > > (1)Invoice Number INV-3337 > > Order Number 12345 > > Invoice Date January 25, 2016 > > Due Date January 31, 2016 > > Total Due $93.50 > > (3)To: > > Test Business > > 123 Somewhere St > > Melbourne, VIC 3000 > > [email protected] > > (4) Hrs/Qty Service Rate/Price Adjust Sub Total > > 1.00 > Web Design > This is a sample description... > > $85.00 0.00% $85.00 > > Sub Total $85.00 > > Tax $8.50 > > Total $93.50 > > (5) ANZ Bank > > ACC # 1234 1234 > > BSB # 4321 432 Pa > id >
