Hi Tim

Thanks, I'm going to try to experiment with different complex enough PDFs
in order to figure out how to enhance the Quarkus Tika extension, what to
let customize, etc (I'll link to it in a follow up email).
Your output looks better :-), and which ContentHandler did you use ?

Sergey

On Thu, Jul 11, 2019 at 7:23 PM Tim Allison <[email protected]> wrote:

> Might not need to break out the neural nets just yet...try turning on
> sortByPosition via the PDFParserConfig and/or tika_config.xml.
>
> This is what you get:
>
>
>
> <title>PDF Invoice Example</title>
> </head>
> <body><div class="page"><p />
> <p>Invoice
> </p>
> <p>From: Invoice Number INV-3337
> </p>
> <p>DEMO - Sliced Invoices Order Number 12345
> Suite 5A-1204 Invoice Date January 25, 2016
> 123 Somewhere Street Due Date January 31, 2016
> Your City AZ 12345
> [email protected] Total Due $93.50
> </p>
> <p>To:
> Test Business
> 123 Somewhere St
> Melbourne, VIC 3000
> [email protected]
> </p>
> <p>Hrs/Qty Service Rate/Price Adjust Sub Total
> </p>
> <p>1.00 Web DesignThis is a sample description... $85.00 0.00% $85.00
> </p>
> <p>Pa
> idSub Total $85.00
> </p>
> <p>Tax $8.50
> Total $93.50
> </p>
> <p>ANZ Bank
> ACC # 1234 1234
> BSB # 4321 432
> </p>
> <p>Payment is due within 30 days from date of invoice. Late payment is
> subject to fees of 5% per month.
> Thanks for choosing DEMO - Sliced Invoices | [email protected]
> Page 1/1</p>
> <p />
> <div class="annotation"><a
> href="http://slicedinvoices.com/demo";>http://slicedinvoices.com/demo
> </a></div>
> <div class="annotation"><a
> href="http://slicedinvoices.com/demo";>http://slicedinvoices.com/demo
> </a></div>
> <div class="annotation"><a
> href="http://slicedinvoices.com/demo";>http://slicedinvoices.com/demo
> </a></div>
> <div class="annotation"><a
> href="mailto:[email protected]";>mailto:[email protected]
> </a></div>
> </div>
> </body></html>
>
> On Thu, Jul 11, 2019 at 1:25 PM Sergey Beryozkin <[email protected]>
> wrote:
> >
> > Hi
> >
> > I've used Tika to parse this invoice PDF:
> >
> > https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf
> >
> > (AutoDetectParser, ToTextContentHandler), see below what is returned.
> > The numbers like (1), (2) are added by myself, this is the preferred
> order (approximately).
> >
> > Is it possible to hint somehow to Tika how to report the content ?
> >
> > Thanks Sergey
> >
> > PDF Invoice Example
> > Invoice
> >
> > (5)Payment is due within 30 days from date of invoice. Late payment is
> subject to fees of 5% per month.
> >
> > Thanks for choosing DEMO - Sliced Invoices | [email protected]
> >
> > Page 1/1
> >
> > (2)From:
> >
> > DEMO - Sliced Invoices
> >
> > Suite 5A-1204
> >
> > 123 Somewhere Street
> >
> > Your City AZ 12345
> >
> > [email protected]
> >
> > (1)Invoice Number INV-3337
> >
> > Order Number 12345
> >
> > Invoice Date January 25, 2016
> >
> > Due Date January 31, 2016
> >
> > Total Due $93.50
> >
> > (3)To:
> >
> > Test Business
> >
> > 123 Somewhere St
> >
> > Melbourne, VIC 3000
> >
> > [email protected]
> >
> > (4) Hrs/Qty Service Rate/Price Adjust Sub Total
> >
> > 1.00
> > Web Design
> > This is a sample description...
> >
> > $85.00 0.00% $85.00
> >
> > Sub Total $85.00
> >
> > Tax $8.50
> >
> > Total $93.50
> >
> > (5) ANZ Bank
> >
> > ACC # 1234 1234
> >
> > BSB # 4321 432 Pa
> > id
>

Reply via email to