Might not need to break out the neural nets just yet...try turning on sortByPosition via the PDFParserConfig and/or tika_config.xml.
This is what you get: <title>PDF Invoice Example</title> </head> <body><div class="page"><p /> <p>Invoice </p> <p>From: Invoice Number INV-3337 </p> <p>DEMO - Sliced Invoices Order Number 12345 Suite 5A-1204 Invoice Date January 25, 2016 123 Somewhere Street Due Date January 31, 2016 Your City AZ 12345 [email protected] Total Due $93.50 </p> <p>To: Test Business 123 Somewhere St Melbourne, VIC 3000 [email protected] </p> <p>Hrs/Qty Service Rate/Price Adjust Sub Total </p> <p>1.00 Web DesignThis is a sample description... $85.00 0.00% $85.00 </p> <p>Pa idSub Total $85.00 </p> <p>Tax $8.50 Total $93.50 </p> <p>ANZ Bank ACC # 1234 1234 BSB # 4321 432 </p> <p>Payment is due within 30 days from date of invoice. Late payment is subject to fees of 5% per month. Thanks for choosing DEMO - Sliced Invoices | [email protected] Page 1/1</p> <p /> <div class="annotation"><a href="http://slicedinvoices.com/demo">http://slicedinvoices.com/demo</a></div> <div class="annotation"><a href="http://slicedinvoices.com/demo">http://slicedinvoices.com/demo</a></div> <div class="annotation"><a href="http://slicedinvoices.com/demo">http://slicedinvoices.com/demo</a></div> <div class="annotation"><a href="mailto:[email protected]">mailto:[email protected]</a></div> </div> </body></html> On Thu, Jul 11, 2019 at 1:25 PM Sergey Beryozkin <[email protected]> wrote: > > Hi > > I've used Tika to parse this invoice PDF: > > https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf > > (AutoDetectParser, ToTextContentHandler), see below what is returned. > The numbers like (1), (2) are added by myself, this is the preferred order > (approximately). > > Is it possible to hint somehow to Tika how to report the content ? > > Thanks Sergey > > PDF Invoice Example > Invoice > > (5)Payment is due within 30 days from date of invoice. Late payment is > subject to fees of 5% per month. > > Thanks for choosing DEMO - Sliced Invoices | [email protected] > > Page 1/1 > > (2)From: > > DEMO - Sliced Invoices > > Suite 5A-1204 > > 123 Somewhere Street > > Your City AZ 12345 > > [email protected] > > (1)Invoice Number INV-3337 > > Order Number 12345 > > Invoice Date January 25, 2016 > > Due Date January 31, 2016 > > Total Due $93.50 > > (3)To: > > Test Business > > 123 Somewhere St > > Melbourne, VIC 3000 > > [email protected] > > (4) Hrs/Qty Service Rate/Price Adjust Sub Total > > 1.00 > Web Design > This is a sample description... > > $85.00 0.00% $85.00 > > Sub Total $85.00 > > Tax $8.50 > > Total $93.50 > > (5) ANZ Bank > > ACC # 1234 1234 > > BSB # 4321 432 Pa > id
