Hi Sergey, Sorry, I thought I hit send on this yesterday...
In reverse order, I used the ToXMLContentHandler: https://github.com/apache/tika/blob/master/tika-core/src/test/java/org/apache/tika/TikaTest.java#L208 For configuring via tika-config.xml, see, e.g.: https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/org/apache/tika/parser/pdf/tika-config.xml The trick, though, is to exclude the PDFParser from the default parser and then add the custom configured one back in: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454066 Let me know if you have any surprises. Best, Tim On Wed, Jul 17, 2019 at 12:20 PM Sergey Beryozkin <[email protected]> wrote: > > Hi Tim, > > How does one configure PDFParserConfig in tika-config.xml ? May be as one of > the PDFParser properties ? > PDFParser.setSortByPosition (and other simple setters) are deprecated so > setting a 'sprtByPosition' as one of the PDFParser properties goes via the > deprecated call path (probably not a big deal though :-)) > I also looked at the source and I'm still not sure which ContentHandler did > you use to get the HTML tags added. > (I may experiment with a custom one sitting on top of it adding the table > tags may be...) > Sergey > > On Thu, Jul 11, 2019 at 9:52 PM Sergey Beryozkin <[email protected]> wrote: >> >> Hi Tim >> >> Thanks, I'm going to try to experiment with different complex enough PDFs in >> order to figure out how to enhance the Quarkus Tika extension, what to let >> customize, etc (I'll link to it in a follow up email). >> Your output looks better :-), and which ContentHandler did you use ? >> >> Sergey >> >> On Thu, Jul 11, 2019 at 7:23 PM Tim Allison <[email protected]> wrote: >>> >>> Might not need to break out the neural nets just yet...try turning on >>> sortByPosition via the PDFParserConfig and/or tika_config.xml. >>> >>> This is what you get: >>> >>> >>> >>> <title>PDF Invoice Example</title> >>> </head> >>> <body><div class="page"><p /> >>> <p>Invoice >>> </p> >>> <p>From: Invoice Number INV-3337 >>> </p> >>> <p>DEMO - Sliced Invoices Order Number 12345 >>> Suite 5A-1204 Invoice Date January 25, 2016 >>> 123 Somewhere Street Due Date January 31, 2016 >>> Your City AZ 12345 >>> [email protected] Total Due $93.50 >>> </p> >>> <p>To: >>> Test Business >>> 123 Somewhere St >>> Melbourne, VIC 3000 >>> [email protected] >>> </p> >>> <p>Hrs/Qty Service Rate/Price Adjust Sub Total >>> </p> >>> <p>1.00 Web DesignThis is a sample description... $85.00 0.00% $85.00 >>> </p> >>> <p>Pa >>> idSub Total $85.00 >>> </p> >>> <p>Tax $8.50 >>> Total $93.50 >>> </p> >>> <p>ANZ Bank >>> ACC # 1234 1234 >>> BSB # 4321 432 >>> </p> >>> <p>Payment is due within 30 days from date of invoice. Late payment is >>> subject to fees of 5% per month. >>> Thanks for choosing DEMO - Sliced Invoices | [email protected] >>> Page 1/1</p> >>> <p /> >>> <div class="annotation"><a >>> href="http://slicedinvoices.com/demo">http://slicedinvoices.com/demo</a></div> >>> <div class="annotation"><a >>> href="http://slicedinvoices.com/demo">http://slicedinvoices.com/demo</a></div> >>> <div class="annotation"><a >>> href="http://slicedinvoices.com/demo">http://slicedinvoices.com/demo</a></div> >>> <div class="annotation"><a >>> href="mailto:[email protected]">mailto:[email protected]</a></div> >>> </div> >>> </body></html> >>> >>> On Thu, Jul 11, 2019 at 1:25 PM Sergey Beryozkin <[email protected]> >>> wrote: >>> > >>> > Hi >>> > >>> > I've used Tika to parse this invoice PDF: >>> > >>> > https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf >>> > >>> > (AutoDetectParser, ToTextContentHandler), see below what is returned. >>> > The numbers like (1), (2) are added by myself, this is the preferred >>> > order (approximately). >>> > >>> > Is it possible to hint somehow to Tika how to report the content ? >>> > >>> > Thanks Sergey >>> > >>> > PDF Invoice Example >>> > Invoice >>> > >>> > (5)Payment is due within 30 days from date of invoice. Late payment is >>> > subject to fees of 5% per month. >>> > >>> > Thanks for choosing DEMO - Sliced Invoices | [email protected] >>> > >>> > Page 1/1 >>> > >>> > (2)From: >>> > >>> > DEMO - Sliced Invoices >>> > >>> > Suite 5A-1204 >>> > >>> > 123 Somewhere Street >>> > >>> > Your City AZ 12345 >>> > >>> > [email protected] >>> > >>> > (1)Invoice Number INV-3337 >>> > >>> > Order Number 12345 >>> > >>> > Invoice Date January 25, 2016 >>> > >>> > Due Date January 31, 2016 >>> > >>> > Total Due $93.50 >>> > >>> > (3)To: >>> > >>> > Test Business >>> > >>> > 123 Somewhere St >>> > >>> > Melbourne, VIC 3000 >>> > >>> > [email protected] >>> > >>> > (4) Hrs/Qty Service Rate/Price Adjust Sub Total >>> > >>> > 1.00 >>> > Web Design >>> > This is a sample description... >>> > >>> > $85.00 0.00% $85.00 >>> > >>> > Sub Total $85.00 >>> > >>> > Tax $8.50 >>> > >>> > Total $93.50 >>> > >>> > (5) ANZ Bank >>> > >>> > ACC # 1234 1234 >>> > >>> > BSB # 4321 432 Pa >>> > id
