Hi Tim This will help for sure, will try after my my PTO Thanks Sergey On Thu 18 Jul 2019, 13:14 Tim Allison, <[email protected]> wrote:
> Hi Sergey, > > Sorry, I thought I hit send on this yesterday... > > In reverse order, I used the ToXMLContentHandler: > > https://github.com/apache/tika/blob/master/tika-core/src/test/java/org/apache/tika/TikaTest.java#L208 > > For configuring via tika-config.xml, see, e.g.: > > https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/org/apache/tika/parser/pdf/tika-config.xml > > The trick, though, is to exclude the PDFParser from the default > parser and then add the custom configured one back in: > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454066 > > Let me know if you have any surprises. > > Best, > > Tim > > On Wed, Jul 17, 2019 at 12:20 PM Sergey Beryozkin <[email protected]> > wrote: > > > > Hi Tim, > > > > How does one configure PDFParserConfig in tika-config.xml ? May be as > one of the PDFParser properties ? > > PDFParser.setSortByPosition (and other simple setters) are deprecated so > setting a 'sprtByPosition' as one of the PDFParser properties goes via the > deprecated call path (probably not a big deal though :-)) > > I also looked at the source and I'm still not sure which ContentHandler > did you use to get the HTML tags added. > > (I may experiment with a custom one sitting on top of it adding the > table tags may be...) > > Sergey > > > > On Thu, Jul 11, 2019 at 9:52 PM Sergey Beryozkin <[email protected]> > wrote: > >> > >> Hi Tim > >> > >> Thanks, I'm going to try to experiment with different complex enough > PDFs in order to figure out how to enhance the Quarkus Tika extension, what > to let customize, etc (I'll link to it in a follow up email). > >> Your output looks better :-), and which ContentHandler did you use ? > >> > >> Sergey > >> > >> On Thu, Jul 11, 2019 at 7:23 PM Tim Allison <[email protected]> > wrote: > >>> > >>> Might not need to break out the neural nets just yet...try turning on > >>> sortByPosition via the PDFParserConfig and/or tika_config.xml. > >>> > >>> This is what you get: > >>> > >>> > >>> > >>> <title>PDF Invoice Example</title> > >>> </head> > >>> <body><div class="page"><p /> > >>> <p>Invoice > >>> </p> > >>> <p>From: Invoice Number INV-3337 > >>> </p> > >>> <p>DEMO - Sliced Invoices Order Number 12345 > >>> Suite 5A-1204 Invoice Date January 25, 2016 > >>> 123 Somewhere Street Due Date January 31, 2016 > >>> Your City AZ 12345 > >>> [email protected] Total Due $93.50 > >>> </p> > >>> <p>To: > >>> Test Business > >>> 123 Somewhere St > >>> Melbourne, VIC 3000 > >>> [email protected] > >>> </p> > >>> <p>Hrs/Qty Service Rate/Price Adjust Sub Total > >>> </p> > >>> <p>1.00 Web DesignThis is a sample description... $85.00 0.00% $85.00 > >>> </p> > >>> <p>Pa > >>> idSub Total $85.00 > >>> </p> > >>> <p>Tax $8.50 > >>> Total $93.50 > >>> </p> > >>> <p>ANZ Bank > >>> ACC # 1234 1234 > >>> BSB # 4321 432 > >>> </p> > >>> <p>Payment is due within 30 days from date of invoice. Late payment is > >>> subject to fees of 5% per month. > >>> Thanks for choosing DEMO - Sliced Invoices | [email protected] > >>> Page 1/1</p> > >>> <p /> > >>> <div class="annotation"><a > >>> href="http://slicedinvoices.com/demo">http://slicedinvoices.com/demo > </a></div> > >>> <div class="annotation"><a > >>> href="http://slicedinvoices.com/demo">http://slicedinvoices.com/demo > </a></div> > >>> <div class="annotation"><a > >>> href="http://slicedinvoices.com/demo">http://slicedinvoices.com/demo > </a></div> > >>> <div class="annotation"><a > >>> href="mailto:[email protected]">mailto:[email protected] > </a></div> > >>> </div> > >>> </body></html> > >>> > >>> On Thu, Jul 11, 2019 at 1:25 PM Sergey Beryozkin <[email protected]> > wrote: > >>> > > >>> > Hi > >>> > > >>> > I've used Tika to parse this invoice PDF: > >>> > > >>> > > https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf > >>> > > >>> > (AutoDetectParser, ToTextContentHandler), see below what is returned. > >>> > The numbers like (1), (2) are added by myself, this is the preferred > order (approximately). > >>> > > >>> > Is it possible to hint somehow to Tika how to report the content ? > >>> > > >>> > Thanks Sergey > >>> > > >>> > PDF Invoice Example > >>> > Invoice > >>> > > >>> > (5)Payment is due within 30 days from date of invoice. Late payment > is subject to fees of 5% per month. > >>> > > >>> > Thanks for choosing DEMO - Sliced Invoices | > [email protected] > >>> > > >>> > Page 1/1 > >>> > > >>> > (2)From: > >>> > > >>> > DEMO - Sliced Invoices > >>> > > >>> > Suite 5A-1204 > >>> > > >>> > 123 Somewhere Street > >>> > > >>> > Your City AZ 12345 > >>> > > >>> > [email protected] > >>> > > >>> > (1)Invoice Number INV-3337 > >>> > > >>> > Order Number 12345 > >>> > > >>> > Invoice Date January 25, 2016 > >>> > > >>> > Due Date January 31, 2016 > >>> > > >>> > Total Due $93.50 > >>> > > >>> > (3)To: > >>> > > >>> > Test Business > >>> > > >>> > 123 Somewhere St > >>> > > >>> > Melbourne, VIC 3000 > >>> > > >>> > [email protected] > >>> > > >>> > (4) Hrs/Qty Service Rate/Price Adjust Sub Total > >>> > > >>> > 1.00 > >>> > Web Design > >>> > This is a sample description... > >>> > > >>> > $85.00 0.00% $85.00 > >>> > > >>> > Sub Total $85.00 > >>> > > >>> > Tax $8.50 > >>> > > >>> > Total $93.50 > >>> > > >>> > (5) ANZ Bank > >>> > > >>> > ACC # 1234 1234 > >>> > > >>> > BSB # 4321 432 Pa > >>> > id >
