Hi Tim
This will help for sure, will try after my my PTO
Thanks Sergey

On Thu 18 Jul 2019, 13:14 Tim Allison, <[email protected]> wrote:

> Hi Sergey,
>
>   Sorry, I thought I hit send on this yesterday...
>
>   In reverse order, I used the ToXMLContentHandler:
>
> https://github.com/apache/tika/blob/master/tika-core/src/test/java/org/apache/tika/TikaTest.java#L208
>
>   For configuring via tika-config.xml, see, e.g.:
>
> https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/org/apache/tika/parser/pdf/tika-config.xml
>
>   The trick, though, is to exclude the PDFParser from the default
> parser and then add the custom configured one back in:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454066
>
>   Let me know if you have any surprises.
>
>           Best,
>
>                     Tim
>
> On Wed, Jul 17, 2019 at 12:20 PM Sergey Beryozkin <[email protected]>
> wrote:
> >
> > Hi Tim,
> >
> > How does one configure PDFParserConfig in tika-config.xml ? May be as
> one of the PDFParser properties ?
> > PDFParser.setSortByPosition (and other simple setters) are deprecated so
> setting a 'sprtByPosition' as one of the PDFParser properties goes via the
> deprecated call path (probably not a big deal though :-))
> > I also looked at the source and I'm still not sure which ContentHandler
> did you use to get the HTML tags added.
> > (I may experiment with a custom one sitting on top of it adding the
> table tags may be...)
> > Sergey
> >
> > On Thu, Jul 11, 2019 at 9:52 PM Sergey Beryozkin <[email protected]>
> wrote:
> >>
> >> Hi Tim
> >>
> >> Thanks, I'm going to try to experiment with different complex enough
> PDFs in order to figure out how to enhance the Quarkus Tika extension, what
> to let customize, etc (I'll link to it in a follow up email).
> >> Your output looks better :-), and which ContentHandler did you use ?
> >>
> >> Sergey
> >>
> >> On Thu, Jul 11, 2019 at 7:23 PM Tim Allison <[email protected]>
> wrote:
> >>>
> >>> Might not need to break out the neural nets just yet...try turning on
> >>> sortByPosition via the PDFParserConfig and/or tika_config.xml.
> >>>
> >>> This is what you get:
> >>>
> >>>
> >>>
> >>> <title>PDF Invoice Example</title>
> >>> </head>
> >>> <body><div class="page"><p />
> >>> <p>Invoice
> >>> </p>
> >>> <p>From: Invoice Number INV-3337
> >>> </p>
> >>> <p>DEMO - Sliced Invoices Order Number 12345
> >>> Suite 5A-1204 Invoice Date January 25, 2016
> >>> 123 Somewhere Street Due Date January 31, 2016
> >>> Your City AZ 12345
> >>> [email protected] Total Due $93.50
> >>> </p>
> >>> <p>To:
> >>> Test Business
> >>> 123 Somewhere St
> >>> Melbourne, VIC 3000
> >>> [email protected]
> >>> </p>
> >>> <p>Hrs/Qty Service Rate/Price Adjust Sub Total
> >>> </p>
> >>> <p>1.00 Web DesignThis is a sample description... $85.00 0.00% $85.00
> >>> </p>
> >>> <p>Pa
> >>> idSub Total $85.00
> >>> </p>
> >>> <p>Tax $8.50
> >>> Total $93.50
> >>> </p>
> >>> <p>ANZ Bank
> >>> ACC # 1234 1234
> >>> BSB # 4321 432
> >>> </p>
> >>> <p>Payment is due within 30 days from date of invoice. Late payment is
> >>> subject to fees of 5% per month.
> >>> Thanks for choosing DEMO - Sliced Invoices | [email protected]
> >>> Page 1/1</p>
> >>> <p />
> >>> <div class="annotation"><a
> >>> href="http://slicedinvoices.com/demo";>http://slicedinvoices.com/demo
> </a></div>
> >>> <div class="annotation"><a
> >>> href="http://slicedinvoices.com/demo";>http://slicedinvoices.com/demo
> </a></div>
> >>> <div class="annotation"><a
> >>> href="http://slicedinvoices.com/demo";>http://slicedinvoices.com/demo
> </a></div>
> >>> <div class="annotation"><a
> >>> href="mailto:[email protected]";>mailto:[email protected]
> </a></div>
> >>> </div>
> >>> </body></html>
> >>>
> >>> On Thu, Jul 11, 2019 at 1:25 PM Sergey Beryozkin <[email protected]>
> wrote:
> >>> >
> >>> > Hi
> >>> >
> >>> > I've used Tika to parse this invoice PDF:
> >>> >
> >>> >
> https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf
> >>> >
> >>> > (AutoDetectParser, ToTextContentHandler), see below what is returned.
> >>> > The numbers like (1), (2) are added by myself, this is the preferred
> order (approximately).
> >>> >
> >>> > Is it possible to hint somehow to Tika how to report the content ?
> >>> >
> >>> > Thanks Sergey
> >>> >
> >>> > PDF Invoice Example
> >>> > Invoice
> >>> >
> >>> > (5)Payment is due within 30 days from date of invoice. Late payment
> is subject to fees of 5% per month.
> >>> >
> >>> > Thanks for choosing DEMO - Sliced Invoices |
> [email protected]
> >>> >
> >>> > Page 1/1
> >>> >
> >>> > (2)From:
> >>> >
> >>> > DEMO - Sliced Invoices
> >>> >
> >>> > Suite 5A-1204
> >>> >
> >>> > 123 Somewhere Street
> >>> >
> >>> > Your City AZ 12345
> >>> >
> >>> > [email protected]
> >>> >
> >>> > (1)Invoice Number INV-3337
> >>> >
> >>> > Order Number 12345
> >>> >
> >>> > Invoice Date January 25, 2016
> >>> >
> >>> > Due Date January 31, 2016
> >>> >
> >>> > Total Due $93.50
> >>> >
> >>> > (3)To:
> >>> >
> >>> > Test Business
> >>> >
> >>> > 123 Somewhere St
> >>> >
> >>> > Melbourne, VIC 3000
> >>> >
> >>> > [email protected]
> >>> >
> >>> > (4) Hrs/Qty Service Rate/Price Adjust Sub Total
> >>> >
> >>> > 1.00
> >>> > Web Design
> >>> > This is a sample description...
> >>> >
> >>> > $85.00 0.00% $85.00
> >>> >
> >>> > Sub Total $85.00
> >>> >
> >>> > Tax $8.50
> >>> >
> >>> > Total $93.50
> >>> >
> >>> > (5) ANZ Bank
> >>> >
> >>> > ACC # 1234 1234
> >>> >
> >>> > BSB # 4321 432 Pa
> >>> > id
>

Reply via email to