Hi Sergey,

  Sorry, I thought I hit send on this yesterday...

  In reverse order, I used the ToXMLContentHandler:
https://github.com/apache/tika/blob/master/tika-core/src/test/java/org/apache/tika/TikaTest.java#L208

  For configuring via tika-config.xml, see, e.g.:
https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/org/apache/tika/parser/pdf/tika-config.xml

  The trick, though, is to exclude the PDFParser from the default
parser and then add the custom configured one back in:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454066

  Let me know if you have any surprises.

          Best,

                    Tim

On Wed, Jul 17, 2019 at 12:20 PM Sergey Beryozkin <[email protected]> wrote:
>
> Hi Tim,
>
> How does one configure PDFParserConfig in tika-config.xml ? May be as one of 
> the PDFParser properties ?
> PDFParser.setSortByPosition (and other simple setters) are deprecated so 
> setting a 'sprtByPosition' as one of the PDFParser properties goes via the 
> deprecated call path (probably not a big deal though :-))
> I also looked at the source and I'm still not sure which ContentHandler did 
> you use to get the HTML tags added.
> (I may experiment with a custom one sitting on top of it adding the table 
> tags may be...)
> Sergey
>
> On Thu, Jul 11, 2019 at 9:52 PM Sergey Beryozkin <[email protected]> wrote:
>>
>> Hi Tim
>>
>> Thanks, I'm going to try to experiment with different complex enough PDFs in 
>> order to figure out how to enhance the Quarkus Tika extension, what to let 
>> customize, etc (I'll link to it in a follow up email).
>> Your output looks better :-), and which ContentHandler did you use ?
>>
>> Sergey
>>
>> On Thu, Jul 11, 2019 at 7:23 PM Tim Allison <[email protected]> wrote:
>>>
>>> Might not need to break out the neural nets just yet...try turning on
>>> sortByPosition via the PDFParserConfig and/or tika_config.xml.
>>>
>>> This is what you get:
>>>
>>>
>>>
>>> <title>PDF Invoice Example</title>
>>> </head>
>>> <body><div class="page"><p />
>>> <p>Invoice
>>> </p>
>>> <p>From: Invoice Number INV-3337
>>> </p>
>>> <p>DEMO - Sliced Invoices Order Number 12345
>>> Suite 5A-1204 Invoice Date January 25, 2016
>>> 123 Somewhere Street Due Date January 31, 2016
>>> Your City AZ 12345
>>> [email protected] Total Due $93.50
>>> </p>
>>> <p>To:
>>> Test Business
>>> 123 Somewhere St
>>> Melbourne, VIC 3000
>>> [email protected]
>>> </p>
>>> <p>Hrs/Qty Service Rate/Price Adjust Sub Total
>>> </p>
>>> <p>1.00 Web DesignThis is a sample description... $85.00 0.00% $85.00
>>> </p>
>>> <p>Pa
>>> idSub Total $85.00
>>> </p>
>>> <p>Tax $8.50
>>> Total $93.50
>>> </p>
>>> <p>ANZ Bank
>>> ACC # 1234 1234
>>> BSB # 4321 432
>>> </p>
>>> <p>Payment is due within 30 days from date of invoice. Late payment is
>>> subject to fees of 5% per month.
>>> Thanks for choosing DEMO - Sliced Invoices | [email protected]
>>> Page 1/1</p>
>>> <p />
>>> <div class="annotation"><a
>>> href="http://slicedinvoices.com/demo";>http://slicedinvoices.com/demo</a></div>
>>> <div class="annotation"><a
>>> href="http://slicedinvoices.com/demo";>http://slicedinvoices.com/demo</a></div>
>>> <div class="annotation"><a
>>> href="http://slicedinvoices.com/demo";>http://slicedinvoices.com/demo</a></div>
>>> <div class="annotation"><a
>>> href="mailto:[email protected]";>mailto:[email protected]</a></div>
>>> </div>
>>> </body></html>
>>>
>>> On Thu, Jul 11, 2019 at 1:25 PM Sergey Beryozkin <[email protected]> 
>>> wrote:
>>> >
>>> > Hi
>>> >
>>> > I've used Tika to parse this invoice PDF:
>>> >
>>> > https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf
>>> >
>>> > (AutoDetectParser, ToTextContentHandler), see below what is returned.
>>> > The numbers like (1), (2) are added by myself, this is the preferred 
>>> > order (approximately).
>>> >
>>> > Is it possible to hint somehow to Tika how to report the content ?
>>> >
>>> > Thanks Sergey
>>> >
>>> > PDF Invoice Example
>>> > Invoice
>>> >
>>> > (5)Payment is due within 30 days from date of invoice. Late payment is 
>>> > subject to fees of 5% per month.
>>> >
>>> > Thanks for choosing DEMO - Sliced Invoices | [email protected]
>>> >
>>> > Page 1/1
>>> >
>>> > (2)From:
>>> >
>>> > DEMO - Sliced Invoices
>>> >
>>> > Suite 5A-1204
>>> >
>>> > 123 Somewhere Street
>>> >
>>> > Your City AZ 12345
>>> >
>>> > [email protected]
>>> >
>>> > (1)Invoice Number INV-3337
>>> >
>>> > Order Number 12345
>>> >
>>> > Invoice Date January 25, 2016
>>> >
>>> > Due Date January 31, 2016
>>> >
>>> > Total Due $93.50
>>> >
>>> > (3)To:
>>> >
>>> > Test Business
>>> >
>>> > 123 Somewhere St
>>> >
>>> > Melbourne, VIC 3000
>>> >
>>> > [email protected]
>>> >
>>> > (4) Hrs/Qty Service Rate/Price Adjust Sub Total
>>> >
>>> > 1.00
>>> > Web Design
>>> > This is a sample description...
>>> >
>>> > $85.00 0.00% $85.00
>>> >
>>> > Sub Total $85.00
>>> >
>>> > Tax $8.50
>>> >
>>> > Total $93.50
>>> >
>>> > (5) ANZ Bank
>>> >
>>> > ACC # 1234 1234
>>> >
>>> > BSB # 4321 432 Pa
>>> > id

Reply via email to