Re: Extract link annotations (hyperlinks) with tika app?

Tim Allison Thu, 28 Feb 2019 09:21:45 -0800

Got it.  Thank you for following up.  Please do let us know if you have any
other surprises.


On Thu, Feb 28, 2019 at 9:25 AM Svensson, Kristian <
[email protected]> wrote:

> Ok! Looking at some other pdf-files it seems to be extracting the links.
> The one that did not work it looks like a clickable link in PDF-Xchange
> Viewer but it's probably the viewer itself that interprets the url from the
> free-text and makes that clickable. All is well, my mistake!
>
>
>
> Thank you for your answer!
>
>
>
> *From:* Tim Allison [mailto:[email protected]]
> *Sent:* den tor februari 2019 15:14
> *To:* [email protected]
> *Subject:* Re: Extract link annotations (hyperlinks) with tika app?
>
>
>
> Hmmmm....we should be extracting links. Tilman's code on SO is slightly
> different from ours at this point, but ours should be working with the
> caveat that we aren't capturing the anchor text as Tilman's code does ---
> we're just repeating the link as the anchor text, and we're dumping the
> hrefs at the end of the page, we're not currently trying to integrate hrefs
> where they actually belong in the text.
>
>
>
> We have one unit test for this:
>
>
>
>     @Test
>
>     public void testLinks() throws Exception {
>
>         final XMLResult result = getXML("testPDFVarious.pdf");
>
>         assertContains("<div class=\"annotation\"><a href=\"
> http://tika.apache.org/\ <http://tika.apache.org/>">"+
>
>                 "http://tika.apache.org/</a></div>", result.xml);
>
>     }
>
>
>
> Is there any chance that you have extractAnnotationText set to false?  The
> default is true and is required to be true to extract hrefs.
>
>
>
> This could be a bug, though...let us know...
>
>
>
>
>
> On Thu, Feb 28, 2019 at 4:56 AM Svensson, Kristian <
> [email protected]> wrote:
>
> Using the tika app (tika-app-1.20.jar), is it possible to extract link
> annotations (hyperlinks)? Ideally I would like to get a href in the xhtml
> output. I failed finding any documentation regarding this.
>
> I found out that pdfbox can extract link annotations:
>
> https://stackoverflow.com/questions/38587567/how-to-extract-hyperlink-information-pdfbox
> But I'm not sure how to use this with the tika app.
>
> I think the tika app is using pdfbox for pdf content extraction, but I
> might be wrong😊
>
> Any help greatly appreciated!
>
> Best Regards,
>
> Kristian
>
> ________________________________
>
>
> NOTICE: This communication and any attachments ("this message") may
> contain information which is privileged, confidential, proprietary or
> otherwise subject to restricted disclosure under applicable law. This
> message is for the sole use of the intended recipient(s). Any unauthorized
> use, disclosure, viewing, copying, alteration, dissemination or
> distribution of, or reliance on, this message is strictly prohibited. If
> you have received this message in error, or you are not an authorized or
> intended recipient, please notify the sender immediately by replying to
> this message, delete this message and all copies from your e-mail system
> and destroy any printed copies.
>
>
>
> -LAEmHhHzdJzBlTWfa4Hgs7pbKl
>
>

Re: Extract link annotations (hyperlinks) with tika app?

Reply via email to