Got it. Thank you for following up. Please do let us know if you have any other surprises.
On Thu, Feb 28, 2019 at 9:25 AM Svensson, Kristian < [email protected]> wrote: > Ok! Looking at some other pdf-files it seems to be extracting the links. > The one that did not work it looks like a clickable link in PDF-Xchange > Viewer but it's probably the viewer itself that interprets the url from the > free-text and makes that clickable. All is well, my mistake! > > > > Thank you for your answer! > > > > *From:* Tim Allison [mailto:[email protected]] > *Sent:* den tor februari 2019 15:14 > *To:* [email protected] > *Subject:* Re: Extract link annotations (hyperlinks) with tika app? > > > > Hmmmm....we should be extracting links. Tilman's code on SO is slightly > different from ours at this point, but ours should be working with the > caveat that we aren't capturing the anchor text as Tilman's code does --- > we're just repeating the link as the anchor text, and we're dumping the > hrefs at the end of the page, we're not currently trying to integrate hrefs > where they actually belong in the text. > > > > We have one unit test for this: > > > > @Test > > public void testLinks() throws Exception { > > final XMLResult result = getXML("testPDFVarious.pdf"); > > assertContains("<div class=\"annotation\"><a href=\" > http://tika.apache.org/\ <http://tika.apache.org/>">"+ > > "http://tika.apache.org/</a></div>", result.xml); > > } > > > > Is there any chance that you have extractAnnotationText set to false? The > default is true and is required to be true to extract hrefs. > > > > This could be a bug, though...let us know... > > > > > > On Thu, Feb 28, 2019 at 4:56 AM Svensson, Kristian < > [email protected]> wrote: > > Using the tika app (tika-app-1.20.jar), is it possible to extract link > annotations (hyperlinks)? Ideally I would like to get a href in the xhtml > output. I failed finding any documentation regarding this. > > I found out that pdfbox can extract link annotations: > > https://stackoverflow.com/questions/38587567/how-to-extract-hyperlink-information-pdfbox > But I'm not sure how to use this with the tika app. > > I think the tika app is using pdfbox for pdf content extraction, but I > might be wrong😊 > > Any help greatly appreciated! > > Best Regards, > > Kristian > > ________________________________ > > > NOTICE: This communication and any attachments ("this message") may > contain information which is privileged, confidential, proprietary or > otherwise subject to restricted disclosure under applicable law. This > message is for the sole use of the intended recipient(s). Any unauthorized > use, disclosure, viewing, copying, alteration, dissemination or > distribution of, or reliance on, this message is strictly prohibited. If > you have received this message in error, or you are not an authorized or > intended recipient, please notify the sender immediately by replying to > this message, delete this message and all copies from your e-mail system > and destroy any printed copies. > > > > -LAEmHhHzdJzBlTWfa4Hgs7pbKl > >
