Re: Extract URLs from a document

2020-11-12 Thread nensick
Hello, I noticed that depending on the file formats we have different behaviors. It looks like it works quite well with PDF but we have issues with DOCX (Microsoft Word OOXML documents). Tika doesn't extract any link from DOCX. Is it a known issue? Thanks a lot. Sent with ProtonMail Secure Em

RE: Extract URLs from a document

2020-11-12 Thread Markus Jelsma
://tika.apache.org/1.19/api/org/apache/tika/sax/LinkContentHandler.html -Original message- > From:Nick Burch > Sent: Thursday 12th November 2020 12:55 > To: nensick > Cc: user@tika.apache.org > Subject: Re: Extract URLs from a document > > On Wed, 11 Nov 2020, nensick wrote: &

Re: Extract URLs from a document

2020-11-12 Thread Nick Burch
On Wed, 11 Nov 2020, nensick wrote: I am exploring the available features and I managed also to extract Office macros but I still don't find a way to get the links. Imagine to have a PDF, a DOCX in which you have a "click here" text as a link pointing to a website (let's say example[.]com). Ho

Extract URLs from a document

2020-11-11 Thread nensick
Hello, I am a new Tika user and it is amazing, so thanks for your effort. I am exploring the available features and I managed also to extract Office macros but I still don't find a way to get the links. Imagine to have a PDF, a DOCX in which you have a "click here" text as a link pointing to a w