Hi, According to TIKA-2069, extracting macros from Office documents will now be enabled on Tika 1.14. Any plans to release that soon? Otherwise, is there a snapshot repo I can refer to in my Maven project such that I can use the latest and greatest? Also, what about POI? It seems to have no maintainer these days - is anyone here actively patching it for bugs, or will I need to do that myself as I find them?
Related to this, I see that if I embed a link in to a Word document and then parse it, I am given a startElement event for the 'a' element but looking at the attributes supplied to the call, I don't see the target for the link, though the next characters event does give me the text of the link. How can I extract the target? What I get is one attribute: Attr(0).localName = name Attr(0).QName = name Attr(0).Type = CDATA Attr(0).URI = Attr(0).Value = _GoBack I can't quite see why the value for "name" would be "_GoBack" unless there is something I am missing about the format of links in (in this case) a word .docm document. I believe that this is to do with supplying display text for the link rather than just putting in a direct http reference (display text is the link). When I add just the href target without any display text, I do get the target value in the href attribute (which is not given when the link is covered by text). Any thoughts from anyone? I can quite believe that I am abusing the API in some way as I have only just started using it. Finally, when dealing with composite documents and archives such as zip, should I handle these with a ParsingEmbeddedDocumentExtractor and create a new handler for each embedded document? I don't need to extract the embedded docs as per the example of this, just parse them as new documents and create my own information structure, which I will track myself. This seems to be the way to do it. I can experiment and work all this out, but if anyone has any pointers or links to examples other than the ones that come with the source, then I would be grateful. Thanks for all the free software! Regards, Jim