Hi,

According to TIKA-2069, extracting macros from Office documents will now be 
enabled on Tika 1.14. Any plans to release that soon? Otherwise, is there a 
snapshot repo I can refer to in my Maven project such that I can use the latest 
and greatest? Also, what about POI? It seems to have no maintainer these days - 
is anyone here actively patching it for bugs, or will I need to do that myself 
as I find them?

Related to this, I see that if I embed a link in to a Word document and then 
parse it, I am given a startElement event for the 'a' element but looking at 
the attributes supplied to the call, I don't see the target for the link, 
though the next characters event does give me the text of the link. How can I 
extract the target?

What I get is one attribute:

Attr(0).localName = name
Attr(0).QName = name
Attr(0).Type = CDATA
Attr(0).URI =
Attr(0).Value = _GoBack

I can't quite see why the value for "name" would be "_GoBack" unless there is 
something I am missing about the format of links in (in this case) a word .docm 
document. I believe that this is to do with supplying display text for the link 
rather than just putting in a direct http reference (display text is the link). 
When I add just the href target without any display text, I do get the target 
value in the href attribute (which is not given when the link is covered by 
text).

Any thoughts from anyone? I can quite believe that I am abusing the API in some 
way as I have only just started using it.

Finally, when dealing with composite documents and archives such as zip, should 
I handle these with a
ParsingEmbeddedDocumentExtractor and create a new handler for each embedded 
document? I don't need to extract the embedded docs  as per the example of 
this, just parse them as new documents and create my own information structure, 
which I will track myself. This seems to be the way to do it. I can experiment 
and work all this out, but if anyone has any pointers or links to examples 
other than the ones that come with the source, then I would be grateful.

Thanks for all the free software!

Regards,

Jim

Reply via email to