On Tue, 3 Mar 2015, Herm Flink wrote:
Anyone here knows about plans to support the new Visio 2013 (.vsdx) format for text extraction?

I know this format is not an OLE2 format, so the VisioTextExtractor cannot be used here (only supports vsd). Also, using the ExtractorFactory (from POI 3.11) does not work: Invalid OOXML Package received - expected 1 core document, found 0

Probably Apache Tika might be a more likely home for this. It would be based on Apache POI at least, for the OPC stuff, but the extraction/mapping onto the xhtml+text elements might be better there

So if anyone knows if/when this format will be supported for text extraction

A few days after the patch hits the bug tracker ;-)

can point me to directions on how to do it myself, I would be very thankful.

OOXML/OPC files are a zip of xml files, so I'd suggest creating a few small sample .vsdx files. Unzip them, and look at the xml, searching for the text you want. Assume that the structure will be similar to the .vsd one, so look at the VSD code in POI to get more of an idea. Then, suggest something!

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to