On Tue, 3 Mar 2015, Herm Flink wrote:
Anyone here knows about plans to support the new Visio 2013 (.vsdx)
format for text extraction?
I know this format is not an OLE2 format, so the VisioTextExtractor
cannot be used here (only supports vsd). Also, using the
ExtractorFactory (from POI 3.11) does not work: Invalid OOXML Package
received - expected 1 core document, found 0
Probably Apache Tika might be a more likely home for this. It would be
based on Apache POI at least, for the OPC stuff, but the
extraction/mapping onto the xhtml+text elements might be better there
So if anyone knows if/when this format will be supported for text
extraction
A few days after the patch hits the bug tracker ;-)
can point me to directions on how to do it myself, I would be very
thankful.
OOXML/OPC files are a zip of xml files, so I'd suggest creating a few
small sample .vsdx files. Unzip them, and look at the xml, searching for
the text you want. Assume that the structure will be similar to the .vsd
one, so look at the VSD code in POI to get more of an idea. Then, suggest
something!
Nick
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]