This is one way to access the underlying CTShape that contains the text:
XSSFWorkbook wb = new XSSFWorkbook(new FileInputStream(f));
XSSFSheet sheet = wb.getSheetAt(0);
XSSFDrawing drawing = sheet.createDrawingPatriarch();
for (XSSFShape shape : drawing.getShapes()){
if (shape instanceof XSSFSimpleShape){
XSSFSimpleShape simple = ((XSSFSimpleShape)shape);
System.out.println("CT: "+simple.getCTShape());
}
}
Hiroshi, If this is a high priority, you could extract the txBody element with
some bean work. I've opened https://issues.apache.org/jira/browse/TIKA-1150
for the longer term fix.
There's some work going on on XSSFTextCell in POI that might make this more
straightforward.
-----Original Message-----
From: Allison, Timothy B. [mailto:[email protected]]
Sent: Monday, July 22, 2013 8:50 AM
To: [email protected]
Subject: RE: How to extract autoshape text in Excel 2007+
This looks like an area for a new feature in both Tika and POI. I've only
looked very briefly into the POI libraries, and I may have missed how to
extract text from autoshapes. I'll open an issue in both projects.
-----Original Message-----
From: Hiroshi Tatsumi [mailto:[email protected]]
Sent: Sunday, July 21, 2013 10:16 AM
To: [email protected]
Subject: How to extract autoshape text in Excel 2007+
Hi,
I am using Tika 1.3 and Solr 4.3.1.
I'd like to extract autoshape text in Excel 2007+(.xlsx), but I can't.
I tried to extract from some MS office files.
The results are below.
Success (I can extract autoshape text.)
- Excel 2003(.xls)
- Word 2003(.doc)
- Word 2007+(.docx)
Failed (I cannot extract autoshape text.)
- Excel 2007+(.xlsx)
Is this a bug?
If you know, could you tell me how to extract autoshape text in Excel 2007+?
Thanks,
Hiro.