Hey, I'm testing tika with these document formats html,doc,docx,odt,txt,pdf,odf,odp,xls,xlsx,ppt. It all works fine as to text extraction, except for PPT.
if I create a ppt like this and
private void createPPTDocument(String from, File file) {
SlideShow ppt = new SlideShow();
Slide slide = ppt.createSlide();
TextBox shape = new TextBox();
RichTextRun rt = shape.getTextRun().getRichTextRuns()[0];
shape.setText(from);
rt.setFontSize(7);
slide.addShape(shape);
shape.setAnchor(new java.awt.Rectangle(50, 50, 500, 300));
slide.addShape(shape);
FileOutputStream out = new FileOutputStream(file);
ppt.write(out);
out.close();
}
and extract it :
Tika tika = new Tika();
tika.setMaxStringLength(new Long(maxCharCount).intValue());
String text = tika.parseToString(is);
then handler.toString(); / text variable contains all text content twice.
I'm attaching a sample file. Text extraction outputs
To enable representatives of Serbian civil society organisations to visit
the EESC and to become acquainted with its activities.
To enable representatives of Serbian civil society organisations to visit
the EESC and to become acquainted with its activities.
To be honest, I can't figure out what is going wrong in the HSLFExtractor.
en.ppt
Description: MS-Powerpoint presentation
