Hey,

I'm testing tika with these document
formats html,doc,docx,odt,txt,pdf,odf,odp,xls,xlsx,ppt. It all works fine as
to text extraction, except for PPT.

 if I create a ppt like this and

private void createPPTDocument(String from, File file) {

      SlideShow ppt = new SlideShow();
      Slide slide = ppt.createSlide();
      TextBox shape = new TextBox();
      RichTextRun rt = shape.getTextRun().getRichTextRuns()[0];
      shape.setText(from);
      rt.setFontSize(7);
      slide.addShape(shape);
      shape.setAnchor(new java.awt.Rectangle(50, 50, 500, 300));
      slide.addShape(shape);
      FileOutputStream out = new FileOutputStream(file);
      ppt.write(out);
      out.close();
}

and extract it :

Tika tika = new Tika();
tika.setMaxStringLength(new Long(maxCharCount).intValue());
String text = tika.parseToString(is);


then handler.toString(); /  text variable contains all text content twice.
I'm attaching a sample file. Text extraction outputs

To enable representatives of Serbian civil society organisations to visit
the EESC and to become acquainted with its activities.
To enable representatives of Serbian civil society organisations to visit
the EESC and to become acquainted with its activities.

To be honest, I can't figure out what is going wrong in the HSLFExtractor.

Attachment: en.ppt
Description: MS-Powerpoint presentation

Reply via email to