First, let me extend my appreciation with the work that had been done on
POI. Let me clarify my earlier comment, we are using POI 3.5 beta
specifically for extracting text from existing documents. For this
purposes, we dont consider it to be very stable. In certain
circumstances, with the newer office formats, it appears to chew memory
and throw null pointer exceptions in certain situations. We are not sure
if its is due to a bug in the underlying XML parser, but this is our
experience anyway. We are trying to obtain more info, but its difficult
since our system works with thousands of docs and we dont have direct
access to the machine's where its running on. In the end, for the newer
office docs, we ended up having to implement our own text extraction
utility using SAX (see attached).
jamie wrote:
Actually, I would hardly describe POI as stable. We get out of memory
errors on many documents, null pointer exceptions, etc.
Yegor Kozlov wrote:
Use the latest beta. It is very stable and many people (including POI
developers) use it in production.
POI-3.5-final will be released in August-September, there are some
features we want to implement before it, and a few more bugs that
need fixing.
Yegor
Hi there,
My project is supposed to upgrade to a newer version of POI pretty
soon. I am just wondering when will POI3.5 become final or if it is
safe to use just beta?
Thanks.
Yanling
__________________________________________________________________
Ask a question on any topic and get answers from real people. Go to
Yahoo! Answers and share what you know at http://ca.answers.yahoo.com
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
package com.stimulus.archiva.extraction;
import java.io.*;
import java.nio.charset.Charset;
import java.util.zip.ZipEntry;
import java.util.zip.ZipInputStream;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.xml.sax.*;
import org.xml.sax.helpers.*;
import com.stimulus.archiva.exception.ExtractionException;
import com.stimulus.archiva.index.IndexInfo;
public class MS2007Extractor implements TextExtractor, Serializable
{
private static final long serialVersionUID = 2121152151457122351L;
protected static final Log logger = LogFactory.getLog(MS2007Extractor.class.getName());
public Reader getText(InputStream is, Charset charset, IndexInfo indexInfo) throws ExtractionException {
ZipInputStream zis = null;
ZipEntry entry = null;
File extractFile = null;
OutputStreamWriter writer = null;
Reader reader = null;
try {
extractFile = File.createTempFile("extract",".tmp");
zis = new ZipInputStream(is);
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
writer = new OutputStreamWriter(new FileOutputStream(extractFile),"UTF-8");
SaxHandler handler = new SaxHandler(writer);
while (( entry = zis.getNextEntry() ) != null) {
String name = entry.getName();
if (name.endsWith("sharedStrings.xml") ||
name.endsWith("document.xml") ||
name.startsWith("ppt/slides/slide")) {
InputStream entryis = new FixedLengthInputStream(zis,(int)entry.getSize());
Reader read = new InputStreamReader(entryis,"UTF-8");
InputSource isource = new InputSource(read);
isource.setEncoding("UTF-8");
parser.parse(isource, handler);
writer.write('\n');
}
}
zis.close();
writer.close();
reader = new InputStreamReader(new FileInputStream(extractFile),"UTF-8");
} catch (Exception e) {
throw new ExtractionException("failed to extract text from microsoft 2007 document:"+e.getMessage(),e,logger);
} catch (OutOfMemoryError ome) {
throw new ExtractionException("failed to extract text from microsoft 2007 document:"+ome.getMessage(),ome,logger);
} finally {
try { if (zis!=null) zis.close(); } catch (Exception e) { System.out.println(e); }
try { if (writer!=null) writer.close(); } catch (Exception e) { System.out.println(e); }
if (extractFile!=null) indexInfo.addDeleteFile(extractFile);
if (reader!=null) indexInfo.addReader(reader);
}
return reader;
}
protected class SaxHandler extends DefaultHandler {
Writer writer;
public SaxHandler(Writer writer) {
this.writer = writer;
}
public void characters(char[] ch, int start, int length) throws SAXException {
try {
writer.write(' ');
writer.write(ch,start,length);
writer.write(' ');
} catch (IOException e) {
System.out.println("failed to write characters to temp file");
}
}
}
public static class FixedLengthInputStream extends InputStream {
private InputStream mIn;
private int mLength;
private int mCount;
public FixedLengthInputStream(InputStream in, int length) {
this.mIn = in;
this.mLength = length;
}
@Override
public int available() throws IOException {
return mLength - mCount;
}
@Override
public int read() throws IOException {
if (mCount < mLength) {
mCount++;
return mIn.read();
} else {
return -1;
}
}
@Override
public int read(byte[] b, int offset, int length) throws IOException {
if (mCount < mLength) {
int d = mIn.read(b, offset, Math.min(mLength - mCount, length));
if (d == -1) {
return -1;
} else {
mCount += d;
return d;
}
} else {
return -1;
}
}
@Override
public int read(byte[] b) throws IOException {
return read(b, 0, b.length);
}
public String toString() {
return String.format("FixedLengthInputStream(in=%s, length=%d)", mIn.toString(), mLength);
}
}
}
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]