Okay, I got more info: this happens when the eml files has attachments. As we know, Tika extracts text from the attachments (which is great, this is what I need), but it seems like it does not close those attachments, although it does delete them.
Mark On Tue, Aug 30, 2011 at 2:04 PM, Mark Kerzner <[email protected]> wrote: > Mike, > > I've isolated the problem. Here is my code, > > import java.io.File; > import java.io.IOException; > import org.apache.tika.Tika; > import org.apache.tika.exception.TikaException; > import org.apache.tika.io.TikaInputStream; > import org.apache.tika.metadata.Metadata; > > /** > * This class is separate to have all Tika-related stuff in a one place > * It may contain more parsing specifics later on > */ > public class DocumentParser { > private static DocumentParser instance = new DocumentParser(); > private Tika tika; > > public static DocumentParser getInstance() { > return instance; > } > > private DocumentParser() { > tika = new Tika(); > tika.setMaxStringLength(10 * 1024 * 1024); > } > public void parse(String fileName, Metadata metadata) { > try { > // the given input stream is closed by the parseToString method > (see Tike documentation) > TikaInputStream tikaInputStream = TikaInputStream.get(new > File(fileName)); > String text = tika.parseToString(tikaInputStream, metadata); > > metadata.set(DocumentMetadataKeys.DOCUMENT_TEXT, text); > > tikaInputStream.close(); > } > catch (Exception e) { > e.printStackTrace(System.out); > } > } > public static void main(String argv[]) { > Metadata metadata = new Metadata(); > String fileName = "7"; > getInstance().parse(fileName, metadata); > System.out.println(metadata); > } > } > > and I am attaching a input file. It is an email out of public Enron email > corpus. > > Thank you, > Mark > > > On Tue, Aug 30, 2011 at 1:57 PM, Michael McCandless < > [email protected]> wrote: > >> One thing I noticed is, in TemporaryFiles.dispose, we call >> file.delete, which returns false if the file could not be deleted. >> >> On Windows this will fail (return false) if we still have the file >> open somewhere, or if it had already been deleted. >> >> So I think we should add an assert that the return value is true (file >> was successfully deleted)? This way when we run tests on Windows >> we'll see tests fail if a parser didn't close the opened temp files... >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> On Tue, Aug 30, 2011 at 2:01 PM, Michael McCandless >> <[email protected]> wrote: >> > I think Tika.parseToString (static sugar method) closes the >> > InputStream for you, while the Parser.parse method does not? Kinda >> > confusing! >> > >> > Mark, do you have specific docs that show this? Then we can boil this >> > down to a test case... >> > >> > Mike McCandless >> > >> > http://blog.mikemccandless.com >> > >> > On Tue, Aug 30, 2011 at 1:51 PM, Mark Kerzner <[email protected]> >> wrote: >> >> I tried TikaInputStream, and I also close it, but I still get the same >> >> behavior. You can see the deleted but open files in the attached screen >> >> image >> >> Mark >> >> >> >> On Tue, Aug 30, 2011 at 12:36 PM, Mark Kerzner <[email protected]> >> >> wrote: >> >>> >> >>> Nick, >> >>> the documentation specifically says that tika closes this input >> stream. I >> >>> used to close it myself, but having read this documentation page, took >> this >> >>> closing out. >> >>> I will try TikaInputStream, to see if this fixes the problem. >> >>> Mark >> >>> >> >>> On Tue, Aug 30, 2011 at 12:26 PM, Nick Burch <[email protected] >> > >> >>> wrote: >> >>>> >> >>>> On Tue, 30 Aug 2011, Mark Kerzner wrote: >> >>>>> >> >>>>> String text = tika.parseToString(new FileInputStream(new >> >>>>> File(fileName)), >> >>>>> metadata); >> >>>> >> >>>> Is that in your code our Tika? >> >>>> >> >>>> If you open a FileInputStream, then you yourself need to close it too >> >>>> >> >>>> Also, if you have a File, you're better off wrapping it in a >> >>>> TikaInputStream rather than a FileInputStream, as some parsers prefer >> a File >> >>>> and Tika can then use that >> >>>> >> >>>> Nick >> >>> >> >> >> >> >> > >> > >
