Mike, I've isolated the problem. Here is my code,
import java.io.File;
import java.io.IOException;
import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
/**
* This class is separate to have all Tika-related stuff in a one place
* It may contain more parsing specifics later on
*/
public class DocumentParser {
private static DocumentParser instance = new DocumentParser();
private Tika tika;
public static DocumentParser getInstance() {
return instance;
}
private DocumentParser() {
tika = new Tika();
tika.setMaxStringLength(10 * 1024 * 1024);
}
public void parse(String fileName, Metadata metadata) {
try {
// the given input stream is closed by the parseToString method
(see Tike documentation)
TikaInputStream tikaInputStream = TikaInputStream.get(new
File(fileName));
String text = tika.parseToString(tikaInputStream, metadata);
metadata.set(DocumentMetadataKeys.DOCUMENT_TEXT, text);
tikaInputStream.close();
}
catch (Exception e) {
e.printStackTrace(System.out);
}
}
public static void main(String argv[]) {
Metadata metadata = new Metadata();
String fileName = "7";
getInstance().parse(fileName, metadata);
System.out.println(metadata);
}
}
and I am attaching a input file. It is an email out of public Enron email
corpus.
Thank you,
Mark
On Tue, Aug 30, 2011 at 1:57 PM, Michael McCandless <
[email protected]> wrote:
> One thing I noticed is, in TemporaryFiles.dispose, we call
> file.delete, which returns false if the file could not be deleted.
>
> On Windows this will fail (return false) if we still have the file
> open somewhere, or if it had already been deleted.
>
> So I think we should add an assert that the return value is true (file
> was successfully deleted)? This way when we run tests on Windows
> we'll see tests fail if a parser didn't close the opened temp files...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Tue, Aug 30, 2011 at 2:01 PM, Michael McCandless
> <[email protected]> wrote:
> > I think Tika.parseToString (static sugar method) closes the
> > InputStream for you, while the Parser.parse method does not? Kinda
> > confusing!
> >
> > Mark, do you have specific docs that show this? Then we can boil this
> > down to a test case...
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > On Tue, Aug 30, 2011 at 1:51 PM, Mark Kerzner <[email protected]>
> wrote:
> >> I tried TikaInputStream, and I also close it, but I still get the same
> >> behavior. You can see the deleted but open files in the attached screen
> >> image
> >> Mark
> >>
> >> On Tue, Aug 30, 2011 at 12:36 PM, Mark Kerzner <[email protected]>
> >> wrote:
> >>>
> >>> Nick,
> >>> the documentation specifically says that tika closes this input stream.
> I
> >>> used to close it myself, but having read this documentation page, took
> this
> >>> closing out.
> >>> I will try TikaInputStream, to see if this fixes the problem.
> >>> Mark
> >>>
> >>> On Tue, Aug 30, 2011 at 12:26 PM, Nick Burch <[email protected]>
> >>> wrote:
> >>>>
> >>>> On Tue, 30 Aug 2011, Mark Kerzner wrote:
> >>>>>
> >>>>> String text = tika.parseToString(new FileInputStream(new
> >>>>> File(fileName)),
> >>>>> metadata);
> >>>>
> >>>> Is that in your code our Tika?
> >>>>
> >>>> If you open a FileInputStream, then you yourself need to close it too
> >>>>
> >>>> Also, if you have a File, you're better off wrapping it in a
> >>>> TikaInputStream rather than a FileInputStream, as some parsers prefer
> a File
> >>>> and Tika can then use that
> >>>>
> >>>> Nick
> >>>
> >>
> >>
> >
>
7
Description: Binary data
