Mike,

I've isolated the problem. Here is my code,

import java.io.File;
import java.io.IOException;
import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;

/**
 * This class is separate to have all Tika-related stuff in a one place
 * It may contain more parsing specifics later on
 */
public class DocumentParser {
    private static DocumentParser instance = new DocumentParser();
    private Tika tika;

    public static DocumentParser getInstance() {
        return instance;
    }

    private DocumentParser() {
        tika = new Tika();
        tika.setMaxStringLength(10 * 1024 * 1024);
    }
    public void parse(String fileName, Metadata metadata) {
        try {
            // the given input stream is closed by the parseToString method
(see Tike documentation)
            TikaInputStream tikaInputStream = TikaInputStream.get(new
File(fileName));
            String text = tika.parseToString(tikaInputStream, metadata);

            metadata.set(DocumentMetadataKeys.DOCUMENT_TEXT, text);

            tikaInputStream.close();
        }
        catch (Exception e) {
            e.printStackTrace(System.out);
        }
    }
    public static void main(String argv[]) {
        Metadata metadata = new Metadata();
        String fileName = "7";
        getInstance().parse(fileName, metadata);
        System.out.println(metadata);
    }
}

and I am attaching a input file. It is an email out of public Enron email
corpus.

Thank you,
Mark


On Tue, Aug 30, 2011 at 1:57 PM, Michael McCandless <
[email protected]> wrote:

> One thing I noticed is, in TemporaryFiles.dispose, we call
> file.delete, which returns false if the file could not be deleted.
>
> On Windows this will fail (return false) if we still have the file
> open somewhere, or if it had already been deleted.
>
> So I think we should add an assert that the return value is true (file
> was successfully deleted)?  This way when we run tests on Windows
> we'll see tests fail if a parser didn't close the opened temp files...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Tue, Aug 30, 2011 at 2:01 PM, Michael McCandless
> <[email protected]> wrote:
> > I think Tika.parseToString (static sugar method) closes the
> > InputStream for you, while the Parser.parse method does not?  Kinda
> > confusing!
> >
> > Mark, do you have specific docs that show this?  Then we can boil this
> > down to a test case...
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > On Tue, Aug 30, 2011 at 1:51 PM, Mark Kerzner <[email protected]>
> wrote:
> >> I tried TikaInputStream, and I also close it, but I still get the same
> >> behavior. You can see the deleted but open files in the attached screen
> >> image
> >> Mark
> >>
> >> On Tue, Aug 30, 2011 at 12:36 PM, Mark Kerzner <[email protected]>
> >> wrote:
> >>>
> >>> Nick,
> >>> the documentation specifically says that tika closes this input stream.
> I
> >>> used to close it myself, but having read this documentation page, took
> this
> >>> closing out.
> >>> I will try TikaInputStream, to see if this fixes the problem.
> >>> Mark
> >>>
> >>> On Tue, Aug 30, 2011 at 12:26 PM, Nick Burch <[email protected]>
> >>> wrote:
> >>>>
> >>>> On Tue, 30 Aug 2011, Mark Kerzner wrote:
> >>>>>
> >>>>> String text = tika.parseToString(new FileInputStream(new
> >>>>> File(fileName)),
> >>>>> metadata);
> >>>>
> >>>> Is that in your code our Tika?
> >>>>
> >>>> If you open a FileInputStream, then you yourself need to close it too
> >>>>
> >>>> Also, if you have a File, you're better off wrapping it in a
> >>>> TikaInputStream rather than a FileInputStream, as some parsers prefer
> a File
> >>>> and Tika can then use that
> >>>>
> >>>> Nick
> >>>
> >>
> >>
> >
>

Attachment: 7
Description: Binary data

Reply via email to