Okay, I got more info:

this happens when the eml files has attachments. As we know, Tika extracts
text from the attachments (which is great, this is what I need), but it
seems like it does not close those attachments, although it does delete
them.

Mark

On Tue, Aug 30, 2011 at 2:04 PM, Mark Kerzner <[email protected]> wrote:

> Mike,
>
> I've isolated the problem. Here is my code,
>
> import java.io.File;
> import java.io.IOException;
> import org.apache.tika.Tika;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
>
> /**
>  * This class is separate to have all Tika-related stuff in a one place
>  * It may contain more parsing specifics later on
>  */
> public class DocumentParser {
>     private static DocumentParser instance = new DocumentParser();
>     private Tika tika;
>
>     public static DocumentParser getInstance() {
>         return instance;
>     }
>
>     private DocumentParser() {
>         tika = new Tika();
>         tika.setMaxStringLength(10 * 1024 * 1024);
>     }
>     public void parse(String fileName, Metadata metadata) {
>         try {
>             // the given input stream is closed by the parseToString method
> (see Tike documentation)
>             TikaInputStream tikaInputStream = TikaInputStream.get(new
> File(fileName));
>             String text = tika.parseToString(tikaInputStream, metadata);
>
>             metadata.set(DocumentMetadataKeys.DOCUMENT_TEXT, text);
>
>             tikaInputStream.close();
>         }
>         catch (Exception e) {
>             e.printStackTrace(System.out);
>         }
>     }
>     public static void main(String argv[]) {
>         Metadata metadata = new Metadata();
>         String fileName = "7";
>         getInstance().parse(fileName, metadata);
>         System.out.println(metadata);
>     }
> }
>
> and I am attaching a input file. It is an email out of public Enron email
> corpus.
>
> Thank you,
> Mark
>
>
> On Tue, Aug 30, 2011 at 1:57 PM, Michael McCandless <
> [email protected]> wrote:
>
>> One thing I noticed is, in TemporaryFiles.dispose, we call
>> file.delete, which returns false if the file could not be deleted.
>>
>> On Windows this will fail (return false) if we still have the file
>> open somewhere, or if it had already been deleted.
>>
>> So I think we should add an assert that the return value is true (file
>> was successfully deleted)?  This way when we run tests on Windows
>> we'll see tests fail if a parser didn't close the opened temp files...
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Tue, Aug 30, 2011 at 2:01 PM, Michael McCandless
>> <[email protected]> wrote:
>> > I think Tika.parseToString (static sugar method) closes the
>> > InputStream for you, while the Parser.parse method does not?  Kinda
>> > confusing!
>> >
>> > Mark, do you have specific docs that show this?  Then we can boil this
>> > down to a test case...
>> >
>> > Mike McCandless
>> >
>> > http://blog.mikemccandless.com
>> >
>> > On Tue, Aug 30, 2011 at 1:51 PM, Mark Kerzner <[email protected]>
>> wrote:
>> >> I tried TikaInputStream, and I also close it, but I still get the same
>> >> behavior. You can see the deleted but open files in the attached screen
>> >> image
>> >> Mark
>> >>
>> >> On Tue, Aug 30, 2011 at 12:36 PM, Mark Kerzner <[email protected]>
>> >> wrote:
>> >>>
>> >>> Nick,
>> >>> the documentation specifically says that tika closes this input
>> stream. I
>> >>> used to close it myself, but having read this documentation page, took
>> this
>> >>> closing out.
>> >>> I will try TikaInputStream, to see if this fixes the problem.
>> >>> Mark
>> >>>
>> >>> On Tue, Aug 30, 2011 at 12:26 PM, Nick Burch <[email protected]
>> >
>> >>> wrote:
>> >>>>
>> >>>> On Tue, 30 Aug 2011, Mark Kerzner wrote:
>> >>>>>
>> >>>>> String text = tika.parseToString(new FileInputStream(new
>> >>>>> File(fileName)),
>> >>>>> metadata);
>> >>>>
>> >>>> Is that in your code our Tika?
>> >>>>
>> >>>> If you open a FileInputStream, then you yourself need to close it too
>> >>>>
>> >>>> Also, if you have a File, you're better off wrapping it in a
>> >>>> TikaInputStream rather than a FileInputStream, as some parsers prefer
>> a File
>> >>>> and Tika can then use that
>> >>>>
>> >>>> Nick
>> >>>
>> >>
>> >>
>> >
>>
>
>

Reply via email to