Re: [jira] Commented: (TIKA-35) Extract MsOffice properties

Jukka Zitting Mon, 01 Oct 2007 03:07:57 -0700

Hi,

On 9/28/07, Bertrand Delacretaz <[EMAIL PROTECTED]> wrote:
> On 9/28/07, kbennett <[EMAIL PROTECTED]> wrote:
> > ...It would be nice if there were some implementation of BufferedReader that
> > used disk instead of memory if the readaheadLimit exceeded a threshold.  If
> > not, we may need to write our own....
>
> Agreed, a BufferedReader with "unlimited" storage on disk sounds like
> the way to go.
>
> I don't know of any existing implementation, though.


I've implemented such classes a few times before, based on support
classes (like DeferredFileOutputStream) from commons-io. I can dig up
some of my old code and contribute it to commons-io and/or Tika.

There's an interesting question about a potential optimization: If the
stream being processed is based on a File, a URI, or a byte array,
should we still create a temporary copy of the data while parsing or
can we rely on rereading the source of the data? A temporary copy
introduces quite a bit of overhead, but avoids nasty problems with
files/resources/arrays being overwritten between consecutive parsing
passes.

BR,

Jukka Zitting

Re: [jira] Commented: (TIKA-35) Extract MsOffice properties

Reply via email to