Chunk Support in Tika?

kbennett Thu, 13 Sep 2007 15:45:59 -0700

All -

We have a use case where we need to be able to support documents that may be
too large to fit in memory.  In other kinds of data sources, we deal with
this by splitting the input data into chunks, so that only one or two chunks
need to be in memory at any given time.


I believe we don't yet support that in Tika, right?  The Parser abstract
class looks like it gives you the entire document's text in one call.

What are the plans, if any, to support chunking?  I could get involved in
that if you like.  I realize that the Parser abstract class would just adapt
the chunking of the Parser implementations (e.g. Poi) to our unified API,
rather than doing the chunking itself.   I suppose that in the Parser
abstract class and its implementations we'd have to add support for:

* querying chunking capabilities (most fundamentally, can it chunk at all?)
* configuring the chunking mode (e.g. on/off, chunk size)
* reading chunks

Thanks,
Keith

-- 
View this message in context: 
http://www.nabble.com/Chunk-Support-in-Tika--tf4438977.html#a12665223
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

Chunk Support in Tika?

Reply via email to