All - We have a use case where we need to be able to support documents that may be too large to fit in memory. In other kinds of data sources, we deal with this by splitting the input data into chunks, so that only one or two chunks need to be in memory at any given time.
I believe we don't yet support that in Tika, right? The Parser abstract class looks like it gives you the entire document's text in one call. What are the plans, if any, to support chunking? I could get involved in that if you like. I realize that the Parser abstract class would just adapt the chunking of the Parser implementations (e.g. Poi) to our unified API, rather than doing the chunking itself. I suppose that in the Parser abstract class and its implementations we'd have to add support for: * querying chunking capabilities (most fundamentally, can it chunk at all?) * configuring the chunking mode (e.g. on/off, chunk size) * reading chunks Thanks, Keith -- View this message in context: http://www.nabble.com/Chunk-Support-in-Tika--tf4438977.html#a12665223 Sent from the Apache Tika - Development mailing list archive at Nabble.com.
