Jukka and All - I've been thinking about how our Parser interface takes an InputStream rather than a resource identifier (URL, File, String).
In order to accomplish the reading of an original resource only once, we have the RereadableInputStream. However, this presents the following potential problems due to the duplication of data in memory or on disk: 1) We are implementing the chunking of data using the SAX events. This allows us to break up a document into smaller parts. However, there is no such chunking with regards to the RereadableInputStream; it reads and stores the entire document. 2) Users need to be much more aware of their system's resources at all points in time during which Tika may be in use. This would require anticipating available disk storage, what other processes are running, etc. 3) In some environments, saving to disk is not practical due to performance or security concerns. 4) We introduce the risk of bringing down the JVM if the maximum memory is exceeded, and possibly worse if the disk runs out of free space. 5) The parser implementations themselves may store data and use large amounts of memory, so we may not have as much memory or disk available as we may think. * * * For casual uses, this will probably not be a problem. However, many users will need Tika to be robust and efficient even under high loads. So I raise the question -- should we think about supporting multiple reads of a resource, at least as an option? Many users will work only with static resources such as files, and not be concerned about the data changing between reads. This would require changing the Parser interface, probably to take a URL rather than an InputStream. Maybe this is not necessary though -- do we know to what extent parsers need to make multiple passes? And will they ever need the first pass to read more than just a small header? If not, then the BufferedInputStream's mark and release would work fine, and we would not need to store the read bytes ourselves, using RereadableInputStream or otherwise. I have no knowledge of the parser implementations, so I thought RereadableInputStream would cover the worst case. However, I'm now seeing that it presents problems of its own. - Keith -- View this message in context: http://www.nabble.com/Parser-Interface%2C-RereadableInputStream-tf4616886.html#a13185507 Sent from the Apache Tika - Development mailing list archive at Nabble.com.
