On Mon, 28 Jun 2010, Jana, Kumar Raja wrote:
We use Apache Tika in our application before sending the content to Solr
for Indexing. Some of our documents are pretty large (over 150 MB in
size with "only text" content over 30 MB).
What file formats are these in?
There are some file formats (eg text, csv) where you could fairly easily
get just the first bit. However, others (eg word, powerpoint) don't get
stored nice and linearly, and you have to process the whole file before
you can figure out where the start is...
Nick