On Mon, 28 Jun 2010, Jana, Kumar Raja wrote:
We use Apache Tika in our application before sending the content to Solr
for Indexing. Some of our documents are pretty large (over 150 MB in
size with "only text" content over 30 MB).

What file formats are these in?

There are some file formats (eg text, csv) where you could fairly easily get just the first bit. However, others (eg word, powerpoint) don't get stored nice and linearly, and you have to process the whole file before you can figure out where the start is...

Nick

Reply via email to