Right, a more nuanced comment involves what _type_ of docs you're
storing, and what the ratio of searchable-to-overall size is. Consider
an image. The searchable data may be 0.01% of the file size. Or even
worse, a movie.

As always, "it depends". I guess that personally I'm not a fan of
using Solr as a fie store when you have to be prepared to re-index
from scratch sometime _anyway_ (IMO), in which case you often might as
well serve the data from the system-of-record since it's there anyway.
IOW, I need to be convinced the use-case really merits it. And the
particular use-case may very well mean it's a fine solution....

So if the use-case merits it, storing files in Solr is fine I just
wonder when it comes to docs with lots of non-searchable bytes and
relatively few searchable bytes.

Best,
Erick

On Fri, Nov 14, 2014 at 2:02 PM, Michael Sokolov
<msoko...@safaribooksonline.com> wrote:
>
> On 11/14/2014 01:43 PM, Erick Erickson wrote:
>>
>> Just skimming, so maybe I misinterpreted.
>>
>> ExternalFileField and ExternalFileFieldReloader
>> refer to storing values for each doc in an external file, they have
>> nothing to do with storing _files_.
>>
>> The usual pattern is to have Solr store just enough data to have the
>> system-of-record return the actual file rather than have Solr
>> actually store the file. Solr isn't really built for this and while some
>> people do this it usually is a poor design if for no other reason than
>> as segments merge, the data gets copied again and again and again
>> to no good purpose.
>
> I was worried about this, and spent a bunch of time working on a custom
> codec that would store files externally (to avoid the merge penalty), while
> still living inside the Solr/Lucene ecosystem. It was a lot of complicated
> work, and after a while I thought I'd better do some careful performance
> measurements to make sure it was worthwhile.  What I found was that the
> merge cost was not very high relative to other indexing costs we were paying
> (indexing large full text documents with fairly complex analysis, but
> nothing unusual). So I don't think this particular performance argument
> against storage in Solr/Lucene is telling, at least for many ratios of
> stored doc size to indexed tokens size. It's also worth mentioning that my
> test involved reindexing every document once (basically a query-level
> replication of an existing index), so perhaps the amount of merging was less
> than it might be in other cases.
>
> I can see that there might be other reasons to store documents elsewhere,
> but in my experience, with our use case, it actually works pretty well to
> store them in Lucene indexes.  Consider, for example, that if you are
> highlighting, you are probably already storing the full text of each
> document anyway. In our case we also need to store a marked-up version of
> the full text (so we can highlight an html view of a document as well as
> deliver plain text snippets), so the incremental cost of storing pdfs was
> not crushing.  Of course these could all be stored externally, too. Maybe
> we'll try that and get massive performance increases :)
>
> -Mike

Reply via email to