Right, a more nuanced comment involves what _type_ of docs you're storing, and what the ratio of searchable-to-overall size is. Consider an image. The searchable data may be 0.01% of the file size. Or even worse, a movie.
As always, "it depends". I guess that personally I'm not a fan of using Solr as a fie store when you have to be prepared to re-index from scratch sometime _anyway_ (IMO), in which case you often might as well serve the data from the system-of-record since it's there anyway. IOW, I need to be convinced the use-case really merits it. And the particular use-case may very well mean it's a fine solution.... So if the use-case merits it, storing files in Solr is fine I just wonder when it comes to docs with lots of non-searchable bytes and relatively few searchable bytes. Best, Erick On Fri, Nov 14, 2014 at 2:02 PM, Michael Sokolov <msoko...@safaribooksonline.com> wrote: > > On 11/14/2014 01:43 PM, Erick Erickson wrote: >> >> Just skimming, so maybe I misinterpreted. >> >> ExternalFileField and ExternalFileFieldReloader >> refer to storing values for each doc in an external file, they have >> nothing to do with storing _files_. >> >> The usual pattern is to have Solr store just enough data to have the >> system-of-record return the actual file rather than have Solr >> actually store the file. Solr isn't really built for this and while some >> people do this it usually is a poor design if for no other reason than >> as segments merge, the data gets copied again and again and again >> to no good purpose. > > I was worried about this, and spent a bunch of time working on a custom > codec that would store files externally (to avoid the merge penalty), while > still living inside the Solr/Lucene ecosystem. It was a lot of complicated > work, and after a while I thought I'd better do some careful performance > measurements to make sure it was worthwhile. What I found was that the > merge cost was not very high relative to other indexing costs we were paying > (indexing large full text documents with fairly complex analysis, but > nothing unusual). So I don't think this particular performance argument > against storage in Solr/Lucene is telling, at least for many ratios of > stored doc size to indexed tokens size. It's also worth mentioning that my > test involved reindexing every document once (basically a query-level > replication of an existing index), so perhaps the amount of merging was less > than it might be in other cases. > > I can see that there might be other reasons to store documents elsewhere, > but in my experience, with our use case, it actually works pretty well to > store them in Lucene indexes. Consider, for example, that if you are > highlighting, you are probably already storing the full text of each > document anyway. In our case we also need to store a marked-up version of > the full text (so we can highlight an html view of a document as well as > deliver plain text snippets), so the incremental cost of storing pdfs was > not crushing. Of course these could all be stored externally, too. Maybe > we'll try that and get massive performance increases :) > > -Mike