Just to add another wrinkle, how clean is your OCR? I've seen it range from very nice (i.e. 99.9% of the words are actually words) to horrible (60%+ of the "words" are nonsense). I saw one attempt to OCR a family tree. As in a stylized tree with the data hand-written along the various branches in every orientation. Not a recognizable word in the bunch <G>....
Best Erick On Jan 22, 2008 2:05 PM, Phillip Farber <[EMAIL PROTECTED]> wrote: > > > Ryan McKinley wrote: > >> > >> We are considering Solr 1.2 to index and search a terabyte-scale > >> dataset of OCR. Initially our requirements are simple: basic > >> tokenizing, score sorting only, no faceting. The schema is simple > >> too. A document consists of a numeric id, stored and indexed and a > >> large text field, indexed not stored, containing the OCR typically > >> ~1.4Mb. Some limited faceting or additional metadata fields may be > >> added later. > > > > I have not done anything on this scale... but with: > > https://issues.apache.org/jira/browse/SOLR-303 it will be possible to > > split a large index into many smaller indices and return the union of > > all results. This may or may not be necessary depending on what the > > data actually looks like (if you text just uses 100 words, your index > > may not be that big) > > > > How many documents are you talking about? > > > > Currently 1M docs @ ~1.4M/doc. Scaling to 7M docs. This is OCR so we > are talking perhaps 50K words total to index so as you point out the > index might not be too big. It's the *data* that is big not the > *index*, right?. So I don't think SOLR-303 (distributed search) is > required here. > > Obviously as the number of documents increase the index size must > increase to some degree -- I think linearly? But what index size will > result for 7M documents over 50K words where we're talking just 2 fields > per doc: 1 id field and one OCR field of ~1.4M? Ballpark? > > Regarding single word queries, do you think, say, 0.5 sec/query to > return 7M score-ranked IDs is possible/reasonable in this scenario? > > > >> > >> Should we expect Solr indexing time to slow significantly as we scale > >> up? What kind of query performance could we expect? Is it totally > >> naive even to consider Solr at this kind of scale? > >> > > > > You may want to check out the lucene benchmark stuff > > http://lucene.apache.org/java/docs/benchmarks.html > > > > > http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/benchmark/byTask/package-summary.html > > > > > > > > ryan > > > > >