Just to add another wrinkle, how clean is your OCR? I've seen it
range from very nice (i.e. 99.9% of the words are actually words) to
horrible (60%+ of the "words" are nonsense). I saw one attempt
to OCR a family tree. As in a stylized tree with the data
hand-written along the various branches in every orientation. Not a
recognizable word in the bunch <G>....

Best
Erick

On Jan 22, 2008 2:05 PM, Phillip Farber <[EMAIL PROTECTED]> wrote:

>
>
> Ryan McKinley wrote:
> >>
> >> We are considering Solr 1.2 to index and search a terabyte-scale
> >> dataset of OCR.  Initially our requirements are simple: basic
> >> tokenizing, score sorting only, no faceting.   The schema is simple
> >> too.  A document consists of a numeric id, stored and indexed and a
> >> large text field, indexed not stored, containing the OCR typically
> >> ~1.4Mb.  Some limited faceting or additional metadata fields may be
> >> added later.
> >
> > I have not done anything on this scale...  but with:
> > https://issues.apache.org/jira/browse/SOLR-303 it will be possible to
> > split a large index into many smaller indices and return the union of
> > all results.  This may or may not be necessary depending on what the
> > data actually looks like (if you text just uses 100 words, your index
> > may not be that big)
> >
> > How many documents are you talking about?
> >
>
> Currently 1M docs @ ~1.4M/doc.  Scaling to 7M docs.  This is OCR so we
> are talking perhaps 50K words total to index so as you point out the
> index might not be too big.  It's the *data* that is big not the
> *index*, right?.  So I don't think SOLR-303 (distributed search) is
> required here.
>
>  Obviously as the number of documents increase the index size must
> increase to some degree -- I think linearly?  But what index size will
> result for 7M documents over 50K words where we're talking just 2 fields
> per doc: 1 id field and one OCR field of ~1.4M?  Ballpark?
>
> Regarding single word queries, do you think, say, 0.5 sec/query to
> return 7M score-ranked IDs is possible/reasonable in this scenario?
>
>
> >>
> >> Should we expect Solr indexing time to slow significantly as we scale
> >> up?  What kind of query performance could we expect?  Is it totally
> >> naive even to consider Solr at this kind of scale?
> >>
> >
> > You may want to check out the lucene benchmark stuff
> > http://lucene.apache.org/java/docs/benchmarks.html
> >
> >
> http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/benchmark/byTask/package-summary.html
> >
> >
> >
> > ryan
> >
> >
>

Reply via email to