: Now, what happens is a user will upload say a word document to us. We then
: parse it and process it into segments. It very well could be 5000 segments
: or even more in that word document. Each one of those ~5000 segments needs
: to be searched for similar segments in solr. I’m not quite sure how I will
: do the query (whether proximate or something else). The point though, is to
: get back similar results for each segment.

You've described your black box (an index of small textual documents) 
and you've described your input (a large document that will be broken down 
into N=~5000 small textual snippets) but you haven't really clarified what 
your desired output should be...

* N textual documents from your index, where each doc is the 1 'best' 
match to 1 of hte N textual input snippets.

* Some fixed number Y textual documents from your index representing the 
"best of the best" matches against your textual input snippets (ie: if one 
input snippet is a "really good" match for multiple indexed docs, return 
all of those "really good" matches, but don't return any matches from 
other snippets if the only matches are "poor".)

* Some variable number Y textual documents from your index representing 
the "best of hte best" matches against your textual input snippets based 
on some minimum threshhold of matching criteria.

* etc...

Forgot for a momoent that we are talking about solr at all -- describe 
some hypothetical data, some hypothetical query examples, and some 
hypothetical results you would like to get back (or not get back) 
from each of those query examples (ideally in psuedo-code) and lets see if 
that doesn't help suggest an implemntation strategy.


-Hoss

Reply via email to