At 5:06 AM -0500 1/12/07, Erik Hatcher wrote:
What the user-interface needs is a way to ask Solr for terms that begin with a
specified prefix, as the user types. Paging via start/rows is necessary, and
also sorting by frequency given some specified constraints. I like the
start/end term idea also, though I can't think of a scenario in my application
where this would be different than having a prefix parameter. If I want all
the 1860's, prefix=186field=year, for example.
I also have exactly this requirement: Paging through the terms (and getting the
document count for each term) optionally limited to those matching a supplied
prefix (there can be thousands of terms for a prefix so start/rows is
absolutely necessary even when prefixing). Choosing whether terms were sorted
by index-order or document-count order would be a plus.
I would love to have this be provided by an extension to the Faceting logic, as
suggested by Yonik and Hoss, incorporating the non-query pathway raised by Erik:
- Assemble the list of term/frequency pairs for a field either by tallying
the term references found in a DocList, or by using the term frequency
information found in the index (optimization for non-query case)
- Apply a criterion (RegExp based would obviously be most flexible -- no need
for full Lucene query syntax -- but prefix-only might be an optimization that
could be applied in the non-query case) to filter the terms, either during
assembly or post-facto.
- Apply the faceting criteria (e.g. facet.zeros, though facet.mincount would
have been a more flexible option in all cases)
- Optionally pass through the BoundedTreeSet/PriorityQueue mechanism to sort
by frequency and in that case optionally keep only the top facet.limit terms
- Cache the results with the query (including a special key for the non-query
case) so paging could be done without any requerying, retallying, or resorting
- Return in the response a subrange of the list
- Naturally allow the full complement of response encodings
- (Am I missing anything?)
While a commendable endeavor, this is a fair bit of work, and it may take a
while before someone (perhaps me even) steps up to the plate, for performance
if not functional considerations. So IMHO it would also be worthwhile to craft
a simpler index-only version.
I would be thrilled if this just magically appeared in Solr's codebase before
I have a chance to build it. :)
Well, after my current deadline (next week) passes, this functionality is on my
task list for my next milestone... so I'd be equally elated if I didn't have
to write it myself. :-)
And adding 2 cents to the other topic in this thread...
As for Hoss's suggestion of a Stats handler - I still hold the opinion that
all of the admin JSPs really ought to be first class request handlers that go
through the whole ResponseWriter stuff, so I can get all of that great
capability in Ruby format instead of XML.
Agreed in principle, though I'm an XML-person.
As it is, to build a Ruby API to Solr and provide access to the stats, there
has to be two different parsing mechanisms. I know he meant index stats, not
Solr admin stats, but it reminded me of the XML pain I'm going to feel in
solrb to add Solr stats :)
I am happy to merely be a spectator of the Rubification of SOLR!
Also,
On Jan 11, 2007, at 3:13 PM, Yonik Seeley wrote:
Attempting to enumerating
all of the values for a field could be dangerous
We do it for faceting :-) But we don't drag it all into memory at once...
Not entirely true: The FieldCache pathway of faceting single-valued fields does
just that. In some cases I've set multivalued=true even when it's not accurate
in order to force the cached-filter pathway.
- J.J.