Hi Erick,
I like your idea, FWIW please also leave room for boost by function query which
takes many numeric fields as input but results in a single value. I don't know
if this counts as a really clever function but here's one that I currently use:
{!boost
b=pow(sum(log(sum(product(boosted,9000),product(product(image,stocked),300),product(product(image,taxonomyCategoryTypeId),300),product(product(image,sales),150),product(stocked,2),product(sales,2),views)),1),3)}
Note, image is an int/bool field: 1=has image, 0=no image, hence all the
product(product(image,...),...) terms above as they negate the boosts if there
isn't an image!
Thanks
Robi
-----Original Message-----
From: Erick Erickson [mailto:[email protected]]
Sent: Tuesday, November 12, 2013 9:01 AM
To: [email protected]
Subject: Sorting memory-efficiently by any numeric field (dates too?)
Before I go and pat myself on the back, what do people think about this trick?
The base problem is "Is there a space-efficient way to return the top N
documents, sorted by a numeric field". The numeric field includes dates.
It come to me in a vision in a flash! (The Pickle Song, Arlo Guthrie). If we
could return the numeric field in question as the score of a document it should
work without allocating the internal arrays for holding all the timestamps.
So what about something like this?
/select?q={!boost b=manufacturedate_dt}text:* and reverse order by
/select?q={!boost b=div(1,manufacturedate_dt)}text:*
It works on the test data. So let's assume that we're space constrained. It
_seems_ like this would only allocate enough space for the top N documents in
the result set which is insignificant in terms of memory consumption for a
large number of documents in a core. Any obvious problems that people see?
I see a couple of shortcomings:
1> You only get one field. Unless you can create a really clever
1> function
that incorporates all the values in multiple fields, this is going to be hard
to use with more than one field.
2> The boost syntax doesn't allow for a *:*, so you have to specify an
existing field. If there happen to be documents that don't have anything in the
field, you'll miss them.
3> I'm not sure what the performance issues are, especially in the case
where _every_ document scores better than the current top-N
Erick