I might consider what Erick suggested to actually be 'normalization' rather than de-normalization! It's just that in Solr you only get one 'table'.

Here's yet another approach, which will have it's own trade-offs:

Keep the document as it is, representing a donor. But in addition to indexing in the multi-valued 'amount' and 'date' fields like you're doing (which will still be neccesary/useful for some kinds of queries), create a multi-valued "date-amount" field, where you index tokens like:

"200-11/14/2006".

Hmm, I'm actually not sure what that will get you, now that I'm here. Especially with this case that involves dates, which can be especially tricky. But combining two values in a de-normalized way is often an approach with solr for this sort of question, so I'll put it out there anyway in case it gives you or anyone ideas. Might or might not get you further if in that combined fields you put things into buckets, if you really aren't going to need to query down to the day, "200.11-2006", or even an integer representation of "Nov 2006".



Andre Bickford wrote:
Thanks for the response Erick.

I did actually try exactly what you suggested. I flipped the index over so that 
a gift is the document. This solution certainly solves the previous problem, 
but introduces a new issue where the search results show duplicate donors. If a 
donor gave 12 times in a year, and we offer full years as facet ranges, my 
understanding is that you'd see that donor 12 times in the search results, once 
for each gift document. Obviously I could do some client side filtering to list 
only distinct donors, but I was hoping to avoid that.

If I've simply stumbled into the basic tradeoffs of denormalization, I can live 
with client side de-duplication, but if you have any further suggestions I'm 
all eyes.

As for sizing, we have some huge charities as clients. However, right now I'm testing on 
a copy of prod data from a smaller client with ~350,000 donors and ~8,000,000 gift 
records. So, when I "flipped" the index around as you suggested, it went from 
350,000 documents to 8,000,000 documents. No issues with performance at all.

Thanks again,
Andre

-----Original Message-----
From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, September 15, 2010 3:09 PM
To: solr-user@lucene.apache.org
Subject: Re: Simple Filter Query (fq) Use Case Question

One strategy is to denormalize all the way. That is, each
Solr "document" is Gift Amount and Gift Date would not be multiValued.
You'd create a different "document" for each gift, so you'd have multiple
documents with the same Id, Name, and Address. Be careful, though,
if you've defined Id as a UniqueKey, you'd only have one record/donor. You
can handle this easily enough by making a composite key of Id+Gift Date
(assuming no donor made more than one gift on exactly the same date).

I know this goes completely against all the reflexes you've built up with
working with DBs, but...

Can you give us a clue how many donations we're talking about here?
You'd have to be working with a really big nonprofit to get enough documents
to have to start worrying about making your index smaller.

HTH
Erick

On Wed, Sep 15, 2010 at 1:41 PM, Andre Bickford <abickf...@softrek.com>wrote:

I'm working on creating a solr index search for a charitable organization.
The solr index stores documents of donors. Each donor document has the
following four fields:

Id
Name
Address
Gift Amount (multiValued)
Gift Date (multiValued)

In our relational database, there is a one-to-many relationship between the
DONOR table and the GIFT table. One donor can of course give many gifts over
time. Consequently, I created the Gift Amount and Gift Date fields to be
mutiValued.

Now, consider the following query filtered for gifts last month between $0
and $100:

q=name:Jones
fq=giftDate:[NOW/MONTH-1 TO NOW/MONTH]
fq=giftAmount:[0 TO 100]

The results show me donors who donated ANY amount in the past month and
donors who had EVER in the past given a gift between $0 and $100. I was
hoping to only see donors who had given a gift between $0 and $100 in the
past month exclusively. I believe the problem is that I neglected to
consider that for two multiValued fields, while the values might align
"index wise", there is really no other association between the two fields,
so the filter query intersection isn't really behaving as I expected.

I think this is a fundamental question of one-to-many denormalization, but
obviously I'm not yet experienced enough with Lucene/Solr to find a
solution. As to why not just keep using a relational database, it's because
I'm trying to provide a faceting solution to "drill down" to donors. The
aforementioned fq parameters would come from faceting. Oh, that and Oracle
Text indexes are a PITA. :-)

Thanks for any help you can provide.

André Bickford
Software Engineering Team Leader
SofTrek Corporation
30 Bryant Woods North  Amherst, NY 14228
716.691.2800 x154  800.442.9211  Fax: 716.691.2828
abickf...@softrek.com  www.softrek.com






Reply via email to