Hi Jonathan,

Thank you very much for your creative suggestions. I also wondered if perhaps 
combining giftDate and giftAmount into one single token was a possible 
solution. I'll definitely explore this further using your ideas. I especially 
like your idea of combing the giftDate and giftAmount  into the format 
'2007-10000000'.

Thanks again,
Andre

On Sep 15, 2010, at 5:55 PM, Jonathan Rochkind wrote:

> I might consider what Erick suggested to actually be 'normalization' rather 
> than de-normalization!  It's just that in Solr you only get one 'table'.
> 
> Here's yet another approach, which will have it's own trade-offs:
> 
> Keep the document as it is, representing a donor. But in addition to indexing 
> in the multi-valued 'amount' and 'date' fields like you're doing (which will 
> still be neccesary/useful for some kinds of queries), create a multi-valued 
> "date-amount" field, where you index tokens like:
> 
> "200-11/14/2006".
> 
> Hmm, I'm actually not sure what that will get you, now that I'm here. 
> Especially with this case that involves dates, which can be especially 
> tricky.  But combining two values in a de-normalized way is often an approach 
> with solr for this sort of question, so I'll put it out there anyway in case 
> it gives you or anyone ideas. Might or might not get you further if in that 
> combined fields you put things into buckets, if you really aren't going to 
> need to query down to the day, "200.11-2006", or even an integer 
> representation of "Nov 2006".
> 
> 
> 
> Andre Bickford wrote:
>> Thanks for the response Erick.
>> 
>> I did actually try exactly what you suggested. I flipped the index over so 
>> that a gift is the document. This solution certainly solves the previous 
>> problem, but introduces a new issue where the search results show duplicate 
>> donors. If a donor gave 12 times in a year, and we offer full years as facet 
>> ranges, my understanding is that you'd see that donor 12 times in the search 
>> results, once for each gift document. Obviously I could do some client side 
>> filtering to list only distinct donors, but I was hoping to avoid that.
>> 
>> If I've simply stumbled into the basic tradeoffs of denormalization, I can 
>> live with client side de-duplication, but if you have any further 
>> suggestions I'm all eyes.
>> 
>> As for sizing, we have some huge charities as clients. However, right now 
>> I'm testing on a copy of prod data from a smaller client with ~350,000 
>> donors and ~8,000,000 gift records. So, when I "flipped" the index around as 
>> you suggested, it went from 350,000 documents to 8,000,000 documents. No 
>> issues with performance at all.
>> 
>> Thanks again,
>> Andre
>> 
>> -----Original Message-----
>> From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, 
>> September 15, 2010 3:09 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Simple Filter Query (fq) Use Case Question
>> 
>> One strategy is to denormalize all the way. That is, each
>> Solr "document" is Gift Amount and Gift Date would not be multiValued.
>> You'd create a different "document" for each gift, so you'd have multiple
>> documents with the same Id, Name, and Address. Be careful, though,
>> if you've defined Id as a UniqueKey, you'd only have one record/donor. You
>> can handle this easily enough by making a composite key of Id+Gift Date
>> (assuming no donor made more than one gift on exactly the same date).
>> 
>> I know this goes completely against all the reflexes you've built up with
>> working with DBs, but...
>> 
>> Can you give us a clue how many donations we're talking about here?
>> You'd have to be working with a really big nonprofit to get enough documents
>> to have to start worrying about making your index smaller.
>> 
>> HTH
>> Erick
>> 
>> On Wed, Sep 15, 2010 at 1:41 PM, Andre Bickford <abickf...@softrek.com>wrote:
>> 
>>  
>>> I'm working on creating a solr index search for a charitable organization.
>>> The solr index stores documents of donors. Each donor document has the
>>> following four fields:
>>> 
>>> Id
>>> Name
>>> Address
>>> Gift Amount (multiValued)
>>> Gift Date (multiValued)
>>> 
>>> In our relational database, there is a one-to-many relationship between the
>>> DONOR table and the GIFT table. One donor can of course give many gifts over
>>> time. Consequently, I created the Gift Amount and Gift Date fields to be
>>> mutiValued.
>>> 
>>> Now, consider the following query filtered for gifts last month between $0
>>> and $100:
>>> 
>>> q=name:Jones
>>> fq=giftDate:[NOW/MONTH-1 TO NOW/MONTH]
>>> fq=giftAmount:[0 TO 100]
>>> 
>>> The results show me donors who donated ANY amount in the past month and
>>> donors who had EVER in the past given a gift between $0 and $100. I was
>>> hoping to only see donors who had given a gift between $0 and $100 in the
>>> past month exclusively. I believe the problem is that I neglected to
>>> consider that for two multiValued fields, while the values might align
>>> "index wise", there is really no other association between the two fields,
>>> so the filter query intersection isn't really behaving as I expected.
>>> 
>>> I think this is a fundamental question of one-to-many denormalization, but
>>> obviously I'm not yet experienced enough with Lucene/Solr to find a
>>> solution. As to why not just keep using a relational database, it's because
>>> I'm trying to provide a faceting solution to "drill down" to donors. The
>>> aforementioned fq parameters would come from faceting. Oh, that and Oracle
>>> Text indexes are a PITA. :-)
>>> 
>>> Thanks for any help you can provide.
>>> 
>>> André Bickford
>>> Software Engineering Team Leader
>>> SofTrek Corporation
>>> 30 Bryant Woods North  Amherst, NY 14228
>>> 716.691.2800 x154  800.442.9211  Fax: 716.691.2828
>>> abickf...@softrek.com  www.softrek.com
>>> 
>>> 
>>> 
>>>    
>> 
>> 
>> 
>>  

Reply via email to