Hi Jonathan, Thank you very much for your creative suggestions. I also wondered if perhaps combining giftDate and giftAmount into one single token was a possible solution. I'll definitely explore this further using your ideas. I especially like your idea of combing the giftDate and giftAmount into the format '2007-10000000'.
Thanks again, Andre On Sep 15, 2010, at 5:55 PM, Jonathan Rochkind wrote: > I might consider what Erick suggested to actually be 'normalization' rather > than de-normalization! It's just that in Solr you only get one 'table'. > > Here's yet another approach, which will have it's own trade-offs: > > Keep the document as it is, representing a donor. But in addition to indexing > in the multi-valued 'amount' and 'date' fields like you're doing (which will > still be neccesary/useful for some kinds of queries), create a multi-valued > "date-amount" field, where you index tokens like: > > "200-11/14/2006". > > Hmm, I'm actually not sure what that will get you, now that I'm here. > Especially with this case that involves dates, which can be especially > tricky. But combining two values in a de-normalized way is often an approach > with solr for this sort of question, so I'll put it out there anyway in case > it gives you or anyone ideas. Might or might not get you further if in that > combined fields you put things into buckets, if you really aren't going to > need to query down to the day, "200.11-2006", or even an integer > representation of "Nov 2006". > > > > Andre Bickford wrote: >> Thanks for the response Erick. >> >> I did actually try exactly what you suggested. I flipped the index over so >> that a gift is the document. This solution certainly solves the previous >> problem, but introduces a new issue where the search results show duplicate >> donors. If a donor gave 12 times in a year, and we offer full years as facet >> ranges, my understanding is that you'd see that donor 12 times in the search >> results, once for each gift document. Obviously I could do some client side >> filtering to list only distinct donors, but I was hoping to avoid that. >> >> If I've simply stumbled into the basic tradeoffs of denormalization, I can >> live with client side de-duplication, but if you have any further >> suggestions I'm all eyes. >> >> As for sizing, we have some huge charities as clients. However, right now >> I'm testing on a copy of prod data from a smaller client with ~350,000 >> donors and ~8,000,000 gift records. So, when I "flipped" the index around as >> you suggested, it went from 350,000 documents to 8,000,000 documents. No >> issues with performance at all. >> >> Thanks again, >> Andre >> >> -----Original Message----- >> From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, >> September 15, 2010 3:09 PM >> To: solr-user@lucene.apache.org >> Subject: Re: Simple Filter Query (fq) Use Case Question >> >> One strategy is to denormalize all the way. That is, each >> Solr "document" is Gift Amount and Gift Date would not be multiValued. >> You'd create a different "document" for each gift, so you'd have multiple >> documents with the same Id, Name, and Address. Be careful, though, >> if you've defined Id as a UniqueKey, you'd only have one record/donor. You >> can handle this easily enough by making a composite key of Id+Gift Date >> (assuming no donor made more than one gift on exactly the same date). >> >> I know this goes completely against all the reflexes you've built up with >> working with DBs, but... >> >> Can you give us a clue how many donations we're talking about here? >> You'd have to be working with a really big nonprofit to get enough documents >> to have to start worrying about making your index smaller. >> >> HTH >> Erick >> >> On Wed, Sep 15, 2010 at 1:41 PM, Andre Bickford <abickf...@softrek.com>wrote: >> >> >>> I'm working on creating a solr index search for a charitable organization. >>> The solr index stores documents of donors. Each donor document has the >>> following four fields: >>> >>> Id >>> Name >>> Address >>> Gift Amount (multiValued) >>> Gift Date (multiValued) >>> >>> In our relational database, there is a one-to-many relationship between the >>> DONOR table and the GIFT table. One donor can of course give many gifts over >>> time. Consequently, I created the Gift Amount and Gift Date fields to be >>> mutiValued. >>> >>> Now, consider the following query filtered for gifts last month between $0 >>> and $100: >>> >>> q=name:Jones >>> fq=giftDate:[NOW/MONTH-1 TO NOW/MONTH] >>> fq=giftAmount:[0 TO 100] >>> >>> The results show me donors who donated ANY amount in the past month and >>> donors who had EVER in the past given a gift between $0 and $100. I was >>> hoping to only see donors who had given a gift between $0 and $100 in the >>> past month exclusively. I believe the problem is that I neglected to >>> consider that for two multiValued fields, while the values might align >>> "index wise", there is really no other association between the two fields, >>> so the filter query intersection isn't really behaving as I expected. >>> >>> I think this is a fundamental question of one-to-many denormalization, but >>> obviously I'm not yet experienced enough with Lucene/Solr to find a >>> solution. As to why not just keep using a relational database, it's because >>> I'm trying to provide a faceting solution to "drill down" to donors. The >>> aforementioned fq parameters would come from faceting. Oh, that and Oracle >>> Text indexes are a PITA. :-) >>> >>> Thanks for any help you can provide. >>> >>> André Bickford >>> Software Engineering Team Leader >>> SofTrek Corporation >>> 30 Bryant Woods North Amherst, NY 14228 >>> 716.691.2800 x154 800.442.9211 Fax: 716.691.2828 >>> abickf...@softrek.com www.softrek.com >>> >>> >>> >>> >> >> >> >>