Re: Help with denormalizing issues

2009-10-09 Thread Chris Hostetter

: business needs.  Our problem is that we have a catalog schema with 
: products and skus, one to many.  The most relevant content being indexed 
: is at the product level, in the name and description fields.  However we 
: are interested in filtering by sku attributes, and in particular making 
: multiple filters apply to a single sku.  For example, find a product 

the first rule of denormalization is to construct documents bsaed on what 
you granularity you wnat to get back -- because from a user perspective, 
that level of granularity is whta's going to make sense for 
faceting/filtern.  

If you want your results to be product based have one doc per product -- 
if you results to be sku based, have one doc per sku, and denormalize the 
product data redundently into every sku.

if sometimes you wnat to reutnr product data, and other times you wnat to 
return sku data, then create both types of documents (either in different 
indexes, or in the same index but with a doctype field that you can filter 
on)



-Hoss



RE: Help with denormalizing issues

2009-10-07 Thread Eric Reeves
Hi again, I'm gonna try this again with more focus this time :D

1) Ideally what we would like to do, is plug in an additional mechanism to 
filter the initial result set, because we can't find a way to implement our 
filtering needs as filter queries against a single index.  We would want to do 
this while maintaining support for paging.  Looking through the codebase it 
looks as if this would not be possible without major surgery, due to the paging 
support being implemented deep inside private methods of SolrIndexSearcher.  
Does this sound accurate? 

2) If we pursue the other option of indexing skus and collapsing the results 
based on product id using the field collapsing patch, is there any validity to 
my concerns about indexing the same content multiple times skewing the scoring?

3) Does anyone have experience using the field collapsing patch, and have any 
idea how much additional overhead it incurs?

Thanks,
Eric

-Original Message-
From: Eric Reeves 
Sent: Monday, October 05, 2009 6:19 PM
To: solr-user@lucene.apache.org
Subject: Help with denormalizing issues

Hi there,

I'm evaluating Solr as a replacement for our current search server, and am 
trying to determine what the best strategy would be to implement our business 
needs.  Our problem is that we have a catalog schema with products and skus, 
one to many.  The most relevant content being indexed is at the product level, 
in the name and description fields.  However we are interested in filtering by 
sku attributes, and in particular making multiple filters apply to a single 
sku.  For example, find a product that contains a sku that is both blue and on 
sale.  No approach I've tried at collapsing the sku data into the product 
document works for this.  If we put the data in separate fields, there's no way 
to apply multiple filters to the same sku. and if we concatenate all of the 
relevant sku data into a single multivalued field then as I understand it, this 
is just indexed as one large field with extra whitespace between the individual 
entries, so there's still no way to enforce that an AND filter query applies to 
the same sku.

One approach I was considering was to create separate indexes for products and 
skus, and store the product IDs in the sku documents.  Then we could apply our 
own filters to the initially generated list, based on unique query parameters.  
I thought creating a component between query and facet would be a good place to 
add such a filter, but further research seems to indicate that this would break 
paging and sorting.  The only other thing I can think of would be to subclass 
QueryComponent itself, which looks rather daunting-the process() method has no 
hooks for this sort of thing, it seems I would have to copy the entire existing 
implementation and add them myself, which looks to be a fair chunk of work and 
brittle to changes in the trunk code.  Ideally it would be nice to be able to 
handle certain fq parameters in a completely different way, perhaps using a 
custom query parser, but I haven't wrapped my head around how those work.  Does 
any of this sound remotely doable?  Any advice?

The other suggestion we are looking at was given to us by our current search 
provider, which is to index the skus themselves.  It looks as if we may be able 
to make this work using the field collapsing patch from SOLR-236.  I have some 
concerns about this approach though: 1) It will make for a much larger index 
and longer indexing times (products can have 10 or more skus in our catalog).  
2) Because the indexing will be copying the description and name from the 
product it will be indexing the same content more than once, and the number of 
times per product will vary based on the number of skus.  I'm concerned that 
this may skew the scoring algorithm, in particular the inverse frequency part.  
3) I'm not sure about the performance of the field collapsing patch, I've read 
contradictory reports on the web.

I apologize if this is a bit rambling.  If anyone has any advice for our 
situation it would be very helpful.

Thanks,
Eric


Re: Help with denormalizing issues

2009-10-07 Thread Lance Norskog
The separate sku do not become one long text string. They are separate
values in the same field. The relevance calculation is completely
separate per value.

The performance problem with the field collapsing patch is that it
does the same thing as a facet or sorting operation: it does a sweep
through the index and builds a data structure whose size depends on
the index. Faceting is not cached directly but still works very
quickly the second time. Sorting has its own cache and is very slow (N
log N) the first time and very fast afterwards. The field collapsing
patch does not cache any of its work and is almost as slow the second
time as the first time.

On 10/7/09, Eric Reeves eree...@eline.com wrote:
 Hi again, I'm gonna try this again with more focus this time :D

 1) Ideally what we would like to do, is plug in an additional mechanism to
 filter the initial result set, because we can't find a way to implement our
 filtering needs as filter queries against a single index.  We would want to
 do this while maintaining support for paging.  Looking through the codebase
 it looks as if this would not be possible without major surgery, due to the
 paging support being implemented deep inside private methods of
 SolrIndexSearcher.  Does this sound accurate?

 2) If we pursue the other option of indexing skus and collapsing the results
 based on product id using the field collapsing patch, is there any validity
 to my concerns about indexing the same content multiple times skewing the
 scoring?

 3) Does anyone have experience using the field collapsing patch, and have
 any idea how much additional overhead it incurs?

 Thanks,
 Eric

 -Original Message-
 From: Eric Reeves
 Sent: Monday, October 05, 2009 6:19 PM
 To: solr-user@lucene.apache.org
 Subject: Help with denormalizing issues

 Hi there,

 I'm evaluating Solr as a replacement for our current search server, and am
 trying to determine what the best strategy would be to implement our
 business needs.  Our problem is that we have a catalog schema with products
 and skus, one to many.  The most relevant content being indexed is at the
 product level, in the name and description fields.  However we are
 interested in filtering by sku attributes, and in particular making multiple
 filters apply to a single sku.  For example, find a product that contains a
 sku that is both blue and on sale.  No approach I've tried at collapsing the
 sku data into the product document works for this.  If we put the data in
 separate fields, there's no way to apply multiple filters to the same sku.
 and if we concatenate all of the relevant sku data into a single multivalued
 field then as I understand it, this is just indexed as one large field with
 extra whitespace between the individual entries, so there's still no way to
 enforce that an AND filter query applies to the same sku.

 One approach I was considering was to create separate indexes for products
 and skus, and store the product IDs in the sku documents.  Then we could
 apply our own filters to the initially generated list, based on unique query
 parameters.  I thought creating a component between query and facet would be
 a good place to add such a filter, but further research seems to indicate
 that this would break paging and sorting.  The only other thing I can think
 of would be to subclass QueryComponent itself, which looks rather
 daunting-the process() method has no hooks for this sort of thing, it seems
 I would have to copy the entire existing implementation and add them myself,
 which looks to be a fair chunk of work and brittle to changes in the trunk
 code.  Ideally it would be nice to be able to handle certain fq parameters
 in a completely different way, perhaps using a custom query parser, but I
 haven't wrapped my head around how those work.  Does any of this sound
 remotely doable?  Any advice?

 The other suggestion we are looking at was given to us by our current search
 provider, which is to index the skus themselves.  It looks as if we may be
 able to make this work using the field collapsing patch from SOLR-236.  I
 have some concerns about this approach though: 1) It will make for a much
 larger index and longer indexing times (products can have 10 or more skus in
 our catalog).  2) Because the indexing will be copying the description and
 name from the product it will be indexing the same content more than once,
 and the number of times per product will vary based on the number of skus.
 I'm concerned that this may skew the scoring algorithm, in particular the
 inverse frequency part.  3) I'm not sure about the performance of the field
 collapsing patch, I've read contradictory reports on the web.

 I apologize if this is a bit rambling.  If anyone has any advice for our
 situation it would be very helpful.

 Thanks,
 Eric



-- 
Lance Norskog
goks...@gmail.com