Even if you go for the 400,000 documents way, the size of data and number of
unique tokens would remain the same. With your data size, you should think
about sharding and distributed search.
Is the availability of a product a boolean value or the number of items? To
make sure that you don't need to do very frequent updates, it will be better
to use a boolean for availability. Even then, real-time updates in Solr are
not possible and you will have to allow for a reasonable delay for changes
to take effect.

On Tue, Nov 25, 2008 at 4:40 AM, Vimal Jobanputra <[EMAIL PROTECTED]>wrote:

> Hi, and apologies in advance for the lengthy question!
>
> I'm looking to use Solr to power searching & browsing over a large set of
> product data stored in a relational db. I'm wandering what the most
> appropriate schema design strategy to use is. A simplified view of the
> relational data is:
>
> Shop (~1000 rows)
> -Id*
> -Name
>
> Product (~300,000 rows)
> -Id*
> -Name
> -Availability
>
> ProductFormat (~5 rows)
> -Id*
> -Name
>
> Component (part of a product that may be sold separately) (~4,000,000 rows)
> -Id*
> -Name
>
> ProductComponent (~4,000,000 rows)
> -ProductId*
> -ComponentId*
>
> ShopProduct (~6,000,000 rows)
> -ShopId*
> -ProductId*
> -ProductFormatId*
> -AvailableDate
>
> ShopProductPriceList (~15,000,000 rows)
> -ShopId*
> -ProductId*
> -ProductFormatId*
> -Applicability (Component/Product)*
> -Type (Regular/SalePrice)*
> -Amount
>
> * logical primary key
>
> This means:
> -availability of a product differs from shop to shop
> -the price of a product or component is dependent on the format, and also
> differs from shop to shop
>
> Textual searching is required over product & component names, and filtering
> is required over Shops, Product Availability, Formats, & Prices.
>
> The simplest approach would be to flatten out the data completely (1 solr
> document per ShopProduct and ShopProductComponent). This would result in
> ~80million documents, which I'm guessing this would need some form of
> sharding/distribution
>
> An alternate approach would be to construct one document per Product, and
> *nest* the relational data via dynamic fields (and possibly plugins?)
> Eg one document per Product; multi-value fields for ProductComponent &
> Shop;
> dynamic fields for Availability/Format, using ShopId as part of the field
> name.
> This approach would result in far fewer documents (400,000), but more
> complex queries. It would also require extending Solr/Lucene to search over
> ProductComponents and filter by price, which I'm not quite clear on as
> yet...
>
> Any guidance on which of the two general approaches (or others) to explore
> further?
>
> Thanks!
> Vim
>



-- 
Regards,
Shalin Shekhar Mangar.

Reply via email to