Re: RE: Indexing Question for large dataset

2011-04-14 Thread karsten-solr
Hi Joshua,

what is the use-case?
Do you need only the facets for one field (for each query)?
Do you need all facet-values or only the first 10 in .sort=index 
(FACET_SORT_INDEX / numeric order) / in .sort=count (FACET_SORT_COUNT) ?
How many different facet-valuss do you have per field?
Do you only need this fields for faceted search?


Your problem will be, that solr normaly put a int[searcher.maxDoc()] array in 
main-memory for each field with facets.
You can avoid this by using .method=enum which should not fit in your case.

Because you do not have multiToken per document, your facets will compute by 
SimpleFacets#getFieldCacheCounts. In Version 3.1 you will find a TODO that fits 
your needs :-(
In this method you will also see the the method use indirectly a WeakHashMap, 
so if you only use 100 fields per hour you should not have a problem :-)
But there will be no warm up for your application (first facet search will take 
a while).

From my point of view you should program your own solr-PlugIn for your 
purpose. This is not so hard, I assure you.

Best regards
  Karsten



 Joshua

 Name equals the product name. 
 
 Each separate product can have 1 to n prices based upon pricelist.
 
 A single document represents that single product.
 
 doc
   field name=id1/field
   field name=nameThe product name./field
   field name=price1.00/field
   field name=priceList1Price0.99/field
   field name=priceList2Price0.98/field
   field name=priceList1500Price0.85/field
 /doc
 doc
   field name=id2/field
   field name=nameThe product name./field
   field name=price1.10/field
   field name=priceList1Price1.09/field
   field name=priceList2Price1.08/field
   field name=priceList1500Price1.05/field
 /doc
 
 Yes, the amount of pricelist could grow from 1000 to 5000 given the user
 base grows.
 
 There are currently about 150,000 products.
 
 We do need to index the products, since they change frequently.
 
 Thanks everyone for all your responses so far!
 
 -Original Message-
 From: kenf_nc [mailto:ken.fos...@realestate.com] 
 Sent: Wednesday, April 13, 2011 1:15 PM
 To: solr-user@lucene.apache.org
 Subject: RE: Indexing Question for large dataset
 
 Is NAME a product name? Why would it be multivalue? And why would it
 appear
 on more than one document?  Is each 'document' a package of products? And
 the pricing tiers are on the package, not individual pieces?
 
 So sounds like you could, potentially, have a PriceListX column for each
 user. As your User base grows, the number of columns you need may grow
 (you
 already bumped up from 2000 to 5000 in the space of a couple posts :) ).
 Is
 that right?
 
 How many products (or packages of products) do you have? Could you flip
 this
 on its ear and make a User the document. Then it could have just 3
 multivalue fields (beyond any you need to identify the user like user_id)
 product_id
 product_name
 product_price
 
 Downside is if a new product is introduced you have to re-index all users
 that have a price point on that product.  
 
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Indexing-Question-for-large-dataset-tp2816344p2816994.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 The recipient of this email should check this email and any attachments
 for the presence of viruses. 
 The Wasserstrom Companies accepts no liability for any damage caused by
 any virus transmitted by this email.
 
 This footnote also confirms that this email message has been scanned for
 the presence of computer viruses.
 
 The Wasserstrom Companies


Indexing Question for large dataset

2011-04-13 Thread Joshua Bouchair
We have an ecommerce application B2C/B2B with a large amount of price list that 
range into 2000+ and growing. They want to index price to have facets and 
sorting. That seems like that would be a lot of columns to index, example below:

INDEX COLUMN:  NamePrice  PriceList1Price   PriceList2Price 
  PriceList3Price   . PriceList500Price
INDEX DATA:  Test   1.00.99 
 .79  .89  
.85

It seems to me that indexing that amount of data is wrong, but of course I do 
not know. I was wondering if anyone has any suggestions as to what should be 
done or is this not even possible?

Thank you very much,
Josh B.

The recipient of this email should check this email and any attachments for the 
presence of viruses. 
The Wasserstrom Companies accepts no liability for any damage caused by any 
virus transmitted by this email.

This footnote also confirms that this email message has been scanned for the 
presence of computer viruses.

The Wasserstrom Companies


Re: Indexing Question for large dataset

2011-04-13 Thread kenf_nc
Indexing isn't a problem, it's just disk space and space is cheap. But, if
you do facets on all those price columns, that gets put into RAM which isn't
as cheap or plentiful. Your cache buffers may get overloaded a lot and
performance will suffer.

2000 price columns seems like a lot, could the documents be organized
differently? Hard to tell from your example.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-Question-for-large-dataset-tp2816344p2816377.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Indexing Question for large dataset

2011-04-13 Thread Joshua Bouchair
Don't know of any other way to organize the documents. We need to have the 
specific price that belongs to the user, so I don't think that the facets would 
be the issue. The facet querying would be modified to the corresponding price 
list field for that user. Let's say the customer belongs to priceList1500, I 
would use the price from that column (priceList1500) instead of the priceList1 
or even price column. Let me post an example data in another way.

INDEX FIELD | INDEX DATA

ID  |1   (INDEXED | STORED)
NAME|TEST(INDEXED | STORED | MULTIVALUED)
PRICE   |1.00(INDEXED)
PRICELIST1  |0.99(INDEXED)
PRICELIST2  |0.89(INDEXED)
PRICELIST500|0.85(INDEXED)

ID  |2   (INDEXED | STORED)
NAME|TEST2   (INDEXED | STORED | MULTIVALUED)
PRICE   |1.10(INDEXED)
PRICELIST1  |1.09(INDEXED)
PRICELIST250|1.05(INDEXED)
PRICELIST600|1.03(INDEXED)

The price list correspond to customer contracts with the company for contracted 
item pricing. Is there a specific size limit to the amount of index columns 
SOLR/LUCENCE can handle? Is there a better way of handling this? Do you see an 
issue with ram from what I am stating here? Also, with the index so huge, let's 
say 5000 columns across per data set will that degrade search performance 
dramatically (note the search fields of course would not be for all those 
columns)?

Example Query:
q=namefl=NAME,IDfacet=truefacet.field=PRICELIST500

Thanks,
Josh B.

-Original Message-
From: kenf_nc [mailto:ken.fos...@realestate.com] 
Sent: Wednesday, April 13, 2011 10:47 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing Question for large dataset

Indexing isn't a problem, it's just disk space and space is cheap. But, if
you do facets on all those price columns, that gets put into RAM which isn't
as cheap or plentiful. Your cache buffers may get overloaded a lot and
performance will suffer.

2000 price columns seems like a lot, could the documents be organized
differently? Hard to tell from your example.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-Question-for-large-dataset-tp2816344p2816377.html
Sent from the Solr - User mailing list archive at Nabble.com.
The recipient of this email should check this email and any attachments for the 
presence of viruses. 
The Wasserstrom Companies accepts no liability for any damage caused by any 
virus transmitted by this email.

This footnote also confirms that this email message has been scanned for the 
presence of computer viruses.

The Wasserstrom Companies


RE: Indexing Question for large dataset

2011-04-13 Thread kenf_nc
Is NAME a product name? Why would it be multivalue? And why would it appear
on more than one document?  Is each 'document' a package of products? And
the pricing tiers are on the package, not individual pieces?

So sounds like you could, potentially, have a PriceListX column for each
user. As your User base grows, the number of columns you need may grow (you
already bumped up from 2000 to 5000 in the space of a couple posts :) ). Is
that right?

How many products (or packages of products) do you have? Could you flip this
on its ear and make a User the document. Then it could have just 3
multivalue fields (beyond any you need to identify the user like user_id)
product_id
product_name
product_price

Downside is if a new product is introduced you have to re-index all users
that have a price point on that product.  


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-Question-for-large-dataset-tp2816344p2816994.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Indexing Question for large dataset

2011-04-13 Thread Joshua Bouchair
Name equals the product name. 

Each separate product can have 1 to n prices based upon pricelist.

A single document represents that single product.

doc
field name=id1/field
field name=nameThe product name./field
field name=price1.00/field
field name=priceList1Price0.99/field
field name=priceList2Price0.98/field
field name=priceList1500Price0.85/field
/doc
doc
field name=id2/field
field name=nameThe product name./field
field name=price1.10/field
field name=priceList1Price1.09/field
field name=priceList2Price1.08/field
field name=priceList1500Price1.05/field
/doc

Yes, the amount of pricelist could grow from 1000 to 5000 given the user base 
grows.

There are currently about 150,000 products.

We do need to index the products, since they change frequently.

Thanks everyone for all your responses so far!

-Original Message-
From: kenf_nc [mailto:ken.fos...@realestate.com] 
Sent: Wednesday, April 13, 2011 1:15 PM
To: solr-user@lucene.apache.org
Subject: RE: Indexing Question for large dataset

Is NAME a product name? Why would it be multivalue? And why would it appear
on more than one document?  Is each 'document' a package of products? And
the pricing tiers are on the package, not individual pieces?

So sounds like you could, potentially, have a PriceListX column for each
user. As your User base grows, the number of columns you need may grow (you
already bumped up from 2000 to 5000 in the space of a couple posts :) ). Is
that right?

How many products (or packages of products) do you have? Could you flip this
on its ear and make a User the document. Then it could have just 3
multivalue fields (beyond any you need to identify the user like user_id)
product_id
product_name
product_price

Downside is if a new product is introduced you have to re-index all users
that have a price point on that product.  


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-Question-for-large-dataset-tp2816344p2816994.html
Sent from the Solr - User mailing list archive at Nabble.com.
The recipient of this email should check this email and any attachments for the 
presence of viruses. 
The Wasserstrom Companies accepts no liability for any damage caused by any 
virus transmitted by this email.

This footnote also confirms that this email message has been scanned for the 
presence of computer viruses.

The Wasserstrom Companies