Re: RE: Indexing Question for large dataset
Hi Joshua, what is the use-case? Do you need only the facets for one field (for each query)? Do you need all facet-values or only the first 10 in .sort=index (FACET_SORT_INDEX / numeric order) / in .sort=count (FACET_SORT_COUNT) ? How many different facet-valuss do you have per field? Do you only need this fields for faceted search? Your problem will be, that solr normaly put a int[searcher.maxDoc()] array in main-memory for each field with facets. You can avoid this by using .method=enum which should not fit in your case. Because you do not have multiToken per document, your facets will compute by SimpleFacets#getFieldCacheCounts. In Version 3.1 you will find a TODO that fits your needs :-( In this method you will also see the the method use indirectly a WeakHashMap, so if you only use 100 fields per hour you should not have a problem :-) But there will be no warm up for your application (first facet search will take a while). From my point of view you should program your own solr-PlugIn for your purpose. This is not so hard, I assure you. Best regards Karsten Joshua Name equals the product name. Each separate product can have 1 to n prices based upon pricelist. A single document represents that single product. doc field name=id1/field field name=nameThe product name./field field name=price1.00/field field name=priceList1Price0.99/field field name=priceList2Price0.98/field field name=priceList1500Price0.85/field /doc doc field name=id2/field field name=nameThe product name./field field name=price1.10/field field name=priceList1Price1.09/field field name=priceList2Price1.08/field field name=priceList1500Price1.05/field /doc Yes, the amount of pricelist could grow from 1000 to 5000 given the user base grows. There are currently about 150,000 products. We do need to index the products, since they change frequently. Thanks everyone for all your responses so far! -Original Message- From: kenf_nc [mailto:ken.fos...@realestate.com] Sent: Wednesday, April 13, 2011 1:15 PM To: solr-user@lucene.apache.org Subject: RE: Indexing Question for large dataset Is NAME a product name? Why would it be multivalue? And why would it appear on more than one document? Is each 'document' a package of products? And the pricing tiers are on the package, not individual pieces? So sounds like you could, potentially, have a PriceListX column for each user. As your User base grows, the number of columns you need may grow (you already bumped up from 2000 to 5000 in the space of a couple posts :) ). Is that right? How many products (or packages of products) do you have? Could you flip this on its ear and make a User the document. Then it could have just 3 multivalue fields (beyond any you need to identify the user like user_id) product_id product_name product_price Downside is if a new product is introduced you have to re-index all users that have a price point on that product. -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-Question-for-large-dataset-tp2816344p2816994.html Sent from the Solr - User mailing list archive at Nabble.com. The recipient of this email should check this email and any attachments for the presence of viruses. The Wasserstrom Companies accepts no liability for any damage caused by any virus transmitted by this email. This footnote also confirms that this email message has been scanned for the presence of computer viruses. The Wasserstrom Companies
Indexing Question for large dataset
We have an ecommerce application B2C/B2B with a large amount of price list that range into 2000+ and growing. They want to index price to have facets and sorting. That seems like that would be a lot of columns to index, example below: INDEX COLUMN: NamePrice PriceList1Price PriceList2Price PriceList3Price . PriceList500Price INDEX DATA: Test 1.00.99 .79 .89 .85 It seems to me that indexing that amount of data is wrong, but of course I do not know. I was wondering if anyone has any suggestions as to what should be done or is this not even possible? Thank you very much, Josh B. The recipient of this email should check this email and any attachments for the presence of viruses. The Wasserstrom Companies accepts no liability for any damage caused by any virus transmitted by this email. This footnote also confirms that this email message has been scanned for the presence of computer viruses. The Wasserstrom Companies
Re: Indexing Question for large dataset
Indexing isn't a problem, it's just disk space and space is cheap. But, if you do facets on all those price columns, that gets put into RAM which isn't as cheap or plentiful. Your cache buffers may get overloaded a lot and performance will suffer. 2000 price columns seems like a lot, could the documents be organized differently? Hard to tell from your example. -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-Question-for-large-dataset-tp2816344p2816377.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Indexing Question for large dataset
Don't know of any other way to organize the documents. We need to have the specific price that belongs to the user, so I don't think that the facets would be the issue. The facet querying would be modified to the corresponding price list field for that user. Let's say the customer belongs to priceList1500, I would use the price from that column (priceList1500) instead of the priceList1 or even price column. Let me post an example data in another way. INDEX FIELD | INDEX DATA ID |1 (INDEXED | STORED) NAME|TEST(INDEXED | STORED | MULTIVALUED) PRICE |1.00(INDEXED) PRICELIST1 |0.99(INDEXED) PRICELIST2 |0.89(INDEXED) PRICELIST500|0.85(INDEXED) ID |2 (INDEXED | STORED) NAME|TEST2 (INDEXED | STORED | MULTIVALUED) PRICE |1.10(INDEXED) PRICELIST1 |1.09(INDEXED) PRICELIST250|1.05(INDEXED) PRICELIST600|1.03(INDEXED) The price list correspond to customer contracts with the company for contracted item pricing. Is there a specific size limit to the amount of index columns SOLR/LUCENCE can handle? Is there a better way of handling this? Do you see an issue with ram from what I am stating here? Also, with the index so huge, let's say 5000 columns across per data set will that degrade search performance dramatically (note the search fields of course would not be for all those columns)? Example Query: q=namefl=NAME,IDfacet=truefacet.field=PRICELIST500 Thanks, Josh B. -Original Message- From: kenf_nc [mailto:ken.fos...@realestate.com] Sent: Wednesday, April 13, 2011 10:47 AM To: solr-user@lucene.apache.org Subject: Re: Indexing Question for large dataset Indexing isn't a problem, it's just disk space and space is cheap. But, if you do facets on all those price columns, that gets put into RAM which isn't as cheap or plentiful. Your cache buffers may get overloaded a lot and performance will suffer. 2000 price columns seems like a lot, could the documents be organized differently? Hard to tell from your example. -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-Question-for-large-dataset-tp2816344p2816377.html Sent from the Solr - User mailing list archive at Nabble.com. The recipient of this email should check this email and any attachments for the presence of viruses. The Wasserstrom Companies accepts no liability for any damage caused by any virus transmitted by this email. This footnote also confirms that this email message has been scanned for the presence of computer viruses. The Wasserstrom Companies
RE: Indexing Question for large dataset
Is NAME a product name? Why would it be multivalue? And why would it appear on more than one document? Is each 'document' a package of products? And the pricing tiers are on the package, not individual pieces? So sounds like you could, potentially, have a PriceListX column for each user. As your User base grows, the number of columns you need may grow (you already bumped up from 2000 to 5000 in the space of a couple posts :) ). Is that right? How many products (or packages of products) do you have? Could you flip this on its ear and make a User the document. Then it could have just 3 multivalue fields (beyond any you need to identify the user like user_id) product_id product_name product_price Downside is if a new product is introduced you have to re-index all users that have a price point on that product. -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-Question-for-large-dataset-tp2816344p2816994.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Indexing Question for large dataset
Name equals the product name. Each separate product can have 1 to n prices based upon pricelist. A single document represents that single product. doc field name=id1/field field name=nameThe product name./field field name=price1.00/field field name=priceList1Price0.99/field field name=priceList2Price0.98/field field name=priceList1500Price0.85/field /doc doc field name=id2/field field name=nameThe product name./field field name=price1.10/field field name=priceList1Price1.09/field field name=priceList2Price1.08/field field name=priceList1500Price1.05/field /doc Yes, the amount of pricelist could grow from 1000 to 5000 given the user base grows. There are currently about 150,000 products. We do need to index the products, since they change frequently. Thanks everyone for all your responses so far! -Original Message- From: kenf_nc [mailto:ken.fos...@realestate.com] Sent: Wednesday, April 13, 2011 1:15 PM To: solr-user@lucene.apache.org Subject: RE: Indexing Question for large dataset Is NAME a product name? Why would it be multivalue? And why would it appear on more than one document? Is each 'document' a package of products? And the pricing tiers are on the package, not individual pieces? So sounds like you could, potentially, have a PriceListX column for each user. As your User base grows, the number of columns you need may grow (you already bumped up from 2000 to 5000 in the space of a couple posts :) ). Is that right? How many products (or packages of products) do you have? Could you flip this on its ear and make a User the document. Then it could have just 3 multivalue fields (beyond any you need to identify the user like user_id) product_id product_name product_price Downside is if a new product is introduced you have to re-index all users that have a price point on that product. -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-Question-for-large-dataset-tp2816344p2816994.html Sent from the Solr - User mailing list archive at Nabble.com. The recipient of this email should check this email and any attachments for the presence of viruses. The Wasserstrom Companies accepts no liability for any damage caused by any virus transmitted by this email. This footnote also confirms that this email message has been scanned for the presence of computer viruses. The Wasserstrom Companies