Taxonomy in SOLR
Hi, I am trying Solr and i have one question. In the schema that i set up, there are 10 fields with always same data(hierarchical taxonomies) but with 4 million documents, space disk and indexing time must be big. I need this field for auto complete. Is there another way to do this type of operation ? Damien
Re: Taxonomy in SOLR
Hi Damien, can you provide a schema sample plus example-data? Since your information is really general, I think no one can give you a situation-specific advice. Regards -- View this message in context: http://lucene.472066.n3.nabble.com/Taxonomy-in-SOLR-tp2317955p2318200.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Taxonomy in SOLR
My schema : field name=id type=string indexed=true stored=true required=true / !-- Document -- field name=lead type=string indexed=true stored=true / field name=title type=string indexed=true stored=true required=true / field name=text type=string indexed=true stored=true required=true / !-- taxo -- dynamicField name=*_taxon_label type=string indexed=true stored=true / dynamicField name=*_taxon_type type=string indexed=true stored=true / dynamicField name=*_taxon_hierarchy type=string indexed=true stored=true multiValued=true / field name=type type=string indexed=true stored=true required=true / Le 24/01/2011 09:56, Em a écrit : Hi Damien, can you provide a schema sample plus example-data? Since your information is really general, I think no one can give you a situation-specific advice. Regards
Re: Taxonomy in SOLR
Hi Damien, why are you storing the taxonomies? When it comes to faceting, it only depends on indexed values. If there is a meaningful difference between the indexed and the stored value, I would prefer to use an RDBMs or something like that to reduce redundancy. Does this help? Regards -- View this message in context: http://lucene.472066.n3.nabble.com/Taxonomy-in-SOLR-tp2317955p2318363.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Taxonomy in SOLR
Yes, i am not obliged to store taxonomies. My taxonomies are type of english_taxon_label = Berlin english_taxon_type = location english_taxon_hierarchy = 0/world 1/world/europe 2/world/europe/germany 3/world/europe/germany/berlin I need *_taxon_hierarchy to faceting and label to auto complete. With a RDBMs, i have 100 entry max for one taxo, but with solr and 4 million documents the redundandcy is huge, no ? And i have 10 different taxonomies per document Damien Le 24/01/2011 10:30, Em a écrit : Hi Damien, why are you storing the taxonomies? When it comes to faceting, it only depends on indexed values. If there is a meaningful difference between the indexed and the stored value, I would prefer to use an RDBMs or something like that to reduce redundancy. Does this help? Regards
Re: Taxonomy in SOLR
100 Entries per taxon? Well, with Solr you got 100 taxon-entries * 4mio docs * 10 taxons. If your indexed taxon-versions are looking okay, you could leave out the DB-overhead and could do everything in Solr. -- View this message in context: http://lucene.472066.n3.nabble.com/Taxonomy-in-SOLR-tp2317955p2318550.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Taxonomy in SOLR
Thanks Em, How i can calculate index time, update time and space disk used by one taxonomy ? Le 24/01/2011 10:58, Em a écrit : 100 Entries per taxon? Well, with Solr you got 100 taxon-entries * 4mio docs * 10 taxons. If your indexed taxon-versions are looking okay, you could leave out the DB-overhead and could do everything in Solr.
Re: Taxonomy in SOLR
Hi Daniem, ahm, the formula I wrote was no definitive guide, just some numbers I combined to visualize the amount of data - perhaps not even a complete formula. Well, when you can use your taxonomy as indexed-only you do not double the used disk space when you are indexing two equal documents. Lucene - and also Solr - are working with an inverted index: This means every document is mapped against its indexed terms. So your index-size will depend on the number of unique taxonomy-terms and the pointers of the documents to these terms. That's it. Usually the used disk-space for an index is much smaller than the size of the original data. I hope what I tried to explain was easy to understand. Regards -- View this message in context: http://lucene.472066.n3.nabble.com/Taxonomy-in-SOLR-tp2317955p2319202.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Taxonomy in SOLR
Le 24/01/2011 13:10, Em a écrit : Hi Daniem, ahm, the formula I wrote was no definitive guide, just some numbers I combined to visualize the amount of data - perhaps not even a complete formula. Well, when you can use your taxonomy as indexed-only you do not double the used disk space when you are indexing two equal documents. So, five document or 4 mi with the same taxonomy are equal in using disk space to one ? Lucene - and also Solr - are working with an inverted index: This means every document is mapped against its indexed terms. So your index-size will depend on the number of unique taxonomy-terms and the pointers of the documents to these terms. That's it. Usually the used disk-space for an index is much smaller than the size of the original data. I hope what I tried to explain was easy to understand. Thanks, it's very helpfull ! How i can find more explaination on the internal structure of the Lucene indexer ? Damien
Re: Taxonomy in SOLR
Just for illustration: This is your original data: doc1 : hello world doc2: hello daniem doc3: hello pal Now, Lucene produces something like this from the input: hello: id_doc1,id_doc2,id_doc3 daniem: id_doc2 pal: id_doc3 Well, it's more complex, but enough for illustration. As you can see, the representation of a document is completly different. A document costs only a few bytes for a Lucene-internal id per word. If words occur more than one time per document AND you do not store termVectors, Lucene just adds the number of occurences per word per doc to its index: hello: id_doc1[1],id_doc2[1],id_doc3[1] daniem: id_doc2[1] pal: id_doc3[1] Imagine what happens to longer texts where especially stopwords or important words occur more than one time. I would suggest to start with the Lucene-Wiki, if you want to learn more about Lucene. Regards, Em -- View this message in context: http://lucene.472066.n3.nabble.com/Taxonomy-in-SOLR-tp2317955p2319920.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Taxonomy in SOLR
First, the redundancy is certainly there, but that's what Solr does, handles large amounts of data. 4 million documents is actually a pretty small corpus by Solr standards, so you may well be able to do exactly what you propose with acceptable performance/size. I'd advise just trying it with, say, 200,000 docs. Why 200K? because index growth is non-linear with the first bunch of documents taking up more space than the second. So index 100K, examine your indexes and index 100K more. Now use the delta to extrapolate to 4M. You don't need to store the taxonomy in each doc for auto-complete, you can get your auto-completion from a different index. Or you can index your taxonomies in a special document in Solr and query the (unique) field in that document for autocomplete. For faceting, you do need taxonomies. But remember that the nature of the inverted index is that unique terms are only stored once, and the document ID for each document that that term appears in is recorded. So if you have 3/europe/germany/berlin stored in 1M documents, your index space is really string length + overhead + space for 1M ids. Best Erick On Mon, Jan 24, 2011 at 4:53 AM, Damien Fontaine dfonta...@rosebud.frwrote: Yes, i am not obliged to store taxonomies. My taxonomies are type of english_taxon_label = Berlin english_taxon_type = location english_taxon_hierarchy = 0/world 1/world/europe 2/world/europe/germany 3/world/europe/germany/berlin I need *_taxon_hierarchy to faceting and label to auto complete. With a RDBMs, i have 100 entry max for one taxo, but with solr and 4 million documents the redundandcy is huge, no ? And i have 10 different taxonomies per document Damien Le 24/01/2011 10:30, Em a écrit : Hi Damien, why are you storing the taxonomies? When it comes to faceting, it only depends on indexed values. If there is a meaningful difference between the indexed and the stored value, I would prefer to use an RDBMs or something like that to reduce redundancy. Does this help? Regards
Re: Taxonomy in SOLR
Thanks Em and Erick for your answers, Now, i better understand functioning of Solr. Damien Le 24/01/2011 16:23, Erick Erickson a écrit : First, the redundancy is certainly there, but that's what Solr does, handles large amounts of data. 4 million documents is actually a pretty small corpus by Solr standards, so you may well be able to do exactly what you propose with acceptable performance/size. I'd advise just trying it with, say, 200,000 docs. Why 200K? because index growth is non-linear with the first bunch of documents taking up more space than the second. So index 100K, examine your indexes and index 100K more. Now use the delta to extrapolate to 4M. You don't need to store the taxonomy in each doc for auto-complete, you can get your auto-completion from a different index. Or you can index your taxonomies in a special document in Solr and query the (unique) field in that document for autocomplete. For faceting, you do need taxonomies. But remember that the nature of the inverted index is that unique terms are only stored once, and the document ID for each document that that term appears in is recorded. So if you have 3/europe/germany/berlin stored in 1M documents, your index space is really string length + overhead +space for 1M ids. Best Erick On Mon, Jan 24, 2011 at 4:53 AM, Damien Fontainedfonta...@rosebud.frwrote: Yes, i am not obliged to store taxonomies. My taxonomies are type of english_taxon_label = Berlin english_taxon_type = location english_taxon_hierarchy = 0/world 1/world/europe 2/world/europe/germany 3/world/europe/germany/berlin I need *_taxon_hierarchy to faceting and label to auto complete. With a RDBMs, i have 100 entry max for one taxo, but with solr and 4 million documents the redundandcy is huge, no ? And i have 10 different taxonomies per document Damien Le 24/01/2011 10:30, Em a écrit : Hi Damien, why are you storing the taxonomies? When it comes to faceting, it only depends on indexed values. If there is a meaningful difference between the indexed and the stored value, I would prefer to use an RDBMs or something like that to reduce redundancy. Does this help? Regards
Re: Taxonomy in SOLR
Hi Erick, in some usecases I really think that your suggestion with some unique-documents for meta-information is a good approach to solve some issues. However there is a hurdle for me and maybe you can help me to clear it: What is the best way to get such meta-data? I see three possible approaches: 1st: get it in another request 2nd: get it with a requestHandler 3rd: get it with a searchComponent I think the 2nd and 3rd are the cleanest ways. But to make a decision between them I run into two problems: RequestHandler: Should I extend the StandardRequestHandler to do what I need? If so, I could just query my index for the needed information and add it to the request before I pass it up the SearchComponents. SearchComponent: The problem with the SearchComponent is the distributed thing and how to test it. However, if this would be the cleanest way to go, one should go it. What would you do, if you want to add some meta-information to your request that was not given by the user? Regards, Em Erick Erickson wrote: First, the redundancy is certainly there, but that's what Solr does, handles large amounts of data. 4 million documents is actually a pretty small corpus by Solr standards, so you may well be able to do exactly what you propose with acceptable performance/size. I'd advise just trying it with, say, 200,000 docs. Why 200K? because index growth is non-linear with the first bunch of documents taking up more space than the second. So index 100K, examine your indexes and index 100K more. Now use the delta to extrapolate to 4M. You don't need to store the taxonomy in each doc for auto-complete, you can get your auto-completion from a different index. Or you can index your taxonomies in a special document in Solr and query the (unique) field in that document for autocomplete. For faceting, you do need taxonomies. But remember that the nature of the inverted index is that unique terms are only stored once, and the document ID for each document that that term appears in is recorded. So if you have 3/europe/germany/berlin stored in 1M documents, your index space is really string length + overhead + space for 1M ids. Best Erick On Mon, Jan 24, 2011 at 4:53 AM, Damien Fontaine dfonta...@rosebud.frwrote: Yes, i am not obliged to store taxonomies. My taxonomies are type of english_taxon_label = Berlin english_taxon_type = location english_taxon_hierarchy = 0/world 1/world/europe 2/world/europe/germany 3/world/europe/germany/berlin I need *_taxon_hierarchy to faceting and label to auto complete. With a RDBMs, i have 100 entry max for one taxo, but with solr and 4 million documents the redundandcy is huge, no ? And i have 10 different taxonomies per document Damien Le 24/01/2011 10:30, Em a écrit : Hi Damien, why are you storing the taxonomies? When it comes to faceting, it only depends on indexed values. If there is a meaningful difference between the indexed and the stored value, I would prefer to use an RDBMs or something like that to reduce redundancy. Does this help? Regards -- View this message in context: http://lucene.472066.n3.nabble.com/Taxonomy-in-SOLR-tp2317955p2320666.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Taxonomy in SOLR
I wasn't thinking about this for adding information to the *request*. Rather, in this case the autocomplete uses an Ajax call that just uses the TermsComponent to get the autocomplete data and display it. This is just textual, so adding it to the request is client-side magic. If you want your app to have access to the meta-data for other purposes, you'd just query and cache it from the app. You could use that to build up the links you embed in the page for new queries if you chose, no custom handlers necessary. Otherwise, I guess you'd create a custom request handler, that seems like a reasonable place. Best Erick On Mon, Jan 24, 2011 at 11:03 AM, Em mailformailingli...@yahoo.de wrote: Hi Erick, in some usecases I really think that your suggestion with some unique-documents for meta-information is a good approach to solve some issues. However there is a hurdle for me and maybe you can help me to clear it: What is the best way to get such meta-data? I see three possible approaches: 1st: get it in another request 2nd: get it with a requestHandler 3rd: get it with a searchComponent I think the 2nd and 3rd are the cleanest ways. But to make a decision between them I run into two problems: RequestHandler: Should I extend the StandardRequestHandler to do what I need? If so, I could just query my index for the needed information and add it to the request before I pass it up the SearchComponents. SearchComponent: The problem with the SearchComponent is the distributed thing and how to test it. However, if this would be the cleanest way to go, one should go it. What would you do, if you want to add some meta-information to your request that was not given by the user? Regards, Em Erick Erickson wrote: First, the redundancy is certainly there, but that's what Solr does, handles large amounts of data. 4 million documents is actually a pretty small corpus by Solr standards, so you may well be able to do exactly what you propose with acceptable performance/size. I'd advise just trying it with, say, 200,000 docs. Why 200K? because index growth is non-linear with the first bunch of documents taking up more space than the second. So index 100K, examine your indexes and index 100K more. Now use the delta to extrapolate to 4M. You don't need to store the taxonomy in each doc for auto-complete, you can get your auto-completion from a different index. Or you can index your taxonomies in a special document in Solr and query the (unique) field in that document for autocomplete. For faceting, you do need taxonomies. But remember that the nature of the inverted index is that unique terms are only stored once, and the document ID for each document that that term appears in is recorded. So if you have 3/europe/germany/berlin stored in 1M documents, your index space is really string length + overhead + space for 1M ids. Best Erick On Mon, Jan 24, 2011 at 4:53 AM, Damien Fontaine dfonta...@rosebud.frwrote: Yes, i am not obliged to store taxonomies. My taxonomies are type of english_taxon_label = Berlin english_taxon_type = location english_taxon_hierarchy = 0/world 1/world/europe 2/world/europe/germany 3/world/europe/germany/berlin I need *_taxon_hierarchy to faceting and label to auto complete. With a RDBMs, i have 100 entry max for one taxo, but with solr and 4 million documents the redundandcy is huge, no ? And i have 10 different taxonomies per document Damien Le 24/01/2011 10:30, Em a écrit : Hi Damien, why are you storing the taxonomies? When it comes to faceting, it only depends on indexed values. If there is a meaningful difference between the indexed and the stored value, I would prefer to use an RDBMs or something like that to reduce redundancy. Does this help? Regards -- View this message in context: http://lucene.472066.n3.nabble.com/Taxonomy-in-SOLR-tp2317955p2320666.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Taxonomy in SOLR
Thank you for the advice, Erick! I will take a look at extending the StandardRequestHandler for such usecases. Erick Erickson wrote: I wasn't thinking about this for adding information to the *request*. Rather, in this case the autocomplete uses an Ajax call that just uses the TermsComponent to get the autocomplete data and display it. This is just textual, so adding it to the request is client-side magic. If you want your app to have access to the meta-data for other purposes, you'd just query and cache it from the app. You could use that to build up the links you embed in the page for new queries if you chose, no custom handlers necessary. Otherwise, I guess you'd create a custom request handler, that seems like a reasonable place. Best Erick On Mon, Jan 24, 2011 at 11:03 AM, Em mailformailingli...@yahoo.de wrote: Hi Erick, in some usecases I really think that your suggestion with some unique-documents for meta-information is a good approach to solve some issues. However there is a hurdle for me and maybe you can help me to clear it: What is the best way to get such meta-data? I see three possible approaches: 1st: get it in another request 2nd: get it with a requestHandler 3rd: get it with a searchComponent I think the 2nd and 3rd are the cleanest ways. But to make a decision between them I run into two problems: RequestHandler: Should I extend the StandardRequestHandler to do what I need? If so, I could just query my index for the needed information and add it to the request before I pass it up the SearchComponents. SearchComponent: The problem with the SearchComponent is the distributed thing and how to test it. However, if this would be the cleanest way to go, one should go it. What would you do, if you want to add some meta-information to your request that was not given by the user? Regards, Em Erick Erickson wrote: First, the redundancy is certainly there, but that's what Solr does, handles large amounts of data. 4 million documents is actually a pretty small corpus by Solr standards, so you may well be able to do exactly what you propose with acceptable performance/size. I'd advise just trying it with, say, 200,000 docs. Why 200K? because index growth is non-linear with the first bunch of documents taking up more space than the second. So index 100K, examine your indexes and index 100K more. Now use the delta to extrapolate to 4M. You don't need to store the taxonomy in each doc for auto-complete, you can get your auto-completion from a different index. Or you can index your taxonomies in a special document in Solr and query the (unique) field in that document for autocomplete. For faceting, you do need taxonomies. But remember that the nature of the inverted index is that unique terms are only stored once, and the document ID for each document that that term appears in is recorded. So if you have 3/europe/germany/berlin stored in 1M documents, your index space is really string length + overhead + space for 1M ids. Best Erick On Mon, Jan 24, 2011 at 4:53 AM, Damien Fontaine dfonta...@rosebud.frwrote: Yes, i am not obliged to store taxonomies. My taxonomies are type of english_taxon_label = Berlin english_taxon_type = location english_taxon_hierarchy = 0/world 1/world/europe 2/world/europe/germany 3/world/europe/germany/berlin I need *_taxon_hierarchy to faceting and label to auto complete. With a RDBMs, i have 100 entry max for one taxo, but with solr and 4 million documents the redundandcy is huge, no ? And i have 10 different taxonomies per document Damien Le 24/01/2011 10:30, Em a écrit : Hi Damien, why are you storing the taxonomies? When it comes to faceting, it only depends on indexed values. If there is a meaningful difference between the indexed and the stored value, I would prefer to use an RDBMs or something like that to reduce redundancy. Does this help? Regards -- View this message in context: http://lucene.472066.n3.nabble.com/Taxonomy-in-SOLR-tp2317955p2320666.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://lucene.472066.n3.nabble.com/Taxonomy-in-SOLR-tp2317955p2321340.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Taxonomy in SOLR
There aren't any great general purpose out of the box ways to handle hieararchical data in Solr. Solr isn't an rdbms. There may be some particular advice on how to set up a particular Solr index to answer particular questions with regard to hieararchical data. I saw a great point made recently comparing rdbms to NoSQL stores, which applied to Solr too even though Solr is NOT a noSQL store. In rdbms, you set up your schema thinking only about your _data_, and modelling your data as flexibly as possible. Then once you've done that, you can ask pretty much any well-specified question you want of your data, and get a correct and reasonably performant answer. In Solr, on the other hand, we set up our schemas to answer particular questions. You have to first figure out what kinds of questions you will want to ask Solr, what kinds of queries you'll want to make, and then you can figure out how to structure your data to ask those questions. Some questions are actually very hard to set up Solr to answer -- in general Solr is about setting up your data so whatever question you have can be reduced to asking is token X in field Y. This can be especially tricky in cases where you want to use a single Solr index to answer multiple questions, where the questions are such that you really need to set up your data _differently_ to get Solr to optimally answer each question. Solr is not a general purpose store like an rdbms, where you can set up your schema once in terms of your data and use it to answer nearly any conceivable well-specified question after that. Instead, Solr does things that rdbms can't do quickly or can't do at all. But you lose some things too. On 1/24/2011 3:03 AM, Damien Fontaine wrote: Hi, I am trying Solr and i have one question. In the schema that i set up, there are 10 fields with always same data(hierarchical taxonomies) but with 4 million documents, space disk and indexing time must be big. I need this field for auto complete. Is there another way to do this type of operation ? Damien