Faceting and first letter of fields
Guys, We have a website running Solr indexing books, and we use a facet to filter books by author. After some time, we detected that this facet is very large and we need to create some other feature to help finding the information. Our product team asked to create a page that can show all authors by it's initial letter, so we can distribute this query easier. Is it a feasible solution to create another field containing only the initial letter for the authors? Using this approach we will be able to filter the authors using this newly created field. Do you think there will be any performance penalty on creating a couple of fields with the initial letter of these other fields (author, publisher)? I guess that this approach is way easier than other solutions we came up with. Am I missing other alternatives? Thanks, Alexandre
Re: Faceting and first letter of fields
I believe that should work fine in Solr 1.4.1. Creating a field with just first letter of author is definitely the right (possibly only) way to allow facetting on first letter of author's name. I have very voluminous facets (few facet values, many docs in each value) like that in my app too, works fine. I get confused over the different facetting methods available in 1.4.1, and exactly when each is called for. If you see initial problems, you could try switching the facet.method and see what happens. You could also try warming the caches with the results of facetting on this field -- whether that matters or not (and how much RAM it takes) may depend on the facetting method chosen, I get confused. But, possibly with some tweaking of those things, I think your strategy should work fine. Jonathan Alexandre Rocco wrote: Guys, We have a website running Solr indexing books, and we use a facet to filter books by author. After some time, we detected that this facet is very large and we need to create some other feature to help finding the information. Our product team asked to create a page that can show all authors by it's initial letter, so we can distribute this query easier. Is it a feasible solution to create another field containing only the initial letter for the authors? Using this approach we will be able to filter the authors using this newly created field. Do you think there will be any performance penalty on creating a couple of fields with the initial letter of these other fields (author, publisher)? I guess that this approach is way easier than other solutions we came up with. Am I missing other alternatives? Thanks, Alexandre
Re: Faceting and first letter of fields
On Thu, Oct 14, 2010 at 3:42 PM, Jonathan Rochkind rochk...@jhu.edu wrote: I believe that should work fine in Solr 1.4.1. Creating a field with just first letter of author is definitely the right (possibly only) way to allow facetting on first letter of author's name. I have very voluminous facets (few facet values, many docs in each value) like that in my app too, works fine. I get confused over the different facetting methods available in 1.4.1, and exactly when each is called for. If you see initial problems, you could try switching the facet.method and see what happens. Right - for faceting on first letter, you should probably use facet.method=enum since there will only be 26 values (assuming english/western languages). In the future, I'm hoping we can come up with a smarter way to pick the facet.method if it's not supplied. The new flex API in 4.0-dev should help out here. -Yonik http://www.lucidimagination.com
Re: Faceting and first letter of fields
Thank you for both responses. Another question I have is where the processing of this first letter is more adequate. I am considering updating my data import handler to execute a script to extract the first letter from the author field. I saw other thread when someone mentioned using a field analyser to extract the letter using a regex. Which one is the best option? Thanks! Alexandre On Thu, Oct 14, 2010 at 4:46 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Thu, Oct 14, 2010 at 3:42 PM, Jonathan Rochkind rochk...@jhu.edu wrote: I believe that should work fine in Solr 1.4.1. Creating a field with just first letter of author is definitely the right (possibly only) way to allow facetting on first letter of author's name. I have very voluminous facets (few facet values, many docs in each value) like that in my app too, works fine. I get confused over the different facetting methods available in 1.4.1, and exactly when each is called for. If you see initial problems, you could try switching the facet.method and see what happens. Right - for faceting on first letter, you should probably use facet.method=enum since there will only be 26 values (assuming english/western languages). In the future, I'm hoping we can come up with a smarter way to pick the facet.method if it's not supplied. The new flex API in 4.0-dev should help out here. -Yonik http://www.lucidimagination.com
Re: Faceting and first letter of fields
Thanks Yonik. I hadn't actually been using enum on facets with a small number of unique values; the wiki page doesn't give much guidance on when each is called for. Do you have any rule of thumb for how few unique values is few enough to want to use method=enum? Does it matter if the field is single or multi-valued? Can you in general explain a bit more about the facet methods? The wiki leaves me a bit confused. (I will happily turn anything you say into additional language on the wiki, if the wiki lets me edit it, including any answer to above). So first, the wiki suggests: The default value is fc (except for BoolField) since it tends to use less memory and is faster when a field has many unique terms in the index. Is this actualy true? If facet.method is left unspecified, facet.method=enum will be used on any BoolField, and facet.method=fc will be used on any other type of field? Both methods can work on multi-valued fields in 1.4, right? Then it's a bit extra confusing becuase while facet.method=enum always works the same way, it seems like facet.method=fc chooses different algorithms under the hood (depending on single vs multi-valued field, possibly also depending on number of unique terms?), which you can't actually specify, you just need to let the fc stuff choose on it's own. That's not mentioned in the wiki, but I think I've seen it mentioned before, and trying to look at the source gives me this impression too. Is this correct? Thanks for any additional guidance, this stuff continues to confuse me despite trying to read what I can on it. Jonathan Yonik Seeley wrote: On Thu, Oct 14, 2010 at 3:42 PM, Jonathan Rochkind rochk...@jhu.edu wrote: I believe that should work fine in Solr 1.4.1. Creating a field with just first letter of author is definitely the right (possibly only) way to allow facetting on first letter of author's name. I have very voluminous facets (few facet values, many docs in each value) like that in my app too, works fine. I get confused over the different facetting methods available in 1.4.1, and exactly when each is called for. If you see initial problems, you could try switching the facet.method and see what happens. Right - for faceting on first letter, you should probably use facet.method=enum since there will only be 26 values (assuming english/western languages). In the future, I'm hoping we can come up with a smarter way to pick the facet.method if it's not supplied. The new flex API in 4.0-dev should help out here. -Yonik http://www.lucidimagination.com
Re: Faceting and first letter of fields
Here's a very recent thread on the matter: http://lucene.472066.n3.nabble.com/facet-method-enum-vs-fc-td1681277.html Thanks Yonik. I hadn't actually been using enum on facets with a small number of unique values; the wiki page doesn't give much guidance on when each is called for. Do you have any rule of thumb for how few unique values is few enough to want to use method=enum? Does it matter if the field is single or multi-valued? Can you in general explain a bit more about the facet methods? The wiki leaves me a bit confused. (I will happily turn anything you say into additional language on the wiki, if the wiki lets me edit it, including any answer to above). So first, the wiki suggests: The default value is fc (except for BoolField) since it tends to use less memory and is faster when a field has many unique terms in the index. Is this actualy true? If facet.method is left unspecified, facet.method=enum will be used on any BoolField, and facet.method=fc will be used on any other type of field? Both methods can work on multi-valued fields in 1.4, right? Then it's a bit extra confusing becuase while facet.method=enum always works the same way, it seems like facet.method=fc chooses different algorithms under the hood (depending on single vs multi-valued field, possibly also depending on number of unique terms?), which you can't actually specify, you just need to let the fc stuff choose on it's own. That's not mentioned in the wiki, but I think I've seen it mentioned before, and trying to look at the source gives me this impression too. Is this correct? Thanks for any additional guidance, this stuff continues to confuse me despite trying to read what I can on it. Jonathan Yonik Seeley wrote: On Thu, Oct 14, 2010 at 3:42 PM, Jonathan Rochkind rochk...@jhu.edu wrote: I believe that should work fine in Solr 1.4.1. Creating a field with just first letter of author is definitely the right (possibly only) way to allow facetting on first letter of author's name. I have very voluminous facets (few facet values, many docs in each value) like that in my app too, works fine. I get confused over the different facetting methods available in 1.4.1, and exactly when each is called for. If you see initial problems, you could try switching the facet.method and see what happens. Right - for faceting on first letter, you should probably use facet.method=enum since there will only be 26 values (assuming english/western languages). In the future, I'm hoping we can come up with a smarter way to pick the facet.method if it's not supplied. The new flex API in 4.0-dev should help out here. -Yonik http://www.lucidimagination.com
Re: Faceting and first letter of fields
Markus Jelsma wrote: Here's a very recent thread on the matter: http://lucene.472066.n3.nabble.com/facet-method-enum-vs-fc-td1681277.html Thanks, that's helpful, but still leaves me with questions. Yonik suggests with only ~25 unique facet values, method=enum is probably the way to go. What about 100? 200? It probably depends on number of documents too: I've got about 3 million. I know I can just try it and see, but since the penalty for picking wrong is using way a lot of memory, rather than performance -- this is very hard for me, with my limited JVM knowledge, to know if I've picked wrong or not. The only thing I know tells me I did it wrong is if I get an OutOfMemory. But maybe I don't get one right away, but get one a couple weeks later, perhaps under a different usage pattern. Was it caused by the facet.method=enum? Or something else maybe I changed in the interim. Or something else that was always there but which the different usage pattern triggered. It's confusing, you know? That thread Markus references says: The enum method creates a bitset for #each# unique facet value. The bit set is (maxdocs / 8) bytes in size (I'm ignoring some overhead here). Is that maxdocs the number of docs in your index, or the number of docs that are assigned to a given unique facet value? (and in the current result set, or in the index as a whole?) Makes a pretty big difference in overall memory use if you've got, say, 3 million docs, 100 unique facet values and the documents are relatively evenly distributed within them. I _think_ from the math that follows, Erick is saying maxdocs in that simple equation is the number of documents assigned to a given unique facet value, in the index as a whole. But that would seem to mean that the amount of memory taken up would be solely a function of number of documents in your index, not in fact of number of unique facet values. And that doesn't doesn't seem to square with the other advice we get on the subject. So... I am confused.
Re: Faceting and first letter of fields
: Another question I have is where the processing of this first letter is : more adequate. : I am considering updating my data import handler to execute a script to : extract the first letter from the author field. : : I saw other thread when someone mentioned using a field analyser to extract : the letter using a regex. : Which one is the best option? best is subjective. conceptually, inherient rules/concepts of your data (ie: what files it has, what types those fields have, etc...) should live in your schema.xml, while things specific to where your data comes from should live in other configs (ie: your DIH config, update processors, etc...) so for something like an first_letter_author_name field that should (by definition) always be the same as the first letter of the author_name field, it should be specified in your schema.xml (two ways i can think of: copyField w/maxChars, or an EdgeNGramTokenizer) .. thta way no matter how a document gets in your index (DIH, XML Push, CSV Push, etc...) you can be certain the fields will be internally consistents. Practically speaking: there's a lot of inherient rules that can't be expressed in the schema.xml, or may be confusing to people if they are expressed there while other more complex rules are expressed elsewhere -- so go with whatever makes the most sense to you, and is the easiest for you to maintain. -Hoss