Faceting and first letter of fields

2010-10-14 Thread Alexandre Rocco
Guys,

We have a website running Solr indexing books, and we use a facet to filter
books by author.
After some time, we detected that this facet is very large and we need to
create some other feature to help finding the information.

Our product team asked to create a page that can show all authors by it's
initial letter, so we can distribute this query easier.
Is it a feasible solution to create another field containing only the
initial letter for the authors? Using this approach we will be able to
filter the authors using this newly created field.
Do you think there will be any performance penalty on creating a couple of
fields with the initial letter of these other fields (author, publisher)?

I guess that this approach is way easier than other solutions we came up
with.
Am I missing other alternatives?

Thanks,
Alexandre


Re: Faceting and first letter of fields

2010-10-14 Thread Jonathan Rochkind
I believe that should work fine in Solr 1.4.1.  Creating a field with 
just first letter of author is definitely the right (possibly only) way 
to allow facetting on first letter of author's name.


I have very voluminous facets (few facet values, many docs in each 
value) like that in my app too, works fine.


I get confused over the different facetting methods available in 1.4.1, 
and exactly when each is called for. If you see initial problems, you 
could try switching the facet.method and see what happens.  You could 
also try warming the caches with the results of facetting on this field 
-- whether that matters or not (and how much RAM it takes) may depend on 
the facetting method chosen, I get confused.


But, possibly with some tweaking of those things, I think your strategy 
should work fine.


Jonathan

Alexandre Rocco wrote:

Guys,

We have a website running Solr indexing books, and we use a facet to filter
books by author.
After some time, we detected that this facet is very large and we need to
create some other feature to help finding the information.

Our product team asked to create a page that can show all authors by it's
initial letter, so we can distribute this query easier.
Is it a feasible solution to create another field containing only the
initial letter for the authors? Using this approach we will be able to
filter the authors using this newly created field.
Do you think there will be any performance penalty on creating a couple of
fields with the initial letter of these other fields (author, publisher)?

I guess that this approach is way easier than other solutions we came up
with.
Am I missing other alternatives?

Thanks,
Alexandre

  


Re: Faceting and first letter of fields

2010-10-14 Thread Yonik Seeley
On Thu, Oct 14, 2010 at 3:42 PM, Jonathan Rochkind rochk...@jhu.edu wrote:
 I believe that should work fine in Solr 1.4.1.  Creating a field with just
 first letter of author is definitely the right (possibly only) way to allow
 facetting on first letter of author's name.

 I have very voluminous facets (few facet values, many docs in each value)
 like that in my app too, works fine.

 I get confused over the different facetting methods available in 1.4.1, and
 exactly when each is called for. If you see initial problems, you could try
 switching the facet.method and see what happens.

Right - for faceting on first letter, you should probably use facet.method=enum
since there will only be 26 values (assuming english/western languages).

In the future, I'm hoping we can come up with a smarter way to pick
the facet.method if it's not supplied.  The new flex API in 4.0-dev
should help out here.

-Yonik
http://www.lucidimagination.com


Re: Faceting and first letter of fields

2010-10-14 Thread Alexandre Rocco
Thank you for both responses.

Another question I have is where the processing of this first letter is
more adequate.
I am considering updating my data import handler to execute a script to
extract the first letter from the author field.

I saw other thread when someone mentioned using a field analyser to extract
the letter using a regex.
Which one is the best option?

Thanks!
Alexandre

On Thu, Oct 14, 2010 at 4:46 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Thu, Oct 14, 2010 at 3:42 PM, Jonathan Rochkind rochk...@jhu.edu
 wrote:
  I believe that should work fine in Solr 1.4.1.  Creating a field with
 just
  first letter of author is definitely the right (possibly only) way to
 allow
  facetting on first letter of author's name.
 
  I have very voluminous facets (few facet values, many docs in each value)
  like that in my app too, works fine.
 
  I get confused over the different facetting methods available in 1.4.1,
 and
  exactly when each is called for. If you see initial problems, you could
 try
  switching the facet.method and see what happens.

 Right - for faceting on first letter, you should probably use
 facet.method=enum
 since there will only be 26 values (assuming english/western languages).

 In the future, I'm hoping we can come up with a smarter way to pick
 the facet.method if it's not supplied.  The new flex API in 4.0-dev
 should help out here.

 -Yonik
 http://www.lucidimagination.com



Re: Faceting and first letter of fields

2010-10-14 Thread Jonathan Rochkind
Thanks Yonik.  I hadn't actually been using enum on facets with a 
small number of unique values; the wiki page doesn't give much guidance 
on when each is called for.  Do you have any rule of thumb for how few 
unique values is few enough to want to use method=enum?  Does it 
matter if the field is single or multi-valued?


Can you in general explain a bit more about the facet methods?  The wiki 
leaves me a bit confused. (I will happily turn anything you say into 
additional language on the wiki, if the wiki lets me edit it, including 
any answer to above).


So first, the wiki suggests: The default value is fc (except for 
BoolField) since it tends to use less memory and is faster when a field 
has many unique terms in the index. 


Is this actualy true?  If facet.method is left unspecified, 
facet.method=enum will be used on any BoolField, and facet.method=fc 
will be used on any other type of field?


Both methods can work on multi-valued fields in 1.4, right?

Then it's a bit extra confusing becuase while facet.method=enum always 
works the same way, it seems like facet.method=fc chooses different 
algorithms under the hood (depending on single vs multi-valued field, 
possibly also depending on number of unique terms?), which you can't 
actually specify, you just need to let the fc stuff choose on it's own. 
That's not mentioned in the wiki, but I think I've seen it mentioned 
before, and trying to look at the source gives me this impression too. 
Is this correct? 

Thanks for any additional guidance, this stuff continues to confuse me 
despite trying to read what I can on it.


Jonathan


Yonik Seeley wrote:

On Thu, Oct 14, 2010 at 3:42 PM, Jonathan Rochkind rochk...@jhu.edu wrote:
  

I believe that should work fine in Solr 1.4.1.  Creating a field with just
first letter of author is definitely the right (possibly only) way to allow
facetting on first letter of author's name.

I have very voluminous facets (few facet values, many docs in each value)
like that in my app too, works fine.

I get confused over the different facetting methods available in 1.4.1, and
exactly when each is called for. If you see initial problems, you could try
switching the facet.method and see what happens.



Right - for faceting on first letter, you should probably use facet.method=enum
since there will only be 26 values (assuming english/western languages).

In the future, I'm hoping we can come up with a smarter way to pick
the facet.method if it's not supplied.  The new flex API in 4.0-dev
should help out here.

-Yonik
http://www.lucidimagination.com

  


Re: Faceting and first letter of fields

2010-10-14 Thread Markus Jelsma
Here's a very recent thread on the matter:
http://lucene.472066.n3.nabble.com/facet-method-enum-vs-fc-td1681277.html

 Thanks Yonik.  I hadn't actually been using enum on facets with a
 small number of unique values; the wiki page doesn't give much guidance
 on when each is called for.  Do you have any rule of thumb for how few
 unique values is few enough to want to use method=enum?  Does it
 matter if the field is single or multi-valued?
 
 Can you in general explain a bit more about the facet methods?  The wiki
 leaves me a bit confused. (I will happily turn anything you say into
 additional language on the wiki, if the wiki lets me edit it, including
 any answer to above).
 
 So first, the wiki suggests: The default value is fc (except for
 BoolField) since it tends to use less memory and is faster when a field
 has many unique terms in the index. 
 
 Is this actualy true?  If facet.method is left unspecified,
 facet.method=enum will be used on any BoolField, and facet.method=fc
 will be used on any other type of field?
 
 Both methods can work on multi-valued fields in 1.4, right?
 
 Then it's a bit extra confusing becuase while facet.method=enum always
 works the same way, it seems like facet.method=fc chooses different
 algorithms under the hood (depending on single vs multi-valued field,
 possibly also depending on number of unique terms?), which you can't
 actually specify, you just need to let the fc stuff choose on it's own.
 That's not mentioned in the wiki, but I think I've seen it mentioned
 before, and trying to look at the source gives me this impression too.
 Is this correct?
 
 Thanks for any additional guidance, this stuff continues to confuse me
 despite trying to read what I can on it.
 
 Jonathan
 
 Yonik Seeley wrote:
  On Thu, Oct 14, 2010 at 3:42 PM, Jonathan Rochkind rochk...@jhu.edu 
wrote:
  I believe that should work fine in Solr 1.4.1.  Creating a field with
  just first letter of author is definitely the right (possibly only) way
  to allow facetting on first letter of author's name.
  
  I have very voluminous facets (few facet values, many docs in each
  value) like that in my app too, works fine.
  
  I get confused over the different facetting methods available in 1.4.1,
  and exactly when each is called for. If you see initial problems, you
  could try switching the facet.method and see what happens.
  
  Right - for faceting on first letter, you should probably use
  facet.method=enum since there will only be 26 values (assuming
  english/western languages).
  
  In the future, I'm hoping we can come up with a smarter way to pick
  the facet.method if it's not supplied.  The new flex API in 4.0-dev
  should help out here.
  
  -Yonik
  http://www.lucidimagination.com


Re: Faceting and first letter of fields

2010-10-14 Thread Jonathan Rochkind

Markus Jelsma wrote:

Here's a very recent thread on the matter:
http://lucene.472066.n3.nabble.com/facet-method-enum-vs-fc-td1681277.html

  


Thanks, that's helpful, but still leaves me with questions.

Yonik suggests with only ~25 unique facet values, method=enum is 
probably the way to go.


What about 100? 200?  It probably depends on number of documents too: 
I've got about 3 million.


I know I can just try it and see, but since the penalty for picking 
wrong is using way a lot of memory, rather than performance -- this is 
very hard for me, with my limited JVM knowledge, to know if I've picked 
wrong or not. The only thing I know tells me I did it wrong is if I get 
an OutOfMemory. But maybe I don't get one right away, but get one a 
couple weeks later, perhaps under a different usage pattern.  Was it 
caused by the facet.method=enum? Or something else maybe I changed in 
the interim. Or something else that was always there but which the 
different usage pattern triggered. It's confusing, you know?


That thread Markus references says:

The enum method creates a bitset for #each# unique facet value. The bit 
set is (maxdocs / 8) bytes in size (I'm ignoring

some overhead here).

Is that maxdocs the number of docs in your index, or the number of docs 
that are assigned to a given unique facet value? (and in the current 
result set, or in the index as a whole?) Makes a pretty big difference 
in overall memory use if you've got, say, 3 million docs, 100 unique 
facet values and the documents are relatively evenly distributed within 
them.   I _think_ from the math that follows, Erick is saying maxdocs 
in that simple equation is the number of documents assigned to a given 
unique facet value, in the index as a whole. But that would seem to mean 
that the amount of memory taken up would be solely a function of number 
of documents in your index, not in fact of number of unique facet 
values. And that doesn't doesn't seem to square with the other advice we 
get on the subject.


So... I am confused.


Re: Faceting and first letter of fields

2010-10-14 Thread Chris Hostetter

: Another question I have is where the processing of this first letter is
: more adequate.
: I am considering updating my data import handler to execute a script to
: extract the first letter from the author field.
: 
: I saw other thread when someone mentioned using a field analyser to extract
: the letter using a regex.
: Which one is the best option?

best is subjective.

conceptually, inherient rules/concepts of your data (ie: what files it 
has, what types those fields have, etc...) should live in your schema.xml, 
while things specific to where your data comes from should live in other 
configs (ie: your DIH config, update processors, etc...)

so for something like an first_letter_author_name field that should (by 
definition) always be the same as the first letter of the author_name 
field, it should be specified in your schema.xml (two ways i can think of: 
copyField w/maxChars, or an EdgeNGramTokenizer) .. thta way no matter how 
a document gets in your index (DIH, XML Push, CSV Push, etc...) you can be 
certain the fields will be internally consistents.

Practically speaking: there's a lot of inherient rules that can't be 
expressed in the schema.xml, or may be confusing to people if they are 
expressed there while other more complex rules are expressed elsewhere -- 
so go with whatever makes the most sense to you, and is the easiest for 
you to maintain.


-Hoss