RE: Solr e Terracotta

2006-12-08 Thread Fuad Efendi
Cool...
I don't believe into any "real-time" for enterprise. Everything is measured
by response time, which is very good in case of 0.5-3 seconds, and
acceptable for up to 10 seconds...
BEA offers 'real-time' WebLogic, JRockit uses 'determenistic garbage
collection'.
Indeed, we have 'asynchronous' index for our data via SOLR.
Why real-time??? Why not 'transactional' ;)
Thanks

-Original Message-
From: Otis Gospodnetic
Sent: Thursday, December 07, 2006 1:58 PM
To: solr-user@lucene.apache.org
Subject: Solr e Terracotta


Now that Terracotta JVM clustering has been open-sourced (and works with
Lucene's RAMDirectory), who is going to be the first to write something to
support HA Solr? :)

http://www.terracotta.org/
http://orionl.blogspot.com/

If I understand the significance of this, this means more or less real-time
Solr replication would be possible, eliminating the need for the
Master/Slave setup, shapshots, pull scripts, etc. Ja?

Otis







Re: Result: numFound inaccuracies

2006-12-08 Thread Yonik Seeley

On 12/8/06, Andrew Nagy <[EMAIL PROTECTED]> wrote:

Hello, me again.

I have been running some extensive tests of my search engine and have
been seeing inaccuracies with the "numFound" attribute.  It tends to
return 1 more than what is actually show in the XML.

Is this a bug, or could I be doing something wrong?

I have a specific example in front of me at the moment where my query
found 2 records, yet I get: "


start is 0 based :-)

-Yonik


Result: numFound inaccuracies

2006-12-08 Thread Andrew Nagy

Hello, me again.

I have been running some extensive tests of my search engine and have 
been seeing inaccuracies with the "numFound" attribute.  It tends to 
return 1 more than what is actually show in the XML.


Is this a bug, or could I be doing something wrong?

I have a specific example in front of me at the moment where my query 
found 2 records, yet I get: "


Any ideas?

Andrew


Re: Facet Performance

2006-12-08 Thread Andrew Nagy

Erik Hatcher wrote:


On Dec 8, 2006, at 2:15 PM, Andrew Nagy wrote:

My data is 492,000 records of book data.  I am faceting on 4  fields: 
author, subject, language, format.
Format and language are fairly simple as their are only a few  unique 
terms.  Author and subject however are much different in  that there 
are thousands of unique terms.



When encountering difficult issues, I like to think in terms of the  
user interface.  Surely you're not presenting 400k+ authors to the  
users in one shot.  In Collex, we have put an AJAX drop-down that  
shows the author facet (we call it name on the UI, with various roles  
like author, painter, etc).  You can see this in action here:


In our data, we don't have unique authors for each records ... so let's 
say out of the 500,000 records ... we have 200,000 authors.  What I am 
trying to display is the top 10 authors from the results of a search.  
So I do a search for title:"Gone with the wind" and I would like to see 
the top 10 matching authors from these results.


But no worries, I have written my own facet handler and I am now back to 
under a second with faceting!


Thanks for everyone's help and keep up the good work!

Andrew


Re: Facet Performance

2006-12-08 Thread Chris Hostetter

: Unfortunately which strategy will be chosen is currently undocumented
: and control is a bit oblique:  If the field is tokenized or multivalued
: or Boolean, the FilterQuery method will be used; otherwise the
: FieldCache method.  I expect I or others will improve that shortly.

Bear in mind, what's provide out of the box is "SimpleFacets" ... it's
designed to meet simple faceting needs ... when you start talking about
100s or thousands of constraints per facet, you are getting outside the
scope of what it was intended to serve efficiently.

At a certain point the only practical thing to do is write a custom
request handler that makes the best choices for your data.

For the record: a really simple patch someone could submit would be to
make add an optional field based param indicating which type of faceting
(termenum/fieldcache) should be used to generate the list of terms and
then make SimpleFacets.getFacetFieldCounts use that and call the
apprpriate method insteado calling getTermCounts -- that way you could
force one or the other if you know it's better for your data/query.



-Hoss



Re: Facet Performance

2006-12-08 Thread Erik Hatcher


On Dec 8, 2006, at 2:15 PM, Andrew Nagy wrote:
My data is 492,000 records of book data.  I am faceting on 4  
fields: author, subject, language, format.
Format and language are fairly simple as their are only a few  
unique terms.  Author and subject however are much different in  
that there are thousands of unique terms.


When encountering difficult issues, I like to think in terms of the  
user interface.  Surely you're not presenting 400k+ authors to the  
users in one shot.  In Collex, we have put an AJAX drop-down that  
shows the author facet (we call it name on the UI, with various roles  
like author, painter, etc).  You can see this in action here:


http://www.nines.org/collex

type in "da" into the name for example.  I developed a custom request  
handler in Solr for returning these types of suggest interfaces  
complete with facet counts.  My code is very specific to our fields,  
so its not usable in a general sense, but maybe this gives you some  
ideas on where to go with these large sets of facet values.


Erik



Re: Facet Performance

2006-12-08 Thread Andrew Nagy

J.J. Larrea wrote:


Unfortunately which strategy will be chosen is currently undocumented and 
control is a bit oblique:  If the field is tokenized or multivalued or Boolean, 
the FilterQuery method will be used; otherwise the FieldCache method.  I expect 
I or others will improve that shortly.
 

Good to hear, cause I can't really get away with not having a 
multi-valued field for author.


Im really excited by solr and really impressed so far.

Thanks!
Andrew


Re: Facet Performance

2006-12-08 Thread Yonik Seeley

On 12/8/06, J.J. Larrea <[EMAIL PROTECTED]> wrote:

Unfortunately which strategy will be chosen is currently undocumented and 
control is a bit oblique:  If the field is tokenized or multivalued or Boolean, 
the FilterQuery method will be used; otherwise the FieldCache method.


If anyone had time some of this could be documented here:
http://wiki.apache.org/solr/SimpleFacetParameters
The wiki is open to all.

Or perhaps a new top level FacetedSearching page that references
SimpleFacetParameters

-Yonik


Re: Facet Performance

2006-12-08 Thread J.J. Larrea
Andrew Nagy, ditto on what Yonik said.  Here is some further elaboration:

I am doing much the same thing (faceting on Author etc.). When my Author field 
was defined as a solr.TextField, even using solr.KeywordTokenizerFactory so it 
wasn't actually tokenized, the faceting code chose the QueryFilter approach, 
and faceting on Author for 100k+ document took about 4 seconds.

When I changed the field to "string" e.g. solr.StrField, the faceting code 
recognized it as untokenized and used the FieldCache approach.  Times have 
dropped to about 120ms for the first query (when the FieldCache is generated) 
and < 10ms for subsequent queries returning a few thousand results.  Quite a 
difference.

The strategy must be chosen on a field-by-field basis.  While QueryFilter is 
excellent for fields with a small set of enumerated values such as Language or 
Format, it is inappropriate for large value sets such as Author.

Unfortunately which strategy will be chosen is currently undocumented and 
control is a bit oblique:  If the field is tokenized or multivalued or Boolean, 
the FilterQuery method will be used; otherwise the FieldCache method.  I expect 
I or others will improve that shortly.

- J.J.

At 2:58 PM -0500 12/8/06, Yonik Seeley wrote:
>Right, if any of these are tokenized, then you could make them
>non-tokenized (use "string" type).  If they really need to be
>tokenized (author for example), then you could use copyField to make
>another copy to a non-tokenized field that you can use for faceting.
>
>After that, as Hoss suggests, run a single faceting query with all 4
>fields and look at the filterCache statistics.  Take the "lookups"
>number and multiply it by, say, 1.5 to leave some room for future
>growth, and use that as your cache size.  You probably want to bump up
>both initialSize and autowarmCount as well.
>
>The first query will still be slow.  The second should be relatively fast.
>You may hit an OOM error.  Increase the JVM heap size if this happens.
>
>-Yonik



Re: Facet Performance

2006-12-08 Thread Andrew Nagy

Yonik Seeley wrote:


Are they multivalued, and do they need to be.
Anything that is of type "string" and not multivalued will use the
lucene FieldCache rather than the filterCache.


The author field is multivalued.  Will this be a strong performance issue?

I could make multiple author fields as to not have the multivalued field 
and then only facet on the first author.


Thanks
Andrew




Re: Facet Performance

2006-12-08 Thread Yonik Seeley

On 12/8/06, Andrew Nagy <[EMAIL PROTECTED]> wrote:

Chris Hostetter wrote:

>: Could you suggest a better configuration based on this?
>
>If that's what your stats look like after a single request, then i would
>guess you would need to make your cache size at least 1.6 million in order
>for it to be of any use in improving your facet speed.
>
>
Would this have any strong impacts on my system?  Should I just set it
to an even 2 million to allow for growth?


Change the following in solrconfig.xml, and you should be fine with a
higher setting.
true
to
false

That will prevent the filtercache from being used for anything but
filters and faceting, so if you set it to high, it won't be utilized
anyway.


>: My data is 492,000 records of book data.  I am faceting on 4 fields:
>: author, subject, language, format.
>: Format and language are fairly simple as their are only a few unique
>: terms.  Author and subject however are much different in that there are
>: thousands of unique terms.
>
>by the looks of it, you have a lot more then a few thousand unique terms
>in those two fields ... are you tokenizing on these fields?  that's
>probably not what you want for ields you're going to facet on.
>
>
All of these fields are set as "string" in my schema


Are they multivalued, and do they need to be.
Anything that is of type "string" and not multivalued will use the
lucene FieldCache rather than the filterCache.

-Yonik


Re: Facet Performance

2006-12-08 Thread Andrew Nagy

Chris Hostetter wrote:


: Could you suggest a better configuration based on this?

If that's what your stats look like after a single request, then i would
guess you would need to make your cache size at least 1.6 million in order
for it to be of any use in improving your facet speed.
 

Would this have any strong impacts on my system?  Should I just set it 
to an even 2 million to allow for growth?



: My data is 492,000 records of book data.  I am faceting on 4 fields:
: author, subject, language, format.
: Format and language are fairly simple as their are only a few unique
: terms.  Author and subject however are much different in that there are
: thousands of unique terms.

by the looks of it, you have a lot more then a few thousand unique terms
in those two fields ... are you tokenizing on these fields?  that's
probably not what you want for ields you're going to facet on.
 

All of these fields are set as "string" in my schema, so if I understand 
the fields correctly, they are not being tokenized.  I also have an 
author field that is set as "text" for searching.


Thanks
Andrew


Re: Facet Performance

2006-12-08 Thread Yonik Seeley

On 12/8/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:

: My data is 492,000 records of book data.  I am faceting on 4 fields:
: author, subject, language, format.
: Format and language are fairly simple as their are only a few unique
: terms.  Author and subject however are much different in that there are
: thousands of unique terms.

by the looks of it, you have a lot more then a few thousand unique terms
in those two fields ... are you tokenizing on these fields?  that's
probably not what you want for ields you're going to facet on.


Right, if any of these are tokenized, then you could make them
non-tokenized (use "string" type).  If they really need to be
tokenized (author for example), then you could use copyField to make
another copy to a non-tokenized field that you can use for faceting.

After that, as Hoss suggests, run a single faceting query with all 4
fields and look at the filterCache statistics.  Take the "lookups"
number and multiply it by, say, 1.5 to leave some room for future
growth, and use that as your cache size.  You probably want to bump up
both initialSize and autowarmCount as well.

The first query will still be slow.  The second should be relatively fast.
You may hit an OOM error.  Increase the JVM heap size if this happens.

-Yonik


Re: Facet Performance

2006-12-08 Thread Chris Hostetter
: Here are the stats, Im still a newbie to SOLR, so Im not totally sure
: what this all means:
: lookups : 1530036
: hits : 2
: hitratio : 0.00
: inserts : 1530035
: evictions : 1504435
: size : 25600

those numbers are telling you that your cache is capable of holding 25,600
items.  you have attempted to lookup something in the cache 1,530,036
times, and only 2 of those times did you get a hit.  you have
added 1,530,035 items to the cache, and 1,504,435 items have been removed
from your cache to make room for newer items.

in short: your cache isn't really helping you at all.

: Could you suggest a better configuration based on this?

If that's what your stats look like after a single request, then i would
guess you would need to make your cache size at least 1.6 million in order
for it to be of any use in improving your facet speed.

: My data is 492,000 records of book data.  I am faceting on 4 fields:
: author, subject, language, format.
: Format and language are fairly simple as their are only a few unique
: terms.  Author and subject however are much different in that there are
: thousands of unique terms.

by the looks of it, you have a lot more then a few thousand unique terms
in those two fields ... are you tokenizing on these fields?  that's
probably not what you want for ields you're going to facet on.



-Hoss



Re: Facet Performance

2006-12-08 Thread Andrew Nagy

Yonik Seeley wrote:


On 12/8/06, Andrew Nagy <[EMAIL PROTECTED]> wrote:


I changed the filterCache to the following:


However a search that normally takes .04s is taking 74 seconds once I
use the facets since I am faceting on 4 fields.



The first time or subsequent times?
Is your filterCache big enough yet?  What do you see for evictions and
hit ratio?


Here are the stats, Im still a newbie to SOLR, so Im not totally sure 
what this all means:

lookups : 1530036
hits : 2
hitratio : 0.00
inserts : 1530035
evictions : 1504435
size : 25600
cumulative_lookups : 1530036
cumulative_hits : 2
cumulative_hitratio : 0.00
cumulative_inserts : 1530035
cumulative_evictions : 1504435

Could you suggest a better configuration based on this?




Can you suggest a better configuration that would solve this performance
issue, or should I not use faceting?



Faceting isn't something that will always be fast... one often needs
to design things in a way that it can be fast.

Can you give some examples of your faceted queries?
Can you show the field and fieldtype definitions for the fields you
are faceting on?
For each field that you are faceting on, how many different terms are 
in it?


My data is 492,000 records of book data.  I am faceting on 4 fields: 
author, subject, language, format.
Format and language are fairly simple as their are only a few unique 
terms.  Author and subject however are much different in that there are 
thousands of unique terms.


Thanks for your help!
Andrew


Re: Facet Performance

2006-12-08 Thread Yonik Seeley

On 12/8/06, Andrew Nagy <[EMAIL PROTECTED]> wrote:

I changed the filterCache to the following:


However a search that normally takes .04s is taking 74 seconds once I
use the facets since I am faceting on 4 fields.


The first time or subsequent times?
Is your filterCache big enough yet?  What do you see for evictions and
hit ratio?


Can you suggest a better configuration that would solve this performance
issue, or should I not use faceting?


Faceting isn't something that will always be fast... one often needs
to design things in a way that it can be fast.

Can you give some examples of your faceted queries?
Can you show the field and fieldtype definitions for the fields you
are faceting on?
For each field that you are faceting on, how many different terms are in it?


I figure I could run the query twice, once limited to 20 records and
then again with the limit set to the total number of records and develop
my own facets.  I have infact done this before with a different back-end
and my code is processed in under .01 seconds.

Why is faceting so slow?


It's computationally expensive to get exact facet counts for a large
number of hits, and that is what the current faceting code is designed
to do.  No single method will be appropriate *and* fast for all
scenarios.

Another method that hasn't been implemented is some statistical
faceting based on the top hits, using stored fields or stored term
vectors.

-Yonik


Re: Facet Performance

2006-12-08 Thread Andrew Nagy

Yonik Seeley wrote:


1) facet on single-valued strings if you can
2) if you can't do (1) then enlarge the fieldcache so that the number
of filters (one per possible term in the field you are filtering on)
can fit.


I changed the filterCache to the following:
   

However a search that normally takes .04s is taking 74 seconds once I 
use the facets since I am faceting on 4 fields.


Can you suggest a better configuration that would solve this performance 
issue, or should I not use faceting?
I figure I could run the query twice, once limited to 20 records and 
then again with the limit set to the total number of records and develop 
my own facets.  I have infact done this before with a different back-end 
and my code is processed in under .01 seconds.


Why is faceting so slow?

Andrew