Re: SolrCloud: Meaning of SYNC state in ZkStateReader?

2014-10-14 Thread Martin Grotzke
Ok, thanks for your response, Mark!

Cheers,
Martin


On Tue, Oct 14, 2014 at 1:59 AM, Mark Miller markrmil...@gmail.com wrote:

 I think it's just cruft I left in and never ended up using anywhere. You
 can ignore it.

 - Mark

  On Oct 13, 2014, at 8:42 PM, Martin Grotzke 
 martin.grot...@googlemail.com wrote:
 
  Hi,
 
  can anybody tell me the meaning of ZkStateReader.SYNC? All other state
  related constants are clear to me, I'm only not sure about the semantics
  of SYNC.
 
  Background: I'm working on an async solr client
  (https://github.com/inoio/solrs) and want to add SolrCloud support - for
  this I'm reusing ZkStateReader.
 
  TIA  cheers,
  Martin
 




-- 
Martin Grotzke
http://twitter.com/martin_grotzke


SolrCloud: Meaning of SYNC state in ZkStateReader?

2014-10-13 Thread Martin Grotzke
Hi,

can anybody tell me the meaning of ZkStateReader.SYNC? All other state
related constants are clear to me, I'm only not sure about the semantics
of SYNC.

Background: I'm working on an async solr client
(https://github.com/inoio/solrs) and want to add SolrCloud support - for
this I'm reusing ZkStateReader.

TIA  cheers,
Martin



signature.asc
Description: OpenPGP digital signature


LBHttpSolrServer to query a preferred server

2012-04-04 Thread Martin Grotzke
Hi,

we want to use the LBHttpSolrServer (4.0/trunk) and specify a preferred
server. Our use case is that for one user request we make several solr
requests with some heavy caching (using a custom request handler with a
special cache) and want to make sure that the subsequent solr requests
are hitting the same solr server.

A possible solution with LBHttpSolrServer would look like this:
- LBHttpSolrServer provides a method getSolrServer() that returns a
ServerWrapper
- LBHttpSolrServer provides a method
   request(final SolrRequest request, ServerWrapper preferredServer)
  that returns the response (NamedListObject).

This method first tries the specified preferredServer and if this fails
queries all others (first alive servers then zombies).

What do you think of this solution? Any other solution preferred?

I'll start implementing this and submit an issue/patch hoping that it
makes it into trunk.

Cheers,
Martin



signature.asc
Description: OpenPGP digital signature


Re: LBHttpSolrServer to query a preferred server

2012-04-04 Thread Martin Grotzke
Hi,

I just submitted an issue with patch for this:
https://issues.apache.org/jira/browse/SOLR-3318

Cheers,
Martin


On 04/04/2012 03:53 PM, Martin Grotzke wrote:
 Hi,
 
 we want to use the LBHttpSolrServer (4.0/trunk) and specify a preferred
 server. Our use case is that for one user request we make several solr
 requests with some heavy caching (using a custom request handler with a
 special cache) and want to make sure that the subsequent solr requests
 are hitting the same solr server.
 
 A possible solution with LBHttpSolrServer would look like this:
 - LBHttpSolrServer provides a method getSolrServer() that returns a
 ServerWrapper
 - LBHttpSolrServer provides a method
request(final SolrRequest request, ServerWrapper preferredServer)
   that returns the response (NamedListObject).
 
 This method first tries the specified preferredServer and if this fails
 queries all others (first alive servers then zombies).
 
 What do you think of this solution? Any other solution preferred?
 
 I'll start implementing this and submit an issue/patch hoping that it
 makes it into trunk.
 
 Cheers,
 Martin
 



signature.asc
Description: OpenPGP digital signature


How to determine memory consumption per core

2012-04-02 Thread Martin Grotzke
Hi,

is it possible to determine the memory consumption (heap space) per core
in solr trunk (4.0-SNAPSHOT)?

I just unloaded a core and saw the difference in memory usage, but it
would be nice to have a smoother way of getting the information without
core downtime.

It would also be interesting, which caches are the biggest ones, to know
which one should/might be reduced.

Thanx  cheers,
Martin



signature.asc
Description: OpenPGP digital signature


Re: AW: How to deal with many files using solr external file field

2011-06-09 Thread Martin Grotzke
Hi,

as I'm also involved in this issue (on the side of Sven) I created a
patch, that replaces the float array by a map that stores score by doc,
so it contains as many entries as the external scoring file contains
lines, but no more.

I created an issue for this: https://issues.apache.org/jira/browse/SOLR-2583

It would be great if someone could have a look at it and comment.

Thanx for your feedback,
cheers,
Martin


On 06/08/2011 12:22 PM, Bohnsack, Sven wrote:
 Hi,
 
 I could not provide a stack trace and IMHO it won't provide some useful 
 information. But we've made a good progress in the analysis.
 
 We took a deeper look at what happened, when an external-file-field-Request 
 is sent to SOLR:
 
 * SOLR looks if there is a file for the requested query, e.g. trousers
 * If so, then SOLR loads the trousers-file and generates a HashMap-Entry 
 consisting of a FileFloatSource-Object and a FloatArray with the size of the 
 number of documents in the SOLR-index. Every document matched by the query 
 gains the score-value, which is provided in the external-score-file. For 
 every(!) other document SOLR writes a zero in that FloatArray
 * if SOLR does not find a file for the query-Request, then SOLR still 
 generates a HashMapEntry with score zero for every document
 
 In our case we have about 8.5 Mio. documents in our index and one of those 
 Arrays occupies about 34MB Heap Space. Having e.g. 100 different queries and 
 using external file field for sorting the result, SOLR occupies about 3.4GB 
 of Heap Space.
 
 The problem might be the use of WeakHashMap [1], which prevents the Garbage 
 Collector from cleaning up unused Keys.
 
 
 What do you think could be a possible solution for this whole problem? 
 (except from don't use external file fields ;)
 
 
 Regards
 Sven
 
 
 [1]: A hashtable-based Map implementation with weak keys. An entry in a 
 WeakHashMap will automatically be removed when its key is no longer in 
 ordinary use. More precisely, the presence of a mapping for a given key will 
 not prevent the key from being discarded by the garbage collector, that is, 
 made finalizable, finalized, and then reclaimed. When a key has been 
 discarded its entry is effectively removed from the map, so this class 
 behaves somewhat differently than other Map implementations.
 
 -Ursprüngliche Nachricht-
 Von: mtnes...@gmail.com [mailto:mtnes...@gmail.com] Im Auftrag von Simon 
 Rosenthal
 Gesendet: Mittwoch, 8. Juni 2011 03:56
 An: solr-user@lucene.apache.org
 Betreff: Re: How to deal with many files using solr external file field
 
 Can you provide a stack trace for the OOM eexception ?
 
 On Tue, Jun 7, 2011 at 4:25 PM, Bohnsack, Sven
 sven.bohns...@shopping24.dewrote:
 
 Hi all,

 we're using solr 1.4 and external file field ([1]) for sorting our
 searchresults. We have about 40.000 Terms, for which we use this sorting
 option.
 Currently we're running into massive OutOfMemory-Problems and were not
 pretty sure, what's the matter. It seems that the garbage collector stops
 working or some processes are going wild. However, solr starts to allocate
 more and more RAM until we experience this OutOfMemory-Exception.


 We noticed the following:

 For some terms one could see in the solr log that there appear some
 java.io.FileNotFoundExceptions, when solr tries to load an external file for
 a term for which there is not such a file, e.g. solr tries to load the
 external score file for trousers but there ist none in the
 /solr/data-Folder.

 Question: is it possible, that those exceptions are responsible for the
 OutOfMemory-Problem or could it be due to the large(?) number of 40k terms
 for which we want to sort the result via external file field?

 I'm looking forward for your answers, suggestions and ideas :)


 Regards
 Sven


 [1]:
 http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html


-- 
Martin Grotzke
http://twitter.com/martin_grotzke



signature.asc
Description: OpenPGP digital signature


Solrj retry handling - prevent ProtocolException: Unbuffered entity enclosing request can not be repeated

2011-04-12 Thread Martin Grotzke
Hi,

from time to time we're seeing a ProtocolException: Unbuffered entity
enclosing request can not be repeated. in the logs when sending ~500
docs to solr (the stack trace is at the end of the email).

I'm aware that this was discussed before (e.g. [1]) and our solution was
already to reduce the number of docs that are sent to solr.

However, I think that the issue might be solved in solrj. This
discussion on the httpclient-dev mailing list [2] points out the
solution under option 3) re-instantiate the input stream and retry the
request manually.

AFAICS CommonsHttpSolrServer.request when _maxRetries is set to s.th. 
0 (see [3]) already does some retry stuff, but not around the actual
http method execution (_httpClient.executeMethod(method)). Not sure for
what the several tries are implemented, but I'd say that if the user
sets maxRetries to s.th.  0 also http method execution should be retried.

Another thing is the actually seen ProtocolException: AFAICS this is
thrown as httpclient (HttpMethodDirector.executeWithRetry) performs a
retry itself (see [4]) while the actually processed HttpMethod does not
support this.

As HttpMethodDirector.executeWithRetry already checks for a
HttpMethodRetryHandler (under param HttpMethodParams.RETRY_HANDLER,
[5]), it seems as if it would be enough to add such a handler for the
update/POST requests to prevent the ProtocolException.

So in summary I suggest two things:
1) Retry http method execution when maxRetiries is  0
2) Prevent HttpClient from doing retries (by adding HttpMethodRetryHandler)

I first wanted to post it here on the list to see if there are
objections or other solutions. Or if there are plans to replace commons
httpclient (3.x) by s.th. like apache httpclient 4.x or async-http-client.

If there's an agreement that the proposed solution is the way to go ATM
I'd submit an appropriate issue for this.

Any comments?

Cheers,
Martin



[1]
http://lucene.472066.n3.nabble.com/Unbuffered-entity-enclosing-request-can-not-be-repeated-tt788186.html

[2]
http://www.mail-archive.com/commons-httpclient-dev@jakarta.apache.org/msg06723.html

[3]
http://svn.apache.org/viewvc/lucene/dev/trunk/solr/src/solrj/org/apache/solr/client/solrj/impl/CommonsHttpSolrServer.java?view=markup#l281

[4]
http://svn.apache.org/viewvc/httpcomponents/oac.hc3x/trunk/src/java/org/apache/commons/httpclient/HttpMethodDirector.java?view=markup#l366

[5]
http://svn.apache.org/viewvc/httpcomponents/oac.hc3x/trunk/src/java/org/apache/commons/httpclient/HttpMethodDirector.java?view=markup#l426


Stack trace:

Caused by: org.apache.commons.httpclient.ProtocolException: Unbuffered
entity enclosing request can not be repeated.
at
org.apache.commons.httpclient.methods.EntityEnclosingMethod.writeRequestBody(EntityEnclosingMethod.java:487)
at
org.apache.commons.httpclient.HttpMethodBase.writeRequest(HttpMethodBase.java:2110)
at
org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1088)
at
org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398)
at
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:427)


-- 
Martin Grotzke
http://twitter.com/martin_grotzke



signature.asc
Description: OpenPGP digital signature


Re: Use terracotta bigmemory for solr-caches

2011-01-26 Thread Martin Grotzke
On Tue, Jan 25, 2011 at 4:19 PM, Em mailformailingli...@yahoo.de wrote:


 Hi Martin,

 are you sure that your GC is well tuned?

This are the heap related jvm configurations for the servers running with
17GB heap size (one with parallel collector, one with CMS):

-XX:+HeapDumpOnOutOfMemoryError -server -Xmx17G -XX:MaxPermSize=256m
-XX:NewSize=2G -XX:MaxNewSize=2G -XX:SurvivorRatio=6
-XX:+UseConcMarkSweepGC

-XX:+HeapDumpOnOutOfMemoryError -server -Xmx17G -XX:MaxPermSize=256m
-XX:NewSize=2G -XX:MaxNewSize=2G -XX:SurvivorRatio=6 -XX:+UseParallelOldGC
-XX:+UseParallelGC

Another heap configuration is running with 8GB max heap, and this search
server also has lower peaks in response times.

To me it seems that it's just too much memory that gets
allocated/collected/compacted. I'm just checking out how far we can reduce
cache sizes (and the max heap) without any reduction of response times (and
disk I/O). Right now it seems that a reduction of the documentCache size
indeed does reduce the hitratio of the cache, but it does not have any
negative impact on response times (neither is I/O increased). Therefore I'd
follow the path of reducing the cache sizes as far as we can as long as
there are no negative impacts and then I'd check again the longest requests
and see if they're still caused by full GC cycles. Even then they should be
much shorter due to the reduced memory that is collected/compacted.

So now I also think, the terracotta bigmemory is not the right solution :-)

Cheers,
Martin



 A request that needs more than a minute isn't the standard, even when I
 consider all the other postings about response-performance...

 Regards
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Use-terracotta-bigmemory-for-solr-caches-tp2328257p2330652.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Martin Grotzke
http://www.javakaffee.de/blog/


Recommendation on RAM-/Cache configuration

2011-01-25 Thread Martin Grotzke
Hi,

recently we're experiencing OOMEs (GC overhead limit exceeded) in our
searches. Therefore I want to get some clarification on heap and cache
configuration.

This is the situation:
- Solr 1.4.1 running on tomcat 6, Sun JVM 1.6.0_13 64bit
- JVM Heap Params: -Xmx8G -XX:MaxPermSize=256m -XX:NewSize=2G
-XX:MaxNewSize=2G -XX:SurvivorRatio=6 -XX:+UseParallelOldGC
-XX:+UseParallelGC
- The machine has 32 GB RAM
- Currently there are 4 processors/cores in the machine, this shall be
changed to 2 cores in the future.
- The index size in the filesystem is ~9.5 GB
- The index contains ~ 5.500.000 documents
- 1.500.000 of those docs are available for searches/queries, the rest are
inactive docs that are excluded from searches (via a flag/field), but
they're still stored in the index as need to be available by id (solr is the
main document store in this app)
- Caches are configured with a big size (the idea was to prevent filesystem
access / disk i/o as much as possible):
  - filterCache (solr.LRUCache): size=20, initialSize=3,
autowarmCount=1000, actual size =~ 60.000, hitratio =~ 0.99
  - documentCache (solr.LRUCache): size=20, initialSize=10,
autowarmCount=0, actual size =~ 160.000 - 190.000, hitratio =~ 0.74
  - queryResultCache (solr.LRUCache): size=20, initialSize=3,
autowarmCount=1, actual size =~ 10.000 - 60.000, hitratio =~ 0.71
- Searches are performed using a catchall text field using standard request
handler, all fields are fetched (no fl specified)
- Normally ~ 5 concurrent requests, peaks up to 30 or 40 (mostly during GC)
- Recently we also added a feature that adds weighted search for special
fields, so that the query might become s.th. like this
  q=(some query) OR name_weighted:(some query)^2.0 OR brand_weighted:(some
query)^4.0 OR longDescription_weighted:(some query)^0.5
  (it seemed as if this was the cause of the OOMEs, but IMHO it only
increased RAM usage so that now GC could not free enough RAM)

The OOMEs that we get are of type GC overhead limit exceeded, one of the
OOMEs was thrown during auto-warming.

I checked two different heapdumps, the first one autogenerated
(by -XX:+HeapDumpOnOutOfMemoryError) the second one generated manually via
jmap.
These show the following distribution of used memory - the autogenerated
dump:
 - documentCache: 56% (size ~ 195.000)
- filterCache: 15% (size ~ 60.000)
- queryResultCache: 8% (size ~ 61.000)
- fieldCache: 6% (fieldCache referenced  by WebappClassLoader)
- SolrIndexSearcher: 2%

The manually generated dump:
- documentCache: 48% (size ~ 195.000)
- filterCache: 20% (size ~ 60.000)
- fieldCache: 11% (fieldCache hängt am WebappClassLoader)
- queryResultCache: 7% (size ~ 61.000)
- fieldValueCache: 3%

We are also running two search engines with 17GB heap, these don't run into
OOMEs. Though, with these bigger heap sizes the longest requests are even
longer due to longer stop-the-world gc cycles.
Therefore my goal is to run with a smaller heap, IMHO even smaller than 8GB
would be good to reduce the time needed for full gc.

So what's the right path to follow now? What would you recommend to change
on the configuration (solr/jvm)?

Would you say it is ok to reduce the cache sizes? Would this increase disk
i/o, or would the index be hold in the OS's disk cache?

Do have other recommendations to follow / questions?

Thanx  cheers,
Martin


Use terracotta bigmemory for solr-caches

2011-01-25 Thread Martin Grotzke
Hi,

as the biggest parts of our jvm heap are used by solr caches I asked myself
if it wouldn't make sense to run solr caches backed by terracotta's
bigmemory (http://www.terracotta.org/bigmemory).
The goal is to reduce the time needed for full / stop-the-world GC cycles,
as with our 8GB heap the longest requests take up to several minutes.

What do you think?

Cheers,
Martin


Re: Recommendation on RAM-/Cache configuration

2011-01-25 Thread Martin Grotzke
On Tue, Jan 25, 2011 at 2:06 PM, Markus Jelsma
markus.jel...@openindex.iowrote:

 On Tuesday 25 January 2011 11:54:55 Martin Grotzke wrote:
  Hi,
 
  recently we're experiencing OOMEs (GC overhead limit exceeded) in our
  searches. Therefore I want to get some clarification on heap and cache
  configuration.
 
  This is the situation:
  - Solr 1.4.1 running on tomcat 6, Sun JVM 1.6.0_13 64bit
  - JVM Heap Params: -Xmx8G -XX:MaxPermSize=256m -XX:NewSize=2G
  -XX:MaxNewSize=2G -XX:SurvivorRatio=6 -XX:+UseParallelOldGC
  -XX:+UseParallelGC

 Consider switching to HotSpot JVM, use the -server as the first switch.

The jvm options I mentioned were not all, we're running the jvm with -server
(of course).



  - The machine has 32 GB RAM
  - Currently there are 4 processors/cores in the machine, this shall be
  changed to 2 cores in the future.
  - The index size in the filesystem is ~9.5 GB
  - The index contains ~ 5.500.000 documents
  - 1.500.000 of those docs are available for searches/queries, the rest
 are
  inactive docs that are excluded from searches (via a flag/field), but
  they're still stored in the index as need to be available by id (solr is
  the main document store in this app)

 How do you exclude them? It should use filter queries.

The docs are indexed with a field findable on which we do a filter query.


 I also remember (but i
 just cannot find it back so please correct my if i'm wrong) that in 1.4.x
 sorting is done before filtering. It should be an improvement if filtering
 is
 done before sorting.

Hmm, I cannot imagine a case where it makes sense to sort before filtering.
Can't believe that solr does it like this.
Can anyone shed a light on this?


 If you use sorting, it takes up a huge amount of RAM if filtering is not
 done
 first.

  - Caches are configured with a big size (the idea was to prevent
 filesystem
  access / disk i/o as much as possible):

 There is only disk I/O if the kernel can't keep the index (or parts) in its
 page cache.

Yes, I'll keep an eye on disk I/O.



- filterCache (solr.LRUCache): size=20, initialSize=3,
  autowarmCount=1000, actual size =~ 60.000, hitratio =~ 0.99
- documentCache (solr.LRUCache): size=20, initialSize=10,
  autowarmCount=0, actual size =~ 160.000 - 190.000, hitratio =~ 0.74
- queryResultCache (solr.LRUCache): size=20, initialSize=3,
  autowarmCount=1, actual size =~ 10.000 - 60.000, hitratio =~ 0.71

 You should decrease the initialSize values. But your hitratio's seem very
 nice.

Does the initialSize have a real impact? According to
http://wiki.apache.org/solr/SolrCaching#initialSize it's the initial size of
the HashMap backing the cache.
What would you say are reasonable values for size/initialSize/autowarmCount?

Cheers,
Martin


Re: Rebuild Spellchecker based on cron expression

2010-12-13 Thread Martin Grotzke
On Mon, Dec 13, 2010 at 4:01 AM, Erick Erickson erickerick...@gmail.com wrote:
 I'm shooting in the dark here, but according to this:
 http://wiki.apache.org/solr/SolrReplication
 http://wiki.apache.org/solr/SolrReplicationafter the slave pulls the index
 down, it issues a commit. So if your
 slave is configured to generate the dictionary on commit, will it
 just happen?

Our slaves spellcheckers are not configured to buildOnCommit,
therefore it shouldn't just happen.


 But according to this: https://issues.apache.org/jira/browse/SOLR-866
 https://issues.apache.org/jira/browse/SOLR-866this is an open issue

Thanx for the pointer! SOLR-866 is even better suited for us - after
reading SOLR-433 again I realized that it targets scripts based
replication (what we're going to leave behind us).

Cheers,
Martin



 Best
 Erick

 On Sun, Dec 12, 2010 at 8:30 PM, Martin Grotzke 
 martin.grot...@googlemail.com wrote:

 On Mon, Dec 13, 2010 at 2:12 AM, Markus Jelsma
 markus.jel...@openindex.io wrote:
  Maybe you've overlooked the build parameter?
  http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.build
 I'm aware of this, but we don't want to maintain cron-jobs on all
 slaves for all spellcheckers for all cores.
 That's why I'm thinking about a more integrated solution. Or did I
 really overlook s.th.?

 Cheers,
 Martin


 
  Hi,
 
  the spellchecker component already provides a buildOnCommit and
  buildOnOptimize option.
 
  Since we have several spellchecker indices building on each commit is
  not really what we want to do.
  Building on optimize is not possible as index optimization is done on
  the master and the slaves don't even run an optimize but only fetch
  the optimized index.
 
  Therefore I'm thinking about an extension of the spellchecker that
  allows you to rebuild the spellchecker based on a cron-expression
  (e.g. rebuild each night at 1 am).
 
  What do you think about this, is there anybody else interested in this?
 
  Regarding the lifecycle, is there already some executor framework or
  any regularly running process in place, or would I have to pull up my
  own thread? If so, how can I stop my thread when solr/tomcat is
  shutdown (I couldn't see any shutdown or destroy method in
  SearchComponent)?
 
  Thanx for your feedback,
  cheers,
  Martin
 



 --
 Martin Grotzke
 http://twitter.com/martin_grotzke





-- 
Martin Grotzke
http://www.javakaffee.de/blog/


Re: Rebuild Spellchecker based on cron expression

2010-12-13 Thread Martin Grotzke
Hi Erick,

thanx for your advice! I'll check the options with our client and see
how we'll proceed. My spare time right now is already full with other
open source stuff, otherwise it'd be fun contributing s.th. to solr!
:-)

Cheers,
Martin


On Mon, Dec 13, 2010 at 2:46 PM, Erick Erickson erickerick...@gmail.com wrote:
 ***
 Just wondering what's the reason that this patch receives that little
 interest. Anything wrong with it?
 ***

 Nobody got behind it and pushed I suspect. And since it's been a long time
 since it was updated, there's no guarantee that it would apply cleanly any
 more.
 Or that it will perform as intended.

 So, if you're really interested, I'd suggest you ping the dev list and ask
 whether this is valuable or if it's been superseded. If the feedback is that
 this
 would be valuable, you can see what you can do to make it happen.

 Once it's working to your satisfaction and you've submitted a patch, let
 people
 know it's ready and ask them to commit it or critique it. You might have to
 remind
 the committers after a few days that it's ready and get it applied to trunk
 and/or 3.x.

 But I really wouldn't start working with it until I got some feedback from
 the
 people who are actively working on Solr whether it's been superseded by
 other functionality first, sometimes bugs just aren't closed when something
 else makes it obsolete.

 Here's a place to start: http://wiki.apache.org/solr/HowToContribute

 Best
 Erick

 On Mon, Dec 13, 2010 at 2:58 AM, Martin Grotzke 
 martin.grot...@googlemail.com wrote:

 Hi,

 when thinking further about it it's clear that
  https://issues.apache.org/jira/browse/SOLR-433
 would be even better - we could generate the spellechecker indices on
 commit/optimize on the master and replicate them to all slaves.

 Just wondering what's the reason that this patch receives that little
 interest. Anything wrong with it?

 Cheers,
 Martin


 On Mon, Dec 13, 2010 at 2:04 AM, Martin Grotzke
 martin.grot...@googlemail.com wrote:
  Hi,
 
  the spellchecker component already provides a buildOnCommit and
  buildOnOptimize option.
 
  Since we have several spellchecker indices building on each commit is
  not really what we want to do.
  Building on optimize is not possible as index optimization is done on
  the master and the slaves don't even run an optimize but only fetch
  the optimized index.
 
  Therefore I'm thinking about an extension of the spellchecker that
  allows you to rebuild the spellchecker based on a cron-expression
  (e.g. rebuild each night at 1 am).
 
  What do you think about this, is there anybody else interested in this?
 
  Regarding the lifecycle, is there already some executor framework or
  any regularly running process in place, or would I have to pull up my
  own thread? If so, how can I stop my thread when solr/tomcat is
  shutdown (I couldn't see any shutdown or destroy method in
  SearchComponent)?
 
  Thanx for your feedback,
  cheers,
  Martin
 



 --
 Martin Grotzke
 http://www.javakaffee.de/blog/





-- 
Martin Grotzke
http://www.javakaffee.de/blog/


Rebuild Spellchecker based on cron expression

2010-12-12 Thread Martin Grotzke
Hi,

the spellchecker component already provides a buildOnCommit and
buildOnOptimize option.

Since we have several spellchecker indices building on each commit is
not really what we want to do.
Building on optimize is not possible as index optimization is done on
the master and the slaves don't even run an optimize but only fetch
the optimized index.

Therefore I'm thinking about an extension of the spellchecker that
allows you to rebuild the spellchecker based on a cron-expression
(e.g. rebuild each night at 1 am).

What do you think about this, is there anybody else interested in this?

Regarding the lifecycle, is there already some executor framework or
any regularly running process in place, or would I have to pull up my
own thread? If so, how can I stop my thread when solr/tomcat is
shutdown (I couldn't see any shutdown or destroy method in
SearchComponent)?

Thanx for your feedback,
cheers,
Martin


Re: Rebuild Spellchecker based on cron expression

2010-12-12 Thread Martin Grotzke
On Mon, Dec 13, 2010 at 2:12 AM, Markus Jelsma
markus.jel...@openindex.io wrote:
 Maybe you've overlooked the build parameter?
 http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.build
I'm aware of this, but we don't want to maintain cron-jobs on all
slaves for all spellcheckers for all cores.
That's why I'm thinking about a more integrated solution. Or did I
really overlook s.th.?

Cheers,
Martin



 Hi,

 the spellchecker component already provides a buildOnCommit and
 buildOnOptimize option.

 Since we have several spellchecker indices building on each commit is
 not really what we want to do.
 Building on optimize is not possible as index optimization is done on
 the master and the slaves don't even run an optimize but only fetch
 the optimized index.

 Therefore I'm thinking about an extension of the spellchecker that
 allows you to rebuild the spellchecker based on a cron-expression
 (e.g. rebuild each night at 1 am).

 What do you think about this, is there anybody else interested in this?

 Regarding the lifecycle, is there already some executor framework or
 any regularly running process in place, or would I have to pull up my
 own thread? If so, how can I stop my thread when solr/tomcat is
 shutdown (I couldn't see any shutdown or destroy method in
 SearchComponent)?

 Thanx for your feedback,
 cheers,
 Martin




-- 
Martin Grotzke
http://twitter.com/martin_grotzke


Re: Rebuild Spellchecker based on cron expression

2010-12-12 Thread Martin Grotzke
Hi,

when thinking further about it it's clear that
  https://issues.apache.org/jira/browse/SOLR-433
would be even better - we could generate the spellechecker indices on
commit/optimize on the master and replicate them to all slaves.

Just wondering what's the reason that this patch receives that little
interest. Anything wrong with it?

Cheers,
Martin


On Mon, Dec 13, 2010 at 2:04 AM, Martin Grotzke
martin.grot...@googlemail.com wrote:
 Hi,

 the spellchecker component already provides a buildOnCommit and
 buildOnOptimize option.

 Since we have several spellchecker indices building on each commit is
 not really what we want to do.
 Building on optimize is not possible as index optimization is done on
 the master and the slaves don't even run an optimize but only fetch
 the optimized index.

 Therefore I'm thinking about an extension of the spellchecker that
 allows you to rebuild the spellchecker based on a cron-expression
 (e.g. rebuild each night at 1 am).

 What do you think about this, is there anybody else interested in this?

 Regarding the lifecycle, is there already some executor framework or
 any regularly running process in place, or would I have to pull up my
 own thread? If so, how can I stop my thread when solr/tomcat is
 shutdown (I couldn't see any shutdown or destroy method in
 SearchComponent)?

 Thanx for your feedback,
 cheers,
 Martin




-- 
Martin Grotzke
http://www.javakaffee.de/blog/


Re: Multicore and Replication (scripts vs. java, spellchecker)

2010-12-11 Thread Martin Grotzke
On Sat, Dec 11, 2010 at 12:38 AM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : #SOLR-433 MultiCore and SpellChecker replication [1]. Based on the
 : status of this feature request I'd asume that the normal procedure of
 : keeping the spellchecker index up2date would be running a cron job on
 : each node/slave that updates the spellchecker.
 : Is that right?

 i'm not 100% certain, but i suspect a lot of people just build the
 spellcheck dictionaries on the slave machines (redundently) using
 buildOnCommit

 http://wiki.apache.org/solr/SpellCheckComponent#Building_on_Commits

Ok, also a good option. Though, for us this is not that perfect
because we have 4 different spellcheckers configured so that this
would eat some cpu that we'd prefer to have left for searching.
I think what would be desirable (in our case) is s.th. like rebuilding
the spellchecker based on a cron expression, so that we could recreate
it e.g. every night at 1 am.

When thinking about creating s.th. like this, do you have some advice
where I could have a look at in solr? Is there already some
framework for running regular tasks, or should I pull up my own
Timer/TimerTask etc. and create it from scratch?

Cheers,
Martin








 -Hoss




-- 
Martin Grotzke
http://www.javakaffee.de/blog/


Re: Multicore and Replication (scripts vs. java, spellchecker)

2010-12-10 Thread Martin Grotzke
Hi,

that there's no feedback indicates that our plans/preferences are
fine. Otherwise it's now a good opportunity to feed back :-)

Cheers,
Martin


On Wed, Dec 8, 2010 at 2:48 PM, Martin Grotzke
martin.grot...@googlemail.com wrote:
 Hi,

 we're just planning to move from our replicated single index setup to
 a replicated setup with multiple cores.
 We're going to start with 2 cores, but the number of cores may
 change/increase over time.

 Our replication is still based on scripts/rsync, and I'm wondering if
 it's worth moving to java based replication.
 AFAICS the main advantage is simplicity, as with scripts based
 replication our operations team would have to maintain rsync daemons /
 cron jobs for each core.
 Therefore my own preference would be to drop scripts and chose the
 java based replication.

 I'd just wanted to ask for experiences with the one or another in a
 multicore setup. What do you say?

 Another question is regarding spellchecker replication. I know there's
 #SOLR-433 MultiCore and SpellChecker replication [1]. Based on the
 status of this feature request I'd asume that the normal procedure of
 keeping the spellchecker index up2date would be running a cron job on
 each node/slave that updates the spellchecker.
 Is that right?

 And a final one: are there other things we should be aware of / keep
 in mind when planning the migration to multiple cores?
 (Ok, I'm risking to get ask specific questions! as an answer, but
 perhaps s.o. has interesting, related stories to tell  :-))

 Thanx in advance,
 cheers,
 Martin


 [1] https://issues.apache.org/jira/browse/SOLR-433




-- 
Martin Grotzke
http://www.javakaffee.de/blog/


Multicore and Replication (scripts vs. java, spellchecker)

2010-12-08 Thread Martin Grotzke
Hi,

we're just planning to move from our replicated single index setup to
a replicated setup with multiple cores.
We're going to start with 2 cores, but the number of cores may
change/increase over time.

Our replication is still based on scripts/rsync, and I'm wondering if
it's worth moving to java based replication.
AFAICS the main advantage is simplicity, as with scripts based
replication our operations team would have to maintain rsync daemons /
cron jobs for each core.
Therefore my own preference would be to drop scripts and chose the
java based replication.

I'd just wanted to ask for experiences with the one or another in a
multicore setup. What do you say?

Another question is regarding spellchecker replication. I know there's
#SOLR-433 MultiCore and SpellChecker replication [1]. Based on the
status of this feature request I'd asume that the normal procedure of
keeping the spellchecker index up2date would be running a cron job on
each node/slave that updates the spellchecker.
Is that right?

And a final one: are there other things we should be aware of / keep
in mind when planning the migration to multiple cores?
(Ok, I'm risking to get ask specific questions! as an answer, but
perhaps s.o. has interesting, related stories to tell  :-))

Thanx in advance,
cheers,
Martin


[1] https://issues.apache.org/jira/browse/SOLR-433


Re: ArrayIndexOutOfBoundsException for query with rows=0 and sort param

2010-12-01 Thread Martin Grotzke
On Tue, Nov 30, 2010 at 7:51 PM, Martin Grotzke
martin.grot...@googlemail.com wrote:
 On Tue, Nov 30, 2010 at 3:09 PM, Yonik Seeley
 yo...@lucidimagination.com wrote:
 On Tue, Nov 30, 2010 at 8:24 AM, Martin Grotzke
 martin.grot...@googlemail.com wrote:
 Still I'm wondering, why this issue does not occur with the plain
 example solr setup with 2 indexed docs. Any explanation?

 It's an old option you have in your solrconfig.xml that causes a
 different code path to be followed in Solr:

   !-- An optimization that attempts to use a filter to satisfy a search.
         If the requested sort does not include score, then the filterCache
         will be checked for a filter matching the query. If found, the filter
         will be used as the source of document ids, and then the sort will be
         applied to that. --
    useFilterForSortedQuerytrue/useFilterForSortedQuery

 Most apps would be better off commenting that out or setting it to
 false.  It only makes sense when a high number of queries will be
 duplicated, but with different sorts.

 Great, this sounds really promising, would be a very easy fix. I need
 to check this tomorrow on our test/integration server if changing this
 does the trick for us.
I just verified this fix on our test/integration system and it works - cool!

Thanx a lot for this hint,
cheers,
Martin


Re: ArrayIndexOutOfBoundsException for query with rows=0 and sort param

2010-11-30 Thread Martin Grotzke
On Tue, Nov 30, 2010 at 10:29 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 Hmm this is in fact a regression.

 TopFieldCollector expects (but does not verify) that numHits is  0.

 I guess to fix this we could fix TopFieldCollector.create to return a
 NullCollector when numHits is 0.

Fixing this in lucene/solr sounds good :-)

Still I'm wondering, why this issue does not occur with the plain
example solr setup with 2 indexed docs. Any explanation?


 But: why is your app doing this?  Ie, if numHits (rows) is 0, the only
 useful thing you can get is totalHits?

Actually I don't know this (yet). Normally our search logic should
optimize this and ignore a requested sorting with rows=0, but there
seems to be a case that circumvents this - still figuring out.


 Still I think we should fix it in Lucene -- it's a nuisance to push
 such corner case checks up into the apps.  I'll open an issue...

Just for the record, this is https://issues.apache.org/jira/browse/LUCENE-2785

One question: as leaving out sorting leads to better performance, this
should also be true for rows=0. Or is lucene/solr already that clever
that it makes this optimization (ignoring sort) automatically? Do I
understand it correctly, that the solution with the null collector
would make this optimiztion?

We're just asking ourselves if we should go ahead and analyze and fix
this in our app or wait for a patch for solr/lucene.
What do you think? Is there s.th. like a timeframe when there's an
agreement on the correct solution and a patch available?

Thanx  cheers,
Martin



 Mike

 On Mon, Nov 29, 2010 at 7:14 AM, Martin Grotzke
 martin.grot...@googlemail.com wrote:
 Hi,

 after an upgrade from solr-1.3 to 1.4.1 we're getting an
 ArrayIndexOutOfBoundsException for a query with rows=0 and a sort
 param specified:

 java.lang.ArrayIndexOutOfBoundsException: 0
        at 
 org.apache.lucene.search.FieldComparator$StringOrdValComparator.copy(FieldComparator.java:660)
        at 
 org.apache.lucene.search.TopFieldCollector$OneComparatorNonScoringCollector.collect(TopFieldCollector.java:84)
        at 
 org.apache.solr.search.SolrIndexSearcher.sortDocSet(SolrIndexSearcher.java:1391)
        at 
 org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:872)
        at 
 org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:341)
        at 
 org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:182)
        at 
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
        at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)

 The query is e.g.:
 /select/?sort=popularity+descrows=0start=0q=foo

 When this is changed to rows=1 or when the sort param is removed the
 exception is gone and everything's fine.

 With a clean 1.4.1 installation (unzipped, started example and posted
 two documents as described in the tutorial) this issue is not
 reproducable.

 Does anyone have a clue what might be the reason for this and how we
 could fix this on the solr side?
 Of course - for a quick fix - I'll change our app so that there's no
 sort param specified when rows=0.

 Thanx  cheers,
 Martin

 --
 Martin Grotzke
 http://twitter.com/martin_grotzke





-- 
Martin Grotzke
http://www.javakaffee.de/blog/


Re: ArrayIndexOutOfBoundsException for query with rows=0 and sort param

2010-11-30 Thread Martin Grotzke
On Tue, Nov 30, 2010 at 3:09 PM, Yonik Seeley
yo...@lucidimagination.com wrote:
 On Tue, Nov 30, 2010 at 8:24 AM, Martin Grotzke
 martin.grot...@googlemail.com wrote:
 Still I'm wondering, why this issue does not occur with the plain
 example solr setup with 2 indexed docs. Any explanation?

 It's an old option you have in your solrconfig.xml that causes a
 different code path to be followed in Solr:

   !-- An optimization that attempts to use a filter to satisfy a search.
         If the requested sort does not include score, then the filterCache
         will be checked for a filter matching the query. If found, the filter
         will be used as the source of document ids, and then the sort will be
         applied to that. --
    useFilterForSortedQuerytrue/useFilterForSortedQuery

 Most apps would be better off commenting that out or setting it to
 false.  It only makes sense when a high number of queries will be
 duplicated, but with different sorts.

Great, this sounds really promising, would be a very easy fix. I need
to check this tomorrow on our test/integration server if changing this
does the trick for us.

Though, I just enabled useFilterForSortedQuery in the solr 1.4.1
example and tested rows=0 with a sort param, which doesn't fail - a
correct/valid result is returned.
Is there any condition that has to be met additionally to produce the error?


 One question: as leaving out sorting leads to better performance, this
 should also be true for rows=0. Or is lucene/solr already that clever
 that it makes this optimization (ignoring sort) automatically?

 Solr has always special-cased this case and avoided sorting altogether

Great, good to know!

Cheers,
Martin


ArrayIndexOutOfBoundsException for query with rows=0 and sort param

2010-11-29 Thread Martin Grotzke
Hi,

after an upgrade from solr-1.3 to 1.4.1 we're getting an
ArrayIndexOutOfBoundsException for a query with rows=0 and a sort
param specified:

java.lang.ArrayIndexOutOfBoundsException: 0
at 
org.apache.lucene.search.FieldComparator$StringOrdValComparator.copy(FieldComparator.java:660)
at 
org.apache.lucene.search.TopFieldCollector$OneComparatorNonScoringCollector.collect(TopFieldCollector.java:84)
at 
org.apache.solr.search.SolrIndexSearcher.sortDocSet(SolrIndexSearcher.java:1391)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:872)
at 
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:341)
at 
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:182)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)

The query is e.g.:
/select/?sort=popularity+descrows=0start=0q=foo

When this is changed to rows=1 or when the sort param is removed the
exception is gone and everything's fine.

With a clean 1.4.1 installation (unzipped, started example and posted
two documents as described in the tutorial) this issue is not
reproducable.

Does anyone have a clue what might be the reason for this and how we
could fix this on the solr side?
Of course - for a quick fix - I'll change our app so that there's no
sort param specified when rows=0.

Thanx  cheers,
Martin

-- 
Martin Grotzke
http://twitter.com/martin_grotzke


Re: How to tokenize/analyze docs for the spellchecker - at indexing and query time

2008-10-20 Thread Martin Grotzke
Thanx for your help so far, I just wanted to post my results here...

In short: Now I use the ShingleFilter to create shingles when copying my
fields into my field spellMultiWords. For query time, I implemented a
MultiWordSpellingQueryConverter that just leaves the query as is, so
that there's only one token that is check for spelling suggestions.

Here's the detailed configuration:

= schema.xml =
fieldType name=textSpellMultiWords class=solr.TextField 
positionIncrementGap=100 
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.ShingleFilterFactory maxShingleSize=3 
outputUnigrams=true/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType

   field name=spellMultiWords type=textSpellMultiWords indexed=true 
stored=true multiValued=true/

   copyField source=name dest=spellMultiWords /
   copyField source=cat dest=spellMultiWords /
   ... and more ...


= solrconfig.xml =
  
  searchComponent name=spellcheckMultiWords class=solr.SpellCheckComponent

!-- this is not used at all, can probably be omitted --
str name=queryAnalyzerFieldTypetextSpellMultiWords/str

lst name=spellchecker
  !-- Optional, it is required when more than one spellchecker is 
configured --
  str name=namedefault/str
  str name=fieldspellMultiWords/str
  str name=spellcheckIndexDir./spellcheckerMultiWords1/str
  str name=accuracy0.5/str
  str name=buildOnCommittrue/str
/lst
lst name=spellchecker
  str name=namejarowinkler/str
  str name=fieldspellMultiWords/str
  str 
name=distanceMeasureorg.apache.lucene.search.spell.JaroWinklerDistance/str
  str name=spellcheckIndexDir./spellcheckerMultiWords2/str
  str name=buildOnCommittrue/str
/lst
  /searchComponent
  
  queryConverter name=queryConverter 
class=my.proj.solr.MultiWordSpellingQueryConverter/


= MultiWordSpellingQueryConverter =

public class MultiWordSpellingQueryConverter extends QueryConverter {

/**
 * Converts the original query string to a collection of Lucene Tokens.
 * 
 * @param original the original query string
 * @return a Collection of Lucene Tokens
 */
public CollectionToken convert( String original ) {
if ( original == null ) {
return Collections.emptyList();
}
final Token token = new Token(0, original.length());
token.setTermBuffer( original );
return Arrays.asList( token );
}

}



There are some issues still to be resolved:
- terms are lowercased in the index, there should happen some case
restoration
- we use stemming for our text field, so the spellchecker might suggest
searches, that lead to equal search results (e.g. the german2 stemmer
stems both hose and hosen to hos - Hose and Hosen give the
same results)
- inconsistent/strange sorting of suggestions (as described in
http://www.nabble.com/spellcheck%3A-issues-td19845539.html).


Cheers,
Martin


On Mon, 2008-10-06 at 22:45 +0200, Martin Grotzke wrote:
 On Mon, 2008-10-06 at 09:00 -0400, Grant Ingersoll wrote: 
  On Oct 6, 2008, at 3:51 AM, Martin Grotzke wrote:
  
   Hi Jason,
  
   what about multi-word searches like harry potter? When I do a search
   in our index for harry poter, I get the suggestion harry
   spotter (using spellcheck.collate=true and jarowinkler distance).
   Searching for harry spotter (we're searching AND, not OR) then gives
   no results. I asume that this is because suggestions are done for  
   words
   separately, and this does not require that both/all suggestions are
   contained in the same document.
  
  
  Yeah, the SpellCheckComponent is not phrase aware.  My guess would be  
  that you would somehow need a QueryConverter (see 
  http://wiki.apache.org/solr/SpellCheckComponent) 
 that preserved phrases as a single token.  Likewise, you would need  
  that on your indexing side as well for the spell checker.  In short, I  
  suppose it's possible, but it would be work.  You probably could use  
  the shingle filter (token based n-grams).
 I also thought about s.th. like this, and also stumbled over the
 ShingleFilter :)
 
 So I would change the spell field to use the ShingleFilter?
 
 Did I understand the answer to the posting chaining copyFields
 correctly, that I cannot pipe the title through some shingledTitle
 field and copy it afterwards to the spell field (while other fields
 like brand are copied directly to the spell field)?
 
 Thanx  cheers,
 Martin
 
 
  
  Alternatively, by using extendedResults, you can get back the  
  frequency of each of the words, and then you could decide whether the  
  collation is going to have any results assuming they are all or'd  
  together.  For phrases and AND queries, I'm not sure.  It's doable,  
  I'm sure, but it would be a lot more involved.
  
  
   I wonder what's the standard approach for searches with multiple  
   words

Re: How to tokenize/analyze docs for the spellchecker - at indexing and query time

2008-10-06 Thread Martin Grotzke
Hi Jason,

what about multi-word searches like harry potter? When I do a search
in our index for harry poter, I get the suggestion harry
spotter (using spellcheck.collate=true and jarowinkler distance).
Searching for harry spotter (we're searching AND, not OR) then gives
no results. I asume that this is because suggestions are done for words
separately, and this does not require that both/all suggestions are
contained in the same document.

I wonder what's the standard approach for searches with multiple words.
Are these working ok for you?

Cheers,
Martin

On Fri, 2008-10-03 at 16:21 -0400, Jason Rennie wrote:
 Hi Martin,
 
 I'm a relative newbie to solr, have been playing with the spellcheck
 component and seem to have it working.  I certainly can't explain what all
 is going on, but with any luck, I can help you get the spellchecker
 up-and-running.  Additional replies in-lined below.
 
 On Wed, Oct 1, 2008 at 7:11 AM, Martin Grotzke [EMAIL PROTECTED]
  wrote:
 
  Now I'm thinking about the source-field in the spellchecker (spell):
  how should fields be analyzed during indexing, and how should the
  queryAnalyzerFieldType be configured.
 
 
 I followed the conventions in the default solrconfig.xml and schema.xml
 files.  So I created a textSpell field type (schema.xml):
 
 !-- field type for the spell checker which doesn't stem --
 fieldtype name=textSpell class=solr.TextField
 positionIncrementGap=100
   analyzer
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
 /fieldtype
 
 and used this for the queryAnalyzerFieldType.  I also created a spellField
 to store the text I want to spell check against and used the same analyzer
 (figuring that the query and indexed data should be analyzed the same way)
 (schema.xml):
 
!-- Spell check field --
field name=spellField type=textSpell indexed=true stored=true /
 
 
 
  If I have brands like e.g. Apple or Ed Hardy I would copy them (the
  field brand) directly to the spell field. The spell field is of
  type string.
 
 
 We're copying description to spellField.  I'd recommend using a type like
 the above textSpell type since The StringField type is not analyzed, but
 indexed/stored verbatim (schema.xml):
 
   copyField source=description dest=spellField /
 
 Other fields like e.g. the product title I would first copy to some
  whitespaceTokinized field (field type with WhitespaceTokenizerFactory)
  and afterwards to the spell field. The product title might be e.g.
  Canon EOS 450D EF-S 18-55 mm.
 
 
 Hmm... I'm not sure if this would work as I don't think the analyzer is
 applied until after the copy is made.  FWIW, I've had trouble copying
 multipe fields to spellField (i.e. adding a second copyField w/
 dest=spellField), so we just index the spellchecker on a single field...
 
 Shouldn't this be a WhitespaceTokenizerFactory, or is it better to use a
  StandardTokenizerFactory here?
 
 
 I think if you use the same analyzer for indexing and queries, the
 distinction probably isn't tremendously important.  When I went searching,
 it looked like the StandardTokenizer split on non-letters.  I'd guess the
 rationale for using the StandardTokenizer is that it won't recommend
 non-letter characters.  I was seeing some weirdness earlier (no
 inserts/deletes), but that disappeared now that I'm using the
 StandardTokenizer.
 
 Cheers,
 
 Jason
-- 
Martin Grotzke
http://www.javakaffee.de/blog/


signature.asc
Description: This is a digitally signed message part


Re: How to tokenize/analyze docs for the spellchecker - at indexing and query time

2008-10-06 Thread Martin Grotzke
On Mon, 2008-10-06 at 09:00 -0400, Grant Ingersoll wrote: 
 On Oct 6, 2008, at 3:51 AM, Martin Grotzke wrote:
 
  Hi Jason,
 
  what about multi-word searches like harry potter? When I do a search
  in our index for harry poter, I get the suggestion harry
  spotter (using spellcheck.collate=true and jarowinkler distance).
  Searching for harry spotter (we're searching AND, not OR) then gives
  no results. I asume that this is because suggestions are done for  
  words
  separately, and this does not require that both/all suggestions are
  contained in the same document.
 
 
 Yeah, the SpellCheckComponent is not phrase aware.  My guess would be  
 that you would somehow need a QueryConverter (see 
 http://wiki.apache.org/solr/SpellCheckComponent) 
that preserved phrases as a single token.  Likewise, you would need  
 that on your indexing side as well for the spell checker.  In short, I  
 suppose it's possible, but it would be work.  You probably could use  
 the shingle filter (token based n-grams).
I also thought about s.th. like this, and also stumbled over the
ShingleFilter :)

So I would change the spell field to use the ShingleFilter?

Did I understand the answer to the posting chaining copyFields
correctly, that I cannot pipe the title through some shingledTitle
field and copy it afterwards to the spell field (while other fields
like brand are copied directly to the spell field)?

Thanx  cheers,
Martin


 
 Alternatively, by using extendedResults, you can get back the  
 frequency of each of the words, and then you could decide whether the  
 collation is going to have any results assuming they are all or'd  
 together.  For phrases and AND queries, I'm not sure.  It's doable,  
 I'm sure, but it would be a lot more involved.
 
 
  I wonder what's the standard approach for searches with multiple  
  words.
  Are these working ok for you?
 
  Cheers,
  Martin
 
  On Fri, 2008-10-03 at 16:21 -0400, Jason Rennie wrote:
  Hi Martin,
 
  I'm a relative newbie to solr, have been playing with the spellcheck
  component and seem to have it working.  I certainly can't explain  
  what all
  is going on, but with any luck, I can help you get the spellchecker
  up-and-running.  Additional replies in-lined below.
 
  On Wed, Oct 1, 2008 at 7:11 AM, Martin Grotzke [EMAIL PROTECTED]
  wrote:
 
  Now I'm thinking about the source-field in the spellchecker  
  (spell):
  how should fields be analyzed during indexing, and how should the
  queryAnalyzerFieldType be configured.
 
 
  I followed the conventions in the default solrconfig.xml and  
  schema.xml
  files.  So I created a textSpell field type (schema.xml):
 
 !-- field type for the spell checker which doesn't stem --
 fieldtype name=textSpell class=solr.TextField
  positionIncrementGap=100
   analyzer
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
 /fieldtype
 
  and used this for the queryAnalyzerFieldType.  I also created a  
  spellField
  to store the text I want to spell check against and used the same  
  analyzer
  (figuring that the query and indexed data should be analyzed the  
  same way)
  (schema.xml):
 
!-- Spell check field --
field name=spellField type=textSpell indexed=true  
  stored=true /
 
 
 
  If I have brands like e.g. Apple or Ed Hardy I would copy them  
  (the
  field brand) directly to the spell field. The spell field is  
  of
  type string.
 
 
  We're copying description to spellField.  I'd recommend using a  
  type like
  the above textSpell type since The StringField type is not  
  analyzed, but
  indexed/stored verbatim (schema.xml):
 
   copyField source=description dest=spellField /
 
  Other fields like e.g. the product title I would first copy to some
  whitespaceTokinized field (field type with  
  WhitespaceTokenizerFactory)
  and afterwards to the spell field. The product title might be e.g.
  Canon EOS 450D EF-S 18-55 mm.
 
 
  Hmm... I'm not sure if this would work as I don't think the  
  analyzer is
  applied until after the copy is made.  FWIW, I've had trouble copying
  multipe fields to spellField (i.e. adding a second copyField w/
  dest=spellField), so we just index the spellchecker on a single  
  field...
 
  Shouldn't this be a WhitespaceTokenizerFactory, or is it better to  
  use a
  StandardTokenizerFactory here?
 
 
  I think if you use the same analyzer for indexing and queries, the
  distinction probably isn't tremendously important.  When I went  
  searching,
  it looked like the StandardTokenizer split on non-letters.  I'd  
  guess the
  rationale for using the StandardTokenizer is that it won't recommend
  non-letter characters.  I was seeing some weirdness earlier (no
  inserts/deletes), but that disappeared now that I'm using the
  StandardTokenizer.
 
  Cheers,
 
  Jason
  -- 
  Martin Grotzke
  http://www.javakaffee.de

How to tokenize/analyze docs for the spellchecker - at indexing and query time

2008-10-01 Thread Martin Grotzke
Hi,

I'm just starting with the spellchecker component provided by solr - it
is really cool!

Now I'm thinking about the source-field in the spellchecker (spell):
how should fields be analyzed during indexing, and how should the
queryAnalyzerFieldType be configured.

If I have brands like e.g. Apple or Ed Hardy I would copy them (the
field brand) directly to the spell field. The spell field is of
type string.

Other fields like e.g. the product title I would first copy to some
whitespaceTokinized field (field type with WhitespaceTokenizerFactory)
and afterwards to the spell field. The product title might be e.g.
Canon EOS 450D EF-S 18-55 mm.

This is the process I have in mind during indexing (though I'm not sure
if some tokens/terms should be removed, but I'd asume that all terms
might be misspelled by the user).

Now when it comes to searching, the query should be analyzed using the
queryAnalyzerFieldType definition, which has a StandardTokenizerFactory
in the schema.xml of the solr example.

Shouldn't this be a WhitespaceTokenizerFactory, or is it better to use a
StandardTokenizerFactory here?

Or should I use a StandardTokenizerFactory for the spell field, so
that fields copied into this field get tokenized/analyzed in the same
way as the query will get tokenized/analyzed?

Do you have any experience with this and/or recommendations regarding
this?

Are there other things to consider?

Thanx for your help,
cheers,
Martin




signature.asc
Description: This is a digitally signed message part


Re: prefix-search ingnores the lowerCaseFilter

2007-10-29 Thread Martin Grotzke

On Thu, 2007-10-25 at 10:48 -0400, Yonik Seeley wrote:
 On 10/25/07, Max Scheffler [EMAIL PROTECTED] wrote:
  Is it possible that the prefix-processing ignores the filters?
 
 Yes, It's a known limitation that we haven't worked out a fix for yet.
 The issue is that you can't just run the prefix through the filters
 because of things like stop words, stemming, minimum length filters,
 etc.
What about not having only facet.prefix but additionally
facet.filtered.prefix that runs the prefix through the filters?
Would that be possible?

Cheers,
Martin

 
 -Yonik
 



signature.asc
Description: This is a digitally signed message part


Re: prefix-search ingnores the lowerCaseFilter

2007-10-29 Thread Martin Grotzke

On Mon, 2007-10-29 at 13:31 -0400, Yonik Seeley wrote:
 On 10/29/07, Martin Grotzke [EMAIL PROTECTED] wrote:
  On Thu, 2007-10-25 at 10:48 -0400, Yonik Seeley wrote:
   On 10/25/07, Max Scheffler [EMAIL PROTECTED] wrote:
Is it possible that the prefix-processing ignores the filters?
  
   Yes, It's a known limitation that we haven't worked out a fix for yet.
   The issue is that you can't just run the prefix through the filters
   because of things like stop words, stemming, minimum length filters,
   etc.
 
  What about not having only facet.prefix but additionally
  facet.filtered.prefix that runs the prefix through the filters?
  Would that be possible?
 
 The underlying issue remains - it's not safe to treat the prefix like
 any other word when running it through the filters.
Yes, definitely the user that uses this feature should know what it
does - but at least there would be the possibility to run the prefix
through e.g. a LowerCaseFilter. Finally the user knows what filters
he has configured. E.g. if you only want an ignore-case prefix test,
s.th. like a facet.filtered.prefix would be really valuable.

Cheers,
Martin


 
 -Yonik
 



signature.asc
Description: This is a digitally signed message part


type ahead - suggest words with facet.prefix, but with original case (or another solution?)

2007-10-20 Thread Martin Grotzke
Hello,

I'm just thinking about a solution for a type ahead functionality
that shall suggest terms that the user can search for, and that
displays how many docs are behind that search (like google suggest).

When I use facet.prefix and facet.field=text, where text is my catchall
field (and default field for searching), then only lowercased words are
suggested, not orgininal ones. And I want to have it independent from
the users input - it should not matter if the user enters fo or Fo,
I always want to have Foo suggested if this words exists in my docs.
Is that possible?

AFAICS the limitation of this approach is, that it is limited to single
words. E.g. when the user enters foo ba, then he would not get
Foo Bar as a suggestion (asuming that my catchall field contains
tokenized terms).

What do you think of this: Asuming I have my own RequestHandler,
I would split the users input to get the last word, and use everything
but this last word as query, to limit the resulting docs (my default
operator is AND). Afterwards I search for terms starting with the last
word and do standard faceting stuff (calculate number of docs for each
term).

Are there other/better approaches/solutions for type ahead functionality
that you would recommend?

Btw: my docs contain products with the main fields name, cat, type,
tags, brand, color - these are used for searching (copied into the text
field).

Thanx in advance,
cheers,
Martin




signature.asc
Description: This is a digitally signed message part


Re: Different search results for (german) singular/plural searches - looking for a solution

2007-10-16 Thread Martin Grotzke
Hi,

now I played around with the snowball porter stemmer and it definitely
feels really good (used German2 as suggested).

For some cases (e.g. product types like top/tops, bermuda/bermudas or
hoody/hoodies) additionally we need synonyms. At first I thought it
would be good to use synonyms only at query time, but the docs in the
wiki recommend to expand synonyms at index time...

What are your experiences? Would you also suggest to use them when
indexing?

On Thu, 2007-10-11 at 17:32 +0200, Thomas Traeger wrote:
 Martin Grotzke schrieb:
  Try the SnowballPorterFilterFactory with German2 as language attribute 
  first and use synonyms for combined words i.e. Herrenhose = Herren, 
  Hose.
  
  so you use a combined approach?

 Yes, we define the relevant parts of compounded words (keywords only) as 
 synonyms and feed them in a special field that is used for searching and 
 for the product index. 
So you don't use a single catchall field text? What is the reason for
this, what is the advantage?

 I hope there will be a filter that can split 
 compounded word sometimes in the future...
There is no standard approach for handling this problem apart from
synonyms?
This is exactly what jwordsplitter does (as posted by Daniel)...


Thanx  cheers,
Martin


  By using stemming you will maybe have some interesting results, but it 
  is much better living with them than having no or much less results ;o)
  
  Do you have an example what interesting results I can expect, just to
  get an idea?

  Find more infos on the Snowball stemming algorithms here:
 
  http://snowball.tartarus.org/
  
  Thanx! I also had a look at this site already, but what is missing is a
  demo where one can see what's happening. I think I'll play a little with
  stemming to get a feeling for this.

 I think the Snowball stemmer is very good so I have no practical example 
 for you. Maybe this is of value to see what happens:
 
 http://snowball.tartarus.org/algorithms/german/diffs.txt
 
 If you have mixed languages in your content, which sometimes happens in 
 product data, you might get into some trouble.
 
  Also have a look at the StopFilterFactory, here is a sample stopwordlist 
  for the german language:
 
  http://snowball.tartarus.org/algorithms/german/stop.txt
  
  Our application handles products, do you think such stopwords are useful
  in this scenario also? I wouldn't expect a user to search for keine
  hose or s.th. like this :)

 I have seen much worse queries, so you never know ;o)
 
 think of a query like this: Hose in blau für Herren
 
 You will definetly want to remove in and für during searching and it 
 reduces index size when removed during indexing. Maybe you will even get 
 better scores when only relevant terms are used. You should optimze the 
 stopword list based on your data.
 
 Regards,
 
 Tom
 
-- 
Martin Grotzke
http://www.javakaffee.de/blog/


signature.asc
Description: This is a digitally signed message part


Re: Different search results for (german) singular/plural searches - looking for a solution

2007-10-11 Thread Martin Grotzke
Hi Daniel,

thanx for your suggestions, being able to export a large synonyms.txt
sounds very well!

Thx  cheers,
Martin


On Wed, 2007-10-10 at 23:38 +0200, Daniel Naber wrote:
 On Wednesday 10 October 2007 12:00, Martin Grotzke wrote:
 
  Basically I see two options: stemming and the usage of synonyms. Are
  there others?
 
 A large list of German words and their forms is available from a Windows 
 software called Morphy 
 (http://www.wolfganglezius.de/doku.php?id=public:cl:morphy). You can use 
 it for mapping fullforms to base forms (Häuser - Haus).
 You can also have 
 a look at www.languagetool.org which uses this data in a Java software. 
 LanguageTool also comes with jWordSplitter, which can find a compound's 
 parts (Autowäsche - Auto + Wäsche).
 
 Regards
  Daniel
 



signature.asc
Description: This is a digitally signed message part


Different search results for (german) singular/plural searches - looking for a solution

2007-10-10 Thread Martin Grotzke
Hello,

with our application we have the issue, that we get different
results for singular and plural searches (german language).

E.g. for hose we get 1.000 documents back, but for hosen
we get 10.000 docs. The same applies to t-shirt or t-shirts,
of e.g. hut and hüte - lots of cases :)

This is absolutely correct according to the schema.xml, as right
now we do not have any stemming or synonyms included.

Now we want to have similar search results for these singular/plural
searches. I'm thinking of a solution for this, and want to ask, what
are your experiences with this.

Basically I see two options: stemming and the usage of synonyms. Are
there others?

My concern with stemming is, that it might produce unexpected results,
so that docs are found that do not match the query from the users point
of view. I asume that this needs a lot of testing with different data.

The issue with synonyms is, that we would have to create a file
containing all synonyms, so we would have to figure out all cases, in
contrast to a solutions that is based on an algorithm.
The advantage of this approach is IMHO, that it is very predictable
which results will be returned for a certain query.

Some background information:
Our documents contain products (id, name, brand, category, producttype,
description, color etc). The singular/plural issue basically applied to
the fields name, category and producttype, so we would like to restrict
the solution to these fields.

Do you have suggestions how to handle this?

Thanx in advance for sharing your experiences,
cheers,
Martin

-
Extracts of our schema.xml:

  types
fieldtype name=text class=solr.TextField positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldtype

fieldType name=trimmedString class=solr.TextField 
sortMissingLast=true omitNorms=true
  analyzer type=index
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.TrimFilterFactory /
  /analyzer
  !-- we should also configure lcasing for index and query analyzer --
/fieldType
  /types

  fields
field name=name type=text indexed=true stored=true/
field name=cat type=trimmedString indexed=true stored=true 
multiValued=true omitNorms=true/
field name=type type=trimmedString indexed=true stored=true 
multiValued=false omitNorms=true/
  /fields

  defaultSearchFieldtext/defaultSearchField

  copyField source=tag dest=text/
  copyField source=cat dest=text/
  copyField source=name dest=text/
  copyField source=type dest=text /
  copyField source=brand dest=text /
-




signature.asc
Description: This is a digitally signed message part


Re: How to extract constrained fields from query

2007-08-24 Thread Martin Grotzke
On Thu, 2007-08-23 at 10:44 -0700, Chris Hostetter wrote:
 : Probably I'm also interested in PrefixQueries, as they also provide a
 : Term, e.g. parsing ipod AND brand:apple gives a PrefixQuery for
 : brand:apple.
 
 uh? ... it shoudn't, not unless we're talking about some other
 customization you've already made.
My fault, this is returned for s.th. like brand:appl* - but perhaps
I would also like to facet on such fields then...

 
 
 : I want to do s.th. like dynamic faceting - so that the solr client
 : does not have to request facets via facet.field, but that I can decide
 : in my CustomRequestHandler which facets are returned. But I want to
 : return only facets for fields that are not already constained, e.g.
 : when the query contains s.th. like brand:apple I don't want to return
 : a facet for the field brand.
 
 Hmmm, i see ... well the easiest way to go is not to worry about it when
 parsing the query, when you go to compute facets for all hte fields you
 tink might be useful, you'll see that only one value for brand matches,
 and you can just skip it.
I would think that this is not the best option in terms of performance.

 
 that doesn't really work well for range queries -- but you can't exactly
 use the same logic for picking what your facet contraints will be on a
 field that makes sense to do a rnage query on anyway, so it's tricky
 either way.
 
 the custom QueryParser is still probably your best bet...
 
 : Ok, so I would override getFieldQuery, getPrefixQuery, getRangeQuery and
 : getWildcardQuery(?) and record the field names? And I would use this
 : QueryParser for both parsing of the query (q) and the filter queries
 : (fq)?
 
 yep.
Alright, then I'll choose this door.

 
 (Also Note there is also an extractTerms method on Query that can help in
 some cases, but the impl for ConstantScoreQuery (which is used when the
 SolrQueryParser sees a range query or a prefix query) doesn't really work
 at the moment.)
Yep, I already had tried this, but it always failed with an
UnsupportedOperationException...

Thanx a lot,
cheers,
Martin


 
 -Hoss
 
-- 
Martin Grotzke
http://www.javakaffee.de/blog/


signature.asc
Description: This is a digitally signed message part


How to extract constrained fields from query

2007-08-22 Thread Martin Grotzke
Hello,

in my custom request handler, I want to determine which fields are
constrained by the user.

E.g. the query (q) might be ipod AND brand:apple and there might
be a filter query (fq) like color:white (or more).

What I want to know is that brand and color are constrained.

AFAICS I could use SolrPluginUtils.parseFilterQueries and test
if the queries are TermQueries and read its Field.
Then should I also test which kind of queries I get when parsing
the query (q) and look for all TermQueries from the parsed query?

Or is there a more elegant way of doing this?

Thanx a lot,
cheers,
Martin




signature.asc
Description: This is a digitally signed message part


RE: How to read values of a field efficiently

2007-08-21 Thread Martin Grotzke
On Tue, 2007-08-21 at 11:52 +0200, Ard Schrijvers wrote:
   you're missing the key piece that Ard alluded to ... the 
  there is one
   ordere list of all terms stored in the index ... a TermEnum lets you
   iterate over this ordered list, and the 
  IndexReader.terms(Term) method
   lets you efficiently start at an arbitrary term.  if you are only
   interested in terms for a specific field, once your 
  TermEnum returns a
   differnet field, you can stop -- you will never get any 
  more terms for
   the field you care about (hence Ard's terms.term().field() 
  == field in his
   loop conditional)
  Ok, I wasn't aware of that - I thought that Ards while loop would be
  wrong, 
 
 I am deeply hurt by your distrust.
 
 :-) 

Shame on me :-$ 


 
 Ard
 
 
 
 



signature.asc
Description: This is a digitally signed message part


Re: How to read values of a field efficiently

2007-07-31 Thread Martin Grotzke
On Mon, 2007-07-30 at 00:30 -0700, Chris Hostetter wrote:
 : Is it possible to get the values from the ValueSource (or from
 : getFieldCacheCounts) sorted by its natural order (from lowest to
 : highest values)?
 
 well, an inverted term index is already a data structure listing terms
 from lowest to highest and the associated documents -- so if you want to
 iterate from low to high between a range and find matching docs you should
 just use hte TermEnum -- the whole point of the FieldCache (and
 FieldCacheSource) is to have a reverse inverted index so you can quickly
 fetch the indexed value if you know the docId.
Ok, I will have a look at the TermEnum and try this.

 
 perhaps you should elaborate a little more on what it is you are trying to
 do so we can help you figure out how to do it more efficinelty ...
I want to read all values of the price field of the found docs,
and calculate the mean value and the standard deviation.
Based on the min value (mean - deviation, the max value (mean +
deviation) and the number of prices I calculate price ranges.

Then I iterate over the sorted array of prices and count how many
prices go into the current range.

This sorting (Arrays.sort) takes much time, that's why I asked if
it's possible to read values in sorted order.

But reading this, I think it would also be possible to skip sorting and
check for each price into which bucket it would go and increment the
counter for this bucket - this should also be a possibility for
optimization.

 ... perhaps you shouldn't be iterating over every doc to figure out your
 ranges .. perhaps you can iterate over the terms themselves?
Are you referring to TermEnum with this?

Thanx  cheers,
Martin


 
 
 hang on ... rereading your first message i just noticed something i
 definitely didn't spot before...
 
  Fairly long: getFieldCacheCounts for the cat field takes ~70 ms
  for the second request, while reading prices takes ~600 ms.
 
 ...i clearly missed this, and fixated on your assertion that your reading
 of field values took longer then the stock methods -- but you're not just
 comparing the time needed byu different methods, you're also timing
 different fields.
 
 this actually makes a lot of sense since there are probably a lot fewer
 unique values for the cat field, so there are a lot fewer discrete values
 to deal with when computing counts.
 
 
 
 
 -Hoss
 
-- 
Martin Grotzke
http://www.javakaffee.de/blog/


signature.asc
Description: This is a digitally signed message part


How to read values of a field efficiently

2007-07-21 Thread Martin Grotzke
Hi,

I have a custom Facet implementation that extends SimpleFacets
and overrides getTermCounts( String field ).

For the price field I calculate available ranges, for this I
have to read the values for this field. Right this looks like
this:

public NamedList getTermCounts( final String field ) throws IOException {
SchemaField sf = searcher.getSchema().getField( field );
FieldType ft = sf.getType();
final DocValues docValues = ft.getValueSource( sf ).getValues( 
searcher.getReader() );
final DocIterator iter = docs.iterator();
final TIntArrayList prices = new TIntArrayList( docs.size() );
while (iter.hasNext()) {
   float value = docValues.floatVal(iter.next());
   prices.add( (int)value );
}
// calculate ranges and return the result
}

This part (reading field values) takes fairly long compared
to the other fields (that use getFacetTermEnumCounts or
getFieldCacheCounts as implemented in SimpleFacets), so that
I asume that there is potential for optimization.

Fairly long: getFieldCacheCounts for the cat field takes ~70 ms
for the second request, while reading prices takes ~600 ms.

Is there a better way (in terms of performance) to determine
the values for the found docs?

Thanx in advance,
cheers,
Martin




signature.asc
Description: This is a digitally signed message part


Indexing question - split word and comma

2007-07-05 Thread Martin Grotzke
Hi all,

I have a document with a name field like this:
field name='name'MP3-Player, Apple, #xBB;iPod nano#xAB;, silber,
4GB/field

and want to find apple. Unfortunately, I only find apple,...

Can anybody help me with this?


The schema.xml containts the following field definition
field name=name type=text indexed=true stored=true/

and this fieldType definition for type text:
fieldtype name=text class=solr.TextField positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory 
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldtype

The default search field is text:
defaultSearchFieldtext/defaultSearchField

with the following definition:
   field name=text type=text indexed=true stored=false
multiValued=true/

and the copy from name to text...
copyField source=name dest=text/



Thanx in advance,
cheers,
Martin




signature.asc
Description: This is a digitally signed message part


Re: Indexing question - split word and comma

2007-07-05 Thread Martin Grotzke
On Thu, 2007-07-05 at 11:56 -0700, Mike Klaas wrote:
 On 5-Jul-07, at 11:43 AM, Martin Grotzke wrote:
 
  Hi all,
 
  I have a document with a name field like this:
  field name='name'MP3-Player, Apple, #xBB;iPod nano#xAB;, silber,
  4GB/field
 
  and want to find apple. Unfortunately, I only find apple,...
 
  Can anybody help me with this?
 
 Sure: you're using WhitespaceAnalyzer, which only splits on  
 whitespace.  If you want to split words from punctuation, you should  
 use something like StandardAnalyzer or WordDelimiterFilter.
I replaced tokenizer class=solr.WhitespaceTokenizerFactory/ by
tokenizer class=solr.StandardTokenizerFactory/ in the indexer part
of the fieldtype definition, and now I find apple and ipod, really
great!

 
 It is also extremely helpful to look at the analysis page on the solr  
 admin (verbose=true) and see exactly what tokens your analyzer produces.
This is such a cool tool, I didn't know it! It's really great that you
see each step of the filters so that it's possible to understand better
what's going on during indexing, really, really cool!!

Thanx a lot,
cheers,
Martin


 
 -Mike
 



signature.asc
Description: This is a digitally signed message part


Re: Same record belonging to multiple facets

2007-07-05 Thread Martin Grotzke
On Thu, 2007-07-05 at 12:39 -0700, Thiago Jackiw wrote:
 Is there a way for a record to belong to multiple facets? If so, how
 would one go about implementing it?
 
 What I'd like to accomplish would be something like:
 
 record A:
 name=John Doe
 category_facet=Cars
 category_facet=Electronics
Isn't this the multiValued=true property in your field definition for
category_facet?

Cheers,
Martin

 
 And when searching for John Doe his record would appear under both
 Cars and Electronics facet categories.
 
 Thanks.
 
 --
 Thiago Jackiw
 



signature.asc
Description: This is a digitally signed message part


Re: Dynamically calculated range facet

2007-06-27 Thread Martin Grotzke
Chris, thanx for all this info! I'll think about these things again
and then come back to you...

Cheers,
Martin


On Tue, 2007-06-26 at 23:22 -0700, Chris Hostetter wrote:
 : my documents (products) have a price field, and I want to have
 : a dynamically calculated range facet for that in the response.
 
 FYI: there have been some previous discussions on this topic...
 
 http://www.nabble.com/blahblah-t2387813.html#a6799060
 http://www.nabble.com/faceted-browsing-t1363854.html#a3753053
 
 : AFAICS I do not have the possibility to specify range queries in my
 : application, as I do not have a clue what's the lowest and highest
 : price in the search result and what are good ranges according
 : to the (statistical) distribution of prices in the search result.
 
 as mentioned in one of those threads, it's *really* hard to get the
 statistical sampling to the point where it's both balanced, but also user
 freindly.  writing code specificly for price ranges in dollars lets you
 make some assumptions about things that give you nice ranges (rounding
 to one significant digit less then the max, doing log based ranges, etc..)
 that wouldn't really apply if you were trying to implement a truely
 generic dynamic range generator.
 
 one thing to keep in mind: it's typically not a good idea to have the
 constraint set of a facet change just because some other constraint was
 added to the query -- individual constraints might disappear because
 they no longer apply, but it can be very disconcerting to a user to
 when options hcange on them  if i search on ipod a statistical
 analysis of prices might yeild facet ranges of $1-20, $20-60, $60-120,
 $120-$200 ... if i then click on accessories the statistics might skew
 cheaper, so hte new ranges are $1-20, $20-30, $30-40, $40-70 ...  and now
 i'm a frustrated user, because i relaly wanted ot use the range $20-60
 (that just happens to be my budget) and you offered it to me and then you
 took it away ... i have to undo my selection or accessories then click
 $20-60, and then click accessories to get what i wnat ... not very nice.
 
 : So if it would be possible to go over each item in the search result
 : I could check the price field and define my ranges for the specific
 : query on solr side and return the price ranges as a facet.
 
 : Otherwise, what would be a good starting point to plug in such
 : functionality into solr?
 
 if you relaly want to do statistical distributions, one way to avoid doing
 all of this work on the client side (and needing to pull back all of hte
 prices from all of hte matches) would be to write a custom request handler
 that subclasses whichever on you currently use and does this computation
 on the server side -- where it has lower level access to the data and
 doesn't need to stream it over the wire.  FieldCache in particular would
 come in handy.
 
 it occurs to me that even though there may not be a way to dynamicly
 create facet ranges that can apply usefully on any numeric field, we could
 add generic support to the request handlers for optionally fetching some
 basic statistics about a DocSet for clients that want them (either for
 building ranges, or for any other purpose)
 
 min, max, mean, median, mode, midrange ... those should all be easy to
 compute using the ValueSource from the field type (it would be nice if
 FieldType's had some way of indicating which DocValues function can best
 manage the field type, but we can always assume float or have an option
 for dictating it ... people might want a float mean for an int field
 anyway)
 
 i suppose even stddev could be computed fairly easily ... there's a
 formula for that that works well in a single pass over a bunch of values
 right?
 
 
 
 
 -Hoss
 
-- 
Martin Grotzke
http://www.javakaffee.de/blog/


signature.asc
Description: This is a digitally signed message part


Re: Dynamically calculated range facet

2007-06-27 Thread Martin Grotzke
On Tue, 2007-06-26 at 16:48 -0700, Mike Klaas wrote:
 On 26-Jun-07, at 3:01 PM, Martin Grotzke wrote:
  AFAICS I do not have the possibility to specify range queries in my
  application, as I do not have a clue what's the lowest and highest
  price in the search result and what are good ranges according
  to the (statistical) distribution of prices in the search result.
 
  So if it would be possible to go over each item in the search result
  I could check the price field and define my ranges for the specific
  query on solr side and return the price ranges as a facet.
 
  Has anybody done s.th. like this before, or is there s.th. that I'm
  missing and why this approach does not make sense at all?
 
  Otherwise, what would be a good starting point to plug in such
  functionality into solr?
 
 Easy: facet based on fixed ranges (say, every 10 dollars for x  100,  
 100 dollars for x  1000, etc)., and combine them sensically on the  
 client-side.  Requires no solr-side modification.
But then I have to find x (the highest value of the price field?) on
solr side and also I have to build the fixed ranges on solr side, right?

Cheers,
Martin

 
 A bit harder: define your own request handler that loops over the  
 documents after a search and samples the values of (say) the first 20  
 docs (or more, but be sure to use the FieldCache if so).  Calculate  
 your range queries, facets (code will be almost identical to the code  
 in the builtin request handlers), and return the results.
 
 cheers,
 -Mike
 
-- 
Martin Grotzke
http://www.javakaffee.de/blog/


signature.asc
Description: This is a digitally signed message part


Re: Dynamically calculated range facet

2007-06-27 Thread Martin Grotzke
On Tue, 2007-06-26 at 19:53 -0700, John Wang wrote:
 www.browseengine.com has facet search that handles this.
You are calculating range facets dynamically? Do you have any
code I can have a look at? I had a look at c.b.solr.
BoboRequestHandler, but this does not seem to calculate ranges.

Cheers,
Martin

 
 We are working on a solr plugin.
 
 -John
 
 On 6/26/07, Mike Klaas [EMAIL PROTECTED] wrote:
 
  On 26-Jun-07, at 3:01 PM, Martin Grotzke wrote:
   AFAICS I do not have the possibility to specify range queries in my
   application, as I do not have a clue what's the lowest and highest
   price in the search result and what are good ranges according
   to the (statistical) distribution of prices in the search result.
  
   So if it would be possible to go over each item in the search result
   I could check the price field and define my ranges for the specific
   query on solr side and return the price ranges as a facet.
  
   Has anybody done s.th. like this before, or is there s.th. that I'm
   missing and why this approach does not make sense at all?
  
   Otherwise, what would be a good starting point to plug in such
   functionality into solr?
 
  Easy: facet based on fixed ranges (say, every 10 dollars for x  100,
  100 dollars for x  1000, etc)., and combine them sensically on the
  client-side.  Requires no solr-side modification.
 
  A bit harder: define your own request handler that loops over the
  documents after a search and samples the values of (say) the first 20
  docs (or more, but be sure to use the FieldCache if so).  Calculate
  your range queries, facets (code will be almost identical to the code
  in the builtin request handlers), and return the results.
 
  cheers,
  -Mike
 
-- 
Martin Grotzke
http://www.javakaffee.de/blog/


signature.asc
Description: This is a digitally signed message part


RE: Dynamically calculated range facet

2007-06-27 Thread Martin Grotzke
On Wed, 2007-06-27 at 09:06 -0400, Will Johnson wrote:
 one thing to keep in mind: it's typically not a good idea to have the
 constraint set of a facet change just because some other constraint was
 added to the query -- individual constraints might disappear because
 they no longer apply, but it can be very disconcerting to a user to
 when options hcange on them  if i search on ipod a statistical
 analysis of prices might yeild facet ranges of $1-20, $20-60, $60-120,
 $120-$200 ... if i then click on accessories the statistics might
 skew
 cheaper, so hte new ranges are $1-20, $20-30, $30-40, $40-70 ...  and
 now
 i'm a frustrated user, because i relaly wanted ot use the range $20-60
 (that just happens to be my budget) and you offered it to me and then
 you
 took it away ... i have to undo my selection or accessories then
 click
 $20-60, and then click accessories to get what i wnat ... not very
 nice.
 
 Many of the other engines I've work with in the past did this and it was
 one of the most requested/implemented features we had with regard to
 facets.  That doesn't make it 'right' but it did tend to make product
 managers and test users happy.  The use case that often came up was the
 ability to dynamically drill inside ranges.  For instance my first
 search for 'computer on a large ecommerce site might yield ranges of
 0-500, 500-1000, 1000-2000, 2000+, selecting 500-1000 might then yield
 ranges of 500-600, 600-700 and so on. There are also many different
 algorithms that can be employed: equal frequency per facet count, equal
 sized ranges, rounded ranges, etc.
I just had a conversation with our customer and they also want to
have it like this - adjusting with a new facet constraint...

Cheers,
Martin


 
 - will 
 
-- 
Martin Grotzke
http://www.javakaffee.de/blog/


signature.asc
Description: This is a digitally signed message part


Dynamically calculated range facet

2007-06-26 Thread Martin Grotzke
Hi,

my documents (products) have a price field, and I want to have
a dynamically calculated range facet for that in the response.

E.g. I want to have this in the response
price:[* TO 20]  - 23
price:[20 TO 40] - 42
price:[40 TO *]  - 33
if prices are between 0 and 60
but
price:[* TO 100]   - 23
price:[100 TO 200] - 42
price:[200 TO *]   - 33
if prices are between 0 and 300

AFAICS I do not have the possibility to specify range queries in my
application, as I do not have a clue what's the lowest and highest
price in the search result and what are good ranges according
to the (statistical) distribution of prices in the search result.

So if it would be possible to go over each item in the search result
I could check the price field and define my ranges for the specific
query on solr side and return the price ranges as a facet.

Has anybody done s.th. like this before, or is there s.th. that I'm
missing and why this approach does not make sense at all?

Otherwise, what would be a good starting point to plug in such
functionality into solr?

Thanx a lot in advance,
cheers,
Martin


signature.asc
Description: This is a digitally signed message part


Re: All facet.fields for a given facet.query?

2007-06-20 Thread Martin Grotzke
On Tue, 2007-06-19 at 11:09 -0700, Chris Hostetter wrote:
 I solve this problem by having metadata stored in my index which tells
 my custom request handler what fields to facet on for each category ...
How do you define this metadata?

Cheers,
Martin


 but i've also got several thousand categories.  If you've got less then
 100 categories, you could easily enumerate them all with default
 facet.field params in your solrconfig using seperate requesthandler
 instances.
 
 : What do the experts think about this?
 
 you may want to read up on the past discussion of this in SOLR-247 ... in
 particular note the link to the mail archive where there was assitional
 discussion about it as well.  Where we left things is that it
 might make sense to support true globging in both fl and facet.field, so
 you can use naming conventions and say things like facet.field=facet_*
 but that in general trying to do something like facet.field=* would be a
 very bad idea even if it was supported.
 
 http://issues.apache.org/jira/browse/SOLR-247
 
 
 -Hoss
 
-- 
Martin Grotzke
http://www.javakaffee.de/blog/


signature.asc
Description: This is a digitally signed message part


Re: All facet.fields for a given facet.query?

2007-06-20 Thread Martin Grotzke
On Tue, 2007-06-19 at 19:16 +0200, Thomas Traeger wrote:
 Hi,
 
 I'm also just at that point where I think I need a wildcard facet.field 
 parameter (or someone points out another solution for my problem...). 
 Here is my situation:
 
 I have many products of different types with totally different 
 attributes. There are currently more than 300 attributes
 I use dynamic fields to import the attributes into solr without having 
 to define a specific field for each attribute. Now when I make a query I 
 would like to get back all facet.fields that are relevant for that query.
 
 I think it would be really nice, if I don't have to know which facets 
 fields are there at query time, instead just import attributes into 
 dynamic fields, get the relevant facets back and decide in the frontend 
 which to display and how...
Do you really need all facets in the frontend?

Would it be a solution to have a facet ranking in the field definitions,
and then decide at query time, on which fields to facet on? This would
need an additional query parameter like facet.query.count.

E.g. if you have a query with q=foo+AND+prop1:bar+AND+prop2:baz
and you have fields
prop1 with facet-ranking 100
prop2 with facet-ranking 90
prop3 with facet-ranking 80
prop4 with facet-ranking 70
prop5 with facet-ranking 60

then you might decide not to facet on prop1 and prop2 as you have
already a constraint on it, but to facet on prop3 and prop4 if
facet.query.count is 2.

Just thinking about that... :)

Cheers,
Martin


 
 What do the experts think about this?
 
 Tom
 
-- 
Martin Grotzke
http://www.javakaffee.de/blog/


signature.asc
Description: This is a digitally signed message part


Re: All facet.fields for a given facet.query?

2007-06-20 Thread Martin Grotzke
On Wed, 2007-06-20 at 12:59 +0200, Thomas Traeger wrote:
 Martin Grotzke schrieb:
  On Tue, 2007-06-19 at 19:16 +0200, Thomas Traeger wrote:
[...]
  I think it would be really nice, if I don't have to know which facets 
  fields are there at query time, instead just import attributes into 
  dynamic fields, get the relevant facets back and decide in the frontend 
  which to display and how...
  
  Do you really need all facets in the frontend?

 no, only the subset with matches for the current query.
ok, that's somehow similar to our requirement, but we want to get only
e.g. the first 5 relevant facets back from solr and not handle this
in the frontend.

  Would it be a solution to have a facet ranking in the field definitions,
  and then decide at query time, on which fields to facet on? This would
  need an additional query parameter like facet.query.count.
[...]

 One step after the other ;o), the ranking of the facets will be another 
 problem I have to solve, counts of facets and matching documents will be 
 a starting point. Another idea is to use the score of the documents 
 returned by the query to compute a score for the facet.field...
Yep, this is also different for different applications.

I'm also interested in this problem and would like to help solving
this problem (though I'm really new to lucene and solr)...

Cheers,
Martin


 
 Tom
 
-- 
Martin Grotzke
http://www.javakaffee.de/blog/


signature.asc
Description: This is a digitally signed message part


Re: All facet.fields for a given facet.query?

2007-06-20 Thread Martin Grotzke
On Wed, 2007-06-20 at 12:49 -0700, Chris Hostetter wrote:
 :  I solve this problem by having metadata stored in my index which tells
 :  my custom request handler what fields to facet on for each category ...
 : How do you define this metadata?
 
 this might be a good place to start, note that this message is almost two
 years old, and predates the opensourcing of Solr ... the Servlet refered
 to in this thread is Solr.
 
 http://www.nabble.com/Announcement%3A-Lucene-powering-CNET.com-Product-Category-Listings-p748420.html
 
 ...i think i also talked a bit about the metadata documents in my
 apachecon slides from last yera ... but i don't really remember, and i
 haven't look at them in a while...
 
 http://people.apache.org/~hossman/apachecon2006us/

thx, I'll have a look at these resources.

cheers,
martin


 
 
 -Hoss
 



signature.asc
Description: This is a digitally signed message part


Re: Solr 1.2 HTTP Client for Java

2007-06-14 Thread Martin Grotzke
On Thu, 2007-06-14 at 11:32 +0100, Daniel Alheiros wrote:
 Hi
 
 I've been using one Java client I got from a colleague but I don't know
 exactly its version or where to get any update for it. Base package is
 org.apache.solr.client (where there are some common packages) and the client
 main package is org.apache.solr.client.solrj.
 
 Is it available via Maven2 central repository?
Have a look at the issue tracker, there's one with solr clients:
http://issues.apache.org/jira/browse/SOLR-20

I've also used one of them, but to be honest, do not remember which
one ;)

Cheers,
Martin


 
 Regards,
 Daniel
 
 
 http://www.bbc.co.uk/
 This e-mail (and any attachments) is confidential and may contain personal 
 views which are not the views of the BBC unless specifically stated.
 If you have received it in error, please delete it from your system.
 Do not use, copy or disclose the information in any way nor act in reliance 
 on it and notify the sender immediately.
 Please note that the BBC monitors e-mails sent or received.
 Further communication will signify your consent to this.
   
 



signature.asc
Description: This is a digitally signed message part


Re: Interesting Practical Solr Question

2007-05-22 Thread Martin Grotzke
On Tue, 2007-05-22 at 13:06 -0400, Erik Hatcher wrote:
 On May 22, 2007, at 11:31 AM, Martin Grotzke wrote:
  You need to specify the constrants (facet.query or facet.field  
  params)
  Too bad, so we would have either to know the schema in the application
  or provide queries for index metadata / the schema / faceting info.
 
 However, the LukeRequestHandler (currently a work in progress)  
 provides the fields and their types.  You certainly would want to  
 specify which fields you want returned as facets rather than it just  
 assuming you want all fields (consider a full-text field!).
For sure, perhaps the schema field element could be extended by an
attribute isfacet. But then we reach the point where we want to
have facet categories, and depending on the context (query) different
facets (categories) are returned.

But really cool what information the LukeRequestHandler provides!!

Cheers,
Martin


 
   http://wiki.apache.org/solr/LukeRequestHandler
 
 Erik
 
-- 
Martin Grotzke
http://www.javakaffee.de/blog/


signature.asc
Description: This is a digitally signed message part


Re: Interesting Practical Solr Question

2007-05-22 Thread Martin Grotzke
On Tue, 2007-05-22 at 15:10 -0400, Erik Hatcher wrote:
 On May 22, 2007, at 1:36 PM, Martin Grotzke wrote:
  For sure, perhaps the schema field element could be extended by an
  attribute isfacet
 
 There is no effective difference between a facet field and any  
 other indexed field.  What fields are facets is application  
 specific and not really something Solr should be responsible for.
 
 In Solr Flare (an evolving Ruby on Rails plugin), we made the  
 decision to use naming conventions to determine what fields are  
 facets.  *_facet named fields are facets.  Maybe that convention  
 would work in your scenario also?
Yes, this is an option, good idea.

Thanx  cheers,
Martin

 
   Erik
 



signature.asc
Description: This is a digitally signed message part


Re: PriceJunkie.com using solr!

2007-05-17 Thread Martin Grotzke
Very nice and really fast, congrats!

Are you willing to provide the mentioned features to solr users?
I think espacially the category to facet management (facet groups)
is really useful...
It would be very nice to have this problem solved once... :)

Cheers,
Martin


On Wed, 2007-05-16 at 16:28 -0500, Mike Austin wrote:
 I just wanted to say thanks to everyone for the creation of solr.  I've been
 using it for a while now and I have recently brought one of my side projects
 online.  I have several other projects that will be using solr for it's
 search and facets.
 
 Please check out www.pricejunkie.com and let us know what you think.. You
 can give feedback and/or sign up on the mailing list for future updates.
 The site is very basic right now and many new and useful features plus
 merchants and product categories will be coming soon!  I thought it would be
 a good idea to at least have a few people use it to get some feedback early
 and often.
 
 Some of the nice things behind the scenes that we did with solr:
 - created custom request handlers that have category to facet to attribute
 caching built in
 - category to facet management
   - ability to manage facet groups (attributes within a set facet) and 
 assign
 them to categories
   - ability to create any category structure and share facet groups
 
 - facet inheritance for any category (a facet group can be defined on a
 parent category and pushed down to all children)
 - ability to create sub-categories as facets instead of normal sub
 categories
 - simple xml configuration for the final outputted category configuration
 file
 
 
 I'm sure there are more cool things but that is all for now.  Join the
 mailing list to see more improvements in the future.
 
 Also.. how do I get added to the Using Solr wiki page?
 
 
 Thanks,
 Mike Austin
 
-- 
Martin Grotzke
http://www.javakaffee.de/blog/


signature.asc
Description: This is a digitally signed message part