Re: Starts with Query

2012-06-15 Thread nutchsolruser
Thanks Jack for valuable response,Actually i am trying to match *any* numeric
pattern at the start of each document.  I dont know documents in index i
just want documents title starting with any digit.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Starts-with-Query-tp3989627p3989761.html
Sent from the Solr - User mailing list archive at Nabble.com.

IndexWrite in Lucene/Solr 3.5 is slower?

2012-06-15 Thread Ramprakash Ramamoorthy
We are upgrading our search infrastructure from Lucene 2.3.1 to Lucene 3.5.
I am in the process of load testing and I could find that Lucene 2.3.1
could index 32,000 docs per second, whereas Lucene 3.5 could index only
around 17,000 docs per second.

Indeed, both of them use the standard analyzer and the default settings. Is
3.5 slower because it indexes more details and thereby resulting in a
faster search? Ours is a log management product and the speed of indexing
is highly important.

Ok, cutting the long story short, will the slower indexing of 3.5 result in
a higher search speed?, if not, what else should I fine tune to improve the
indexing speed?

-- 
With Thanks and Regards,
Ramprakash Ramamoorthy,
Engineer Trainee,
Zoho Corporation.
+91 9626975420


RE: Starts with Query

2012-06-15 Thread Afroz Ahmad
If you are not searching for the specific digit and want to match all
documents that start with any digit, you could as part of the indexing
process, have another field say startsWithDigit and set it to true if
it the title begins with a digit. All you need to do at query time then
is query for startsWithDigit =true.
Thanks
Afroz


From: nutchsolruser
Sent: 6/14/2012 11:03 PM
To: solr-user@lucene.apache.org
Subject: Re: Starts with Query
Thanks Jack for valuable response,Actually i am trying to match *any* numeric
pattern at the start of each document.  I dont know documents in index i
just want documents title starting with any digit.

--
View this message in context:
http://lucene.472066.n3.nabble.com/Starts-with-Query-tp3989627p3989761.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: IndexWrite in Lucene/Solr 3.5 is slower?

2012-06-15 Thread pravesh
BTW, Have you changed the MergePolicy  MergeScheduler settings also? Since
Lucene 3.x/3.5 onwards,
there have been new MergePolicy  MergeScheduler implementations available,
like TieredMergePolicy  ConcurrentMergeScheduler.

Regards
Pravesh

--
View this message in context: 
http://lucene.472066.n3.nabble.com/IndexWrite-in-Lucene-Solr-3-5-is-slower-tp3989764p3989768.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Starts with Query

2012-06-15 Thread Michael Kuhlmann
It's not necessary to do this. You can simply be happy about the fact 
that all digits are ordered strictly in unicode, so you can use a range 
query:


(f)q={!frange l=0 u=\: incl=true incu=false}title

This finds all documents where any token from the title field starts 
with a digit, so if you want to only find documents where the whole 
title starts with a digit, you need a second field with a string or 
untokenized text type. Use the copyField directive then, as Jack 
Krupansky already suggested in a previous reply.


Greetings,
Kuli


Am 15.06.2012 08:38, schrieb Afroz Ahmad:

If you are not searching for the specific digit and want to match all
documents that start with any digit, you could as part of the indexing
process, have another field say startsWithDigit and set it to true if
it the title begins with a digit. All you need to do at query time then
is query for startsWithDigit =true.
Thanks
Afroz


From: nutchsolruser
Sent: 6/14/2012 11:03 PM
To: solr-user@lucene.apache.org
Subject: Re: Starts with Query
Thanks Jack for valuable response,Actually i am trying to match *any* numeric
pattern at the start of each document.  I dont know documents in index i
just want documents title starting with any digit.

--
View this message in context:
http://lucene.472066.n3.nabble.com/Starts-with-Query-tp3989627p3989761.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: IndexWrite in Lucene/Solr 3.5 is slower?

2012-06-15 Thread Ramprakash Ramamoorthy
On Fri, Jun 15, 2012 at 12:20 PM, pravesh suyalprav...@yahoo.com wrote:

 BTW, Have you changed the MergePolicy  MergeScheduler settings also? Since
 Lucene 3.x/3.5 onwards,
 there have been new MergePolicy  MergeScheduler implementations available,
 like TieredMergePolicy  ConcurrentMergeScheduler.

 Regards
 Pravesh

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/IndexWrite-in-Lucene-Solr-3-5-is-slower-tp3989764p3989768.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Thanks for the reply Pravesh. Yes I initially used the default
 TieredMergePolicy and later set the merge policy in both the versions to
LogByteSizeMergePolicy, in order to maintain congruence. But still Lucene
3.5 lagged behind by 2X approx.

-- 
With Thanks and Regards,
Ramprakash Ramamoorthy,
Engineer Trainee,
Zoho Corporation.
+91 9626975420


Re: DIH idle in transaction forever

2012-06-15 Thread Jasper Floor
Btw, I removed the batchSize but performance is better with
batchSize=1. I haven't done further testing to see what the best
setting is, but the difference between setting it at 1 and not
setting it is almost double the indexing time (~20 minutes vs ~37
minutes)

On Thu, Jun 14, 2012 at 4:49 PM, Jasper Floor jasper.fl...@m4n.nl wrote:
 Actually, the readOnly=true makes things worse.
 What it does (among other things) is:
            c.setTransactionIsolation(Connection.TRANSACTION_READ_UNCOMMITTED);

 which leads to:
 Caused by: org.postgresql.util.PSQLException: Cannot change
 transaction isolation level in the middle of a transaction.

 because the connection is idle in transaction.

 I found this issue:
 https://issues.apache.org/jira/browse/SOLR-2045

 Patching DIH with the code they suggest seems to work.

 mvg,
 Jasper

 On Thu, Jun 14, 2012 at 4:36 PM, Dyer, James james.d...@ingrambook.com 
 wrote:
 Try readOnly=true in the dataSource configuration.  This causes several 
 defaults to get set in the JDBC connection, and often will solve problems 
 like this. (see 
 http://wiki.apache.org/solr/DataImportHandler#Configuring_JdbcDataSource)  
 Also, try a batch size of 0 to let your jdbc driver pick what it thinks is 
 optimal.  This might be better than 1.

 There is also an issue in that it doesn't explicitly close the resultset but 
 relies on closing the connection to implicily close the child objects.  I 
 know when I tried using DIH with Derby a while back this had at the least 
 caused some log warnings, and it wouldn't work at all without 
 readOnly=false.  Not sure abour PostgreSql.

 James Dyer
 E-Commerce Systems
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: Jasper Floor [mailto:jasper.fl...@m4n.nl]
 Sent: Thursday, June 14, 2012 8:21 AM
 To: solr-user@lucene.apache.org
 Subject: DIH idle in transaction forever

 Hi all,

 It seems that DIH always holds two connections open to the database.
 One of them is almost always 'idle in transaction'. It may sometimes
 seem to do a little work but then it goes idle again.


 datasource definition:
        dataSource name=df-stream-store-ds
 jndiName=java:ext_solr_datafeeds_dba type=JdbcDataSource
 autoCommit=false batchSize=1 /

 We have a datasource defined in the jndi:
        no-tx-datasource
                jndi-nameext_solr_datafeeds_dba/jndi-name
                
 security-domainext_solr_datafeeds_dba_realm/security-domain
                
 connection-urljdbc:postgresql://db1.live.mbuyu.nl/datafeeds/connection-url
                min-pool-size0/min-pool-size
                max-pool-size5/max-pool-size
                
 transaction-isolationTRANSACTION_READ_COMMITTED/transaction-isolation
                driver-classorg.postgresql.Driver/driver-class
                blocking-timeout-millis3/blocking-timeout-millis
                idle-timeout-minutes5/idle-timeout-minutes
                new-connection-sqlSELECT 1/new-connection-sql
                check-valid-connection-sqlSELECT 
 1/check-valid-connection-sql
        /no-tx-datasource


 If we set autocommit to true then we get an OOM on indexing so that is
 not an option.

 Does anyone have any idea why this happens? I would guess that DIH
 doesn't close the connection, but reading the code I can't be sure of
 this. The ResultSet object should close itself once it reaches the
 end.

 mvg,
 JAsper


FileListEntityProcessor limit at 11 files?

2012-06-15 Thread Roland Ucker
Hello,

I'm using the DIH to index some PDFs.
Everything works fine for the first 11 files.
But after indexing 11 PDFs the process stops independently of the PDFs
being indexed or the directory structure (recursive=true).
The lucene index for these 11 documents is valid.

Is there anything like a FileListEntityProcessor limit that can be set?

Regards,
Roland


Re: FilterCache - maximum size of document set

2012-06-15 Thread Erick Erickson
Test first, of course, but slave on 3.6 and master on 3.5 should be
fine. If you're
getting evictions with the cache settings that high, you really want
to look at why.

Note that in particular, using NOW in your filter queries virtually guarantees
that they won't be re-used as per the link I sent yesterday.

Best
Erick

On Fri, Jun 15, 2012 at 1:15 AM, Pawel Rog pawelro...@gmail.com wrote:
 It can be true that filters cache max size is set to high value. That is
 also true that.
 We looked at evictions and hit rate earlier. Maybe you are right that
 evictions are
 not always unwanted. Some time ago we made tests. There are not so high
 difference in hit rate when filters maxSize is set to 4000 (hit rate about
 85%) and
 16000 (hitrate about 91%). I think that also using LFU cache can be helpful
 but
 it makes me to migrate to 3.6. Do you think it is reasonable to use slave on
 version 3.6 and master on 3.5?

 Once again, Thanks for your help

 --
 Pawel

 On Thu, Jun 14, 2012 at 7:22 PM, Erick Erickson 
 erickerick...@gmail.comwrote:

 Hmmm, your maxSize is pretty high, it may just be that you've set this
 much higher
 than is wise. The maxSize setting governs the number of entries. I'd start
 with
 a much lower number here, and monitor the solr/admin page for both
 hit ratio and evictions. Well, and size too. 16,000 entries puts a
 ceiling of, what,
 48G on it? Ouch! It sounds like what's happening here is you're just
 accumulating
 more and more fqs over the course of the evening and blowing memory.

 Not all FQs will be that big, there's some heuristics in there to just
 store the
 document numbers for sparse filters, maxDocs/8 is pretty much the upper
 bound though.

 Evictions are not necessarily a bad thing, the hit-ratio is important
 here. And
 if you're using a bare NOW in your filter queries, you're probably never
 re-using them anyway, see:

 http://www.lucidimagination.com/blog/2012/02/23/date-math-now-and-filter-queries/

 I really question whether this limit is reasonable, but you know your
 situation best.

 Best
 Erick

 On Wed, Jun 13, 2012 at 5:40 PM, Pawel Rog pawelro...@gmail.com wrote:
  Thanks for your response
  Yes, maybe you are right. I thought that filters can be larger than 3M.
 All
  kinds of filters uses BitSet?
  Moreover maxSize of filterCache is set to 16000 in my case. There are
  evictions during day traffic
  but not during night traffic.
 
  Version of Solr which I use is 3.5
 
  I haven't used Memory Anayzer yet. Could you write more details about it?
 
  --
  Regards,
  Pawel
 
  On Wed, Jun 13, 2012 at 10:55 PM, Erick Erickson 
 erickerick...@gmail.comwrote:
 
  Hmmm, I think you may be looking at the wrong thing here. Generally, a
  filterCache
  entry will be maxDocs/8 (plus some overhead), so in your case they
 really
  shouldn't be all that large, on the order of 3M/filter. That shouldn't
  vary based
  on the number of docs that match the fq, it's just a bitset. To see if
  that makes any
  sense, take a look at the admin page and the number of evictions in
  your filterCache. If
  that is  0, you're probably using all the memory you're going to in
  the filterCache during
  the day..
 
  But you haven't indicated what version of Solr you're using, I'm going
  from a
  relatively recent 3x knowledge-base.
 
  Have you put a memory analyzer against your Solr instance to see where
  the memory
  is being used?
 
  Best
  Erick
 
  On Wed, Jun 13, 2012 at 1:05 PM, Pawel pawelmis...@gmail.com wrote:
   Hi,
   I have solr index with about 25M documents. I optimized FilterCache
 size
  to
   reach the best performance (considering traffic characteristic that my
  Solr
   handles). I see that the only way to limit size of a Filter Cace is to
  set
   number of document sets that Solr can cache. There is no way to set
  memory
   limit (eg. 2GB, 4GB or something like that). When I process a standard
   trafiic (during day) everything is fine. But when Solr handle night
  traffic
   (and the charateristic of requests change) some problems appear.
 There is
   JVM out of memory error. I know what is the reason. Some filters on
 some
   fields are quite poor filters. They returns 15M of documents or even
  more.
   You could say 'Just put that into q'. I tried to put that filters into
   Query part but then, the statistics of request processing time
 (during
   day) become much worse. Reduction of Filter Cache maxSize is also not
  good
   solution because during day cache filters are very very helpful.
   You could be interested in type of filters that I use. These are range
   filters (I tried standard range filters and frange) - eg. price:[* TO
   1]. Some fq with price can return few thousands of results (eg.
   price:[40 TO 50]), but some (eg. price:[* TO 1]) can return
 milions
  of
   documents. I'd also like to avoid solution which will introduce strict
   ranges that user can choose.
   Have you any suggestions what can I do? Is there any way to limit 

SolrCloud subdirs in conf boostrap dir

2012-06-15 Thread Markus Jelsma
Hi,

We'd like to create subdirectories for each collection in our conf bootstrap 
directory for cleaner maintenance and not having to include the collection name 
in each configuration file. However, it is not working:

2012-06-15 11:31:08,483 ERROR [solr.core.CoreContainer] - [main] - : 
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode f
or /configs/COLLECTION_NAME/solrconfig.xml

The solrconfig.xml is in boostrap_conf/dirname/solrconfig.xml and solr.xml's 
solrconfig attribute points to the proper file.

A better question might be, how can i nicely maintain multiple collection 
configuration directories in SolrCloud?

Thanks,
Markus


Re: Dedupe and overwriteDupes setting

2012-06-15 Thread Shameema Umer
Hi,
My solrconfig dedupe setting is as follows.

 updateRequestProcessorChain name=dedupe
processor
class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
  bool name=enabledtrue/bool
  bool name=overwriteDupesfalse/bool
  str name=signatureFielddupesign/str
  str name=fieldstitle,url/str
  str
name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str
/processor
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
  /updateRequestProcessorChain

Even though overwriteDupes is set to false, search qiery results show the
contents are overwrtten.

Is this because there are duplicate contents on solr and the query results
is displaying only the latest entery from the duplicate?

I actually need the date field not to be overwritten. Please help.

Thanks
Shameema


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Dedupe-and-overwriteDupes-setting-tp809320p3989807.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: IndexWrite in Lucene/Solr 3.5 is slower?

2012-06-15 Thread Ramprakash Ramamoorthy
On Fri, Jun 15, 2012 at 12:50 PM, Ramprakash Ramamoorthy 
youngestachie...@gmail.com wrote:



 On Fri, Jun 15, 2012 at 12:20 PM, pravesh suyalprav...@yahoo.com wrote:

 BTW, Have you changed the MergePolicy  MergeScheduler settings also?
 Since
 Lucene 3.x/3.5 onwards,
 there have been new MergePolicy  MergeScheduler implementations
 available,
 like TieredMergePolicy  ConcurrentMergeScheduler.

 Regards
 Pravesh

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/IndexWrite-in-Lucene-Solr-3-5-is-slower-tp3989764p3989768.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 Thanks for the reply Pravesh. Yes I initially used the default
  TieredMergePolicy and later set the merge policy in both the versions to
 LogByteSizeMergePolicy, in order to maintain congruence. But still Lucene
 3.5 lagged behind by 2X approx.


 --
 With Thanks and Regards,
 Ramprakash Ramamoorthy,
 Engineer Trainee,
 Zoho Corporation.
 +91 9626975420


Can someone help me with this please?

-- 
With Thanks and Regards,
Ramprakash Ramamoorthy,
Engineer Trainee,
Zoho Corporation.
+91 9626975420


Re: Building a heat map from geo data in index

2012-06-15 Thread Jamie Johnson
So I've tried this a bit, but I can't get it to look quite right.
What I was doing up until now was taking the center point of the
geohash cell as location for the value I am getting from the index.
Doing this you end up with what appears to be islands (using
HeatMap.js currently).  I guess what I would like to do is take this
information and generate a static image so I can quickly prototype
some things.  Are there any good Java based heatmap tools?  Also if
anyone has done this before any thoughts on how to do this would
really be appreciated.

On Mon, Jun 11, 2012 at 12:52 PM, Jamie Johnson jej2...@gmail.com wrote:
 Yeah I'll have to play to see how useful it is, I really don't know at
 this point.

 On another note we already using some binning like is described in teh
 wiki you sent, specifically http://code.google.com/p/javageomodel/ for
 other purposes.  Not sure if that could be used or not, guess I'd have
 to think on it harder.


 On Mon, Jun 11, 2012 at 12:04 PM, Tanguy Moal tanguy.m...@gmail.com wrote:
 Yes it looks interesting and is not too difficult to do.
 However, the length of the geohashes gives you very little control on the
 size of the regions to colorize. Quoting wikipedia :
 geohash length


 km error1



 ±25002



 ±6303



 ±784


 ±205


 ±2.46



 ±0.617



 ±0.0768



 ±0.019
 This is interesting also : http://wiki.openstreetmap.org/wiki/QuadTiles
 But it does what you're looking for, somehow :)

 --
 Tanguy


 2012/6/11 Jamie Johnson jej2...@gmail.com

 If you look at the Stack response from David he had suggested breaking
 the geohash up into pieces and then using a prefix for refining
 precision.  I hadn't imagined limiting this to a particular area, just
 limiting it based on the prefix (which would be based on users zoom
 level or something) allowing the information to become more precise as
 the user zoomed in.  That seemed a very reasonable approach to the
 problem.

 On Mon, Jun 11, 2012 at 10:55 AM, Tanguy Moal tanguy.m...@gmail.com
 wrote:
  There is definitely something interesting to do around geohashes.
 
  I'm wondering how one could map the N by N tiles requested tiles to a
 range
  of geohashes. (Where the gap would be a function of N).
  What I try to mean is that I don't know if a bijective function exist
  between tiles and geohash ranges.
  I don't even know if a contiguous range of geohashes ends up in a squared
  box.
 
  Because if you can find such a function, then you could probably solve
 the
  issue by asking facet ranges on a geohash field to solr.
 
  I don't if that helps but the topic is very interesting to me...
  Please share your findings, if any :-)
 
  --
  Tanguy
 
  2012/6/11 Dmitry Kan dmitry@gmail.com
 
  so it sounds to me, that the geohash is just a hash representation of
 lat,
  lon coordinates for an easier referencing (see e.g.
  http://en.wikipedia.org/wiki/Geohash).
  I would probably start with something easier, having bbox lat,lon
  coordinate pairs of top left corner (or in some coordinate systems, it
 is
  down left corner), break each bbox into cells of size w/N, h/N (and
  probably, that's equal numbers). Then you can loop over the cells and
  compute your facet counts with bbox of a cell. You could then evolve
 this
  to geohashes, if you want, but at least you would know where to start.
 
  -- Dmitry
 
  On Mon, Jun 11, 2012 at 4:48 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
   That is certainly an option but the collecting of the heat map data is
   really the question.
  
   I saw this
  
  
  
 
 http://stackoverflow.com/questions/8798711/solr-using-facets-to-sum-documents-based-on-variable-precision-geohashes
  
   but don't have a really good understanding of how this would be
   accomplished.  I need to get a more firm understanding of geohashes as
   my understanding is extremely lacking at this point.
  
   On Mon, Jun 11, 2012 at 8:55 AM, Stefan Matheis
   matheis.ste...@googlemail.com wrote:
I'm not entirely sure, that it has to be that complicated .. what
 about
   using for example http://www.patrick-wied.at/static/heatmapjs/ ? You
   could collect all the geo-related data and do the (heat)map stuff on
 the
   client.
   
   
   
On Sunday, June 10, 2012 at 7:49 PM, Jamie Johnson wrote:
   
I had a request from a customer which to this point I have not seen
much similar so I figured I'd pose the question here. I've been
 asked
if it was possible to build a heat map from the results of a
 query. I
can imagine a process to do this through some post processing, but
that sounds very expensive for large/distributed indices so I was
wondering if with all of the new geospatial support that is being
added to lucene/solr there was a way to do geospatial faceting.
 What
I am imagining is bounding box being defined and that box being
 broken
into an N by N matrix, each of which would return counts so a heat
 map
could be constructed. Any other thoughts on this would 

Re: FilterCache - maximum size of document set

2012-06-15 Thread Pawel Rog
Thanks
I don't use NOW in queries. All my filters with timestamp are rounded to
hundreds of
seconds to increase hitrate. The only problem could be in price filters
which can be
varied (users are unpredictable :P), but also that filters from fq or
setting cache=false
is also bad idea ... checked it :) Load rised three times :)

--
Pawel

On Fri, Jun 15, 2012 at 1:30 PM, Erick Erickson erickerick...@gmail.comwrote:

 Test first, of course, but slave on 3.6 and master on 3.5 should be
 fine. If you're
 getting evictions with the cache settings that high, you really want
 to look at why.

 Note that in particular, using NOW in your filter queries virtually
 guarantees
 that they won't be re-used as per the link I sent yesterday.

 Best
 Erick

 On Fri, Jun 15, 2012 at 1:15 AM, Pawel Rog pawelro...@gmail.com wrote:
  It can be true that filters cache max size is set to high value. That is
  also true that.
  We looked at evictions and hit rate earlier. Maybe you are right that
  evictions are
  not always unwanted. Some time ago we made tests. There are not so high
  difference in hit rate when filters maxSize is set to 4000 (hit rate
 about
  85%) and
  16000 (hitrate about 91%). I think that also using LFU cache can be
 helpful
  but
  it makes me to migrate to 3.6. Do you think it is reasonable to use
 slave on
  version 3.6 and master on 3.5?
 
  Once again, Thanks for your help
 
  --
  Pawel
 
  On Thu, Jun 14, 2012 at 7:22 PM, Erick Erickson erickerick...@gmail.com
 wrote:
 
  Hmmm, your maxSize is pretty high, it may just be that you've set this
  much higher
  than is wise. The maxSize setting governs the number of entries. I'd
 start
  with
  a much lower number here, and monitor the solr/admin page for both
  hit ratio and evictions. Well, and size too. 16,000 entries puts a
  ceiling of, what,
  48G on it? Ouch! It sounds like what's happening here is you're just
  accumulating
  more and more fqs over the course of the evening and blowing memory.
 
  Not all FQs will be that big, there's some heuristics in there to just
  store the
  document numbers for sparse filters, maxDocs/8 is pretty much the upper
  bound though.
 
  Evictions are not necessarily a bad thing, the hit-ratio is important
  here. And
  if you're using a bare NOW in your filter queries, you're probably never
  re-using them anyway, see:
 
 
 http://www.lucidimagination.com/blog/2012/02/23/date-math-now-and-filter-queries/
 
  I really question whether this limit is reasonable, but you know your
  situation best.
 
  Best
  Erick
 
  On Wed, Jun 13, 2012 at 5:40 PM, Pawel Rog pawelro...@gmail.com
 wrote:
   Thanks for your response
   Yes, maybe you are right. I thought that filters can be larger than
 3M.
  All
   kinds of filters uses BitSet?
   Moreover maxSize of filterCache is set to 16000 in my case. There are
   evictions during day traffic
   but not during night traffic.
  
   Version of Solr which I use is 3.5
  
   I haven't used Memory Anayzer yet. Could you write more details about
 it?
  
   --
   Regards,
   Pawel
  
   On Wed, Jun 13, 2012 at 10:55 PM, Erick Erickson 
  erickerick...@gmail.comwrote:
  
   Hmmm, I think you may be looking at the wrong thing here. Generally,
 a
   filterCache
   entry will be maxDocs/8 (plus some overhead), so in your case they
  really
   shouldn't be all that large, on the order of 3M/filter. That
 shouldn't
   vary based
   on the number of docs that match the fq, it's just a bitset. To see
 if
   that makes any
   sense, take a look at the admin page and the number of evictions in
   your filterCache. If
   that is  0, you're probably using all the memory you're going to in
   the filterCache during
   the day..
  
   But you haven't indicated what version of Solr you're using, I'm
 going
   from a
   relatively recent 3x knowledge-base.
  
   Have you put a memory analyzer against your Solr instance to see
 where
   the memory
   is being used?
  
   Best
   Erick
  
   On Wed, Jun 13, 2012 at 1:05 PM, Pawel pawelmis...@gmail.com
 wrote:
Hi,
I have solr index with about 25M documents. I optimized FilterCache
  size
   to
reach the best performance (considering traffic characteristic
 that my
   Solr
handles). I see that the only way to limit size of a Filter Cace
 is to
   set
number of document sets that Solr can cache. There is no way to set
   memory
limit (eg. 2GB, 4GB or something like that). When I process a
 standard
trafiic (during day) everything is fine. But when Solr handle night
   traffic
(and the charateristic of requests change) some problems appear.
  There is
JVM out of memory error. I know what is the reason. Some filters on
  some
fields are quite poor filters. They returns 15M of documents or
 even
   more.
You could say 'Just put that into q'. I tried to put that filters
 into
Query part but then, the statistics of request processing time
  (during
day) become much worse. Reduction of Filter Cache maxSize 

SolrCloud and split-brain

2012-06-15 Thread Otis Gospodnetic
Hi,

How exactly does SolrCloud handle split brain situations?

Imagine a cluster of 10 nodes.
Imagine 3 of them being connected to the network by some switch and imagine the 
out port of this switch dies.
When that happens, these 3 nodes will be disconnected from the other 7 nodes 
and we'll have 2 clusters, one with 3 nodes and one with 7 nodes and we'll have 
a split brain situation.  
Imagine we had 3 ZK nodes in the original 10-node cluster, 2 of which are 
connected to the dead switch and are thus aware only of the 3 node cluster now, 
and 1 ZK instance which is on a different switch and is thus aware only of the 
7 node cluster.

At this point how exactly does ZK make SolrCloud immune to split brain?


Does LBHttpSolrServer play a key role here? (I see LBHttpSolrServer mentioned 
only once on http://wiki.apache.org/solr/SolrCloud and with a question mark 
next to it)


Thanks,
Otis

Performance Monitoring for Solr / ElasticSearch / HBase - 
http://sematext.com/spm


Re: SolrCloud and split-brain

2012-06-15 Thread Yury Kats
On 6/15/2012 12:49 PM, Otis Gospodnetic wrote:
 Hi,
 
 How exactly does SolrCloud handle split brain situations?
 
 Imagine a cluster of 10 nodes.
 Imagine 3 of them being connected to the network by some switch and imagine 
 the out port of this switch dies.
 When that happens, these 3 nodes will be disconnected from the other 7 nodes 
 and we'll have 2 clusters, one with 3 nodes and one with 7 nodes and we'll 
 have a split brain situation.  
 Imagine we had 3 ZK nodes in the original 10-node cluster, 2 of which are 
 connected to the dead switch and are thus aware only of the 3 node cluster 
 now, and 1 ZK instance which is on a different switch and is thus aware only 
 of the 7 node cluster.
 
 At this point how exactly does ZK make SolrCloud immune to split brain?

A quorum of N/2+1 nodes is required to operate (that's also the reason you need 
at least 3 to begin with)


StreamingUpdateSolrServer Connection Timeout Setting

2012-06-15 Thread Kissue Kissue
Hi,

Does anybody know what the default connection timeout setting is for
StreamingUpdateSolrServer? Can i explicitly set one and how?

Thanks.


Re: SolrCloud and split-brain

2012-06-15 Thread Mark Miller
Zookeeper avoids split brain using Paxos (or something very like it - I can't 
remember if they extended it or modified and/or what they call it).

So you will only ever see one Zookeeper cluster - the smaller partition will be 
down. There is a proof for Paxos if I remember right.

Zookeeper then acts as the system of record for Solr. Solr won't auto form its 
own new little clusters - *the* cluster is modeled in Zookeeper and that's the 
cluster. So Solr does not find it self organizing new mini clusters on 
partition splits.

When we lose our connection to Zookeeper, update requests are no longer 
accepted, because we may have a stale cluster view and not know it for a long 
period of time.


On Jun 15, 2012, at 12:49 PM, Otis Gospodnetic wrote:

 Hi,
 
 How exactly does SolrCloud handle split brain situations?
 
 Imagine a cluster of 10 nodes.
 Imagine 3 of them being connected to the network by some switch and imagine 
 the out port of this switch dies.
 When that happens, these 3 nodes will be disconnected from the other 7 nodes 
 and we'll have 2 clusters, one with 3 nodes and one with 7 nodes and we'll 
 have a split brain situation.  
 Imagine we had 3 ZK nodes in the original 10-node cluster, 2 of which are 
 connected to the dead switch and are thus aware only of the 3 node cluster 
 now, and 1 ZK instance which is on a different switch and is thus aware only 
 of the 7 node cluster.
 
 At this point how exactly does ZK make SolrCloud immune to split brain?
 
 
 Does LBHttpSolrServer play a key role here? (I see LBHttpSolrServer mentioned 
 only once on http://wiki.apache.org/solr/SolrCloud and with a question mark 
 next to it)
 
 
 Thanks,
 Otis
 
 Performance Monitoring for Solr / ElasticSearch / HBase - 
 http://sematext.com/spm

- Mark Miller
lucidimagination.com













Re: SolrCloud and split-brain

2012-06-15 Thread Otis Gospodnetic
Hi,
 
 Zookeeper avoids split brain using Paxos (or something very like it - I 

 can't remember if they extended it or modified and/or what they call it).
 
 So you will only ever see one Zookeeper cluster - the smaller partition will 
 be 
 down. There is a proof for Paxos if I remember right.
 
 Zookeeper then acts as the system of record for Solr. Solr won't auto form 
 its own new little clusters - *the* cluster is modeled in Zookeeper and 
 that's the cluster. So Solr does not find it self organizing new mini 
 clusters on partition splits.
 
 When we lose our connection to Zookeeper, update requests are no longer 
 accepted, because we may have a stale cluster view and not know it for a long 
 period of time.


Does this work even when outside clients (apps for indexing or searching) send 
their requests directly to individual nodes?
Let's use the example from my email where we end up with 2 groups of nodes: 
7-node group with 2 ZK nodes on the same network and 3-node group with 1 ZK 
node on the same network.

If a client sends a request to a node in the 7-node group what happens?
And if a client sends a request to a node in the 3-node group what happens?

Yury wrote:
 A quorum of N/2+1 nodes is required to operate (that's also the reason you 
need at least 3 to begin with)

N=3 (ZK nodes), right?
So in that case we need at least 3/2+1 = 2.5 ZK nodes to operate.  So in my 
example neither the 7-node group nor the 3-node group will operate (does that 
mean request rejection or something else?) because neither sees 2.5 ZK nodes?

Thanks,
Otis

Performance Monitoring for Solr / ElasticSearch / HBase - 
http://sematext.com/spm 




 On Jun 15, 2012, at 12:49 PM, Otis Gospodnetic wrote:
 
  Hi,
 
  How exactly does SolrCloud handle split brain situations?
 
  Imagine a cluster of 10 nodes.
  Imagine 3 of them being connected to the network by some switch and imagine 
 the out port of this switch dies.
  When that happens, these 3 nodes will be disconnected from the other 7 
 nodes and we'll have 2 clusters, one with 3 nodes and one with 7 nodes and 
 we'll have a split brain situation.  
  Imagine we had 3 ZK nodes in the original 10-node cluster, 2 of which are 
 connected to the dead switch and are thus aware only of the 3 node cluster 
 now, 
 and 1 ZK instance which is on a different switch and is thus aware only of 
 the 7 
 node cluster.
 
  At this point how exactly does ZK make SolrCloud immune to split brain?
 
 
  Does LBHttpSolrServer play a key role here? (I see LBHttpSolrServer 
 mentioned only once on http://wiki.apache.org/solr/SolrCloud and with a 
 question 
 mark next to it)
 
 
  Thanks,
  Otis
  
  Performance Monitoring for Solr / ElasticSearch / HBase - 
 http://sematext.com/spm
 
 - Mark Miller
 lucidimagination.com



Re: SolrCloud and split-brain

2012-06-15 Thread Mark Miller

On Jun 15, 2012, at 1:44 PM, Otis Gospodnetic wrote:

 Does this work even when outside clients (apps for indexing or searching) 
 send their requests directly to individual nodes?
 Let's use the example from my email where we end up with 2 groups of nodes: 
 7-node group with 2 ZK nodes on the same network and 3-node group with 1 ZK 
 node on the same network.

The 3-node group with 1 ZK would not have a functioning zk - so it would stop 
accepting updates. If it could serve a complete view of the index, it would 
though, for searches.

The 7-node group would have a working ZK it could talk to, and it would 
continue to accept updates as long as a node for a shard for that hash range is 
up. It would also of course serve searches.

In this case, hitting a box in the 3-node group for searches would start 
becoming stale. A smart client would no longer hit those boxes though.

If you have a 'dumb' client or load balancer, then yes - you would have to 
remove the bad nodes from rotation.

We could improve this or make the behavior configurable. At least initially 
though, we figured it was better if we kept serving searches even when we 
cannot talk to zookeeper.

 
 If a client sends a request to a node in the 7-node group what happens?
 And if a client sends a request to a node in the 3-node group what happens?

- Mark Miller
lucidimagination.com













Re: StreamingUpdateSolrServer Connection Timeout Setting

2012-06-15 Thread Sami Siren
The api doc for version 3.6.0 is available here:
http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.html

I think the default is coming from your OS if you are not setting it explicitly.

--
 Sami Siren

On Fri, Jun 15, 2012 at 8:22 PM, Kissue Kissue kissue...@gmail.com wrote:
 Hi,

 Does anybody know what the default connection timeout setting is for
 StreamingUpdateSolrServer? Can i explicitly set one and how?

 Thanks.


Re: SolrCloud and split-brain

2012-06-15 Thread Otis Gospodnetic
Ola,

Thanks Mark!
 
  Does this work even when outside clients (apps for indexing or searching) 

 send their requests directly to individual nodes?
  Let's use the example from my email where we end up with 2 groups of 
 nodes: 7-node group with 2 ZK nodes on the same network and 3-node group with 
 1 
 ZK node on the same network.
 
 The 3-node group with 1 ZK would not have a functioning zk - so it would stop 
 accepting updates. If it could serve a complete view of the index, it would 
 though, for searches.


So in this case information in this 1 ZK node would tell the 3 Solr nodes 
whether they have all index data or if some shards are missing (i.e. were only 
on nodes in the other 7-node group)?
And if nodes figure out they don't have all index data they will reject search 
requests?  Or will they accept and perform searches, but return responses that 
tell the client that the searched index was not complete?

 The 7-node group would have a working ZK it could talk to, and it would 
 continue 
 to accept updates as long as a node for a shard for that hash range is up. It 
 would also of course serve searches.


Right, so if the node for the shard where a doc is supposed to go to is in that 
3-node group, then the indexing request will be rejected.  Is this correct?

 In this case, hitting a box in the 3-node group for searches would start 
 becoming stale. A smart client would no longer hit those boxes though.
 
 If you have a 'dumb' client or load balancer, then yes - you would have 
 to remove the bad nodes from rotation.


Aha, yes and yes.

 We could improve this or make the behavior configurable. At least initially 
 though, we figured it was better if we kept serving searches even when we 
 cannot 
 talk to zookeeper.


Makes sense.  Do responses carry something to alert the client that something 
is rotten in the state of cluster?

Thanks,

Otis

Performance Monitoring for Solr / ElasticSearch / HBase - 
http://sematext.com/spm 


Re: SolrCloud and split-brain

2012-06-15 Thread Mark Miller

On Jun 15, 2012, at 2:12 PM, Otis Gospodnetic wrote:

 Makes sense.  Do responses carry something to alert the client that 
 something is rotten in the state of cluster?

No, I don't think so - we should probably add that to the header similar to how 
I assume partial results will work.

Feel free to fire up a JIRA issue for that.

- Mark Miller
lucidimagination.com













Re: SolrCloud and split-brain

2012-06-15 Thread Otis Gospodnetic
Thanks Mark, will open an issue in a bit.

But I think the following is the real meat of the Q about split brain and 
SolrCloud, especially when it comes to how indexing is handled during split 
brain:

  Does this work even when outside clients (apps for indexing or searching) 

 send their requests directly to individual nodes?
  Let's use the example from my email where we end up with 2 groups of 
 nodes: 7-node group with 2 ZK nodes on the same network and 3-node group with 
 1 
 ZK node on the same network.
 
 The 3-node group with 1 ZK would not have a functioning zk - so it would stop 
 accepting updates. If it could serve a complete view of the index, it would 
 though, for searches.

So in this case information in this 1 ZK node would tell the 3 Solr nodes 
whether they have all index data or if some shards are missing (i.e. were only 
on nodes in the other 7-node group)?
And if nodes figure out they don't have all index data they will reject search 
requests?  Or will they accept and perform searches, but return responses that 
tell the client that the searched index was not complete?

 The 7-node group would have a working ZK it could talk to, and it would 
 continue 
 to accept updates as long as a node for a shard for that hash range is up. It 
 would also of course serve searches.

Right, so if the node for the shard where a doc is supposed to go to is in that 
3-node group, then the indexing request will be rejected.  Is this correct? 



Otis 

Performance Monitoring for Solr / ElasticSearch / HBase - 
http://sematext.com/spm 



- Original Message -
 From: Mark Miller markrmil...@gmail.com
 To: solr-user solr-user@lucene.apache.org
 Cc: 
 Sent: Friday, June 15, 2012 2:22 PM
 Subject: Re: SolrCloud and split-brain
 
 
 On Jun 15, 2012, at 2:12 PM, Otis Gospodnetic wrote:
 
  Makes sense.  Do responses carry something to alert the client that 
 something is rotten in the state of cluster?
 
 No, I don't think so - we should probably add that to the header similar to 
 how I assume partial results will work.
 
 Feel free to fire up a JIRA issue for that.
 
 - Mark Miller
 lucidimagination.com



WordBreak and default dictionary crash Solr

2012-06-15 Thread Carrie Coy

Is this a configuration problem or a bug?

We use two dictionaries, default (spellcheckerFreq)  and 
solr.WordBreakSolrSpellChecker.  When a query contains 2 misspellings, 
one corrected by the default dictionary, and the other corrected by the 
wordbreak dictionary (strawberryn shortcake) , Solr crashes with error 
below.   It doesn't matter which dictionary is checked first.


java.lang.NullPointerException
at 
org.apache.solr.handler.component.SpellCheckComponent.toNamedList(SpellCheckComponent.java:566)
at 
org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:177)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:204)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:240)

at org.apache.solr.core.SolrCore.execute(SolrCore.java:1555)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:442)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:263)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
at 
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)

at java.lang.Thread.run(Thread.java:662)


Multiple errors corrected by the SAME dictionary (either wordbreak or 
default) do not crash Solr.   Here is excerpt from our solrconfig.xml:


searchComponent name=spellcheck class=solr.SpellCheckComponent
str name=queryAnalyzerFieldTypetextSpell/str
lst name=spellchecker
str name=namewordbreak/str
str name=classnamesolr.WordBreakSolrSpellChecker/str
str name=fieldspell/str
str name=combineWordstrue/str
str name=breakWordstrue/str
int name=maxChanges1/int
/lst
lst name=spellchecker
str name=namedefault/str
str name=fieldspell/str
str name=spellcheckIndexDirspellcheckerFreq/str
str name=buildOnOptimizetrue/str
/lst
/searchComponent

requestHandler name=/select class=solr.SearchHandler
lst name=defaults
   .
str name=spellcheck.dictionarywordbreak/str
str name=spellcheck.dictionarydefault/str
str name=spellcheck.count3/str
str name=spellcheck.collatetrue/str
str name=spellcheck.onlyMorePopularfalse/str
/lst
/requestHandler





Re: SolrCloud and split-brain

2012-06-15 Thread Mark Miller

On Jun 15, 2012, at 3:21 PM, Otis Gospodnetic wrote:

 Thanks Mark, will open an issue in a bit.
 
 But I think the following is the real meat of the Q about split brain and 
 SolrCloud, especially when it comes to how indexing is handled during split 
 brain:
 
   Does this work even when outside clients (apps for indexing or searching) 
 
 send their requests directly to individual nodes?
   Let's use the example from my email where we end up with 2 groups of 
 nodes: 7-node group with 2 ZK nodes on the same network and 3-node group 
 with 1 
 ZK node on the same network.
  
 The 3-node group with 1 ZK would not have a functioning zk - so it would 
 stop 
 accepting updates. If it could serve a complete view of the index, it would 
 though, for searches.
 
 So in this case information in this 1 ZK node would tell the 3 Solr nodes 
 whether they have all index data or if some shards are missing (i.e. were 
 only on nodes in the other 7-node group)?
 And if nodes figure out they don't have all index data they will reject 
 search requests?  Or will they accept and perform searches, but return 
 responses that tell the client that the searched index was not complete?

The 1 ZK node will not function, so the 3 Solr nodes will not accept updates.

If there is one replica for each shard available, search will still work. I 
don't think partial results has been committed yet for distrib search. In that 
case, we will put something in the header to indicate a full copy of the index 
was not available. I think we can also add something in the header if we know 
we cannot talk to zookeeper to let the client know it could be seeing stale 
state. SmartClients that talked to zookeeper would see those nodes appear as 
down in zookeeper and stop trying to talk to them.

 
 The 7-node group would have a working ZK it could talk to, and it would 
 continue 
 to accept updates as long as a node for a shard for that hash range is up. 
 It 
 would also of course serve searches.
 
 Right, so if the node for the shard where a doc is supposed to go to is in 
 that 3-node group, then the indexing request will be rejected.  Is this 
 correct? 

it depends on what is available - but you will need at least one replica for 
each shard available - eg your partition needs to have one copy of the index - 
otherwise updates are rejected if there are no nodes hosting a shard of the 
hash range. So if a replica made it into the larger partition, you will be fine 
- it will become the leader.

 
 
 
 Otis 
 
 Performance Monitoring for Solr / ElasticSearch / HBase - 
 http://sematext.com/spm 
 
 
 
 - Original Message -
 From: Mark Miller markrmil...@gmail.com
 To: solr-user solr-user@lucene.apache.org
 Cc: 
 Sent: Friday, June 15, 2012 2:22 PM
 Subject: Re: SolrCloud and split-brain
 
 
 On Jun 15, 2012, at 2:12 PM, Otis Gospodnetic wrote:
 
 Makes sense.  Do responses carry something to alert the client that 
 something is rotten in the state of cluster?
 
 No, I don't think so - we should probably add that to the header similar to 
 how I assume partial results will work.
 
 Feel free to fire up a JIRA issue for that.
 
 - Mark Miller
 lucidimagination.com
 

- Mark Miller
lucidimagination.com













RE: WordBreak and default dictionary crash Solr

2012-06-15 Thread Dyer, James
Carrie,

Thank you for trying out new features!  I'm pretty sure you've found a bug 
here.  Could you tell me whether you're using a build from Trunk or Solr_4x ?  
Also, do you know the svn revision or the Jenkins build # (or timestamp) you're 
working from?

Could you try instead to use DirectSolrSpellChecker instead of 
IndexBasedSpellChecker for your default dictionary?  (In Trunk and the 4.x 
branch, the Solr Example now uses DirectSolrSpellChecker as its default.)  It 
could be this is a problem related to using WordBreakSolrSpellChecker with the 
older IndexBasedSpellChecker.  So if you have better luck with 
DirectSolrSpellChecker, that would be helpful in honing in on the exact problem.

Also, judging from the line that is failing, could it be you're using a build 
based on svn revision pre-r1346489 (Trunk) or pre-r1346499 (Branch_4x) ?  
https://issues.apache.org/jira/browse/SOLR-2993  Shortly after the initial 
commit of this feature, a bug similar to the one you're reporting was later 
fixed with these subsequent revisions.  

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: Carrie Coy [mailto:c...@ssww.com] 
Sent: Friday, June 15, 2012 2:46 PM
To: solr-user@lucene.apache.org
Subject: WordBreak and default dictionary crash Solr

Is this a configuration problem or a bug?

We use two dictionaries, default (spellcheckerFreq)  and 
solr.WordBreakSolrSpellChecker.  When a query contains 2 misspellings, 
one corrected by the default dictionary, and the other corrected by the 
wordbreak dictionary (strawberryn shortcake) , Solr crashes with error 
below.   It doesn't matter which dictionary is checked first.

java.lang.NullPointerException
 at 
org.apache.solr.handler.component.SpellCheckComponent.toNamedList(SpellCheckComponent.java:566)
 at 
org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:177)
 at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:204)
 at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
 at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:240)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1555)
 at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:442)
 at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:263)
 at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
 at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
 at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
 at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
 at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
 at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
 at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
 at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
 at 
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
 at java.lang.Thread.run(Thread.java:662)


Multiple errors corrected by the SAME dictionary (either wordbreak or 
default) do not crash Solr.   Here is excerpt from our solrconfig.xml:

searchComponent name=spellcheck class=solr.SpellCheckComponent
str name=queryAnalyzerFieldTypetextSpell/str
lst name=spellchecker
str name=namewordbreak/str
str name=classnamesolr.WordBreakSolrSpellChecker/str
str name=fieldspell/str
str name=combineWordstrue/str
str name=breakWordstrue/str
int name=maxChanges1/int
/lst
lst name=spellchecker
str name=namedefault/str
str name=fieldspell/str
str name=spellcheckIndexDirspellcheckerFreq/str
str name=buildOnOptimizetrue/str
/lst
/searchComponent

requestHandler name=/select class=solr.SearchHandler
lst name=defaults
.
str name=spellcheck.dictionarywordbreak/str
str name=spellcheck.dictionarydefault/str
str name=spellcheck.count3/str
str name=spellcheck.collatetrue/str
str name=spellcheck.onlyMorePopularfalse/str
/lst
/requestHandler





Re: How to boost a field with another field's value?

2012-06-15 Thread smita
Actually I have a title field that I am searching for my query term, and the
documents have a rating field that I want to boost the results by, so the
higher rated items appear before the lower rated documents.

I am also boosting results on another field using bq:

q=summerdf=titlebq=sponsored:true^5.0qf=rating^2.0defType=dismax

However, when I use qf to boost the results by rating, Sorl is trying to
match the query in the rating field. How can I accomplish boosting by rating
using query time boosting?





--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-boost-a-field-with-another-field-s-value-tp3989706p3989917.html
Sent from the Solr - User mailing list archive at Nabble.com.


Writing index files that have the right owner

2012-06-15 Thread Mike O'Leary
I have been putting together an application using Quartz to run several 
indexing jobs in sequence using SolrJ and Tomcat on Windows. I would like the 
Quartz job to do the following:

1.   Delete index directories from the cores so each indexing job starts 
fresh with empty indexes to populate.

2.   Start the Tomcat server.

3.   Run the indexing job.

4.   Stop the Tomcat server.

5.   Copy the index directories to an archive.

Steps 2-5 work fine, but I haven't been able to find a way to delete the index 
directories from within Java. I also can't delete them from a Windows command 
shell window: I get an error message that says Access is denied. The reason 
for this is that the index directories and files have the owner 
BUILTIN\Administrators. Although I am an administrator on this machine, the 
fact that these files have a different owner means that I can only delete them 
in a Windows command shell window if I start it with Run as administrator. I 
spent a bunch of time today trying every Java function and Windows shell 
command I could find that would let me change the owner of these files, grant 
my user account the capability to delete the files, etc. Nothing I tried 
worked, likely because along with not having permission to delete the files, I 
also don't have permission to give myself permission to delete the files.

At a certain point I stopped wondering how to change the files owner or 
permissions and started wondering why the files have BUILTIN\Administrators 
as owner, and the permissions associated with that owner, in the first place. 
Is there somewhere in the Solr or Tomcat configuration files, or in the SolrJ 
code, where I can set who the owner of files written to the index directories 
should be?
Thanks,
Mike


Re: Solr Search Count Variance

2012-06-15 Thread Jack Krupansky
The variance is simply likely due to the fact that your text field is 
analyzed differently than the source fields you include in your dismax qf. 
For example, maybe some of them may be string with no analysis. So, fewer 
of those fields are matching on your query terms when using dismax.


Look at the results of both queries and then try querying on the specific 
fields of a document that is found by the traditional Lucene/Solr query 
parser but not found using dismax.


-- Jack Krupansky

-Original Message- 
From: mechravi25

Sent: Friday, June 15, 2012 1:16 AM
To: solr-user@lucene.apache.org
Subject: Solr Search Count Variance

Hi all,

When we give a search request to solr, the part of the request url to solr
having the search query will be as following

/select/?qf=name%5e2.3+text+r_name%5e0.3+id%5e0.3+xid%5e0.3fl=*f.tFacet.facet.mincount=1facet.field=tFacetf.rFacet.facet.mincount=1facet.field=rFacetfacet=truehl.fl=*hl=truerows=10start=0q=test+LogdebugQuery=on?

We find the number of documnts returned to be 5000 (approx.). Here, it makes
use of the standard handler and we get the parsed query as follows

str name=parsedquery(text:Cxx1 text:test) (text:Dyy3 text:Log)/str
str name=parsedquery_toString(text:Cxx1 text:test) (text:Dyy3
text:Log)/str

here, text is the default field and this is used by the standard handler and
it is the destination field for all the other fields.

The same way, when we alter the above url to fetch the result by using the
dismax handler,

/select/?qf=name%5e2.3+text+r_name%5e0.3+id%5e0.3+xid%5e0.3qt=dismaxfl=*f.tFacet.facet.mincount=1facet.field=tFacetf.rFacet.facet.mincount=1facet.field=rFacetfacet=truehl.fl=*hl=truerows=10start=0q=test+LogdebugQuery=on?

We find the number of documents found to be 710 and the parsed query is as
follows

str name=parsedquery+((DisjunctionMaxQuery((xid:test^0.3 | id:test^0.3 |
((r_name:Cxx1 r_name:test)^0.3) | (text:Cxx1 text:test) | ((name:Cxx1
name:test)^2.3))) DisjunctionMaxQuery((xid:Log^0.3 | id:Log^0.3 |
((r_name:Dyy3 r_name:Log)^0.3) | (text:Dyy3 text:Log) | ((name:Dyy3
name:Log)^2.3~2) ()/str
 str name=parsedquery_toString+(((xid:test^0.3 | id:test^0.3 |
((r_name:Cxx1 r_name:test)^0.3) | (text:Cxx1 text:test) | ((name:Cxx1
name:test)^2.3)) (xid:Log^0.3 | id:Log^0.3 | ((r_name:Dyy3 r_name:Log)^0.3)
| (text:Dyy3 text:Log) | ((name:Dyy3 name:Log)^2.3)))~2) ()/str

If we try to give the boosts like dismax in q parameter for standard, its
working fine i.e. the total number of documents fetched is 710. The query
used is as follows

q:(name:test^2.3 AND name:Log^2.3)OR(text:test AND
text:Log)OR(r_name:test^0.3 AND r_name:Log^0.3)OR(id:test^0.3 AND
id:Log^0.3)OR(xid:test^0.3 AND xid:Log^0.3)

I have two doubts here

1. Why is there a count difference of this extent between the standard and
dismax handler?
2. Does the dismax handler use AND operation in the phrase query (when we
use with/without quotes)?

Can you please explain me the same?

Thanks in advance

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Search-Count-Variance-tp3989760.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: SolrCloud and split-brain

2012-06-15 Thread Otis Gospodnetic
Thanks Mark.

The reason I asked this is because I saw mentions of SolrCloud being resilient 
to split brain because it uses ZooKeeper.
However, if my half brain understands what split brain is then I think that's 
not a completely true claim because one can get unlucky and get a SolrCloud 
cluster partitioned in a way that one or even all partitions reject indexing 
(and update and deletion) requests if they do not have a complete index.

In my example of a 10-node cluster that gets split into a 7-node and a 3-node 
partition, if neither partition ends up containing the full index (i.e. at 
least one copy of each shard) then neither partition will accept updates.

And here is one more Q.
* Imagine a client is adding documents and, for simplicity, imagine SolrCloud 
routes all these documents to the same shard, call it S.
* Imagine that both the 7-node and the 3-node partition end up with a complete 
index and thus both accept updates.
* This means that both the 7-node and the 3-node partition have at least one 
replica of shard S, lets call then S7 and S3.
* Now imagine if the client sending documents for indexing happened to be 
sending documents to 2 nodes, say in round-robin fashion.
* And imagine that each of these 2 nodes ended up in a different partition.

The client now keeps sending docs to these 2 nodes and both happily take and 
index documents in their own copies of S.
To the client everything looks normal - all documents are getting indexed.
But S7 and S3 are no longer the same - they contain different documents!

Problem, no?
What happens with somebody fixes the cluster and all nodes are back in the same 
10-node cluster?  What happens to S7 and S3?
Wouldn't SolrCloud have to implement bi-directional synchronization to fix 
things and unify S7 and S3?

And if there are updates and deletes involved, things get even messier :(

Otis

Performance Monitoring for Solr / ElasticSearch / HBase - 
http://sematext.com/spm 



- Original Message -
 From: Mark Miller markrmil...@gmail.com
 To: solr-user solr-user@lucene.apache.org
 Cc: 
 Sent: Friday, June 15, 2012 5:07 PM
 Subject: Re: SolrCloud and split-brain
 
 
 On Jun 15, 2012, at 3:21 PM, Otis Gospodnetic wrote:
 
  Thanks Mark, will open an issue in a bit.
 
  But I think the following is the real meat of the Q about split brain and 
 SolrCloud, especially when it comes to how indexing is handled during split 
 brain:
 
    Does this work even when outside clients (apps for indexing or 
 searching) 
 
  send their requests directly to individual nodes?
    Let's use the example from my email where we end up with 2 
 groups of 
  nodes: 7-node group with 2 ZK nodes on the same network and 3-node 
 group with 1 
  ZK node on the same network.
   
  The 3-node group with 1 ZK would not have a functioning zk - so it 
 would stop 
  accepting updates. If it could serve a complete view of the index, it 
 would 
  though, for searches.
 
  So in this case information in this 1 ZK node would tell the 3 Solr nodes 
 whether they have all index data or if some shards are missing (i.e. were 
 only 
 on nodes in the other 7-node group)?
  And if nodes figure out they don't have all index data they will reject 
 search requests?  Or will they accept and perform searches, but return 
 responses 
 that tell the client that the searched index was not complete?
 
 The 1 ZK node will not function, so the 3 Solr nodes will not accept updates.
 
 If there is one replica for each shard available, search will still work. I 
 don't think partial results has been committed yet for distrib search. In 
 that case, we will put something in the header to indicate a full copy of the 
 index was not available. I think we can also add something in the header if 
 we 
 know we cannot talk to zookeeper to let the client know it could be seeing 
 stale 
 state. SmartClients that talked to zookeeper would see those nodes appear as 
 down in zookeeper and stop trying to talk to them.
 
 
  The 7-node group would have a working ZK it could talk to, and it would 
 continue 
  to accept updates as long as a node for a shard for that hash range is 
 up. It 
  would also of course serve searches.
 
  Right, so if the node for the shard where a doc is supposed to go to is in 
 that 3-node group, then the indexing request will be rejected.  Is this 
 correct? 
 
 
 it depends on what is available - but you will need at least one replica for 
 each shard available - eg your partition needs to have one copy of the index 
 - 
 otherwise updates are rejected if there are no nodes hosting a shard of the 
 hash 
 range. So if a replica made it into the larger partition, you will be fine - 
 it 
 will become the leader.
 
 
 
 
  Otis 
  
  Performance Monitoring for Solr / ElasticSearch / HBase - 
 http://sematext.com/spm 
 
 
 
  - Original Message -
  From: Mark Miller markrmil...@gmail.com
  To: solr-user solr-user@lucene.apache.org
  Cc: 
  Sent: Friday, June