solr-cloud performance decrease day by day

2013-04-19 Thread qibaoyuan
Hello,
   i am using sold 4.1.0 and ihave used sold cloud in my product.I have found 
at first everything seems good,the search time is fast and delay is slow,but it 
becomes very slow after days.does any one knows if there maybe some params or 
optimization to use sold cloud?

Re: solr-cloud performance decrease day by day

2013-04-19 Thread Furkan KAMACI
Could you give more info about your index size and technical details of
your machine? Maybe you are indexing more data day by day and your RAM
capability is not enough anymore?

2013/4/19 qibaoyuan qibaoy...@gmail.com

 Hello,
i am using sold 4.1.0 and ihave used sold cloud in my product.I have
 found at first everything seems good,the search time is fast and delay is
 slow,but it becomes very slow after days.does any one knows if there maybe
 some params or optimization to use sold cloud?


Re: solr-cloud performance decrease day by day

2013-04-19 Thread qibaoyuan
there are 6 shards and they are in one machine,and the jvm param is very 
big,the physical memory is 16GB,the total #docs is about 150k,the index size of 
each shard is about 1GB.AND there is indexing while searching,I USE auto commit 
 each 10min.and the data comes about 100 per minutes. 


在 2013-4-19,下午3:17,Furkan KAMACI furkankam...@gmail.com 写道:

 Could you give more info about your index size and technical details of
 your machine? Maybe you are indexing more data day by day and your RAM
 capability is not enough anymore?
 
 2013/4/19 qibaoyuan qibaoy...@gmail.com
 
 Hello,
   i am using sold 4.1.0 and ihave used sold cloud in my product.I have
 found at first everything seems good,the search time is fast and delay is
 slow,but it becomes very slow after days.does any one knows if there maybe
 some params or optimization to use sold cloud?



Re: SolrCloud loadbalancing, replication, and failover

2013-04-19 Thread John Nielsen
Well, to consume 120GB of RAM with a 120GB index, you would have to query
over every single GB of data.

If you only actually query over, say, 500MB of the 120GB data in your dev
environment, you would only use 500MB worth of RAM for caching. Not 120GB


On Fri, Apr 19, 2013 at 7:55 AM, David Parks davidpark...@yahoo.com wrote:

 Wow! That was the most pointed, concise discussion of hardware requirements
 I've seen to date, and it's fabulously helpful, thank you Shawn!  We
 currently have 2 servers that I can dedicate about 12GB of ram to Solr on
 (we're moving to these 2 servers now). I can upgrade further if it's needed
  justified, and your discussion helps me justify that such an upgrade is
 the right thing to do.

 So... If I move to 3 servers with 50GB of RAM each, using 3 shards, I
 should
 be in the free and clear then right?  This seems reasonable and doable.

 In this more extreme example the failover properties of solr cloud become
 more clear. I couldn't possibly run a replica shard without doubling the
 memory, so really replication isn't reasonable until I have double the
 hardware, then the load balancing scheme makes perfect sense. With 3
 servers, 50GB of RAM and 120GB index I should just backup the index
 directory I think.

 My previous though to run replication just for failover would have actually
 resulted in LOWER performance because I would have halved the memory
 available to the master  replica. So the previous question is answered as
 well now.

 Question: if I had 1 server with 60GB of memory and 120GB index, would solr
 make full use of the 60GB of memory? Thus trimming disk access in half. Or
 is it an all-or-nothing thing?  In a dev environment, I didn't notice SOLR
 consuming the full 5GB of RAM assigned to it with a 120GB index.

 Dave


 -Original Message-
 From: Shawn Heisey [mailto:s...@elyograg.org]
 Sent: Friday, April 19, 2013 11:51 AM
 To: solr-user@lucene.apache.org
 Subject: Re: SolrCloud loadbalancing, replication, and failover

 On 4/18/2013 8:12 PM, David Parks wrote:
  I think I still don't understand something here.
 
  My concern right now is that query times are very slow for 120GB index
  (14s on avg), I've seen a lot of disk activity when running queries.
 
  I'm hoping that distributing that query across 2 servers is going to
  improve the query time, specifically I'm hoping that we can distribute
  that disk activity because we don't have great disks on there (yet).
 
  So, with disk IO being a factor in mind, running the query on one box,
 vs.
  across 2 *should* be a concern right?
 
  Admittedly, this is the first step in what will probably be many to
  try to work our query times down from 14s to what I want to be around 1s.

 I went through my mailing list archive to see what all you've said about
 your setup.  One thing that I can't seem to find is a mention of how much
 total RAM is in each of your servers.  I apologize if it was actually there
 and I overlooked it.

 In one email thread, you wanted to know whether Solr is CPU-bound or
 IO-bound.  Solr is heavily reliant on the index on disk, and disk I/O is
 the
 slowest piece of the puzzle. The way to get good performance out of Solr is
 to have enough memory that you can take the disk mostly out of the equation
 by having the operating system cache the index in RAM.  If you don't have
 enough RAM for that, then Solr becomes IO-bound, and your CPUs will be busy
 in iowait, unable to do much real work.  If you DO have enough RAM to cache
 all (or most) of your index, then Solr will be CPU-bound.

 With 120GB of total index data on each server, you would want at least
 128GB
 of RAM per server, assuming you are only giving 8-16GB of RAM to Solr, and
 that Solr is the only thing running on the machine.  If you have more
 servers and shards, you can reduce the per-server memory requirement
 because
 the amount of index data on each server would go down.  I am aware of the
 cost associated with this kind of requirement - each of my Solr servers has
 64GB.

 If you are sharing the server with another program, then you want to have
 enough RAM available for Solr's heap, Solr's data, the other program's
 heap,
 and the other program's data.  Some programs (like
 MySQL) completely skip the OS disk cache and instead do that caching
 themselves with heap memory that's actually allocated to the program.
 If you're using a program like that, then you wouldn't need to count its
 data.

 Using SSDs for storage can speed things up dramatically and may reduce the
 total memory requirement to some degree, but even an SSD is slower than
 RAM.
 The transfer speed of RAM is faster, and from what I understand, the
 latency
 is at least an order of magnitude quicker - nanoseconds vs microseconds.

 In another thread, you asked about how Google gets such good response
 times.
 Although Google's software probably works differently than Solr/Lucene,
 when
 it comes right down to it, all search engines do similar 

RE: SolrCloud loadbalancing, replication, and failover

2013-04-19 Thread David Parks
Interesting. I'm trying to correlate this new understanding to what I see on
my servers.  I've got one server with 5GB dedicated to solr, solr dashboard
reports a 167GB index actually.

When I do many typical queries I see between 3MB and 9MB of disk reads
(watching iostat).

But solr's dashboard only shows 710MB of memory in use (this box has had
many hundreds of queries put through it, and has been up for 1 week). That
doesn't quite correlate with my understanding that Solr would cache the
index as much as possible. 

Should I be thinking that things aren't configured correctly here?

Dave


-Original Message-
From: John Nielsen [mailto:j...@mcb.dk] 
Sent: Friday, April 19, 2013 2:35 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud loadbalancing, replication, and failover

Well, to consume 120GB of RAM with a 120GB index, you would have to query
over every single GB of data.

If you only actually query over, say, 500MB of the 120GB data in your dev
environment, you would only use 500MB worth of RAM for caching. Not 120GB


On Fri, Apr 19, 2013 at 7:55 AM, David Parks davidpark...@yahoo.com wrote:

 Wow! That was the most pointed, concise discussion of hardware 
 requirements I've seen to date, and it's fabulously helpful, thank you 
 Shawn!  We currently have 2 servers that I can dedicate about 12GB of 
 ram to Solr on (we're moving to these 2 servers now). I can upgrade 
 further if it's needed  justified, and your discussion helps me 
 justify that such an upgrade is the right thing to do.

 So... If I move to 3 servers with 50GB of RAM each, using 3 shards, I 
 should be in the free and clear then right?  This seems reasonable and 
 doable.

 In this more extreme example the failover properties of solr cloud 
 become more clear. I couldn't possibly run a replica shard without 
 doubling the memory, so really replication isn't reasonable until I 
 have double the hardware, then the load balancing scheme makes perfect 
 sense. With 3 servers, 50GB of RAM and 120GB index I should just 
 backup the index directory I think.

 My previous though to run replication just for failover would have 
 actually resulted in LOWER performance because I would have halved the 
 memory available to the master  replica. So the previous question is 
 answered as well now.

 Question: if I had 1 server with 60GB of memory and 120GB index, would 
 solr make full use of the 60GB of memory? Thus trimming disk access in 
 half. Or is it an all-or-nothing thing?  In a dev environment, I 
 didn't notice SOLR consuming the full 5GB of RAM assigned to it with a
120GB index.

 Dave


 -Original Message-
 From: Shawn Heisey [mailto:s...@elyograg.org]
 Sent: Friday, April 19, 2013 11:51 AM
 To: solr-user@lucene.apache.org
 Subject: Re: SolrCloud loadbalancing, replication, and failover

 On 4/18/2013 8:12 PM, David Parks wrote:
  I think I still don't understand something here.
 
  My concern right now is that query times are very slow for 120GB 
  index (14s on avg), I've seen a lot of disk activity when running
queries.
 
  I'm hoping that distributing that query across 2 servers is going to 
  improve the query time, specifically I'm hoping that we can 
  distribute that disk activity because we don't have great disks on there
(yet).
 
  So, with disk IO being a factor in mind, running the query on one 
  box,
 vs.
  across 2 *should* be a concern right?
 
  Admittedly, this is the first step in what will probably be many to 
  try to work our query times down from 14s to what I want to be around
1s.

 I went through my mailing list archive to see what all you've said 
 about your setup.  One thing that I can't seem to find is a mention of 
 how much total RAM is in each of your servers.  I apologize if it was 
 actually there and I overlooked it.

 In one email thread, you wanted to know whether Solr is CPU-bound or 
 IO-bound.  Solr is heavily reliant on the index on disk, and disk I/O 
 is the slowest piece of the puzzle. The way to get good performance 
 out of Solr is to have enough memory that you can take the disk mostly 
 out of the equation by having the operating system cache the index in 
 RAM.  If you don't have enough RAM for that, then Solr becomes 
 IO-bound, and your CPUs will be busy in iowait, unable to do much real 
 work.  If you DO have enough RAM to cache all (or most) of your index, 
 then Solr will be CPU-bound.

 With 120GB of total index data on each server, you would want at least 
 128GB of RAM per server, assuming you are only giving 8-16GB of RAM to 
 Solr, and that Solr is the only thing running on the machine.  If you 
 have more servers and shards, you can reduce the per-server memory 
 requirement because the amount of index data on each server would go 
 down.  I am aware of the cost associated with this kind of requirement 
 - each of my Solr servers has 64GB.

 If you are sharing the server with another program, then you want to 
 have enough RAM available for Solr's 

Re: solr-cloud performance decrease day by day

2013-04-19 Thread Manuel Le Normand
Can happen for various reasons.

Can you recreate the situation, meaning restarting the servlet or server
would start with good qTime and decrease from that point? How fast does
this happen?

Start by monitoring the jvm process, with oracle visualVM for example.
Monitor for frequent garbage collections or unreasonable memory peacks or
opening threads.
Then monitor your system to see if there's an io disk latency or disk usage
that increases in time, the writing queue to disk exploads, cpu load
becomes heavier or network usage's exeeds limit.

If you can recreate the decrease and monitor well, one of the above params
should pop up. Fixing it after defining the problem will be easier.

Good day,
Manu
On Apr 19, 2013 10:26 AM, qibaoyuan qibaoy...@gmail.com wrote:


Re: solr-cloud performance decrease day by day

2013-04-19 Thread qibaoyuan
Thanks manu,i will check it.
在 2013-4-19,下午4:26,Manuel Le Normand manuel.lenorm...@gmail.com 写道:

 Can happen for various reasons.
 
 Can you recreate the situation, meaning restarting the servlet or server
 would start with good qTime and decrease from that point? How fast does
 this happen?
 
 Start by monitoring the jvm process, with oracle visualVM for example.
 Monitor for frequent garbage collections or unreasonable memory peacks or
 opening threads.
 Then monitor your system to see if there's an io disk latency or disk usage
 that increases in time, the writing queue to disk exploads, cpu load
 becomes heavier or network usage's exeeds limit.
 
 If you can recreate the decrease and monitor well, one of the above params
 should pop up. Fixing it after defining the problem will be easier.
 
 Good day,
 Manu
 On Apr 19, 2013 10:26 AM, qibaoyuan qibaoy...@gmail.com wrote:



Re: shard query return 500 on large data set

2013-04-19 Thread Dmitry Kan
Can you instead use paging mechanism?


On Thu, Apr 18, 2013 at 8:03 PM, Jie Sun jsun5...@yahoo.com wrote:

 Hi -

 when I execute a shard query like:


 [myhost]:8080/solr/mycore/select?q=type:messagerows=14...qt=standardwt=standardexplainOther=hl.fl=shards=solrserver1:8080/solr/mycore,solrserver2:8080/solr/mycore,solrserver3:8080/solr/mycore

 everything works fine until I query against a large set of data ( 100k
 documents),
 when the number of rows returned exceeds about 50k.

 by the way I am using HttpClient GET method to send the solr shard query
 over.

 In the above scenario, the query fails with a 500 server error as returned
 status code.

 I am using solr 3.5.

 I encountered a 404 before, when one of the shard servers does not have the
 core (404) the whole shard query will return 404 to me; so I expect if one
 of the server encounter a timeout (408?),  the shard query should return
 time out status code?

 I guess I am not sure what will be the shard query results with various
 error scenario... guess i could look into solr code, but if you have any
 input, it will be appreciated. thanks

 Renee



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/shard-query-return-500-on-large-data-set-tp4057038.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: SolrCloud loadbalancing, replication, and failover

2013-04-19 Thread Shawn Heisey
On 4/19/2013 1:34 AM, John Nielsen wrote:
 Well, to consume 120GB of RAM with a 120GB index, you would have to query
 over every single GB of data.
 
 If you only actually query over, say, 500MB of the 120GB data in your dev
 environment, you would only use 500MB worth of RAM for caching. Not 120GB

What you are saying is essentially true, although I would not be
surprised to learn that even a single query would read a few gigabytes
from a 120GB index, assuming that you start after a server reboot.  The
next query would re-use a lot of the data cached by the first query and
return much faster.

 On Fri, Apr 19, 2013 at 7:55 AM, David Parks davidpark...@yahoo.com wrote:
 Question: if I had 1 server with 60GB of memory and 120GB index, would solr
 make full use of the 60GB of memory? Thus trimming disk access in half. Or
 is it an all-or-nothing thing?  In a dev environment, I didn't notice SOLR
 consuming the full 5GB of RAM assigned to it with a 120GB index.

Solr would likely cause the OS to use most or all of that memory.  It's
not an all or nothing thing.  The first few queries will load a big
chunk, then each additional query will load a little more.  60GB of RAM
will be significantly better than 12GB.  With only 12GB, it is extremely
likely that a given query will read a section of the index that will
push the data required for the next query out of the disk cache, so it
will have to re-read it from the disk on the next query, and so on in a
never-ending cycle.  That is far less likely if you have enough RAM for
half your index rather than a tenth.  Operating system disk caches are
pretty good at figuring out which data is needed frequently.  If the
cache is big enough, that data can be kept in the cache easily.

An ideal setup would have enough RAM to cache the entire index.
Depending on your schema, you may find that the disk cache in production
only ends up caching somewhere between half and two thirds of your
index.  The 60GB figure you have quoted above *MIGHT* be enough to make
things work really well with a 120GB index, but I always tell people
that if they want top performance, they will buy enough RAM to cache the
whole thing.

You might have a combination of query pattern and data that results in
more of the index needing cache than I have seen on my setup.  You are
likely to add documents continuously.  You may learn that your schema
doesn't cover your needs, so you have to modify it to tokenize more
aggressively, or you may need to copy fields so you can analyze the same
data more than one way.  These things will make your index bigger.  If
your query volume grows or gets more varied, more of your index is
likely to end up in the disk cache.

I would not recommend going into production with an index that has no
redundancy.  If you buy quality hardware with redundancy in storage,
dual power supplies, and ECC memory, catastrophic failures are rare, but
they DO happen.  The motherboard or an entire RAM chip could suddenly
die.  Someone might accidentally hit the power switch on the server and
cause it to shut down.  They might be working in the rack, fall down,
and pull out both power cords in an attempt to catch themselves.  The
latter scenarios are a temporary problem, but your users will probably
notice.

Thanks,
Shawn



Re: SolrCloud loadbalancing, replication, and failover

2013-04-19 Thread Shawn Heisey
On 4/19/2013 2:15 AM, David Parks wrote:
 Interesting. I'm trying to correlate this new understanding to what I see on
 my servers.  I've got one server with 5GB dedicated to solr, solr dashboard
 reports a 167GB index actually.
 
 When I do many typical queries I see between 3MB and 9MB of disk reads
 (watching iostat).
 
 But solr's dashboard only shows 710MB of memory in use (this box has had
 many hundreds of queries put through it, and has been up for 1 week). That
 doesn't quite correlate with my understanding that Solr would cache the
 index as much as possible. 

There are two memory sections on the dashboard.  The one at the top
shows the operating system view of physical memory.  That is probably
showing virtually all of it in use.  Most UNIX platforms will show you
the same info with 'top' or 'free'.  Some of them, like Solaris, require
different tools.  I assume you're not using Windows, because you mention
iostat.

The other memory section is for the JVM, and that only covers the memory
used by Solr.  The dark grey section is the amount of Java heap memory
currently utilized by Solr and its servlet container.  The light grey
section represents the memory that the JVM has allocated from system
memory.  If any part of that bar is white, then Java has not yet
requested the maximum configured heap.  Typically a long-running Solr
install will have only dark and light grey, no white.

The operating system is what caches your index, not Solr.  The bulk of
your RAM should be unallocated.  With your index size, the OS will use
all unallocated RAM for the disk cache.  If a program requests some of
that RAM, the OS will instantly give it up.

Thanks,
Shawn



RE: SolrCloud loadbalancing, replication, and failover

2013-04-19 Thread David Parks
Ok, I understand better now.

The Physical Memory is 90% utilized (21.18GB of 23.54GB). Solr has dark grey
allocation of 602MB, and light grey of an additional 108MB, for a JVM total
of 710MB allocated. If I understand correctly, Solr memory utilization is
*not* for caching (unless I configured document caches or some of the other
cache options in Solr, which don't seem to apply in this case, and I haven't
altered from their defaults).

So assuming this box was dedicated to 1 solr instance/shard. What JVM heap
should I set? Does that matter? 24GB JVM heap? Or keep it lower and ensure
the OS cache has plenty of room to operate? (this is an Ubuntu 12.10 server
instance).

Would I be wise to just put the index on a RAM disk and guarantee
performance?  Assuming I installed sufficient RAM?

Dave


-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Friday, April 19, 2013 4:19 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud loadbalancing, replication, and failover

On 4/19/2013 2:15 AM, David Parks wrote:
 Interesting. I'm trying to correlate this new understanding to what I 
 see on my servers.  I've got one server with 5GB dedicated to solr, 
 solr dashboard reports a 167GB index actually.
 
 When I do many typical queries I see between 3MB and 9MB of disk reads 
 (watching iostat).
 
 But solr's dashboard only shows 710MB of memory in use (this box has 
 had many hundreds of queries put through it, and has been up for 1 
 week). That doesn't quite correlate with my understanding that Solr 
 would cache the index as much as possible.

There are two memory sections on the dashboard.  The one at the top shows
the operating system view of physical memory.  That is probably showing
virtually all of it in use.  Most UNIX platforms will show you the same info
with 'top' or 'free'.  Some of them, like Solaris, require different tools.
I assume you're not using Windows, because you mention iostat.

The other memory section is for the JVM, and that only covers the memory
used by Solr.  The dark grey section is the amount of Java heap memory
currently utilized by Solr and its servlet container.  The light grey
section represents the memory that the JVM has allocated from system memory.
If any part of that bar is white, then Java has not yet requested the
maximum configured heap.  Typically a long-running Solr install will have
only dark and light grey, no white.

The operating system is what caches your index, not Solr.  The bulk of your
RAM should be unallocated.  With your index size, the OS will use all
unallocated RAM for the disk cache.  If a program requests some of that RAM,
the OS will instantly give it up.

Thanks,
Shawn



in solrcoud, how to assign a schemaConf to a collection ?

2013-04-19 Thread sling
hi all, help~~~
how to specify a schema to a collection in solrcloud?

i have a solrcloud with 3 collections, and each configfile is uploaded to zk
like this:
args=-Xmn3000m -Xms5000m -Xmx5000m -XX:MaxPermSize=384m
-Dbootstrap_confdir=/workspace/solr/solrhome/doc/conf
-Dcollection.configName=docconf -DzkHost=zk1:2181,zk2:2181,zk3:2181
-DnumShards=3 -Dname=docCollection

the solr.xml is like this
  cores ...
core name=doc instanceDir=doc/ loadOnStartup=true
transient=false collection=docCollection /
core name=video instanceDir=video/ loadOnStartup=true
transient=false collection=videoCollection /
core name=pic instanceDir=pic/ loadOnStartup=true
transient=false collection=picCollection  /
  /cores

then, when all nodes startup, i find the schema of 2 collection(doc and
video) are the same , while the schema of pic is wrong too..

are there some propeties in core, which can specify a its own schma??? 

thands for any help...







--
View this message in context: 
http://lucene.472066.n3.nabble.com/in-solrcoud-how-to-assign-a-schemaConf-to-a-collection-tp4057238.html
Sent from the Solr - User mailing list archive at Nabble.com.


solr-cloud problem about user-specified tags

2013-04-19 Thread qibaoyuan
I have plenty of docs and each docs maybe connected to many user-defined tags.I 
have used sold-cloud, and use join to do this kind of job,and recently i know 
sole-cloud does not support distributed search.AND so this is a big problem so 
far.AND the decomposition is quite impossible,because docs and user-defined 
docs are so huge,and many search is always searched on these two fields.ANY 
good idea to deal with this problem??



Re: SolrCloud loadbalancing, replication, and failover

2013-04-19 Thread Toke Eskildsen
On Fri, 2013-04-19 at 06:51 +0200, Shawn Heisey wrote:
 Using SSDs for storage can speed things up dramatically and may reduce
 the total memory requirement to some degree,

We have been using SSDs for several years in our servers. It is our
clear experience that to some degree should be replaced with very
much in the above.

Our current SSD-equipped servers each holds a total of 127GB of index
data spread ever 3 instances. The machines each have 16GB of RAM, of
which about 7GB are left for disk cache.

We are the State and University Library, Denmark and our search engine
is the primary (and arguably only) way to locate resources for our
users. The average raw search time is 32ms for non-faceted queries and
616ms for heavy faceted (which is much too slow. Dang! I thought I fixed
that).

  but even an SSD is slower than RAM.  The transfer speed of RAM is faster,
 and from what I understand, the latency is at least an order of
 magnitude quicker - nanoseconds vs microseconds.

True, but you might as well argue that everyone should go for the
fastest CPU possible, as it will be, well, faster than the slower ones.

The question is almost never to get the fastest possible, but to get a
good price/performance tradeoff. I would argue that SSDs fit that bill
very well for a great deal of the My search is too slow-threads that
are spun on this mailing list. Especially for larger indexes.

Regards,
Toke Eskildsen



RE: SolrCloud loadbalancing, replication, and failover

2013-04-19 Thread David Parks
Wow, thank you for those benchmarks Toke, that really gives me some firm 
footing to stand on in knowing what to expect and thinking out which path to 
venture down. It's tremendously appreciated!

Dave


-Original Message-
From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk] 
Sent: Friday, April 19, 2013 5:17 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud loadbalancing, replication, and failover

On Fri, 2013-04-19 at 06:51 +0200, Shawn Heisey wrote:
 Using SSDs for storage can speed things up dramatically and may reduce 
 the total memory requirement to some degree,

We have been using SSDs for several years in our servers. It is our clear 
experience that to some degree should be replaced with very much in the 
above.

Our current SSD-equipped servers each holds a total of 127GB of index data 
spread ever 3 instances. The machines each have 16GB of RAM, of which about 7GB 
are left for disk cache.

We are the State and University Library, Denmark and our search engine is the 
primary (and arguably only) way to locate resources for our users. The average 
raw search time is 32ms for non-faceted queries and 616ms for heavy faceted 
(which is much too slow. Dang! I thought I fixed that).

  but even an SSD is slower than RAM.  The transfer speed of RAM is 
 faster, and from what I understand, the latency is at least an order 
 of magnitude quicker - nanoseconds vs microseconds.

True, but you might as well argue that everyone should go for the fastest CPU 
possible, as it will be, well, faster than the slower ones.

The question is almost never to get the fastest possible, but to get a good 
price/performance tradeoff. I would argue that SSDs fit that bill very well for 
a great deal of the My search is too slow-threads that are spun on this 
mailing list. Especially for larger indexes.

Regards,
Toke Eskildsen



Re: WordDelimiterFactory

2013-04-19 Thread Erick Erickson
Ashok:

You really, _really_ need to dive into the admin/analysis page.
That'll show you exactly what WDFF (and all the other elements of your
chain) do to input tokens. Understanding the index and query-time
implications of all the settings in WDFF takes a while.

But from what you're describing, WDFF may not be what you're looking
for anyway, some of the regex filters could split, for instance, on
all non-alphanum characters.

Best
Erick

On Wed, Apr 17, 2013 at 12:25 AM, Shawn Heisey s...@elyograg.org wrote:
 On 4/16/2013 8:12 PM, Ashok wrote:
 It looks like any 'word' that starts with a digit is treated as a numeric
 string.

 Setting generateNumberParts=1 in stead of 0 seems to generate the right
 tokens in this case but need to see if it has any other impacts on the
 finalized token list...

 I have a fieldType that is using WDF with the following settings on the
 index side.  Both index and query analysis show it behaving correctly
 with terms that start with numbers, on versions 4.2.1 and 3.5.0:

 filter class=solr.WordDelimiterFilterFactory
   splitOnCaseChange=1
   splitOnNumerics=1
   stemEnglishPossessive=1
   generateWordParts=1
   generateNumberParts=1
   catenateWords=1
   catenateNumbers=1
   catenateAll=0
   preserveOriginal=1
 /

 It has different settings on the query side, but generateNumberParts is
 1 for both:

 filter class=solr.WordDelimiterFilterFactory
   splitOnCaseChange=1
   splitOnNumerics=1
   stemEnglishPossessive=1
   generateWordParts=1
   generateNumberParts=1
   catenateWords=0
   catenateNumbers=0
   catenateAll=0
   preserveOriginal=0
 /

 I haven't tried it with generateNumberParts set to 0.

 Thanks,
 Shawn



Re: in solrcoud, how to assign a schemaConf to a collection ?

2013-04-19 Thread sling
when i add a schema property to core
core name=pic instanceDir=pic/ loadOnStartup=true transient=false
collection=picCollection  
config=solrconfig.xml schema=../picconf/schema.xml/
it seems there a default path to schema ,that is /configs/docconf/
the exception is:
[18:59:09.211] java.lang.IllegalArgumentException: Invalid path string
/configs/docconf/../picconf/schema.xml caused by relative paths not
allowed @18
[18:59:09.211]  at
org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:99)
[18:59:09.211]  at
org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1133)
[18:59:09.211]  at
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:253)
[18:59:09.211]  at
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:250)
[18:59:09.211]  at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
[18:59:09.211]  at
org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:250)
[18:59:09.211]  at
org.apache.solr.cloud.ZkController.getConfigFileData(ZkController.java:388)
[18:59:09.211]  at
org.apache.solr.core.CoreContainer.getSchemaFromZk(CoreContainer.java:1659)
[18:59:09.211]  at
org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:948)
[18:59:09.211]  at
org.apache.solr.core.CoreContainer.create(CoreContainer.java:1031)
[18:59:09.211]  at
org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629)
[18:59:09.211]  at
org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:624)
[18:59:09.211]  at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
[18:59:09.211]  at java.util.concurrent.FutureTask.run(FutureTask.java:138)
[18:59:09.211]  at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
[18:59:09.211]  at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
[18:59:09.211]  at java.util.concurrent.FutureTask.run(FutureTask.java:138)
[18:59:09.211]  at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
[18:59:09.211]  at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
[18:59:09.211]  at java.lang.Thread.run(Thread.java:619)



--
View this message in context: 
http://lucene.472066.n3.nabble.com/in-solrcoud-how-to-assign-a-schemaConf-to-a-collection-tp4057238p4057250.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Indexing problems

2013-04-19 Thread GASPARD Joel
Hello

Thank you for your answer.
We have solved our problem now. I describe it for someone who could encounter a 
similar problem. 

Some of our fields are dynamic, and the name of one of these fields was not 
correct : it was sent to Solr as a java object, eg 
solrInputDocument.addField(myObject, stringValue);

A string representation of this object was displayed in the Solr admin page, 
and that alerted us. We have replaced this wrong field name by the string we 
expect and no more OOME occur.

At least we could test diverse solr configurations.

Regards

Joel Gaspard 



-Message d'origine-
De : Erick Erickson [mailto:erickerick...@gmail.com] 
Envoyé : jeudi 31 janvier 2013 14:00
À : solr-user@lucene.apache.org
Objet : Re: Indexing problems

I'm really surprised you're hitting OOM errors, I suspect you have something 
else pathological in your system. So, I'd start checking things like
- how many concurrent warming searchers you allow
- How big your indexing RAM is set to (we find very little gain over 128M BTW).
- Other load on your Solr server. Are you, for instance, searching on it too?
- what your autocommit characterstics are (think about autocommitting fairly 
often with openSearcher=false).
- have you defined huge caches?
- .

How big are these documents anyway? With 12G of ram, they'd have to be 
absolutely _huge_ to matter much.

Multiple collections should work fine in ZK. I really think you have some 
innocent-looking configuration setting thats bollixing you up, this is not 
expected behavior.

If at all possible, I'd also go with 4.1. I don't really think it's relevant to 
your situation, but there have been a lot of improvements in the code

Best
Erick


Re: in solrcoud, how to assign a schemaConf to a collection ?

2013-04-19 Thread sling
i copy the 3 schema.xml and solrconfig.xml to $solrhome/conf/.xml, and
upload this filedir to zk like this:
args=-Xmn1000m -Xms2000m -Xmx2000m -XX:MaxPermSize=384m
-Dbootstrap_confdir=/home/app/workspace/solrcloud/solr/solrhome/conf
-Dcollection.configName=conf -DzkHost=zk1:2181,zk2:2181,zk3:2181
-DnumShards=2 -Dname=docCollection

then in solr.xml , it changes to:
 core name=doc instanceDir=doc/ loadOnStartup=true transient=false
collection=docCollection  schema=s1.xml config=sc1.xml /

in this way , the schema.xml is seprated.

it seems the schema and config properties   has a relative path
/configs/conf,
and this is what i uploaded from local,
$solrhome/conf  is equals to /configs/conf.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/in-solrcoud-how-to-assign-a-schemaConf-to-a-collection-tp4057238p4057254.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr using a ridiculous amount of memory

2013-04-19 Thread Erick Erickson
Hmmm. There has been quite a bit of work lately to support a couple of
things that might be of interest (4.3, which Simon cut today, probably
available to all mid next week at the latest). Basically, you can
choose to pre-define all the cores in solr.xml (so-called old style)
_or_ use the new-style solr.xml which uses auto-discover mode to
walk the indicated directory and find all the cores (indicated by the
presence of a 'core.properties' file). Don't know if this would make
your particular case easier, and I should warn you that this is
relatively new code (although there are some reasonable unit tests).

You also have the option to only load the cores when they are
referenced, and only keep N cores open at a time (loadOnStartup and
transient properties).

See: http://wiki.apache.org/solr/CoreAdmin#Configuration and
http://wiki.apache.org/solr/Solr.xml%204.3%20and%20beyond

Note, the docs are somewhat sketchy, so if you try to go down this
route let us know anything that should be improved (or you can be
added to the list of wiki page contributors and help out!)

Best
Erick

On Thu, Apr 18, 2013 at 8:31 AM, John Nielsen j...@mcb.dk wrote:
 You are missing an essential part: Both the facet and the sort
 structures needs to hold one reference for each document
 _in_the_full_index_, even when the document does not have any values in
 the fields.


 Wow, thank you for this awesome explanation! This is where the penny
 dropped for me.

 I will definetely move to a multi-core setup. It will take some time and a
 lot of re-coding. As soon as I know the result, I will let you know!






 --
 Med venlig hilsen / Best regards

 *John Nielsen*
 Programmer



 *MCB A/S*
 Enghaven 15
 DK-7500 Holstebro

 Kundeservice: +45 9610 2824
 p...@mcb.dk
 www.mcb.dk


Re: stats.facet not working for timestamp field

2013-04-19 Thread Erick Erickson
I'm guessing that your timestamp is a tdate, which stores extra
information in the index for fast range searches. What happens if you
try to facet on just a date field?

Best
Erick

On Thu, Apr 18, 2013 at 8:37 AM, J Mohamed Zahoor zah...@indix.com wrote:
 Hi

 I am using SOlr 4.1 with 6 shards.

 i want to find out some price stats for all the days in my index.
 I ended up using stats component like 
 stats=truestats.field=pricestats.facet=timestamp.



 but it throws up error like

 str name=msgInvalid Date String:' #1;#0;#0;#0;'[my(#0;'/str



 My Question is : is timestamp supported as stats.facet ?

 ./zahoor




Re: solr4 : disable updateLog

2013-04-19 Thread Erick Erickson
updateLog is _required_ if you're in solrCloud mode. Assuming that
you're not using SolrCloud, then you can freely disable it.

Why do you want to? It's not a bad idea necessarily, but this might be
an XY problem.

Best
Erick

On Thu, Apr 18, 2013 at 10:47 AM, Jamel ESSOUSSI
jamel.essou...@gmail.com wrote:
 Hi,

 If I disable (comment) the updateLog bloc, this will affect indexing result:






 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/solr4-disable-updateLog-tp4056998.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Update Request Processor Chains

2013-04-19 Thread Furkan KAMACI
I am trying to understand update request processor chains. Do they runs one
by one when indexing a ducument? Can I identify multiple update request
processor chains? Also what are that LogUpdateProcessorFactory and
RunUpdateProcessorFactory?


Re: solr-cloud performance decrease day by day

2013-04-19 Thread Jack Krupansky
How are you committing data? With 4.0, CommitWithin is now a soft commit, 
which means that the transaction log will grow until you do a hard commit. 
You need to periodically do a hard commit if you are continually updating 
the index. How much updating are you doing?


Also, check how much heap is available about you first start the server and 
have done a few queries and then monitor heap available over time. Maybe you 
are hitting garbage collections. Maybe you have too much heap allocated so 
that even a normal Java GC just takes a very long time because so much 
garbage accumulates - which is why you want only a modest amount of heap 
available above what the data needs after a few queries have loaded caches.


-- Jack Krupansky

-Original Message- 
From: qibaoyuan

Sent: Friday, April 19, 2013 3:15 AM
To: solr-user@lucene.apache.org
Subject: solr-cloud performance decrease day by day

Hello,
  i am using sold 4.1.0 and ihave used sold cloud in my product.I have 
found at first everything seems good,the search time is fast and delay is 
slow,but it becomes very slow after days.does any one knows if there maybe 
some params or optimization to use sold cloud?= 



fuzzy search issue with PatternTokenizer Factory

2013-04-19 Thread meghana
I m using Solr4.2 , I have changed my text field definition, to use the
Solr.PatternTokenizerFactory instead of Solr.StandardTokenizerFactory , and
changed my schema defination as below

fieldType name=text_token class=solr.TextField
positionIncrementGap=100
  analyzer type=index
   tokenizer class=solr.PatternTokenizerFactory
pattern=[^a-zA-Z0-9amp;\-']|\d{0,4}s: /
   filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=false /
   
filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
   tokenizer class=solr.PatternTokenizerFactory
pattern=[^a-zA-Z0-9amp;\-']|\d{0,4}s: /
   filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords_extra_query.txt enablePositionIncrements=false /
   filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
  filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType

after doing so, fuzzy search do not seems to working properly as it was
working before. 

I m searching with search term : worde~1 

on search , before it was returning , around 300 records , but now its
returning only 5 records. not sure what can be issue.

Can anybody help me to make it work!!







--
View this message in context: 
http://lucene.472066.n3.nabble.com/fuzzy-search-issue-with-PatternTokenizer-Factory-tp4057275.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: facet.method enum vs fc

2013-04-19 Thread Joel Bernstein
Faceting on a high cardinality string field, like url, on a 120 million
record index is going to be very memory intensive.

You will very likely need to shard the index to get the performance that
you need.

In Solr 4.2, you can make the url field a Disk based DocValue and shift the
memory from Solr to the file system cache. But to run efficiently this is
still going to take a lot of memory in the OS file cache.




On Thu, Apr 18, 2013 at 12:00 PM, Mingfeng Yang mfy...@wisewindow.comwrote:

 20G is allocated to Solr already.

 Ming


 On Wed, Apr 17, 2013 at 11:56 PM, Toke Eskildsen t...@statsbiblioteket.dk
 wrote:

  On Wed, 2013-04-17 at 20:06 +0200, Mingfeng Yang wrote:
   I am doing faceting on an index of 120M documents,
   on the field of url[...]
 
  I would guess that you would need 3-4GB for that.
  How much memory do you allocate to Solr?
 
  - Toke Eskildsen
 
 




-- 
Joel Bernstein
Professional Services LucidWorks


Import in Solr

2013-04-19 Thread hassancrowdc
I want to update(delta-import) one specific item. Is there any query to do
that? 

like i can delete specific item with the following query: 

localhost:8080/solr/devices/update?stream.body=deletequeryid:46/query/deletecommit=true

Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Import-in-Solr-tp4057301.html
Sent from the Solr - User mailing list archive at Nabble.com.


Returning similarity values for more like this search

2013-04-19 Thread Achim Domma
Hi,

I'm executing a search including a search for similar documents 
(mlt=truemlt.fl=) which works fine so far. I would like to get the 
similarity value for each document. I expected this to be quite common and 
simple, but I could not find a hint how to do it. Any hint how to do it would 
be very appreciated.

kind regards,
Achim

RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.

2013-04-19 Thread Dyer, James
I guess the first thing I'd do is to set maxCollationTries to zero.  This 
means it will only run your main query once and not re-run it to check the 
collations. Now see if your queries have consistent qtime.  One easy 
explanation is that with maxCollationTries=10, it may be running your query 
up to 11 times to check up to 10 possible collations.  If the query takes 50ms 
by itself, then you've got 550ms total to not find spelling corrections.  
Unfortunately, the worst case here is the one that gives the user nothing back. 
 

Another thing to look at, with maxCollationTries at zero, set maxCollations 
to 10.  This will give you a list of the 10 collations it would have tried.  
You can figure if the one that gets hits is far enough down the list to explain 
the high total qtime when maxCollationTries=10.  If this explains it, then 
the obvious solution is to set maxCollationTries to something lower than 10.  
(you'll need tio weigh how long you're willing to make your users wait to 
possibly get spelling suggestions)  Or possibly, use spellcheck.q to give it 
an easier query to evalutate than the main query (but that can still give valid 
collations). Also, see https://issues.apache.org/jira/browse/SOLR-3240 which is 
an optimization for this feature.

James Dyer
Ingram Content Group
(615) 213-4311


-Original Message-
From: SandeepM [mailto:skmi...@hotmail.com] 
Sent: Thursday, April 18, 2013 11:33 PM
To: solr-user@lucene.apache.org
Subject: DirectSolrSpellChecker : vastly varying spellcheck QTime times.

Hi!

I am using SOLR 4.2.1.

My solrconfig.xml contains the following:

  searchComponent name=MySpellcheck class=solr.SpellCheckComponent
   str name=queryAnalyzerFieldTypetext_spell/str

 lst name=spellchecker
   str name=nameMySpellchecker/str
   str name=fieldspell/str
   str name=classnamesolr.DirectSolrSpellChecker/str
   str name=distanceMeasureinternal/str
   float name=accuracy0.5/float
   int name=maxEdits2/int
   int name=minPrefix1/int
   int name=maxInspections5/int
   int name=minQueryLength3/int
   float name=maxQueryFrequency0.01/float
   
 /lst
 /searchComponent

requestHandler name=/select class=solr.SearchHandler startup=lazy
lst name=defaults
  int name=rows10/int
  str name=dfid/str
  str name=spellcheck.dictionaryMySpellchecker/str
  str name=spellcheckon/str
  str name=spellcheck.extendedResultsfalse/str
  str name=spellcheck.count10/str
  str name=spellcheck.alternativeTermCount10/str
  str name=spellcheck.maxResultsForSuggest35/str
  str name=spellcheck.onlyMorePopulartrue/str
  str name=spellcheck.collatetrue/str
  str name=spellcheck.collateExtendedResultsfalse/str
  str name=spellcheck.maxCollationTries10/str
  str name=spellcheck.maxCollations1/str
  str name=spellcheck.collateParam.q.opAND/str
/lst
arr name=last-components
  strMySpellcheck/str
/arr
  /requestHandler

schema.xml with the spell field looks like:

fieldType name=text_spell class=solr.TextField
positionIncrementGap=100  sortMissingLast=true 
analyzer type=index
tokenizer
class=solr.StandardTokenizerFactory /
filter class=solr.LowerCaseFilterFactory
/
filter class=solr.StopFilterFactory
ignoreCase=true
 words=lang/stopwords_en.txt
enablePositionIncrements=true /
/analyzer
analyzer type=query
tokenizer
class=solr.StandardTokenizerFactory /
filter class=solr.LowerCaseFilterFactory
/
filter class=solr.StopFilterFactory
ignoreCase=true
 words=lang/stopwords_en.txt
enablePositionIncrements=true /
/analyzer
/fieldType

field name=spell type=text_spell indexed=true
stored=false multiValued=true /

copyField source=title dest=spell /
copyField source=artist dest=spell /
 
My query:
http://host/solr/select?q=spellcheck.q=chocolat%20factryspellcheck=truedf=spellfl=indent=onwt=xmlrows=10version=2.2echoParams=explicit

In this case, the intent is to correct chocolat factry with chocolate
factory which exists in my spell field index. I see a QTime from the above
query as somewhere between 350-400ms

I run a similar query replacing the spellcheck terms to pursut hapyness
whereas pursuit happyness actually exists in my spell field and I see
QTime of 15-17ms .

Both query produce collations correctly but there is order of magnitude
difference in QTime.  There is one edit per term in both cases or 2 edits in
each query. The length of words in both these queries seem identical. I'd
like to understand why there is this vast difference in QTime.  I would
appreciate 

Re: SolrCloud loadbalancing, replication, and failover

2013-04-19 Thread Shawn Heisey
On 4/19/2013 3:48 AM, David Parks wrote:
 The Physical Memory is 90% utilized (21.18GB of 23.54GB). Solr has dark grey
 allocation of 602MB, and light grey of an additional 108MB, for a JVM total
 of 710MB allocated. If I understand correctly, Solr memory utilization is
 *not* for caching (unless I configured document caches or some of the other
 cache options in Solr, which don't seem to apply in this case, and I haven't
 altered from their defaults).

Right.  Solr does have caches, but they serve specific purposes.  The OS
is much better at general large-scale caching than Solr is.  Solr caches
get cleared (and possibly re-warmed) whenever you issue a commit on your
index that makes new documents visible.

 So assuming this box was dedicated to 1 solr instance/shard. What JVM heap
 should I set? Does that matter? 24GB JVM heap? Or keep it lower and ensure
 the OS cache has plenty of room to operate? (this is an Ubuntu 12.10 server
 instance).

The JVM heap to use is highly dependent on the nature of your queries,
the number of documents, the number of unique terms, etc.  The best
thing to do is try it out with a relatively large heap, see how much
memory actually gets used inside the JVM.  The jvisualvm and jconsole
tools will give you nice graphs of JVM memory usage.  The jstat program
will give you raw numbers on the commandline that you'll need to add to
get the full picture.  Due to the garbage collection model that Java
uses, what you'll see is a sawtooth pattern - memory usage goes up to
max heap, then garbage collection reduces it to the actual memory used.
 Generally speaking, you want to have more heap available than the low
point of that sawtooth pattern.  If that low point is around 3GB when
you are hitting your index hard with queries and updates, then you would
want to give Solr a heap of 4 to 6 GB.

 Would I be wise to just put the index on a RAM disk and guarantee
 performance?  Assuming I installed sufficient RAM?

A RAM disk is a very good way to guarantee performance - but RAM disks
are ephemeral.  Reboot or have an OS crash and it's gone, you'll have to
reindex.  Also remember that you actually need at *least* twice the size
of your index so that Solr (Lucene) has enough room to do merges, and
the worst-case scenario is *three* times the index size.  Merging
happens during normal indexing, not just when you optimize.  If you have
enough RAM for three times your index size and it takes less than an
hour or two to rebuild the index, then a RAM disk might be a viable way
to go.  I suspect that this won't work for you.

Thanks,
Shawn



Re: Returning similarity values for more like this search

2013-04-19 Thread Koji Sekiguchi

(13/04/19 23:24), Achim Domma wrote:

Hi,

I'm executing a search including a search for similar documents 
(mlt=truemlt.fl=) which works fine so far. I would like to get the 
similarity value for each document. I expected this to be quite common and simple, 
but I could not find a hint how to do it. Any hint how to do it would be very 
appreciated.

kind regards,
Achim



Using debugQuery=true, you can find explanations in the debug section of the 
response.

See:
https://issues.apache.org/jira/browse/SOLR-860

koji
--
http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html


is phrase search possible in solr

2013-04-19 Thread vicky desai
I want to do a phrase search in solr without analyzers being applied to it 
eg - If I search for *DelhiDareDevil* (i.e - with inverted commas)it
should search the exact text and not apply any analyzers or tokenizers on
this field
However if i search for *DelhiDareDevil* it should use tokenizers and
analyzers and split it to something like this *delhi dare devil*

My schema definition for this is as follows

fieldType name=text class=solr.TextField
positionIncrementGap=100 
autoGeneratePhraseQueries=false
analyzer type=index
tokenizer 
class=solr.WhitespaceTokenizerFactory /
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 
generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 
splitOnCaseChange=1
preserveOriginal=1/
filter class=solr.LowerCaseFilterFactory /
/analyzer
analyzer type=query
tokenizer 
class=solr.WhitespaceTokenizerFactory /
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 
generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 
splitOnCaseChange=1
preserveOriginal=1/
filter class=solr.LowerCaseFilter``Factory /
/analyzer
/fieldType

field name=cContent type=text indexed=true stored=true
multiValued=false/

any help would be appreciated




--
View this message in context: 
http://lucene.472066.n3.nabble.com/is-phrase-search-possible-in-solr-tp4057312.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SEVERE: shard update error StdNode on SolrCloud 4.2.1

2013-04-19 Thread Steve Woodcock
On 16 April 2013 11:35, Steve Woodcock steve.woodc...@gmail.com wrote:

 We have a simple SolrCloud setup (4.2.1) running with a single shard and
 two nodes, and it's working fine except whenever we send an update request,
 the leader logs this error:

 SEVERE: shard update error StdNode:
 http://10.20.10.42:8080/solr/ts/:org.apache.solr.common.SolrException:
 Server at http://10.20.10.42:8080/solr/ts returned non ok status:500,
 message:Internal Server Error
  at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)


Turns out I think this was caused by having the wrong type for the
_version_ field in the schema. We had type=string, but should be
type=long, ie.

  field name=_version_ type=long indexed=true stored=true
multiValued=false/

Which, to be fair, is well documented at
http://wiki.apache.org/solr/SolrCloud

Certainly seems to be working a lot better so far ...

Cheers, Steve


Re: is phrase search possible in solr

2013-04-19 Thread Raymond Wiker
On Apr 19, 2013, at 16:59 , vicky desai vicky.de...@germinait.com wrote:
 I want to do a phrase search in solr without analyzers being applied to it 
 eg - If I search for *DelhiDareDevil* (i.e - with inverted commas)it
 should search the exact text and not apply any analyzers or tokenizers on
 this field
 However if i search for *DelhiDareDevil* it should use tokenizers and
 analyzers and split it to something like this *delhi dare devil*
 
 My schema definition for this is as follows
 
fieldType name=text class=solr.TextField
   positionIncrementGap=100 
 autoGeneratePhraseQueries=false
   analyzer type=index
   tokenizer 
 class=solr.WhitespaceTokenizerFactory /
   filter class=solr.WordDelimiterFilterFactory
   generateWordParts=1 
 generateNumberParts=1 catenateWords=1
   catenateNumbers=1 catenateAll=0 
 splitOnCaseChange=1
 preserveOriginal=1/
   filter class=solr.LowerCaseFilterFactory /
   /analyzer
   analyzer type=query
   tokenizer 
 class=solr.WhitespaceTokenizerFactory /
   filter class=solr.WordDelimiterFilterFactory
   generateWordParts=1 
 generateNumberParts=1 catenateWords=1
   catenateNumbers=1 catenateAll=0 
 splitOnCaseChange=1
 preserveOriginal=1/
   filter class=solr.LowerCaseFilter``Factory /
   /analyzer
   /fieldType
 
field name=cContent type=text indexed=true stored=true
 multiValued=false/
 
 any help would be appreciated


First of all, it appears that you have a typo in the definition for the 
LowerCaseFilter for the query analyzer.

Secondly, as the two analyzers appear to be identical (except forn the probable 
typo), I think you could just specify it once, without specifying the type.




Re: is phrase search possible in solr

2013-04-19 Thread Jack Krupansky
By definition, phrase search is one of two things: 1) match on a string 
field literally, or 2) analyze as a sequence of tokens as per the field type 
index analyzer.


You could use the keyword tokenizer to store the whole field as one string, 
with filtering for the whole string. Or, just make it a string field and do 
literal and wildcard matches.


You can use copyField to make copies of the same input data in multiple 
fields, each with different analyzers. You would then need to specify which 
field you want to search, whether literal or keyword.


-- Jack Krupansky

-Original Message- 
From: vicky desai

Sent: Friday, April 19, 2013 10:59 AM
To: solr-user@lucene.apache.org
Subject: is phrase search possible in solr

I want to do a phrase search in solr without analyzers being applied to it
eg - If I search for *DelhiDareDevil* (i.e - with inverted commas)it
should search the exact text and not apply any analyzers or tokenizers on
this field
However if i search for *DelhiDareDevil* it should use tokenizers and
analyzers and split it to something like this *delhi dare devil*

My schema definition for this is as follows

   fieldType name=text class=solr.TextField
   positionIncrementGap=100 autoGeneratePhraseQueries=false
   analyzer type=index
   tokenizer class=solr.WhitespaceTokenizerFactory /
   filter class=solr.WordDelimiterFilterFactory
   generateWordParts=1 generateNumberParts=1 catenateWords=1
   catenateNumbers=1 catenateAll=0 splitOnCaseChange=1
preserveOriginal=1/
   filter class=solr.LowerCaseFilterFactory /
   /analyzer
   analyzer type=query
   tokenizer class=solr.WhitespaceTokenizerFactory /
   filter class=solr.WordDelimiterFilterFactory
   generateWordParts=1 generateNumberParts=1 catenateWords=1
   catenateNumbers=1 catenateAll=0 splitOnCaseChange=1
preserveOriginal=1/
   filter class=solr.LowerCaseFilter``Factory /
   /analyzer
   /fieldType

   field name=cContent type=text indexed=true stored=true
multiValued=false/

any help would be appreciated




--
View this message in context: 
http://lucene.472066.n3.nabble.com/is-phrase-search-possible-in-solr-tp4057312.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Pros and cons of using RAID or different RAIDS?

2013-04-19 Thread Furkan KAMACI
Is there any documentation that explains pros and cons of using RAID or
different RAIDS?


Re: is phrase search possible in solr

2013-04-19 Thread Jack Krupansky

Oops... that's query analyzer, not index analyzer, so it's:

By definition, phrase search is one of two things: 1) match on a string
field literally, or 2) analyze as a sequence of tokens as per the field type
query analyzer.

-- Jack Krupansky

-Original Message- 
From: Jack Krupansky

Sent: Friday, April 19, 2013 11:14 AM
To: solr-user@lucene.apache.org
Subject: Re: is phrase search possible in solr

By definition, phrase search is one of two things: 1) match on a string
field literally, or 2) analyze as a sequence of tokens as per the field type
index analyzer.

You could use the keyword tokenizer to store the whole field as one string,
with filtering for the whole string. Or, just make it a string field and do
literal and wildcard matches.

You can use copyField to make copies of the same input data in multiple
fields, each with different analyzers. You would then need to specify which
field you want to search, whether literal or keyword.

-- Jack Krupansky

-Original Message- 
From: vicky desai

Sent: Friday, April 19, 2013 10:59 AM
To: solr-user@lucene.apache.org
Subject: is phrase search possible in solr

I want to do a phrase search in solr without analyzers being applied to it
eg - If I search for *DelhiDareDevil* (i.e - with inverted commas)it
should search the exact text and not apply any analyzers or tokenizers on
this field
However if i search for *DelhiDareDevil* it should use tokenizers and
analyzers and split it to something like this *delhi dare devil*

My schema definition for this is as follows

   fieldType name=text class=solr.TextField
   positionIncrementGap=100 autoGeneratePhraseQueries=false
   analyzer type=index
   tokenizer class=solr.WhitespaceTokenizerFactory /
   filter class=solr.WordDelimiterFilterFactory
   generateWordParts=1 generateNumberParts=1 catenateWords=1
   catenateNumbers=1 catenateAll=0 splitOnCaseChange=1
preserveOriginal=1/
   filter class=solr.LowerCaseFilterFactory /
   /analyzer
   analyzer type=query
   tokenizer class=solr.WhitespaceTokenizerFactory /
   filter class=solr.WordDelimiterFilterFactory
   generateWordParts=1 generateNumberParts=1 catenateWords=1
   catenateNumbers=1 catenateAll=0 splitOnCaseChange=1
preserveOriginal=1/
   filter class=solr.LowerCaseFilter``Factory /
   /analyzer
   /fieldType

   field name=cContent type=text indexed=true stored=true
multiValued=false/

any help would be appreciated




--
View this message in context:
http://lucene.472066.n3.nabble.com/is-phrase-search-possible-in-solr-tp4057312.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Searching

2013-04-19 Thread hassancrowdc
I want to search so that:

- if i write an alphabet it returns all the items that start with that
alphabet(a returns apple, aspire etc).

- if i ask for a whole string, it returns me just the results with exact
string. (like search for Samsung S3 then only result is samsung s3)

-if i ask for something it returns me anything that is similar to what i m
asking.(like if i only write 'sam' it should return 'samsung') 

right now i m using text_en_splitting for my field type, it looks like this:

fieldType name=text_en_splitting class=solr.TextField
positionIncrementGap=100 autoGeneratePhraseQueries=true
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/


filter class=solr.StopFilterFactory ignoreCase=true
words=lang/stopwords_en.txt enablePositionIncrements=true/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
filter class=solr.PorterStemFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory
ignoreCase=true
words=lang/stopwords_en.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
filter class=solr.PorterStemFilterFactory/
filter class=solr.PositionFilterFactory /
  /analyzer
/fieldType



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Searching-tp4057328.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: facet.method enum vs fc

2013-04-19 Thread Mingfeng Yang
Joel,

Thanks for your kind reply.   The problem is solved with sharding and using
facet.method=enum.  I am curious about  what's the different between enum
and fc, so that enum works but fc does not.   Do you know something about
this?

Thank you!

Regards,
Ming


On Fri, Apr 19, 2013 at 6:18 AM, Joel Bernstein joels...@gmail.com wrote:

 Faceting on a high cardinality string field, like url, on a 120 million
 record index is going to be very memory intensive.

 You will very likely need to shard the index to get the performance that
 you need.

 In Solr 4.2, you can make the url field a Disk based DocValue and shift the
 memory from Solr to the file system cache. But to run efficiently this is
 still going to take a lot of memory in the OS file cache.




 On Thu, Apr 18, 2013 at 12:00 PM, Mingfeng Yang mfy...@wisewindow.com
 wrote:

  20G is allocated to Solr already.
 
  Ming
 
 
  On Wed, Apr 17, 2013 at 11:56 PM, Toke Eskildsen t...@statsbiblioteket.dk
  wrote:
 
   On Wed, 2013-04-17 at 20:06 +0200, Mingfeng Yang wrote:
I am doing faceting on an index of 120M documents,
on the field of url[...]
  
   I would guess that you would need 3-4GB for that.
   How much memory do you allocate to Solr?
  
   - Toke Eskildsen
  
  
 



 --
 Joel Bernstein
 Professional Services LucidWorks



Re: Searching

2013-04-19 Thread Jack Krupansky
Yes, you can do all of that... but it would be a non-trivial amount of 
effort - the kind of thing consultants get paid real money to do. You should 
also consider doing it in a middleware application layer, using possibly 
multiple queries of separate Solr collections. Otherwise, your index might 
become too large and unwieldy (and risk giving bad or misleading results), 
unless the number of products is rather small.


-- Jack Krupansky

-Original Message- 
From: hassancrowdc

Sent: Friday, April 19, 2013 11:48 AM
To: solr-user@lucene.apache.org
Subject: Searching

I want to search so that:

- if i write an alphabet it returns all the items that start with that
alphabet(a returns apple, aspire etc).

- if i ask for a whole string, it returns me just the results with exact
string. (like search for Samsung S3 then only result is samsung s3)

-if i ask for something it returns me anything that is similar to what i m
asking.(like if i only write 'sam' it should return 'samsung')

right now i m using text_en_splitting for my field type, it looks like this:

fieldType name=text_en_splitting class=solr.TextField
positionIncrementGap=100 autoGeneratePhraseQueries=true
 analyzer type=index
   tokenizer class=solr.WhitespaceTokenizerFactory/


   filter class=solr.StopFilterFactory ignoreCase=true
words=lang/stopwords_en.txt enablePositionIncrements=true/
   filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
   filter class=solr.PorterStemFilterFactory/
 /analyzer
 analyzer type=query
   tokenizer class=solr.WhitespaceTokenizerFactory/
   filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
   filter class=solr.StopFilterFactory
   ignoreCase=true
   words=lang/stopwords_en.txt
   enablePositionIncrements=true
   /
   filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
   filter class=solr.PorterStemFilterFactory/
filter class=solr.PositionFilterFactory /
 /analyzer
   /fieldType



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Searching-tp4057328.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Update Request Processor Chains

2013-04-19 Thread Erik Hatcher
You can have multiple update chains defined and use only one of them per update 
request.

LogUpdateProcessor logs the update request and the RunUpdateProcessor is where 
the actual index is updated.

Erik



On Apr 19, 2013, at 07:49 , Furkan KAMACI wrote:

 I am trying to understand update request processor chains. Do they runs one
 by one when indexing a ducument? Can I identify multiple update request
 processor chains? Also what are that LogUpdateProcessorFactory and
 RunUpdateProcessorFactory?



Re: WordDelimiterFactory

2013-04-19 Thread Ashok
Yes, thank you Erick. The analysis/document handlers hold the key to deciding
the type  order of the filters to employ given one's document set, 
subject matter at hand. The finalized terms they produce for SOLR search,
mlt etc... are crucial to the quality of the results.

- ashok



--
View this message in context: 
http://lucene.472066.n3.nabble.com/WordDelimiterFactory-tp4056529p4057349.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: fuzzy search issue with PatternTokenizer Factory

2013-04-19 Thread Jack Krupansky
Give us some examples of tokens that you are expecting that pattern to 
tokenize. And express the pattern in simple English as well. Some some 
actual input data.


I suspect that Solr is working fine - but you may not have precisely 
specified your pattern. But we don't know what your pattern is supposed to 
recognize.


Maybe some of your previous hits had punctuation adjacent to to the terms 
that your pattern doesn't recognize.


And use the Solr Admin UI Analysis page to see how your sample input data is 
analyzed.

w
One other thing... without a group, the pattern specifies what delimiter 
sequence will split the rest of the input into tokens. I suspect you 
didn't mean this.


-- Jack Krupansky

-Original Message- 
From: meghana

Sent: Friday, April 19, 2013 9:01 AM
To: solr-user@lucene.apache.org
Subject: fuzzy search issue with PatternTokenizer Factory

I m using Solr4.2 , I have changed my text field definition, to use the
Solr.PatternTokenizerFactory instead of Solr.StandardTokenizerFactory , and
changed my schema defination as below

fieldType name=text_token class=solr.TextField
positionIncrementGap=100
 analyzer type=index
  tokenizer class=solr.PatternTokenizerFactory
pattern=[^a-zA-Z0-9amp;\-']|\d{0,4}s: /
  filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=false /

   filter class=solr.LowerCaseFilterFactory/
 /analyzer
 analyzer type=query
  tokenizer class=solr.PatternTokenizerFactory
pattern=[^a-zA-Z0-9amp;\-']|\d{0,4}s: /
  filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords_extra_query.txt enablePositionIncrements=false /
  filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
 filter class=solr.LowerCaseFilterFactory/
 /analyzer
   /fieldType

after doing so, fuzzy search do not seems to working properly as it was
working before.

I m searching with search term : worde~1

on search , before it was returning , around 300 records , but now its
returning only 5 records. not sure what can be issue.

Can anybody help me to make it work!!







--
View this message in context: 
http://lucene.472066.n3.nabble.com/fuzzy-search-issue-with-PatternTokenizer-Factory-tp4057275.html
Sent from the Solr - User mailing list archive at Nabble.com. 



RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.

2013-04-19 Thread SandeepM
James,
Thanks for the reply.  I see your point and sure enough, reducing
maxCollationTries does reduce time, however may not produce results.
It seems like the time is taken for the collations re-runs.  Is there any
way we can activate caching for collations.  The same query repeatedly takes
the same amount of time.  My queryCaches are activated, however don't
believe it gets used for spellchecks.
Thanks.
-- Sandeep



--
View this message in context: 
http://lucene.472066.n3.nabble.com/DirectSolrSpellChecker-vastly-varying-spellcheck-QTime-times-tp4057176p4057389.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Updating clusterstate from the zookeeper

2013-04-19 Thread Michael Della Bitta
I would like to know the answer to this as well.

Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Thu, Apr 18, 2013 at 8:15 PM, Manuel Le Normand
manuel.lenorm...@gmail.com wrote:
 Hello,
 After creating a distributed collection on several different servers I
 sometimes get to deal with failing servers (cores appear not available =
 grey) or failing cores (Down / unable to recover = brown / red).
 In case i wish to delete this errorneous collection (through collection
 API) only the green nodes get erased, leaving a meaningless unavailable
 collection in the clusterstate.json.

 Is there any way to edit explicitly the clusterstate.json? If not, how do i
 update it so the collection as above gets deleted?

 Cheers,
 Manu


Re: Updating clusterstate from the zookeeper

2013-04-19 Thread mike st. john
you can use the eclipse plugin for zookeeper.


http://www.massedynamic.org/mediawiki/index.php?title=Eclipse_Plug-in_for_ZooKeeper


-Msj.


On Fri, Apr 19, 2013 at 1:53 PM, Michael Della Bitta 
michael.della.bi...@appinions.com wrote:

 I would like to know the answer to this as well.

 Michael Della Bitta

 
 Appinions
 18 East 41st Street, 2nd Floor
 New York, NY 10017-6271

 www.appinions.com

 Where Influence Isn’t a Game


 On Thu, Apr 18, 2013 at 8:15 PM, Manuel Le Normand
 manuel.lenorm...@gmail.com wrote:
  Hello,
  After creating a distributed collection on several different servers I
  sometimes get to deal with failing servers (cores appear not available
 =
  grey) or failing cores (Down / unable to recover = brown / red).
  In case i wish to delete this errorneous collection (through collection
  API) only the green nodes get erased, leaving a meaningless unavailable
  collection in the clusterstate.json.
 
  Is there any way to edit explicitly the clusterstate.json? If not, how
 do i
  update it so the collection as above gets deleted?
 
  Cheers,
  Manu



RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.

2013-04-19 Thread Dyer, James
I do not know what it would take to have the collation tests make betetr use of 
the QueryResultCache.  However, outside of a test scenario, I do not know if 
this would help a lot.

Hopefully you wouldn't have a lot of users issuing the exact same query with 
the exact same misspelled words over and over.  In the real world, if you find 
that a collation is a better query than the one the user intially issued, then 
when that user pages through results, etc, your application should use the 
corrected query and not re-run the incorrect query over and over again.  In the 
case of maxResultsForSuggest, if a user does the first query then rejects any 
did-you-mean suggstions, you can just turn spellcheck off if they page, 
facet, etc, so that you don't have to generate these suggestions over and over 
again.  

You do have to weigh when setting maxCollationTries whether or not it is 
acceptable to make a user with a misspelled query wait 1/2 second or so to 
(hopefully) get a correction, or if you want to simply reduce the maximum time 
someone will have to wait.  If you find that it usually needs 10 tries to find 
a good collation, then you probably need to try a different distance algorithm, 
or play with the various accuracy settings to see if you can get better 
corrections to be nearer the top of the individual-word lists.  Also, try 
setting alternativeTermCount lower than count (maybe set atc to 1/2 of 
what you have count).  This will reduce the number of terms it has to try 
combinations of.  If you set maxResultsForSuggest to a lower value (like 2-3, 
maybe), then it won't try to return did-you-mean suggestions for queries 
returning (was it 35?!) hits.

As I mentioned, SOLR-3240 does have promise of speeding this feature up so 
maybe we won't have to talk about these kinds of trade-offs so much in the 
future.

James Dyer
Ingram Content Group
(615) 213-4311


-Original Message-
From: SandeepM [mailto:skmi...@hotmail.com] 
Sent: Friday, April 19, 2013 12:48 PM
To: solr-user@lucene.apache.org
Subject: RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.

James,
Thanks for the reply.  I see your point and sure enough, reducing
maxCollationTries does reduce time, however may not produce results.
It seems like the time is taken for the collations re-runs.  Is there any
way we can activate caching for collations.  The same query repeatedly takes
the same amount of time.  My queryCaches are activated, however don't
believe it gets used for spellchecks.
Thanks.
-- Sandeep



--
View this message in context: 
http://lucene.472066.n3.nabble.com/DirectSolrSpellChecker-vastly-varying-spellcheck-QTime-times-tp4057176p4057389.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: Updating clusterstate from the zookeeper

2013-04-19 Thread Mingfeng Yang
Right. I am wondering if/how we can download a specific file from the
zookeeper, modify it and then upload to rewrite it.  Anyone ?

Thanks,
Ming


On Fri, Apr 19, 2013 at 10:53 AM, Michael Della Bitta 
michael.della.bi...@appinions.com wrote:

 I would like to know the answer to this as well.

 Michael Della Bitta

 
 Appinions
 18 East 41st Street, 2nd Floor
 New York, NY 10017-6271

 www.appinions.com

 Where Influence Isn’t a Game


 On Thu, Apr 18, 2013 at 8:15 PM, Manuel Le Normand
 manuel.lenorm...@gmail.com wrote:
  Hello,
  After creating a distributed collection on several different servers I
  sometimes get to deal with failing servers (cores appear not available
 =
  grey) or failing cores (Down / unable to recover = brown / red).
  In case i wish to delete this errorneous collection (through collection
  API) only the green nodes get erased, leaving a meaningless unavailable
  collection in the clusterstate.json.
 
  Is there any way to edit explicitly the clusterstate.json? If not, how
 do i
  update it so the collection as above gets deleted?
 
  Cheers,
  Manu



Re: Updating clusterstate from the zookeeper

2013-04-19 Thread Nate Fox
I've used zookeeper's cli to do this. I doubt its the right way and I have
no idea if it'll work for clusterstate.json, but it seems to work for
certain things.

cd /opt/zookeeper/bin
./zkCli.sh -server 127.0.0.1:2183 set /configs/collection1/schema.xml `cat
/tmp/newschema.xml`
sleep 10  # give a lil time to get pushed out
curl 
http://localhost:8080/solr/admin/cores?wt=jsonaction=RELOADcore=collection1


This is on zk 3.4.5



--
Nate Fox
Sr Systems Engineer

o: 310.658.5775
m: 714.248.5350

Follow us @NEOGOV http://twitter.com/NEOGOV and on
Facebookhttp://www.facebook.com/neogov

NEOGOV http://www.neogov.com/ is among the top fastest growing software
companies in the USA, recognized by Inc 500|5000, Deloitte Fast 500, and
the LA Business Journal. We are hiring!http://www.neogov.com/#/company/careers



On Fri, Apr 19, 2013 at 11:30 AM, Mingfeng Yang mfy...@wisewindow.comwrote:

 Right. I am wondering if/how we can download a specific file from the
 zookeeper, modify it and then upload to rewrite it.  Anyone ?

 Thanks,
 Ming


 On Fri, Apr 19, 2013 at 10:53 AM, Michael Della Bitta 
 michael.della.bi...@appinions.com wrote:

  I would like to know the answer to this as well.
 
  Michael Della Bitta
 
  
  Appinions
  18 East 41st Street, 2nd Floor
  New York, NY 10017-6271
 
  www.appinions.com
 
  Where Influence Isn’t a Game
 
 
  On Thu, Apr 18, 2013 at 8:15 PM, Manuel Le Normand
  manuel.lenorm...@gmail.com wrote:
   Hello,
   After creating a distributed collection on several different servers I
   sometimes get to deal with failing servers (cores appear not
 available
  =
   grey) or failing cores (Down / unable to recover = brown / red).
   In case i wish to delete this errorneous collection (through collection
   API) only the green nodes get erased, leaving a meaningless
 unavailable
   collection in the clusterstate.json.
  
   Is there any way to edit explicitly the clusterstate.json? If not, how
  do i
   update it so the collection as above gets deleted?
  
   Cheers,
   Manu
 



Weird query issues

2013-04-19 Thread Ravi Solr
Hello,
We are using Solr 3.6.2 single core ( both index and query on same machine)
and randomly the server fails to query correctly.  If we query from the
admin console the query is not even applied and it returns numFound count
equal to total docs in the index as if no query is made, and if use SOLRJ
to query it throws javabin error

Invalid version (expected 2, but 60) or the data in not in 'javabin' format

Once we restart the container everything is back to normal.

In the process of debugging the solr logs I found empty queries like the
one below. Can anybody tell me what can cause empty queries in the log as
given below so trying to see if it may be relateed to the solr issues

[#|2013-04-19T14:10:20.308-0400|INFO|sun-appserver2.1.1|org.apache.solr.core.SolrCore|_ThreadID=19;_ThreadName=httpSSLWorkerThread-9001-0;|[core1]
webapp=/solr path=/select params={} hits=21727 status=0 QTime=24 |#]

Would Appreciate any pointers

Thanks

Ravi Kiran Bhaskar


Could not find an instance of QueryComponent. Disabling collation verification against the index.

2013-04-19 Thread balaji.gandhi
Hi Team,

I am trying to configure the Auto-suggest feature for the businessProvince
field in my schema.

I followed the instructions here:- http://wiki.apache.org/solr/Suggester

But then I got the following error:- INFO: Could not find an instance of
QueryComponent. Disabling collation verification against the index.

Based on this forum
(http://stackoverflow.com/questions/10547438/solr-returns-only-one-collation-for-suggester-component),
I added a query component.

So now all these queries work:-
http://localhost:8983/solr/collection1/cityProvinceSuggest?q=AZ - Searches
the default field
http://localhost:8983/solr/collection1/cityProvinceSuggest?q=businessProvince:AZ
Searches the businessProvince field
http://localhost:8983/solr/collection1/cityProvinceSuggest?q=businessCity:Phoenix
Searches the businessCity field
http://localhost:8983/solr/collection1/cityProvinceSuggest?q=name:Balaji
Searches the name field

So my question now is whether the field element is honored? Bcos holding
all the data in the lookup data-structure may cause memory issues. Any help
will be appreciated.

 searchComponent class=solr.SpellCheckComponent name=suggest
lst name=spellchecker
  str name=namesuggest/str
  str name=classnameorg.apache.solr.spelling.suggest.Suggester/str
  str
name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str
  str name=fieldbusinessProvince/str  
  float name=threshold0.005/float
  str name=buildOnCommittrue/str
/lst
  /searchComponent

Thanks,
Balaji



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Could-not-find-an-instance-of-QueryComponent-Disabling-collation-verification-against-the-index-tp4057417.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr-cloud performance decrease day by day

2013-04-19 Thread alxsss
How many segments each shard has and what is the reason of running multiple 
shards in one machine?

Alex.

 

 

 

-Original Message-
From: qibaoyuan qibaoy...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Fri, Apr 19, 2013 12:26 am
Subject: Re: solr-cloud performance decrease day by day


there are 6 shards and they are in one machine,and the jvm param is very 
big,the 
physical memory is 16GB,the total #docs is about 150k,the index size of each 
shard is about 1GB.AND there is indexing while searching,I USE auto commit  
each 
10min.and the data comes about 100 per minutes. 


在 2013-4-19,下午3:17,Furkan KAMACI furkankam...@gmail.com 写道:

 Could you give more info about your index size and technical details of
 your machine? Maybe you are indexing more data day by day and your RAM
 capability is not enough anymore?
 
 2013/4/19 qibaoyuan qibaoy...@gmail.com
 
 Hello,
   i am using sold 4.1.0 and ihave used sold cloud in my product.I have
 found at first everything seems good,the search time is fast and delay is
 slow,but it becomes very slow after days.does any one knows if there maybe
 some params or optimization to use sold cloud?


 


Weird query issues

2013-04-19 Thread Ravi Solr
Hello,
We are using Solr 3.6.2 single core ( both index and query on same machine)
and randomly the server fails to query correctly.  If we query from the
admin console the query is not even applied and it returns numFound count
equal to total docs in the index as if no query is made, and if use SOLRJ
to query it throws javabin error

Invalid version (expected 2, but 60) or the data in not in 'javabin' format

Once we restart the container everything is back to normal.

In the process of debugging the solr logs I found empty queries like the
one below. Can anybody tell me what can cause empty queries in the log as
given below so trying to see if it may be relateed to the solr issues

[#|2013-04-19T14:10:20.308-0400|INFO|sun-appserver2.1.1|org.apache.solr.core.SolrCore|_ThreadID=19;_ThreadName=httpSSLWorkerThread-9001-0;|[core1]
webapp=/solr path=/select params={} hits=21727 status=0 QTime=24 |#]

Would Appreciate any pointers

Thanks

Ravi Kiran Bhaskar


Re: facet.method enum vs fc

2013-04-19 Thread Chris Hostetter

: Thanks for your kind reply.   The problem is solved with sharding and using
: facet.method=enum.  I am curious about  what's the different between enum
: and fc, so that enum works but fc does not.   Do you know something about
: this?

method=fc/fcs uses the field caches (or uninverted fields if they are 
multivalued) to build a large data structure that is reusable across 
many requests and allows faceting happen very quickly even when the 
number of terms is large.

enum causes solr to walk the term enum for the field and generate a DocSet 
for each term which is then intersected with the main results -- basically 
doing facet.field just like facet.query iwth simple term queries.

these DocSets from using facet.method=enum will be cached in the 
filterCache, so there is some performance savings there if/when people 
filter on these facet constraints, but the regular rules about cache 
evicitions apply.

So in a situation where the heap size is big enough not to matter 
method=fc should be faster and take up less ram then if you size your 
filterCache big enough to hold all of the DocSets involved if you use 
method=enum to not have cache evictions.  

In most cases, the only motivation for using method=enum is if you know 
the cardinality of your set of constraints is relatively small and fixed 
(ie: there are only 50 states in the US, so you might find that faceting 
on a state field with method=enum is just as fast as using method=fc and 
takes less ram -- this is why boolean fields default to method=enum, the 
cardinality is garunteed to be 2).  But in some less common cases, you 
might care more about saving ram then speed, or you might be trying to 
facet on huge index with fields containing lots of terms (ie: full text) 
so that method=fc just wont work with any concievable amount of ram, so it 
could make sense to use method=enum with filterCache disabled.


-Hoss


Re: Update Request Processor Chains

2013-04-19 Thread Chris Hostetter

: I am trying to understand update request processor chains. Do they runs one
: by one when indexing a ducument? Can I identify multiple update request
: processor chains? Also what are that LogUpdateProcessorFactory and
: RunUpdateProcessorFactory?

http://wiki.apache.org/solr/UpdateRequestProcessor

solrconfig.xml files can contain any number of UpdateRequestProcessorChains...

Once one or more update chains are defined, you may select one on the 
update request through the parameter update.chain


https://lucene.apache.org/solr/4_2_1/solr-core/org/apache/solr/update/processor/LogUpdateProcessorFactory.html

 This keeps track of all commands that have passed through the chain and 
prints them on finish(). At the Debug (FINE) level, a message will be 
logged for each command prior to the next stage in the chain. 



https://lucene.apache.org/solr/4_2_1/solr-core/org/apache/solr/update/processor/RunUpdateProcessorFactory.html

Executes the update commands using the underlying UpdateHandler. Allmost 
all processor chains should end with an instance of 
RunUpdateProcessorFactory unless the user is explicitly executing the 
update commands in an alternative custom UpdateRequestProcessorFactory


-Hoss


Re: Searching

2013-04-19 Thread hassancrowdc
thanks. I was expecting an answer that could help me to choose analyzers or
tokenizers. any help for anyone of the scenarios?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Searching-tp4057328p4057465.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Update Request Processor Chains

2013-04-19 Thread Furkan KAMACI
Thanks for detailed answers.

2013/4/19 Chris Hostetter hossman_luc...@fucit.org


 : I am trying to understand update request processor chains. Do they runs
 one
 : by one when indexing a ducument? Can I identify multiple update request
 : processor chains? Also what are that LogUpdateProcessorFactory and
 : RunUpdateProcessorFactory?

 http://wiki.apache.org/solr/UpdateRequestProcessor

 solrconfig.xml files can contain any number of
 UpdateRequestProcessorChains...

 Once one or more update chains are defined, you may select one on the
 update request through the parameter update.chain



 https://lucene.apache.org/solr/4_2_1/solr-core/org/apache/solr/update/processor/LogUpdateProcessorFactory.html

  This keeps track of all commands that have passed through the chain and
 prints them on finish(). At the Debug (FINE) level, a message will be
 logged for each command prior to the next stage in the chain. 




 https://lucene.apache.org/solr/4_2_1/solr-core/org/apache/solr/update/processor/RunUpdateProcessorFactory.html

 Executes the update commands using the underlying UpdateHandler. Allmost
 all processor chains should end with an instance of
 RunUpdateProcessorFactory unless the user is explicitly executing the
 update commands in an alternative custom UpdateRequestProcessorFactory


 -Hoss



Re: Weird query issues

2013-04-19 Thread Shawn Heisey

On 4/19/2013 12:55 PM, Ravi Solr wrote:

We are using Solr 3.6.2 single core ( both index and query on same machine)
and randomly the server fails to query correctly.  If we query from the
admin console the query is not even applied and it returns numFound count
equal to total docs in the index as if no query is made, and if use SOLRJ
to query it throws javabin error

Invalid version (expected 2, but 60) or the data in not in 'javabin' format


The UI problem is likely a browser issue, but I could be wrong.  Some 
browsers, IE in particular, but not limited to that one, have problems 
with the admin UI.  Using a different browser or clearing the browser 
cache can sometimes fix those problems.


As for SolrJ, are you using a really old (1.x) SolrJ with Solr 3.6.2? 
Have you ever had Solr 1.x running on the same machine that's now 
running 3.6.2?


Because the javabin version changed between 1.4.1 and 3.1.0, SolrJ 1.x 
is not compatible with Solr 3.1 and later unless you set the response 
parser on the server object to XML before you try to use it.  If you 
have upgraded Solr from an old version, your servlet container 
(sun-appserver) may have some of the old jars remaining from the 1.x 
install.  They must be removed.


To change your SolrJ to use the XML response parser, use code like the 
following:


server.setParser(new XMLResponseParser());

When SolrJ and Solr are both version 3.x or 4.x, you can remove this line.

Another way that you can get the javabin error is when Solr is returning 
an error response, or returning a response that is not an error but is 
an HTML response reporting an unusual circumstance rather than the usual 
javabin.  These HTML responses should no longer exist in the newest 
versions of Solr.  Do you see any errors or warnings in your server log? 
 The server log line you included in your email is not an error.


Thanks,
Shawn



external values source

2013-04-19 Thread Maciej Liżewski
I need some explanation on how ValuesSource and related classes work.

There are already implemented ExternalFileField, example on how to load data
from database (
http://sujitpal.blogspot.com/2011/05/custom-sorting-in-solr-using-external.
html
http://sujitpal.blogspot.com/2011/05/custom-sorting-in-solr-using-external.h
tml)

But they all fetch ALL data into memory which may consume large amounts of
this resource. Also documents are referenced by 'doc' integer value.

 

My questions:

1)  Is the 'doc' value pointing to document in whole index? If so - how
to get value of such documents field (for example: field named 'id')?

2)  Is there possibility to create ValuesSource, FieldType (or similar
interface which will provide external data to sort and in query results)
which will work only on some subset of documents and use external source
capabilities to fetch document related data?

3)  How does it all work (memory consumption, hashtable access speed,
etc), when there is a lot of documents in index (tens of millions for
example)?

4)  Are there any other examples on loading external data from database
(I want to have numerical 'rate' from simple table having two columns:
'document unique key' string, 'rate' integer/float) which are not just proof
of concept but real-life examples?

 

Any help and hints appreciated

TIA

 

--

Maciek



Rogue query killed several replicas with OOM, after recovering - match all docs query problem

2013-04-19 Thread Timothy Potter
We had a rogue query take out several replicas in a large 4.2.0 cluster
today, due to OOM's (we use the JVM args to kill the process on OOM).

After recovering, when I execute the match all docs query (*:*), I get a
different count each time.

In other words, if I execute q=*:* several times in a row, then I get a
different count back for numDocs.

This was not the case prior to the failure as that is one thing we monitor
for.

I think I should be worried ... any ideas on how to troubleshoot this? One
thing to mention is that several of my replicas had to do full recoveries
from the leader when they came back online. Indexing was happening when the
replicas failed.

Thanks.
Tim


Re: external values source

2013-04-19 Thread Timothy Potter
Hi Maciek,

I think a custom ValueSource is definitely what you want because you
need to compute some derived value based on an indexed field and some
external value.

The trick is figuring how to make the lookup to the external data
very, very fast. Here's a rough sketch of what we do:

We have a table in a database that contains a numeric value for a user
and an organization, such as query:

select num from table where userId='bob' and orgId=123 (similar to
what you stated in question #4)

On the Solr side, documents are indexed with user_id_s field, which is
half of what I need to do my lookup. The orgId is determined by the
Solr client at query construction time, so is passed to my custom
ValueSource (aka function) in the query. In our app, users can be
associated with many different orgIds and changes frequently so we
can't index the association.

To do the lookup to the database, we have a custom ValueSource,
something like: dbLookup(user_id_s, 123)

(note: user_id_s is the name of the field holding my userID values in
the index and 123 is the orgId)

Behind the scenes, the ValueSource will have access to the user_id_s
field values using FieldCache, something like:

final BinaryDocValues dv =
FieldCache.DEFAULT.getTerms(reader.reader(), user_id_s);

This gives us fast access to the user_id_s value for any given doc
(question #1 above) So now we can return an IntDocValues instance by
doing:

@Override
public FunctionValues getValues(Map context, AtomicReaderContext
reader) throws IOException {
final BytesRef br = new BytesRef();
final BinaryDocValues dv =
FieldCache.DEFAULT.getTerms(reader.reader(), fieldName);
return new IntDocValues(this) {
@Override
public int intVal(int doc) {
dv.get(doc,br);
if (br.length == 0)
return 0;

final String user_id_s = br.utf8ToString(); // the
indexed userID for doc
int val = 0;
// todo: do custom lookup with orgID and user_id_s to
compute int value for doc
return val;
}
}
...
}

In this code, fieldName is set in the constructor (not shown) by
parsing it out of the parameters, something like:

this.fieldName =
((org.apache.solr.schema.StrFieldSource)source).getField();

The user_id_s field comes into your ValueSource as a StrFieldSource
(or whatever type you use) ... here is how the ValueSource gets
constructed at query time:

public class MyValueSourceParser extends ValueSourceParser {
public void init(NamedList namedList) {}

public ValueSource parse(FunctionQParser fqp) throws SyntaxError {
return new MyValueSource(fqp.parseValueSource(), fqp.parseArg());
}
}

There is one instance of your ValueSourceParser created per core. The
parse method gets called for every query that uses the ValueSource.

At query time, I might use the ValueSource to return this computed
value in my fl list, such as:

fl=id,looked_up:dbLookup(user_id_l,123),...

Or to sort by:

sort=dbLookup(user_id_s,123) desc

The data in our table doesn't change that frequently, so we export it
to a flat file in S3 and our custom ValueSource downloads from S3,
transforms it into an in-memory HashMap for fast lookups. We thought
about just issuing a query to load the data from the db directly but
we have many nodes and the query is expensive and result set is large
so we didn't want to hammer our database with N Solr nodes querying
for the same data at roughly the same time. So we do it once and post
the compressed results to a shared location. The data in the table is
sparse as compared to the number of documents and userIds we have.

We simply poll S3 for changes every few minutes, which is good enough
for us. This happens from many nodes in a large Solr Cloud cluster
running in EC2 so S3 works well for us as a distribution mechanism.
Admittedly polling kind of sucks so we tried using Zookeeper to notify
our custom watchers when a znode changes but a ValueSource doesn't get
notified when a core is reloaded so we ended up having many weird
issues with Zookeeper watchers in our custom ValueSource. For example,
new ValueSourceParsers get created when a core is reloaded but the
previous instance doesn't get notified that it's going out of service.
So this gives you an idea of how we load external data into a fast
lookup data structure in Solr (~question #2)

When filtering, we use PostFilter to tell Solr that our filter is
expensive so should be applied last (after all other criteria have
run), something like:

fq={!frange l=2 u=8 cost=200 cache=false}dbLookup(user_id_s,123)

This computes a function range query using our custom ValueSource but
tells Solr that it is expensive (cost = 100) so apply it after all
other filters have been applied.
http://yonik.wordpress.com/tag/post-filter/

Lastly, as for speed, the user_id_s field gets loaded into FieldCache
and the lookup 

RE: SolrCloud loadbalancing, replication, and failover

2013-04-19 Thread David Parks
Again, thank you for this incredible information, I feel on much firmer
footing now. I'm going to test distributing this across 10 servers,
borrowing a Hadoop cluster temporarily, and see how it does with enough
memory to have the whole index cached. But I'm thinking that we'll try the
SSD route as our index will probably rest in the 1/2 terabyte range
eventually, there's still a lot of active development.

I guess the RAM disk would work in our case also, as we only index in
batches, and eventually I'd like to do that off of Solr and just update the
index (I'm presuming this is doable in solr cloud, but I haven't put it to
task yet). If I could purpose Hadoop to index the shards, that would be
ideal, though I haven't quite figured out how to go about it yet.

David


-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Friday, April 19, 2013 9:42 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud loadbalancing, replication, and failover

On 4/19/2013 3:48 AM, David Parks wrote:
 The Physical Memory is 90% utilized (21.18GB of 23.54GB). Solr has 
 dark grey allocation of 602MB, and light grey of an additional 108MB, 
 for a JVM total of 710MB allocated. If I understand correctly, Solr 
 memory utilization is
 *not* for caching (unless I configured document caches or some of the 
 other cache options in Solr, which don't seem to apply in this case, 
 and I haven't altered from their defaults).

Right.  Solr does have caches, but they serve specific purposes.  The OS is
much better at general large-scale caching than Solr is.  Solr caches get
cleared (and possibly re-warmed) whenever you issue a commit on your index
that makes new documents visible.

 So assuming this box was dedicated to 1 solr instance/shard. What JVM 
 heap should I set? Does that matter? 24GB JVM heap? Or keep it lower 
 and ensure the OS cache has plenty of room to operate? (this is an 
 Ubuntu 12.10 server instance).

The JVM heap to use is highly dependent on the nature of your queries, the
number of documents, the number of unique terms, etc.  The best thing to do
is try it out with a relatively large heap, see how much memory actually
gets used inside the JVM.  The jvisualvm and jconsole tools will give you
nice graphs of JVM memory usage.  The jstat program will give you raw
numbers on the commandline that you'll need to add to get the full picture.
Due to the garbage collection model that Java uses, what you'll see is a
sawtooth pattern - memory usage goes up to max heap, then garbage collection
reduces it to the actual memory used.
 Generally speaking, you want to have more heap available than the low
point of that sawtooth pattern.  If that low point is around 3GB when you
are hitting your index hard with queries and updates, then you would want to
give Solr a heap of 4 to 6 GB.

 Would I be wise to just put the index on a RAM disk and guarantee 
 performance?  Assuming I installed sufficient RAM?

A RAM disk is a very good way to guarantee performance - but RAM disks are
ephemeral.  Reboot or have an OS crash and it's gone, you'll have to
reindex.  Also remember that you actually need at *least* twice the size of
your index so that Solr (Lucene) has enough room to do merges, and the
worst-case scenario is *three* times the index size.  Merging happens during
normal indexing, not just when you optimize.  If you have enough RAM for
three times your index size and it takes less than an hour or two to rebuild
the index, then a RAM disk might be a viable way to go.  I suspect that this
won't work for you.

Thanks,
Shawn



Re: Pros and cons of using RAID or different RAIDS?

2013-04-19 Thread Otis Gospodnetic
Yeah, but as far as I know, there is nothing Solr-specific about that.

See http://www.acnc.com/raid

Otis
--
Solr  ElasticSearch Support
http://sematext.com/





On Fri, Apr 19, 2013 at 11:19 AM, Furkan KAMACI furkankam...@gmail.com wrote:
 Is there any documentation that explains pros and cons of using RAID or
 different RAIDS?


Re: Import in Solr

2013-04-19 Thread Gora Mohanty
On 19 April 2013 19:50, hassancrowdc hassancrowdc...@gmail.com wrote:
 I want to update(delta-import) one specific item. Is there any query to do
 that?

No.

Regards,
Gora