Re: Solr cloud performance degradation with billions of documents

2014-08-16 Thread shushuai zhu
Erik,

---
I fear the problem will be this: you won't even be able to do basic searches as 
the number of shards on a particular machine increase. To test, fire off a 
simple search for each of your 60 days. I expect it'll blow you out of the 
water. This assumes that all your shards are hosted in the same JVM on each of 
your 32 machines. But that's totally a guess.
---

In this case, assuming there are 60 collections, and only one collection is 
queried each time, should the memory requirements be those for that collection 
only? My understanding is, when a new collection is queried, the indexes 
(cores) of the old collection in OS cache are to be swapped out and the indexes 
of new collection are brought in, but the memory requirements should be roughly 
the same as long as two collections have similar sizes. 

I am interested in knowing, when you have multiple collections like this case 
(60), and you just query one collection, should other collections matter from 
performance perspective? Since different collections contain different cores, 
if querying one collection involves cores in other collections, is it a bug?

Thanks.

Shushuai

 
 From: Erick Erickson 
To: solr-user@lucene.apache.org 
Sent: Friday, August 15, 2014 7:30 PM
Subject: Re: Solr cloud performance degradation with billions of documents
  

Toke:

bq: I would have agreed with you fully an hour ago.

Well, I now disagree with myself too :) I don't mind
talking to myself. I don't even mind arguing with myself. I
really _do_ mind losing the arguments I have with
myself though.

Scott:

OK, that has a much better chance of working, I obviously
misunderstood. So you'll have 60 different collections and each
collection will have one shard on each machine.

When the time comes to roll some of the collections off the
end due to age, "collection aliasing" may be helpful. I still think
you're significantly undersized, but you know your problem
space better than I do.

I fear the problem will be this: you won't even be able to do
basic searches as the number of shards on a particular
machine increase. To test, fire off a simple search for each of
your 60 days. I expect it'll blow you out of the water. This
assumes that all your shards are hosted in the same JVM
on each of your 32 machines. But that's totally a guess.

Keep us posted!


On Fri, Aug 15, 2014 at 2:40 PM, Toke Eskildsen  
wrote:
> Erick Erickson [erickerick...@gmail.com] wrote:
>> I guess that my main issue is that from everything I've seen so far,
>> this project is doomed. You simply cannot put 7B documents in a single
>> shard, period. Lucene has a 2B hard limit.
>
> I would have agreed with you fully an hour ago and actually planned to ask 
> Wilbur to check if he had corrupted his indexes. However, his latest post 
> suggests that the scenario is more about having a larger amount of more 
> resonably sized shards in play than building gigantic shards.
>
>> For instance, Wilburn is talking about only using 6G of memory. Even
>> at 2B docs/shard, I'd be surprised to see it function at all. Don't
>> try sorting on a timestamp for instance.
>
> I haven't understood Wilburns setup completely, as it seems to me that he 
> will quickly run out of memory for starting new shards. But if we are looking 
> at shards of 30GB and 160M documents, 6GB sounds a lot better.
>
> Regards,
> Toke Eskildsen

Re: Replication Issue with Repeater Please help

2014-08-16 Thread Shawn Heisey
On 8/16/2014 8:11 AM, waqas sarwar wrote:
>> Thank you so much. You helped alot. One more question is that can i use only 
>> one zookeeper server to manage 3 solr servers, or i've to configure 3 
>> zookeeper servers for each. >And zookeeper servers should be stand alone or 
>> better to use same solr server machine ?>>Best Regards,>Waqas
>>

I think Erick basically said the same thing as this, in a slightly
different way:

If you want zookeeper to be fault tolerant, you must have at least three
servers running it.  One zookeeper will work, but if it goes down,
SolrCloud doesn't function properly.  Three are needed for full
redundancy.  If one of the three goes down, the other two will still
function as a quorum.

You can use the same servers for Zookeeper and Solr.  This *can* be a
source of performance problems, but that will usually only be a problem
if you put a major load on your SolrCloud.  If you do put them on the
same server, I would recommend putting the zk database on a separate
disk or disks -- the CPU requirements for Zookeeper are very small, but
it relies on extremely responsive I/O to/from its database.

As Erick said, we strongly recommend that you don't use the embedded ZK
-- this starts up a zookeeper server in the same Java process as Solr.
If Solr is stopped or goes down, you also lose zookeeper.

Thanks,
Shawn



Re: How to restore an index from a backup over HTTP

2014-08-16 Thread Shawn Heisey
On 8/16/2014 4:03 AM, Greg Solovyev wrote:
> Thanks Shawn, this is a pretty cool idea. Adding the handler seems pretty 
> straight forward, but the main concern I have is the internal data format 
> that ReplicationHandler and SnapPuller use. This new handler as well as the 
> code that I've already written to download the index files from Solr will 
> depend on that format. Unfortunately, this format is not documented and is 
> not abstracted by SolrJ, so I wonder what I can do to make sure it does not 
> change on us without notice.

I am not really sure what format you're referencing here, but I'm about
99% sure the format *over the wire* is javabin.  When the javabin format
changed between 1.4.1 and 3.1.0, replication between those versions
became impossible.

Historical: The Solr version made a huge leap after the Solr and Lucene
development was merged -- it was synchronized with the Lucene version.
There are no 1.5, 2.x, or 3.0 versions of Solr.

https://issues.apache.org/jira/browse/SOLR-2204

Thanks,
Shawn



Re: Replication Issue with Repeater Please help

2014-08-16 Thread Erick Erickson
It Depends (tm).

> One ZooKeeper is a single point of failure. It goes away and your SolrCloud 
> cluster is kinda hosed. OTOH, with only 3 servers, the chance that one of 
> them is going down is low anyway. How lucky do you feel?

> I would be cautious about running your ZK instances embedded, 
> super-especially if there's only one ZK instance. That couples your ZK 
> instances with your Solr instances. So if for any reason you want  to 
> stop/start Solr, you will stop/start ZK as well and it's easy to fall below a 
> quorum. It's perfectly viable to run them embedded, especially on a very 
> small cluster. You do have to think a bit more about sequencing Solr nodes 
> going up/down is all.

Best,
Erick

On Sat, Aug 16, 2014 at 7:11 AM, waqas sarwar  wrote:
>
>
>> Date: Thu, 14 Aug 2014 06:51:02 -0600
>> From: s...@elyograg.org
>> To: solr-user@lucene.apache.org
>> Subject: Re: Replication Issue with Repeater Please help
>>
>> On 8/14/2014 2:09 AM, waqas sarwar wrote:
>> > Thanks Shawn. What i got is Circular replication is totally impossible & 
>> > Solr fails in distributed environment. Then why solr documentation says 
>> > that configure "REPEATER" for distributed architecture, because "REPEATER" 
>> > behave like master-slave at a time.
>> > Can i configure SolrCloud on LAN, or i've to configure zookeeper myself. 
>> > Please provide me any solution for LAN distributed servers. If zookeeper 
>> > in only solution then provide me any link to configure it that can help me 
>> > & to avoid wrong direction.
>>
>> The repeater config is designed to avoid master overload from many
>> slaves.  So instead of configuring ten slaves to replicate from one
>> master, you configure two slaves to replicate directly from your master,
>> and then you configure those as repeaters.  The other eight slaves are
>> configured so that four of them replicate from each of the repeaters
>> instead of the true master, reducing the load.
>>
>> SolrCloud is the easiest way to build a fully distributed and redundant
>> solution.  It is designed for a LAN.  You configure three machines as
>> your zookeeper ensemble, using the zookeeper download and instructions
>> for a clustered setup:
>>
>> http://zookeeper.apache.org/doc/r3.4.6/zookeeperAdmin.html#sc_zkMulitServerSetup
>>
>> The way to start Solr in cloud mode is to give it a zkHost system
>> property.  That informs Solr about all of your ZK servers.  If you have
>> another way of setting that property, you can use that instead.  I
>> strongly recommend using a chroot with the zkHost parameter, but that is
>> not required.  Search the zookeeper page linked above for "chroot" to
>> find a link to additional documentation about chroot.
>>
>> You can use the same servers for ZK as you do for Solr, but be aware
>> that if Solr puts a large I/O load on the disks, you may want the ZK
>> database to be on its own disks(s) so that it responds quickly.
>> Separate servers is even better, but not strictly required unless the
>> servers are under extreme load.
>>
>> https://cwiki.apache.org/confluence/display/solr/SolrCloud
>>
>> You will find a "Getting Started" link on the page above.  Note that the
>> "Getting Started" page talks about a zkRun option, which starts an
>> embedded zookeeper as part of Solr.  I strongly recommend that you do
>> NOT take this route, except for *initial* testing.  SolrCloud works much
>> better if the Zookeeper ensemble is in its own process, separate from Solr.
>>
>> Thanks,
>> Shawn
>>
>> Thank you so much. You helped alot. One more question is that can i use only 
>> one zookeeper server to manage 3 solr servers, or i've to configure 3 
>> zookeeper servers for each. >And zookeeper servers should be stand alone or 
>> better to use same solr server machine ?>>Best Regards,>Waqas


Re: Syntax unavailable for parameter substitution Solr 3.5

2014-08-16 Thread Yonik Seeley
You can't do this with stock solr, but a generic templating ability is
now in heliosearch (a fork of solr):
http://heliosearch.org/solr-query-parameter-substitution/

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


On Fri, Aug 15, 2014 at 5:46 AM, deepaksshettigar
 wrote:
>
> Environment :-
> --
> Solr version 3.5
> Apache Web Server on Jboss AS 5.1.x
>
> ===
> Problem statement :-
> --
>
> I am using a singe request handler to handle dynamic scenarios.
> So my UI decides at runtime which facet field (using a Dynamic field type
> String) to apply.
> E.g. Depending on current' logged in user's usergroup ( employee,admin, etc)
> I apply the facet field as
> &facet.field=platform_emp OR &facet.field=platform_admin (it needed to
> designed this way due to functionality)
>
> How using this technique, I have many such dynamic facet fields & the Solr
> Query string became too long resulting in HTTP 413(request entity too
> large).
>
> Now, I am looking move these facet field declarations from the URL to the
> Search Request Handler.
>
> Is there way to have local params do this for me.
>
> =
> Workable Solution:-
> -
> I have tried local params, which works if the whole term is passed through a
> Query String,
> but am stuck with syntax with does not allow any concatenation of params to
> a prefix.
>
> My Request handler looks like this -
>
>   
>  
>explicit
>edismax
>.
>.
> {!v=$role}
>
>
> If I pass &role=plaform_emp OR &role=plaform_emp, it works for me, but i
> would like to move the prefix inside the handler, as I have more such facet
> fields to be declared dynamically, e.g
> facet.field=share_class_emp , facet.field=share_class_admin, etc
> However I would like to avoid these multiple facet.field declarations
> through the URL to avoid running into HTTP 413 at runtime.
>
> ===
>
> Required Possible Solution:-
> ---
> Is there a way to have a configuration which might look like this -
>
>  
>  
>explicit
>edismax
>.
>.
>platform_
>share_class_
>{!v=$prefixPlatform$role}
>{!v=$prefixShareClass$role}
>
>
> & pass &role=emp from the URL at runtime.
>
> 
>
> Another Query, is it possible to handle HTTP 413 by increasing Allowed HTTP
> Request Size on Apache/Jboss
>
> -
>
> Any help will be highly appreciated.
>
> Regards
> Deepak
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Syntax-unavailable-for-parameter-substitution-Solr-3-5-tp4153197.html
> Sent from the Solr - User mailing list archive at Nabble.com.


RE: Replication Issue with Repeater Please help

2014-08-16 Thread waqas sarwar


> Date: Thu, 14 Aug 2014 06:51:02 -0600
> From: s...@elyograg.org
> To: solr-user@lucene.apache.org
> Subject: Re: Replication Issue with Repeater Please help
> 
> On 8/14/2014 2:09 AM, waqas sarwar wrote:
> > Thanks Shawn. What i got is Circular replication is totally impossible & 
> > Solr fails in distributed environment. Then why solr documentation says 
> > that configure "REPEATER" for distributed architecture, because "REPEATER" 
> > behave like master-slave at a time.
> > Can i configure SolrCloud on LAN, or i've to configure zookeeper myself. 
> > Please provide me any solution for LAN distributed servers. If zookeeper in 
> > only solution then provide me any link to configure it that can help me & 
> > to avoid wrong direction.
> 
> The repeater config is designed to avoid master overload from many
> slaves.  So instead of configuring ten slaves to replicate from one
> master, you configure two slaves to replicate directly from your master,
> and then you configure those as repeaters.  The other eight slaves are
> configured so that four of them replicate from each of the repeaters
> instead of the true master, reducing the load.
> 
> SolrCloud is the easiest way to build a fully distributed and redundant
> solution.  It is designed for a LAN.  You configure three machines as
> your zookeeper ensemble, using the zookeeper download and instructions
> for a clustered setup:
> 
> http://zookeeper.apache.org/doc/r3.4.6/zookeeperAdmin.html#sc_zkMulitServerSetup
> 
> The way to start Solr in cloud mode is to give it a zkHost system
> property.  That informs Solr about all of your ZK servers.  If you have
> another way of setting that property, you can use that instead.  I
> strongly recommend using a chroot with the zkHost parameter, but that is
> not required.  Search the zookeeper page linked above for "chroot" to
> find a link to additional documentation about chroot.
> 
> You can use the same servers for ZK as you do for Solr, but be aware
> that if Solr puts a large I/O load on the disks, you may want the ZK
> database to be on its own disks(s) so that it responds quickly. 
> Separate servers is even better, but not strictly required unless the
> servers are under extreme load.
> 
> https://cwiki.apache.org/confluence/display/solr/SolrCloud
> 
> You will find a "Getting Started" link on the page above.  Note that the
> "Getting Started" page talks about a zkRun option, which starts an
> embedded zookeeper as part of Solr.  I strongly recommend that you do
> NOT take this route, except for *initial* testing.  SolrCloud works much
> better if the Zookeeper ensemble is in its own process, separate from Solr.
> 
> Thanks,
> Shawn
> 
> Thank you so much. You helped alot. One more question is that can i use only 
> one zookeeper server to manage 3 solr servers, or i've to configure 3 
> zookeeper servers for each. >And zookeeper servers should be stand alone or 
> better to use same solr server machine ?>>Best Regards,>Waqas 
>

Re: How to restore an index from a backup over HTTP

2014-08-16 Thread Greg Solovyev
Thanks Shawn, this is a pretty cool idea. Adding the handler seems pretty 
straight forward, but the main concern I have is the internal data format that 
ReplicationHandler and SnapPuller use. This new handler as well as the code 
that I've already written to download the index files from Solr will depend on 
that format. Unfortunately, this format is not documented and is not abstracted 
by SolrJ, so I wonder what I can do to make sure it does not change on us 
without notice.

Thanks,
Greg

- Original Message -
From: "Shawn Heisey" 
To: solr-user@lucene.apache.org
Sent: Friday, August 15, 2014 7:31:19 PM
Subject: Re: How to restore an index from a backup over HTTP

On 8/15/2014 5:51 AM, Greg Solovyev wrote:
> What I want to achieve is being able to send the backed up index to Solr 
> (either standalone or with ZooKeeper) in a way similar to creating a new 
> Collection. I.e. create a new collection and upload an exiting index directly 
> into that Collection. I've looked through Solr code and so far I have not 
> found a handler that would allow this scenario. So, the last idea is to 
> implement a special handler for this case, perhaps extending 
> CoreAdminHandler. ReplicationHandler together with SnapPuller do pretty much 
> what I need to do, except that the action has to be initiated by the 
> receiving Solr server and I need to initiate the action externally. I.e., 
> instead of having Solr slave download an index from Solr master, I need to 
> feed the index to Solr master and ideally this would work the same way in 
> standalone and SolrCloud modes. 

I have not made any attempt to verify what I'm stating below.  It may
not work.

What I think I would *try* is setting up a standalone Solr (no cloud) on
the backup server.  Use scripted index/config copies and Solr start/stop
actions to get the index up and running on a known core in the
standalone Solr.  Then use the replication handler's HTTP API to
replicate the index from that standalone server to each of the replicas
in your cluster.

https://wiki.apache.org/solr/SolrReplication#HTTP_API
https://cwiki.apache.org/confluence/display/solr/Index+Replication#IndexReplication-HTTPAPICommandsfortheReplicationHandler

One thing that I do not know is whether SolrCloud itself might interfere
with these actions, or whether it might automatically take care of
additional replicas if you replicate to the shard leader.  If SolrCloud
*would* interfere, then this idea might need special support in
SolrCloud, perhaps as an extension to the Collections API.  If it won't
interfere, then the use-case would need to be documented (on the user
wiki at a minimum) so that committers will be aware of it and preserve
the capability in future versions.  An extension to the Collections API
might be a good idea either way -- I've seen a number of questions about
capability that falls under this basic heading.

Thanks,
Shawn