Re: Team please help

2018-04-30 Thread Greg Solovyev
Sujeet, what do you mean by migrating? E.g., are you moving your data from
Cloudera CDH to Azure HDI? Are migrating your application code written on
top of Cloudera CDH to run on top of Azure HDI? As far as I know, Azure HDI
does not include Solr, so if your application on top of Cloudera CDH is
using Solr, it won't run on HDI.
Greg

On Sat, Apr 28, 2018 at 5:45 PM Sujeet Singh 
wrote:

> Adding Dev
>
>
>
> *From:* Sujeet Singh
> *Sent:* Sunday, April 29, 2018 12:14 AM
> *To:* 'solr-user@lucene.apache.org'
> *Subject:* Team please help
>
>
>
> Team I am facing an issue right now. I am working ahead to migrate
> cloudera to HDI Azure. Now cloudera has Solr implementation and using the
> below jar
>
> search-mr-1.0.0-cdh5.7.0-job.jar
> org.apache.solr.hadoop.MapReduceIndexerTool
>
>
>
> While looking into all option I found “solr-map-reduce-4.9.0.jar” and
> tried using it with class “org.apache.solr.hadoop.MapReduceIndexerTool”. I
> tried adding lib details in solrconfig.xml but did not worked . Getting
> error
>
> “Caused by: java.lang.ClassNotFoundException:
> org.apache.solr.morphlines.solr.DocumentLoader”
>
>
>
> Please let me know the right way to use MapReduceIndexerTool class.
>
>
>
> Regards,
> --
>
> *Sujeet Singh* | Sr. Software Analyst | cloudmoyo | *E.*
> sujeet.si...@cloudmoyo.com | *M.* +91 9860586055 <+91%2098605%2086055>
>
> [image: CloudMoyo Logo]
> 
> [image:
> https://icertisportalstorage.blob.core.windows.net/siteasset/icon-linkedin.png]
> [image:
> https://icertisportalstorage.blob.core.windows.net/siteasset/icon-fb.png]
> [image:
> https://icertisportalstorage.blob.core.windows.net/siteasset/icon-twitter.png]
> 
> www.cloudmoyo.com
>
>
>


Re: CloudSolrServer, concurrency and too many connections

2014-12-10 Thread Greg Solovyev
I am seeing the same problem with 4.10.2 and 4.9.0. CloudSolrServer keeps 
opening connections to ZK and never closes them. Eventually (very soon) ZK runs 
out of connections and stops accepting new ones. 

Thanks,
Greg

- Original Message -
From: JoeSmith fidw...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Sunday, December 7, 2014 8:11:50 PM
Subject: Re: CloudSolrServer, concurrency and too many connections

i've upgraded to 4.10.2 on the client-side.  Still seeing this connection
problem when connecting to the Zookeeper port.  If I connect directly to
SolrServer, the connections do not increase.  But when connecting to
Zookeeper, the connections increase up to 60 and then start to fail.  I
understand Zookeeper is configured to fail after 60 connections to prevent
a DOS attack, but I dont see why we keep adding new connections (up to
60).  Does the client-side Zookeeper code also use HttpClient
ConnectionPooling for its Connection Pool?  Below is the Exception that
shows up in the log file when this happens.  When we execute queries we are
using the _route_ parameter, could this explain anything?

o.a.zookeeper.ClientCnxn - Session 0x0 for server
aweqca3utmtc10.cloud..com/10.22.10.107:9983, unexpected error, closing
socket connection and attempting reconnect

java.io.IOException: Connection reset by peer

at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[na:1.7.0_55]

at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
~[na:1.7.0_55]

at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
~[na:1.7.0_55]

at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[na:1.7.0_55]

at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
~[na:1.7.0_55]

at
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
~[zookeeper-3.4.6.jar:3.4.6-1569965]

at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
~[zookeeper-3.4.6.jar:3.4.6-1569965]

at
org.apache.zookeeper.Clie4.ntCnxn$SendThread.run(ClientCnxn.java:1081)
~[zookeeper-3.4.6.jar:3.4.6-1569965]


Will try to get the server code upgraded to 4.10.2.



On Sat, Dec 6, 2014 at 3:52 PM, Shawn Heisey apa...@elyograg.org wrote:

 On 12/6/2014 12:09 PM, JoeSmith wrote:
  We are currently using CloudSolrServer, but it looks like this class is
 not
  thread-safe (setDefaultCollection). Should this instance be initialized
  once (at startup) and then re-used (in all threads) until shutdown when
 the
  process terminates?  Or should it re-instantiated for each request?
 
  Currently, we are trying to use CloudSolrServer as a singleton, but it
  looks like the connections to the host are not being closed and under
 load
  we start getting failures.  and In the Zookeeper logs we see this error:
 
  WARN  - 2014-12-04 10:09:14.364;
  org.apache.zookeeper.server.NIOServerCnxnFactory; Too many connections
 from
  /11.22.33.44 - max is 60
 
  netstat (on the Zookeeper host) shows that the connections are not being
  closed. What is the 'correct' way to fix this?   Apologies if i have
 missed
  any documentation that explains, pointers would be helpful.

 All SolrServer implementations in SolrJ, including CloudSolrServer, are
 supposed to be threadsafe.  If it turns out they're not actually
 threadsafe, then we treat that as a bug.  The discussion to determine
 that it's a bug takes place on this mailing list, and once we determine
 that, the next step is to file an issue in Jira.

 The general way to use SolrJ is to initialize the server instance at the
 beginning and re-use it for all client communication to Solr.  With
 CloudSolrServer, you normally only need a single server instance to talk
 to the entire cloud, because you can set the collection parameter on
 each request to indicate which collection to work on.  If you only have
 a handful of collections, you might want to use multiple instances and
 use setDefaultCollection  to specify the collection.  With
 HttpSolrServer, an instance is required for each core, because the core
 name is in the initialization URL.

 I've not looked at the code, but I can't imagine that the client ever
 needs to make more than one connection to each server in the zookeeper
 ensemble.  Here's a list of the open connections on one of my zookeeper
 servers for my SolrCloud 4.2.1 install:

 java21800 root   21u  IPv62836983  0t0  TCP
 10.8.0.151:50178-10.8.0.152:2888 (ESTABLISHED)
 java21800 root   22u  IPv62661097  0t0  TCP
 10.8.0.151:3888-10.8.0.152:34116 (ESTABLISHED)
 java21800 root   26u  IPv6   28065088  0t0  TCP
 10.8.0.151:2181-10.8.0.141:52583 (ESTABLISHED)
 java21800 root   27u  IPv6   23967470  0t0  TCP
 10.8.0.151:2181-10.8.0.152:49436 (ESTABLISHED)
 java21800 root   28r  IPv6   23969636  0t0  TCP
 10.8.0.151:2181-10.8.0.151:57290 (ESTABLISHED)
 java21800 root   29r 

Re: CloudSolrServer, concurrency and too many connections

2014-12-10 Thread Greg Solovyev
I am seeing this problem with Java 1.8.0_25-b17 on Ubuntu 14.04.1 LTS ZK 3.4.6, 
Solr 4.10.2

Thanks,
Greg

- Original Message -
From: JoeSmith fidw...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Monday, December 8, 2014 6:19:08 PM
Subject: Re: CloudSolrServer, concurrency and too many connections

Thanks, Shawn.  I updated to 7u72 and was not able to reproduce the
problem. That was good.  But just to be sure about this, I backed back down
to 7u55 and again was not able to reproduce.  So at least for now, this has
gone away even if the reason is inconclusive.


On Mon, Dec 8, 2014 at 7:37 AM, JoeSmith fidw...@gmail.com wrote:

 We will need to update to 7u52, we are using 7u55.  On the client side,
 this happens with zookeeper 3.4.6 and 4.10.2 solrj.  And we will need to
 update both on the server side.   What kind of config/setup information
 would you need to see if we do still have an issue after these updates?

 On Mon, Dec 8, 2014 at 12:40 AM, Shawn Heisey apa...@elyograg.org wrote:

 On 12/7/2014 9:11 PM, JoeSmith wrote:
  i've upgraded to 4.10.2 on the client-side.  Still seeing this
 connection
  problem when connecting to the Zookeeper port.  If I connect directly to
  SolrServer, the connections do not increase.  But when connecting to
  Zookeeper, the connections increase up to 60 and then start to fail.  I
  understand Zookeeper is configured to fail after 60 connections to
 prevent
  a DOS attack, but I dont see why we keep adding new connections (up to
  60).  Does the client-side Zookeeper code also use HttpClient
  ConnectionPooling for its Connection Pool?  Below is the Exception that
  shows up in the log file when this happens.  When we execute queries we
 are
  using the _route_ parameter, could this explain anything?

 The docs say that Zookeeper uses NIO communication directly by default,
 so there's no layer like HttpClient.  I don't think it uses pooling ...
 it does everything over a single TCP connection that doesn't normally
 disconnect until the program exits.

 Basically, the Zookeeper authors built their own networking layer that
 uses TCP directly.  You have the option of using Netty instead:


 http://zookeeper.apache.org/doc/r3.4.6/zookeeperAdmin.html#Communication+using+the+Netty+framework

 Are you running version 3.4.6 for your zookeeper servers?  That's the
 version of ZK client code you'll find in Solr 4.10.x, and the
 recommended version for both the server and your SolrJ program.

 The most likely reasons for the connection problems you are seeing are:

 1) A bug in the networking layer of your JVM.
 1a) The latest Oracle Java 7 (currently 7u72) is highly recommended.
 2) A bug or misconfig in the OS TCP stack, or possibly its firewall.
 3) A bug or misconfig in zookeeper.

 I can't rule out the fourth possibility, but so far I think it's unlikely:

 4) A bug in SolrJ that has not yet been reported or fixed.

 Thanks,
 Shawn





Re: CloudSolrServer, concurrency and too many connections

2014-12-10 Thread Greg Solovyev
This was a user error. My code was re-instantiating CloudSolrServer for each 
request and never calling CloudSolrServer::shutdown(). 

Thanks,
Greg

- Original Message -
From: Greg Solovyev g...@zimbra.com
To: solr-user@lucene.apache.org
Sent: Wednesday, December 10, 2014 11:08:10 AM
Subject: Re: CloudSolrServer, concurrency and too many connections

I am seeing this problem with Java 1.8.0_25-b17 on Ubuntu 14.04.1 LTS ZK 3.4.6, 
Solr 4.10.2

Thanks,
Greg

- Original Message -
From: JoeSmith fidw...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Monday, December 8, 2014 6:19:08 PM
Subject: Re: CloudSolrServer, concurrency and too many connections

Thanks, Shawn.  I updated to 7u72 and was not able to reproduce the
problem. That was good.  But just to be sure about this, I backed back down
to 7u55 and again was not able to reproduce.  So at least for now, this has
gone away even if the reason is inconclusive.


On Mon, Dec 8, 2014 at 7:37 AM, JoeSmith fidw...@gmail.com wrote:

 We will need to update to 7u52, we are using 7u55.  On the client side,
 this happens with zookeeper 3.4.6 and 4.10.2 solrj.  And we will need to
 update both on the server side.   What kind of config/setup information
 would you need to see if we do still have an issue after these updates?

 On Mon, Dec 8, 2014 at 12:40 AM, Shawn Heisey apa...@elyograg.org wrote:

 On 12/7/2014 9:11 PM, JoeSmith wrote:
  i've upgraded to 4.10.2 on the client-side.  Still seeing this
 connection
  problem when connecting to the Zookeeper port.  If I connect directly to
  SolrServer, the connections do not increase.  But when connecting to
  Zookeeper, the connections increase up to 60 and then start to fail.  I
  understand Zookeeper is configured to fail after 60 connections to
 prevent
  a DOS attack, but I dont see why we keep adding new connections (up to
  60).  Does the client-side Zookeeper code also use HttpClient
  ConnectionPooling for its Connection Pool?  Below is the Exception that
  shows up in the log file when this happens.  When we execute queries we
 are
  using the _route_ parameter, could this explain anything?

 The docs say that Zookeeper uses NIO communication directly by default,
 so there's no layer like HttpClient.  I don't think it uses pooling ...
 it does everything over a single TCP connection that doesn't normally
 disconnect until the program exits.

 Basically, the Zookeeper authors built their own networking layer that
 uses TCP directly.  You have the option of using Netty instead:


 http://zookeeper.apache.org/doc/r3.4.6/zookeeperAdmin.html#Communication+using+the+Netty+framework

 Are you running version 3.4.6 for your zookeeper servers?  That's the
 version of ZK client code you'll find in Solr 4.10.x, and the
 recommended version for both the server and your SolrJ program.

 The most likely reasons for the connection problems you are seeing are:

 1) A bug in the networking layer of your JVM.
 1a) The latest Oracle Java 7 (currently 7u72) is highly recommended.
 2) A bug or misconfig in the OS TCP stack, or possibly its firewall.
 3) A bug or misconfig in zookeeper.

 I can't rule out the fourth possibility, but so far I think it's unlikely:

 4) A bug in SolrJ that has not yet been reported or fixed.

 Thanks,
 Shawn





Re: Consul instead of ZooKeeper anyone?

2014-11-04 Thread Greg Solovyev
Thanks for the answers Erick. I can see that this is a significant effort and I 
am certainly not asking the community to undertake this work. I was actually 
going to take a stab at it myself. Regarding $$ savings from not requiring ZK 
my assumption is that ZK in production demands a dedicated host and requires 
2GB RAM/instance while Consul runs on less than 100MB RAM/instance. So, for 
ISPs, BSP and large enterprise deployments, the savings come would from reduced 
resource requirements. 

Thanks,
Greg

- Original Message -
From: Erick Erickson erickerick...@gmail.com
To: solr-user@lucene.apache.org
Sent: Monday, November 3, 2014 3:25:25 PM
Subject: Re: Consul instead of ZooKeeper anyone?

bq:  Do you think it would be possible to add an abstraction layer to
Solr source code in near future?

I strongly doubt it. As you've already noted, this is a large amount
of work. Without some super-compelling advantage I just don't see the
interest.

bq:  to avoid deploying ZK just for SolrCloud would save a bunch of $$
for large customers

How so? It's free.

Making this change would, IMO, require a compelling story to generate
much enthusiasm. So far I haven't seen that story, and Jürgen and
Walter raise valid points that haven't been addressed. I suspect
you're significantly underestimating the effort to get this stable in
the SolrCloud world as well.

I don't really want to be such a wet blanket, but you're asking about
a very significant amount of work from a bunch of people, all of whom
have lots of things on their plate. So without a _very_ good reason, I
think it's unlikely to generate much interest.

Best,
Erick

On Mon, Nov 3, 2014 at 11:17 AM, Greg Solovyev g...@zimbra.com wrote:
 Thanks Erick,
 after looking further into Solr's source code, I see that it's married to ZK 
 libraries and it won't be possible to extend existing code without diverting 
 from the trunk. At the same time, I don't see any reason for lack of 
 abstraction in cloud-related code of Solr and SolrJ. As far as I can see 
 Consul provides all that SolrCloud needs and so if cloud code was using some 
 more abstraction, ZK bindings could be substituted with another library. I am 
 willing to implement a this functionality and the abstraction, but at the 
 same time, I don't want to maintain my own branch of Solr because of this 
 integration. Do you think it would be possible to add an abstraction layer to 
 Solr source code in near future?

 I think Consul has all the features that SolrCloud needs and what's 
 especially attractive about Consul is that it's memory footprint is 100X 
 smaller than ZK. Mainly though, we are considering Consul as a main service 
 locator for a bunch of other moving parts within Zimbra, so being able to 
 avoid deploying ZK just for SolrCloud would save a bunch of $$ for large 
 customers.

 Thanks,
 Greg

 - Original Message -
 From: Erick Erickson erickerick...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Friday, October 31, 2014 5:15:09 PM
 Subject: Re: Consul instead of ZooKeeper anyone?

 Not that I know of, but look before you leap. I took a quick look at
 Consul and it really doesn't look like any kind of drop-in replacement.
 Also, the Zookeeper usage in SolrCloud isn't really pluggable
 AFAIK, so there'll be lots of places in the Solr code that need to be
 reworked etc., especially in the realm of collections and sharding.

 The Collections API will be challenging to port over I think.

 Not to mention SolrJ and CloudSolrServer for clients who want to interact
 with SolrCloud through Java.

 Not saying it won't work, I just suspect that getting it done would be
 a big job, and thereafter keeping those changes in sync with the
 changing SolrCloud code base would chew up a lots of time. So if
 I were putting my Product Manager hat on I'd ask is the benefit
 worth the effort?.

 All that said, go for it if you've a mind to!

 Best,
 Erick

 On Fri, Oct 31, 2014 at 4:08 PM, Greg Solovyev g...@zimbra.com wrote:
 I am investigating a project to make SolrCloud run on Consul instead of 
 ZooKeeper. So far, my research revealed no such efforts, but I wanted to 
 check with this list to make sure I am not going to be reinventing the 
 wheel. Have anyone attempted using Consul instead of ZK to coordinate 
 SolrCloud nodes?

 Thanks,
 Greg


Re: Consul instead of ZooKeeper anyone?

2014-11-03 Thread Greg Solovyev
Thanks Erick, 
after looking further into Solr's source code, I see that it's married to ZK 
libraries and it won't be possible to extend existing code without diverting 
from the trunk. At the same time, I don't see any reason for lack of 
abstraction in cloud-related code of Solr and SolrJ. As far as I can see Consul 
provides all that SolrCloud needs and so if cloud code was using some more 
abstraction, ZK bindings could be substituted with another library. I am 
willing to implement a this functionality and the abstraction, but at the same 
time, I don't want to maintain my own branch of Solr because of this 
integration. Do you think it would be possible to add an abstraction layer to 
Solr source code in near future? 

I think Consul has all the features that SolrCloud needs and what's especially 
attractive about Consul is that it's memory footprint is 100X smaller than ZK. 
Mainly though, we are considering Consul as a main service locator for a bunch 
of other moving parts within Zimbra, so being able to avoid deploying ZK just 
for SolrCloud would save a bunch of $$ for large customers.

Thanks,
Greg

- Original Message -
From: Erick Erickson erickerick...@gmail.com
To: solr-user@lucene.apache.org
Sent: Friday, October 31, 2014 5:15:09 PM
Subject: Re: Consul instead of ZooKeeper anyone?

Not that I know of, but look before you leap. I took a quick look at
Consul and it really doesn't look like any kind of drop-in replacement.
Also, the Zookeeper usage in SolrCloud isn't really pluggable
AFAIK, so there'll be lots of places in the Solr code that need to be
reworked etc., especially in the realm of collections and sharding.

The Collections API will be challenging to port over I think.

Not to mention SolrJ and CloudSolrServer for clients who want to interact
with SolrCloud through Java.

Not saying it won't work, I just suspect that getting it done would be
a big job, and thereafter keeping those changes in sync with the
changing SolrCloud code base would chew up a lots of time. So if
I were putting my Product Manager hat on I'd ask is the benefit
worth the effort?.

All that said, go for it if you've a mind to!

Best,
Erick

On Fri, Oct 31, 2014 at 4:08 PM, Greg Solovyev g...@zimbra.com wrote:
 I am investigating a project to make SolrCloud run on Consul instead of 
 ZooKeeper. So far, my research revealed no such efforts, but I wanted to 
 check with this list to make sure I am not going to be reinventing the wheel. 
 Have anyone attempted using Consul instead of ZK to coordinate SolrCloud 
 nodes?

 Thanks,
 Greg


Consul instead of ZooKeeper anyone?

2014-10-31 Thread Greg Solovyev
I am investigating a project to make SolrCloud run on Consul instead of 
ZooKeeper. So far, my research revealed no such efforts, but I wanted to check 
with this list to make sure I am not going to be reinventing the wheel. Have 
anyone attempted using Consul instead of ZK to coordinate SolrCloud nodes? 

Thanks, 
Greg 


Re: Mongo DB Users

2014-09-15 Thread Greg Solovyev
Remove me from this thread please

Thanks,
Greg

- Original Message -
From: Jack Krupansky j...@basetechnology.com
To: solr-user@lucene.apache.org
Sent: Monday, September 15, 2014 10:44:00 AM
Subject: Re: Mongo DB Users

 Waiting for a positive response!

-1

-- Jack Krupansky

-Original Message- 
From: Rakesh Varna
Sent: Monday, September 15, 2014 10:18 AM
To: solr-user@lucene.apache.org
Subject: Re: Mongo DB Users

Remove

Regards,
Rakesh Varna


On Mon, Sep 15, 2014 at 9:29 AM, Ed Smiley ed.smi...@proquest.com wrote:

 Remove

 On 9/15/14, 8:35 AM, Aaron Susan aaronsus...@gmail.com wrote:

 Hi,
 
 I am here to inform you that we are having a contact list of *Mongo DB
 Users *would you be interested in it?
 
 Data Field¹s Consist Of: Name, Job Title, Verified Phone Number, Verified
 Email Address, Company Name  Address Employee Size, Revenue size, SIC
 Code, Industry Type etc.,
 
 We also provide other technology users as well depends on your
 requirement.
 
 For Example:
 
 
 *Red Hat *
 
 *Terra data *
 
 *Net-app *
 
 *NuoDB*
 
 *MongoHQ ** and many more*
 
 
 We also provide IT Decision Makers, Sales and Marketing Decision Makers,
 C-level Titles and other titles as per your requirement.
 
 Please review and let me know your interest if you are looking for above
 mentioned users list or other contacts list for your campaigns.
 
 Waiting for a positive response!
 
 Thanks
 
 *Aaron Susan*
 Data Specialist
 
 If you are not the right person, feel free to forward this email to the
 right person in your organization. To opt out response Remove




Re: How to restore an index from a backup over HTTP

2014-09-04 Thread Greg Solovyev
Thanks Jeff!

Thanks,
Greg

- Original Message -
From: Jeff Wartes jwar...@whitepages.com
To: solr-user@lucene.apache.org
Sent: Wednesday, August 20, 2014 10:36:07 AM
Subject: Re: How to restore an index from a backup over HTTP

Here’s the repo:
https://github.com/whitepages/solrcloud_manager


Comments/Issues/Patches welcome.


On 8/18/14, 11:28 AM, Greg Solovyev g...@zimbra.com wrote:

Thanks Jeff, I'd be interested in taking a look at the code for this
tool. My github ID is grishick.

Thanks,
Greg

- Original Message -
From: Jeff Wartes jwar...@whitepages.com
To: solr-user@lucene.apache.org
Sent: Monday, August 18, 2014 9:49:28 PM
Subject: Re: How to restore an index from a backup over HTTP

I¹m able to do cross-solrcloud-cluster index copy using nothing more than
careful use of the ³fetchindex² replication handler command.

I¹m using this as a build/deployment tool, so I manually create a
collection in two clusters, index into one, test, and then ask the other
cluster to fetchindex from it on each shard/replica.

Some caveats:
  1. It seems like fetchindex may silently decline if it thinks the index
it has is newer.
  2. I¹m not doing this on an index that¹s currently receiving updates.
  3. SolrCloud replication doesn¹t come into this flow, even if you
fetchindex on a leader. (although once you¹re done, updates should get
replicated normally)
  4. Both collections must be created with the same number of shards and
sharding mechanism. (although replication factor can vary)
 

I¹ve got a tool for automating this that I¹d like to push to github at
some point, let me know if you¹re interested.





On 8/16/14, 3:03 AM, Greg Solovyev g...@zimbra.com wrote:

Thanks Shawn, this is a pretty cool idea. Adding the handler seems pretty
straight forward, but the main concern I have is the internal data format
that ReplicationHandler and SnapPuller use. This new handler as well as
the code that I've already written to download the index files from Solr
will depend on that format. Unfortunately, this format is not documented
and is not abstracted by SolrJ, so I wonder what I can do to make sure it
does not change on us without notice.

Thanks,
Greg

- Original Message -
From: Shawn Heisey s...@elyograg.org
To: solr-user@lucene.apache.org
Sent: Friday, August 15, 2014 7:31:19 PM
Subject: Re: How to restore an index from a backup over HTTP

On 8/15/2014 5:51 AM, Greg Solovyev wrote:
 What I want to achieve is being able to send the backed up index to
Solr (either standalone or with ZooKeeper) in a way similar to creating
a new Collection. I.e. create a new collection and upload an exiting
index directly into that Collection. I've looked through Solr code and
so far I have not found a handler that would allow this scenario. So,
the last idea is to implement a special handler for this case, perhaps
extending CoreAdminHandler. ReplicationHandler together with SnapPuller
do pretty much what I need to do, except that the action has to be
initiated by the receiving Solr server and I need to initiate the action
externally. I.e., instead of having Solr slave download an index from
Solr master, I need to feed the index to Solr master and ideally this
would work the same way in standalone and SolrCloud modes.

I have not made any attempt to verify what I'm stating below.  It may
not work.

What I think I would *try* is setting up a standalone Solr (no cloud) on
the backup server.  Use scripted index/config copies and Solr start/stop
actions to get the index up and running on a known core in the
standalone Solr.  Then use the replication handler's HTTP API to
replicate the index from that standalone server to each of the replicas
in your cluster.

https://wiki.apache.org/solr/SolrReplication#HTTP_API
https://cwiki.apache.org/confluence/display/solr/Index+Replication#IndexR
e
plication-HTTPAPICommandsfortheReplicationHandler

One thing that I do not know is whether SolrCloud itself might interfere
with these actions, or whether it might automatically take care of
additional replicas if you replicate to the shard leader.  If SolrCloud
*would* interfere, then this idea might need special support in
SolrCloud, perhaps as an extension to the Collections API.  If it won't
interfere, then the use-case would need to be documented (on the user
wiki at a minimum) so that committers will be aware of it and preserve
the capability in future versions.  An extension to the Collections API
might be a good idea either way -- I've seen a number of questions about
capability that falls under this basic heading.

Thanks,
Shawn


Re: How to restore an index from a backup over HTTP

2014-08-18 Thread Greg Solovyev
Thanks Jeff, I'd be interested in taking a look at the code for this tool. My 
github ID is grishick.

Thanks,
Greg

- Original Message -
From: Jeff Wartes jwar...@whitepages.com
To: solr-user@lucene.apache.org
Sent: Monday, August 18, 2014 9:49:28 PM
Subject: Re: How to restore an index from a backup over HTTP

I¹m able to do cross-solrcloud-cluster index copy using nothing more than
careful use of the ³fetchindex² replication handler command.

I¹m using this as a build/deployment tool, so I manually create a
collection in two clusters, index into one, test, and then ask the other
cluster to fetchindex from it on each shard/replica.

Some caveats:
  1. It seems like fetchindex may silently decline if it thinks the index
it has is newer.
  2. I¹m not doing this on an index that¹s currently receiving updates.
  3. SolrCloud replication doesn¹t come into this flow, even if you
fetchindex on a leader. (although once you¹re done, updates should get
replicated normally)
  4. Both collections must be created with the same number of shards and
sharding mechanism. (although replication factor can vary)
 

I¹ve got a tool for automating this that I¹d like to push to github at
some point, let me know if you¹re interested.





On 8/16/14, 3:03 AM, Greg Solovyev g...@zimbra.com wrote:

Thanks Shawn, this is a pretty cool idea. Adding the handler seems pretty
straight forward, but the main concern I have is the internal data format
that ReplicationHandler and SnapPuller use. This new handler as well as
the code that I've already written to download the index files from Solr
will depend on that format. Unfortunately, this format is not documented
and is not abstracted by SolrJ, so I wonder what I can do to make sure it
does not change on us without notice.

Thanks,
Greg

- Original Message -
From: Shawn Heisey s...@elyograg.org
To: solr-user@lucene.apache.org
Sent: Friday, August 15, 2014 7:31:19 PM
Subject: Re: How to restore an index from a backup over HTTP

On 8/15/2014 5:51 AM, Greg Solovyev wrote:
 What I want to achieve is being able to send the backed up index to
Solr (either standalone or with ZooKeeper) in a way similar to creating
a new Collection. I.e. create a new collection and upload an exiting
index directly into that Collection. I've looked through Solr code and
so far I have not found a handler that would allow this scenario. So,
the last idea is to implement a special handler for this case, perhaps
extending CoreAdminHandler. ReplicationHandler together with SnapPuller
do pretty much what I need to do, except that the action has to be
initiated by the receiving Solr server and I need to initiate the action
externally. I.e., instead of having Solr slave download an index from
Solr master, I need to feed the index to Solr master and ideally this
would work the same way in standalone and SolrCloud modes.

I have not made any attempt to verify what I'm stating below.  It may
not work.

What I think I would *try* is setting up a standalone Solr (no cloud) on
the backup server.  Use scripted index/config copies and Solr start/stop
actions to get the index up and running on a known core in the
standalone Solr.  Then use the replication handler's HTTP API to
replicate the index from that standalone server to each of the replicas
in your cluster.

https://wiki.apache.org/solr/SolrReplication#HTTP_API
https://cwiki.apache.org/confluence/display/solr/Index+Replication#IndexRe
plication-HTTPAPICommandsfortheReplicationHandler

One thing that I do not know is whether SolrCloud itself might interfere
with these actions, or whether it might automatically take care of
additional replicas if you replicate to the shard leader.  If SolrCloud
*would* interfere, then this idea might need special support in
SolrCloud, perhaps as an extension to the Collections API.  If it won't
interfere, then the use-case would need to be documented (on the user
wiki at a minimum) so that committers will be aware of it and preserve
the capability in future versions.  An extension to the Collections API
might be a good idea either way -- I've seen a number of questions about
capability that falls under this basic heading.

Thanks,
Shawn


Re: How to restore an index from a backup over HTTP

2014-08-18 Thread Greg Solovyev
Shawn, the format that I am referencing is filestream, which starts with 2 
bytes carrying file size, then 4 bytes carrying checksum (optional) and then 
the actual bits of the file.

Thanks,
Greg

- Original Message -
From: Shawn Heisey s...@elyograg.org
To: solr-user@lucene.apache.org
Sent: Sunday, August 17, 2014 12:28:12 AM
Subject: Re: How to restore an index from a backup over HTTP

On 8/16/2014 4:03 AM, Greg Solovyev wrote:
 Thanks Shawn, this is a pretty cool idea. Adding the handler seems pretty 
 straight forward, but the main concern I have is the internal data format 
 that ReplicationHandler and SnapPuller use. This new handler as well as the 
 code that I've already written to download the index files from Solr will 
 depend on that format. Unfortunately, this format is not documented and is 
 not abstracted by SolrJ, so I wonder what I can do to make sure it does not 
 change on us without notice.

I am not really sure what format you're referencing here, but I'm about
99% sure the format *over the wire* is javabin.  When the javabin format
changed between 1.4.1 and 3.1.0, replication between those versions
became impossible.

Historical: The Solr version made a huge leap after the Solr and Lucene
development was merged -- it was synchronized with the Lucene version.
There are no 1.5, 2.x, or 3.0 versions of Solr.

https://issues.apache.org/jira/browse/SOLR-2204

Thanks,
Shawn


Re: How to restore an index from a backup over HTTP

2014-08-16 Thread Greg Solovyev
Thanks Shawn, this is a pretty cool idea. Adding the handler seems pretty 
straight forward, but the main concern I have is the internal data format that 
ReplicationHandler and SnapPuller use. This new handler as well as the code 
that I've already written to download the index files from Solr will depend on 
that format. Unfortunately, this format is not documented and is not abstracted 
by SolrJ, so I wonder what I can do to make sure it does not change on us 
without notice.

Thanks,
Greg

- Original Message -
From: Shawn Heisey s...@elyograg.org
To: solr-user@lucene.apache.org
Sent: Friday, August 15, 2014 7:31:19 PM
Subject: Re: How to restore an index from a backup over HTTP

On 8/15/2014 5:51 AM, Greg Solovyev wrote:
 What I want to achieve is being able to send the backed up index to Solr 
 (either standalone or with ZooKeeper) in a way similar to creating a new 
 Collection. I.e. create a new collection and upload an exiting index directly 
 into that Collection. I've looked through Solr code and so far I have not 
 found a handler that would allow this scenario. So, the last idea is to 
 implement a special handler for this case, perhaps extending 
 CoreAdminHandler. ReplicationHandler together with SnapPuller do pretty much 
 what I need to do, except that the action has to be initiated by the 
 receiving Solr server and I need to initiate the action externally. I.e., 
 instead of having Solr slave download an index from Solr master, I need to 
 feed the index to Solr master and ideally this would work the same way in 
 standalone and SolrCloud modes. 

I have not made any attempt to verify what I'm stating below.  It may
not work.

What I think I would *try* is setting up a standalone Solr (no cloud) on
the backup server.  Use scripted index/config copies and Solr start/stop
actions to get the index up and running on a known core in the
standalone Solr.  Then use the replication handler's HTTP API to
replicate the index from that standalone server to each of the replicas
in your cluster.

https://wiki.apache.org/solr/SolrReplication#HTTP_API
https://cwiki.apache.org/confluence/display/solr/Index+Replication#IndexReplication-HTTPAPICommandsfortheReplicationHandler

One thing that I do not know is whether SolrCloud itself might interfere
with these actions, or whether it might automatically take care of
additional replicas if you replicate to the shard leader.  If SolrCloud
*would* interfere, then this idea might need special support in
SolrCloud, perhaps as an extension to the Collections API.  If it won't
interfere, then the use-case would need to be documented (on the user
wiki at a minimum) so that committers will be aware of it and preserve
the capability in future versions.  An extension to the Collections API
might be a good idea either way -- I've seen a number of questions about
capability that falls under this basic heading.

Thanks,
Shawn


How to restore an index from a backup over HTTP

2014-08-15 Thread Greg Solovyev
Hello, I am looking for advice on implementing the following backup/restore 
scenario. 
We are using Solr to index email. Each mailbox has it's own Collection. We do 
not store emails in Solr, the emails are stored on disk in a blob store, meta 
data is stored in a database and Solr is used only for full text search. The 
scenario is restoring a mailbox from a backup. The backup of a mailbox contains 
blobs, meta data in a SQL file. We can also pull Lucene index files from Solr 
using ReplicationHandler in the same way Solr's SnapPuller does it on a slave 
server. We already have restore utility that restores blobs and meta-data, but 
are working on a mechanism to backup and restore Solr index in a way that 
allows us to package each mailbox into a separate backup folder/archive. 

An obvious first idea for restoring is to drop the index files into a new 
folder on one of the existing Solr servers and make it pick up the new 
collection - that's simple. However, this approach has two downsides 1 - it 
requires that SSH access is set up between the machine where backup-and-restore 
script is running and Solr server, 2 - if Solr is running in SolrCloud mode, 
this approach bypasses ZooKeeper and we would have to pick the Solr instance 
for this new Collection without ZooKeeper. 

Another idea is to not include index files in backups and re-index mail upon 
restoring it. This isn't a good idea at all when restoring large mailboxes. 

What I want to achieve is being able to send the backed up index to Solr 
(either standalone or with ZooKeeper) in a way similar to creating a new 
Collection. I.e. create a new collection and upload an exiting index directly 
into that Collection. I've looked through Solr code and so far I have not found 
a handler that would allow this scenario. So, the last idea is to implement a 
special handler for this case, perhaps extending CoreAdminHandler. 
ReplicationHandler together with SnapPuller do pretty much what I need to do, 
except that the action has to be initiated by the receiving Solr server and I 
need to initiate the action externally. I.e., instead of having Solr slave 
download an index from Solr master, I need to feed the index to Solr master and 
ideally this would work the same way in standalone and SolrCloud modes. 

What are your thoughts and ideas on the subject? 

Thanks, 
Greg