Re: Replication for SolrCloud

2015-04-19 Thread gengmao
Thanks for the suggestion, Erick. However here what we need is not a patch,
is a clarification from practice perspective.

I think solr replication is a great feature to scale reads, and kind of
increase reliability. However, on HDFS it is not as useful as just
sharding. Sharding can scale both reads and writes at same time, and
doesn't have consistency concern along with replication. So I doubt Solr
replication on HDFS has real meanings?

I will try to reach out Mark Miller and will appreciate if he or anyone can
provide more convincing points on this.

Thanks,
Mao

On Sat, Apr 18, 2015 at 4:44 PM Erick Erickson erickerick...@gmail.com
wrote:

 AFAIK, the HDFS replication of Solr indexes isn't something that was
 designed, it just came along for the ride given HDFS replication.
 Having a shard with 1 leader and two followers have 9 copies of the
 index around _is_ overkill, nobody argues that at all.

 I know the folks at Cloudera (who contributed the original HDFS
 implementation) have discussed various options around this. In the
 grand scheme of things, there have been other priorities without
 tearing into the guts of Solr and/or HDFS since disk space is
 relatively cheap.

 That said, I'm also sure that this will get some attention as
 priorities change. All patches welcome of course ;), But if you're
 inclined to work on this issue, I'd _really_ discuss it with Mark
 Miller  etc. before investing too much effort in it. I don't quite
 know the tradeoffs well enough to have an opinion on the right
 implementation.

 Best
 Erick

 On Sat, Apr 18, 2015 at 1:59 AM, Shalin Shekhar Mangar
 shalinman...@gmail.com wrote:
  Some comments inline:
 
  On Sat, Apr 18, 2015 at 2:12 PM, gengmao geng...@gmail.com wrote:
 
  On Sat, Apr 18, 2015 at 12:20 AM Jürgen Wagner (DVT) 
  juergen.wag...@devoteam.com wrote:
 
Replication on the storage layer will provide a reliable storage for
 the
   index and other data of Solr. In particular, this replication does not
   guarantee your index files are consistent at any time as there may be
   intermediate states that are only partially replicated. Replication is
  only
   a convergent process, not an instant, atomic operation. With frequent
   changes, this becomes an issue.
  
  Firstly thanks for your reply. However I can't agree with you on this.
  HDFS guarantees the consistency even with replicates - you always read
 what
  you write, no partially replicated state will be read, which is
 guaranteed
  by HDFS server and client. Hence HBase can rely on HDFS for consistency
 and
  availability, without implementing another replication mechanism - if I
  understand correctly.
 
 
  Lucene index is not one file but a collection of files which are written
  independently. So if you replicate them out of order, Lucene might
 consider
  the index as corrupted (because of missing files). I don't think HBase
  works in that way.
 
 
 
   Replication inside SolrCloud as an application will not only maintain
 the
   consistency of the search-level interfaces to your indexes, but also
  scale
   in the sense of the application (query throughput).
  
   Split one shard into two shards can increase the query throughput too.
 
 
   Imagine a database: if you change one record, this may also result in
 an
   index change. If the record and the index are stored in different
 storage
   blocks, one will get replicated first. However, the replication target
  will
   only be consistent again when both have been replicated. So, you would
  have
   to suspend all accesses until the entire replication has completed.
  That's
   undesirable. If you replicate on the application (database management
   system) level, the application will employ a more fine-grained
 approach
  to
   replication, guaranteeing application consistency.
  
  In HBase, a region only locates on single region server at any time,
 which
  guarantee its consistency. Because your read/write always drops in one
  region, you won't have concern of parallel writes happens on multiple
  replicates of same region.
  The replication of HDFS is totally transparent to HBase. When a HDFS
 write
  call returns, HBase know the data is written and replicated so losing
 one
  copy of the data won't impact HBase at all.
  So HDFS means consistency and reliability for HBase. However, HBase
 doesn't
  use replicates (either HBase itself or HDFS's) to scale reads. If one
  region's is too hot for reads or write, you split that region into two
  regions, so that the reads and writes of that region can be distributed
  into two region servers. Hence HBase scales.
  I think this is the simplicity and beauty of HBase. Again, I am curious
 if
  SolrCloud has better reason to use replication on HDFS? As I described,
  HDFS provided consistency and reliability, meanwhile scalability can be
  achieved via sharding, even without Solr replication.
 
 
  That's something that has been considered and may even be in the roadmap

Re: Replication for SolrCloud

2015-04-19 Thread juergen.wag...@devoteam.com
In simple words:

HDFS is good for file-oriented replication. Solr is good for index replication.

Consequently, if atomic file update operations of an application (like Solr) 
are not atomic on a file level, HDFS is not adequate - like for Solr with live 
index updates. Running Solr on HDFS (as a file system) will pose limitations 
due to HDFS properties. Indexing, however, still won't use Hadoop.

If you produce indexes and distribute them as finalized, read-only structures 
(e.g., through Hadoop jobs), HDFS is fine. Solr does not need to be much aware 
of HDFS.

The third one in the picture is records-based replication to be handled by 
Hbase, Cassandra or Zookeeper, depending on requirements.

Cheers,
Jürgen

Re: Replication for SolrCloud

2015-04-19 Thread gengmao
Please see my response in line:

On Fri, Apr 17, 2015 at 10:59 PM Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 Some comments inline:

 On Sat, Apr 18, 2015 at 2:12 PM, gengmao geng...@gmail.com wrote:

  On Sat, Apr 18, 2015 at 12:20 AM Jürgen Wagner (DVT) 
  juergen.wag...@devoteam.com wrote:
 
Replication on the storage layer will provide a reliable storage for
 the
   index and other data of Solr. In particular, this replication does not
   guarantee your index files are consistent at any time as there may be
   intermediate states that are only partially replicated. Replication is
  only
   a convergent process, not an instant, atomic operation. With frequent
   changes, this becomes an issue.
  
  Firstly thanks for your reply. However I can't agree with you on this.
  HDFS guarantees the consistency even with replicates - you always read
 what
  you write, no partially replicated state will be read, which is
 guaranteed
  by HDFS server and client. Hence HBase can rely on HDFS for consistency
 and
  availability, without implementing another replication mechanism - if I
  understand correctly.
 
 
 Lucene index is not one file but a collection of files which are written
 independently. So if you replicate them out of order, Lucene might consider
 the index as corrupted (because of missing files). I don't think HBase
 works in that way.

Again HDFS replication is transparent to HBase. You can set HDFS
replication factor to 1 and HBase will still work, but it will lose the
fault tolerance to any disk failure which is provided by HDFS replicates.
Also HBase doesn't directly utilize HDFS replicates. Increase HDFS
replication factors won't improve HBase's scalability. To achieve better
read/write throughput, split shards is the only approach.



 
   Replication inside SolrCloud as an application will not only maintain
 the
   consistency of the search-level interfaces to your indexes, but also
  scale
   in the sense of the application (query throughput).
  
   Split one shard into two shards can increase the query throughput too.
 
 
   Imagine a database: if you change one record, this may also result in
 an
   index change. If the record and the index are stored in different
 storage
   blocks, one will get replicated first. However, the replication target
  will
   only be consistent again when both have been replicated. So, you would
  have
   to suspend all accesses until the entire replication has completed.
  That's
   undesirable. If you replicate on the application (database management
   system) level, the application will employ a more fine-grained approach
  to
   replication, guaranteeing application consistency.
  
  In HBase, a region only locates on single region server at any time,
 which
  guarantee its consistency. Because your read/write always drops in one
  region, you won't have concern of parallel writes happens on multiple
  replicates of same region.
  The replication of HDFS is totally transparent to HBase. When a HDFS
 write
  call returns, HBase know the data is written and replicated so losing one
  copy of the data won't impact HBase at all.
  So HDFS means consistency and reliability for HBase. However, HBase
 doesn't
  use replicates (either HBase itself or HDFS's) to scale reads. If one
  region's is too hot for reads or write, you split that region into two
  regions, so that the reads and writes of that region can be distributed
  into two region servers. Hence HBase scales.
  I think this is the simplicity and beauty of HBase. Again, I am curious
 if
  SolrCloud has better reason to use replication on HDFS? As I described,
  HDFS provided consistency and reliability, meanwhile scalability can be
  achieved via sharding, even without Solr replication.
 
 
 That's something that has been considered and may even be in the roadmap
 for the Cloudera guys. See https://issues.apache.org/jira/browse/SOLR-6237

 But one problem that isn't solved by HDFS replication is of near-real-time
 indexing where you want the documents to be available for searchers as fast
 as possible. SolrCloud replication supports that by replicating documents
 as they come in and indexing them in several replicas. A new index searcher
 is opened on the flushed index files as well as on the internal data
 structures of the index writer. If we switch to relying on HDFS replication
 then this will be awfully expensive. However, as Jürgen mentioned, HDFS can
 certainly help with replicating static indexes

My understanding is near-real-time indexing is not necessary to rely on
replication.
https://cwiki.apache.org/confluence/display/solr/Near+Real+Time+Searching
just describes soft commit but doesn't mention replication. Also the
Cloudera Search, which is Solr based on HDFS, claims near-real-time
indexing however doesn't mention replication too. Quote from
http://www.cloudera.com/content/cloudera/en/documentation/cloudera-search/v1-latest/Cloudera-Search-User-Guide/csug_introducing.html

Re: Replication for SolrCloud

2015-04-18 Thread Erick Erickson
AFAIK, the HDFS replication of Solr indexes isn't something that was
designed, it just came along for the ride given HDFS replication.
Having a shard with 1 leader and two followers have 9 copies of the
index around _is_ overkill, nobody argues that at all.

I know the folks at Cloudera (who contributed the original HDFS
implementation) have discussed various options around this. In the
grand scheme of things, there have been other priorities without
tearing into the guts of Solr and/or HDFS since disk space is
relatively cheap.

That said, I'm also sure that this will get some attention as
priorities change. All patches welcome of course ;), But if you're
inclined to work on this issue, I'd _really_ discuss it with Mark
Miller  etc. before investing too much effort in it. I don't quite
know the tradeoffs well enough to have an opinion on the right
implementation.

Best
Erick

On Sat, Apr 18, 2015 at 1:59 AM, Shalin Shekhar Mangar
shalinman...@gmail.com wrote:
 Some comments inline:

 On Sat, Apr 18, 2015 at 2:12 PM, gengmao geng...@gmail.com wrote:

 On Sat, Apr 18, 2015 at 12:20 AM Jürgen Wagner (DVT) 
 juergen.wag...@devoteam.com wrote:

   Replication on the storage layer will provide a reliable storage for the
  index and other data of Solr. In particular, this replication does not
  guarantee your index files are consistent at any time as there may be
  intermediate states that are only partially replicated. Replication is
 only
  a convergent process, not an instant, atomic operation. With frequent
  changes, this becomes an issue.
 
 Firstly thanks for your reply. However I can't agree with you on this.
 HDFS guarantees the consistency even with replicates - you always read what
 you write, no partially replicated state will be read, which is guaranteed
 by HDFS server and client. Hence HBase can rely on HDFS for consistency and
 availability, without implementing another replication mechanism - if I
 understand correctly.


 Lucene index is not one file but a collection of files which are written
 independently. So if you replicate them out of order, Lucene might consider
 the index as corrupted (because of missing files). I don't think HBase
 works in that way.



  Replication inside SolrCloud as an application will not only maintain the
  consistency of the search-level interfaces to your indexes, but also
 scale
  in the sense of the application (query throughput).
 
  Split one shard into two shards can increase the query throughput too.


  Imagine a database: if you change one record, this may also result in an
  index change. If the record and the index are stored in different storage
  blocks, one will get replicated first. However, the replication target
 will
  only be consistent again when both have been replicated. So, you would
 have
  to suspend all accesses until the entire replication has completed.
 That's
  undesirable. If you replicate on the application (database management
  system) level, the application will employ a more fine-grained approach
 to
  replication, guaranteeing application consistency.
 
 In HBase, a region only locates on single region server at any time, which
 guarantee its consistency. Because your read/write always drops in one
 region, you won't have concern of parallel writes happens on multiple
 replicates of same region.
 The replication of HDFS is totally transparent to HBase. When a HDFS write
 call returns, HBase know the data is written and replicated so losing one
 copy of the data won't impact HBase at all.
 So HDFS means consistency and reliability for HBase. However, HBase doesn't
 use replicates (either HBase itself or HDFS's) to scale reads. If one
 region's is too hot for reads or write, you split that region into two
 regions, so that the reads and writes of that region can be distributed
 into two region servers. Hence HBase scales.
 I think this is the simplicity and beauty of HBase. Again, I am curious if
 SolrCloud has better reason to use replication on HDFS? As I described,
 HDFS provided consistency and reliability, meanwhile scalability can be
 achieved via sharding, even without Solr replication.


 That's something that has been considered and may even be in the roadmap
 for the Cloudera guys. See https://issues.apache.org/jira/browse/SOLR-6237

 But one problem that isn't solved by HDFS replication is of near-real-time
 indexing where you want the documents to be available for searchers as fast
 as possible. SolrCloud replication supports that by replicating documents
 as they come in and indexing them in several replicas. A new index searcher
 is opened on the flushed index files as well as on the internal data
 structures of the index writer. If we switch to relying on HDFS replication
 then this will be awfully expensive. However, as Jürgen mentioned, HDFS can
 certainly help with replicating static indexes.



  Consequently, HDFS will allow you to scale storage and possibly even
  replicate static indexes that won't change

Re: Replication for SolrCloud

2015-04-18 Thread gengmao
On Sat, Apr 18, 2015 at 12:20 AM Jürgen Wagner (DVT) 
juergen.wag...@devoteam.com wrote:

  Replication on the storage layer will provide a reliable storage for the
 index and other data of Solr. In particular, this replication does not
 guarantee your index files are consistent at any time as there may be
 intermediate states that are only partially replicated. Replication is only
 a convergent process, not an instant, atomic operation. With frequent
 changes, this becomes an issue.

Firstly thanks for your reply. However I can't agree with you on this.
HDFS guarantees the consistency even with replicates - you always read what
you write, no partially replicated state will be read, which is guaranteed
by HDFS server and client. Hence HBase can rely on HDFS for consistency and
availability, without implementing another replication mechanism - if I
understand correctly.


 Replication inside SolrCloud as an application will not only maintain the
 consistency of the search-level interfaces to your indexes, but also scale
 in the sense of the application (query throughput).

 Split one shard into two shards can increase the query throughput too.


 Imagine a database: if you change one record, this may also result in an
 index change. If the record and the index are stored in different storage
 blocks, one will get replicated first. However, the replication target will
 only be consistent again when both have been replicated. So, you would have
 to suspend all accesses until the entire replication has completed. That's
 undesirable. If you replicate on the application (database management
 system) level, the application will employ a more fine-grained approach to
 replication, guaranteeing application consistency.

In HBase, a region only locates on single region server at any time, which
guarantee its consistency. Because your read/write always drops in one
region, you won't have concern of parallel writes happens on multiple
replicates of same region.
The replication of HDFS is totally transparent to HBase. When a HDFS write
call returns, HBase know the data is written and replicated so losing one
copy of the data won't impact HBase at all.
So HDFS means consistency and reliability for HBase. However, HBase doesn't
use replicates (either HBase itself or HDFS's) to scale reads. If one
region's is too hot for reads or write, you split that region into two
regions, so that the reads and writes of that region can be distributed
into two region servers. Hence HBase scales.
I think this is the simplicity and beauty of HBase. Again, I am curious if
SolrCloud has better reason to use replication on HDFS? As I described,
HDFS provided consistency and reliability, meanwhile scalability can be
achieved via sharding, even without Solr replication.


 Consequently, HDFS will allow you to scale storage and possibly even
 replicate static indexes that won't change, but it won't help much with
 live index replication. That's where SolrCloud jumps in.


 Cheers,
 --Jürgen


 On 18.04.2015 08:44, gengmao wrote:

 I wonder why need to use SolrCloud replication on HDFS at all, given HDFS
 already provides replication and availability? The way to optimize
 performance and scalability should be tweaking shards, just like tweaking
 regions on HBase - which doesn't provide region replication too, isn't
 it?

 I have this question for a while and I didn't find clear answer about it.
 Could some experts please explain a bit?

 Best regards,
 Mao Geng





 --

 Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
 уважением
 *i.A. Jürgen Wagner*
 Head of Competence Center Intelligence
  Senior Cloud Consultant

 Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
 Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
 E-Mail: juergen.wag...@devoteam.com, URL: www.devoteam.de
 --
 Managing Board: Jürgen Hatzipantelis (CEO)
 Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
 Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071





Re: Replication for SolrCloud

2015-04-18 Thread gengmao
I wonder why need to use SolrCloud replication on HDFS at all, given HDFS
already provides replication and availability? The way to optimize
performance and scalability should be tweaking shards, just like tweaking
regions on HBase - which doesn't provide region replication too, isn't
it?

I have this question for a while and I didn't find clear answer about it.
Could some experts please explain a bit?

Best regards,
Mao Geng

On Thu, Apr 9, 2015 at 8:41 AM Erick Erickson erickerick...@gmail.com
wrote:

 Yes. 3 replicas and an HDFS replication factor of 3 means 9 copies of
 the index are laying around. You can change your HDFS replication
 factor, but that affects other applications using HDFS, so that may
 not be an option.

 Best,
 Erick

 On Thu, Apr 9, 2015 at 2:31 AM, Vijaya Narayana Reddy Bhoomi Reddy
 vijaya.bhoomire...@whishworks.com wrote:
  Hi,
 
  Can anyone please tell me how does shard replication work when the
 indexes
  are stored in HDFS? i..e with HDFS, the default replication factor is 3.
  Now, for the Solr shards, if I set the replication factor to 3 again,
 does
  that mean, internally index data is replicated thrice and then HDFS
  replication works on top it again and duplicates the data across HDFS
  cluster?
 
 
  Thanks  Regards
  Vijay
 
  --
  The contents of this e-mail are confidential and for the exclusive use of
  the intended recipient. If you receive this e-mail in error please delete
  it from your system immediately and notify us either by e-mail or
  telephone. You should not copy, forward or otherwise disclose the content
  of the e-mail. The views expressed in this communication may not
  necessarily be the view held by WHISHWORKS.



Re: Replication for SolrCloud

2015-04-18 Thread Shalin Shekhar Mangar
Some comments inline:

On Sat, Apr 18, 2015 at 2:12 PM, gengmao geng...@gmail.com wrote:

 On Sat, Apr 18, 2015 at 12:20 AM Jürgen Wagner (DVT) 
 juergen.wag...@devoteam.com wrote:

   Replication on the storage layer will provide a reliable storage for the
  index and other data of Solr. In particular, this replication does not
  guarantee your index files are consistent at any time as there may be
  intermediate states that are only partially replicated. Replication is
 only
  a convergent process, not an instant, atomic operation. With frequent
  changes, this becomes an issue.
 
 Firstly thanks for your reply. However I can't agree with you on this.
 HDFS guarantees the consistency even with replicates - you always read what
 you write, no partially replicated state will be read, which is guaranteed
 by HDFS server and client. Hence HBase can rely on HDFS for consistency and
 availability, without implementing another replication mechanism - if I
 understand correctly.


Lucene index is not one file but a collection of files which are written
independently. So if you replicate them out of order, Lucene might consider
the index as corrupted (because of missing files). I don't think HBase
works in that way.



  Replication inside SolrCloud as an application will not only maintain the
  consistency of the search-level interfaces to your indexes, but also
 scale
  in the sense of the application (query throughput).
 
  Split one shard into two shards can increase the query throughput too.


  Imagine a database: if you change one record, this may also result in an
  index change. If the record and the index are stored in different storage
  blocks, one will get replicated first. However, the replication target
 will
  only be consistent again when both have been replicated. So, you would
 have
  to suspend all accesses until the entire replication has completed.
 That's
  undesirable. If you replicate on the application (database management
  system) level, the application will employ a more fine-grained approach
 to
  replication, guaranteeing application consistency.
 
 In HBase, a region only locates on single region server at any time, which
 guarantee its consistency. Because your read/write always drops in one
 region, you won't have concern of parallel writes happens on multiple
 replicates of same region.
 The replication of HDFS is totally transparent to HBase. When a HDFS write
 call returns, HBase know the data is written and replicated so losing one
 copy of the data won't impact HBase at all.
 So HDFS means consistency and reliability for HBase. However, HBase doesn't
 use replicates (either HBase itself or HDFS's) to scale reads. If one
 region's is too hot for reads or write, you split that region into two
 regions, so that the reads and writes of that region can be distributed
 into two region servers. Hence HBase scales.
 I think this is the simplicity and beauty of HBase. Again, I am curious if
 SolrCloud has better reason to use replication on HDFS? As I described,
 HDFS provided consistency and reliability, meanwhile scalability can be
 achieved via sharding, even without Solr replication.


That's something that has been considered and may even be in the roadmap
for the Cloudera guys. See https://issues.apache.org/jira/browse/SOLR-6237

But one problem that isn't solved by HDFS replication is of near-real-time
indexing where you want the documents to be available for searchers as fast
as possible. SolrCloud replication supports that by replicating documents
as they come in and indexing them in several replicas. A new index searcher
is opened on the flushed index files as well as on the internal data
structures of the index writer. If we switch to relying on HDFS replication
then this will be awfully expensive. However, as Jürgen mentioned, HDFS can
certainly help with replicating static indexes.



  Consequently, HDFS will allow you to scale storage and possibly even
  replicate static indexes that won't change, but it won't help much with
  live index replication. That's where SolrCloud jumps in.
 

  Cheers,
  --Jürgen
 
 
  On 18.04.2015 08:44, gengmao wrote:
 
  I wonder why need to use SolrCloud replication on HDFS at all, given HDFS
  already provides replication and availability? The way to optimize
  performance and scalability should be tweaking shards, just like tweaking
  regions on HBase - which doesn't provide region replication too, isn't
  it?
 
  I have this question for a while and I didn't find clear answer about it.
  Could some experts please explain a bit?
 
  Best regards,
  Mao Geng
 
 
 
 
 
  --
 
  Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
  уважением
  *i.A. Jürgen Wagner*
  Head of Competence Center Intelligence
   Senior Cloud Consultant
 
  Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
  Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864
 1543
  E-Mail: juergen.wag...@devoteam.com, URL

Replication for SolrCloud

2015-04-09 Thread Vijaya Narayana Reddy Bhoomi Reddy
Hi,

Can anyone please tell me how does shard replication work when the indexes
are stored in HDFS? i..e with HDFS, the default replication factor is 3.
Now, for the Solr shards, if I set the replication factor to 3 again, does
that mean, internally index data is replicated thrice and then HDFS
replication works on top it again and duplicates the data across HDFS
cluster?


Thanks  Regards
Vijay

-- 
The contents of this e-mail are confidential and for the exclusive use of 
the intended recipient. If you receive this e-mail in error please delete 
it from your system immediately and notify us either by e-mail or 
telephone. You should not copy, forward or otherwise disclose the content 
of the e-mail. The views expressed in this communication may not 
necessarily be the view held by WHISHWORKS.


Re: Replication for SolrCloud

2015-04-09 Thread Erick Erickson
Yes. 3 replicas and an HDFS replication factor of 3 means 9 copies of
the index are laying around. You can change your HDFS replication
factor, but that affects other applications using HDFS, so that may
not be an option.

Best,
Erick

On Thu, Apr 9, 2015 at 2:31 AM, Vijaya Narayana Reddy Bhoomi Reddy
vijaya.bhoomire...@whishworks.com wrote:
 Hi,

 Can anyone please tell me how does shard replication work when the indexes
 are stored in HDFS? i..e with HDFS, the default replication factor is 3.
 Now, for the Solr shards, if I set the replication factor to 3 again, does
 that mean, internally index data is replicated thrice and then HDFS
 replication works on top it again and duplicates the data across HDFS
 cluster?


 Thanks  Regards
 Vijay

 --
 The contents of this e-mail are confidential and for the exclusive use of
 the intended recipient. If you receive this e-mail in error please delete
 it from your system immediately and notify us either by e-mail or
 telephone. You should not copy, forward or otherwise disclose the content
 of the e-mail. The views expressed in this communication may not
 necessarily be the view held by WHISHWORKS.


Traditional replication behind SolrCloud

2013-01-29 Thread Mingfeng Yang
Our application of Solr is somehow non-typical.  We constantly feed Solr
with lots of documents grabbed from internet, and NRT searching is not
required.  A typical search will return millions of result, and query
response need to be as fast as possible.

Since in SolrCloud environment, indexing request is constantly distributing
to all leaders and replicas, and I think that may impact the query
performance since the replicas are doing indexing and searching at the same
time.   I think about setting up a traditional replication behind each
shard of SolrCloud, and set the replication interval to a few minutes, to
minimize the impact of indexing on system resources.

Or is there already some way to enforce traditional type of replication in
the replicas of SolrCloud?

Thanks,
Ming


Re: Replication in SolrCloud

2012-12-03 Thread Arkadi Colson

  
  
Thanks for the explaination It's clear now...
  
  I expanded the setup to:
  4 hosts with 2 shards en 1 replicator for each shard. When I
  shutdown tomcat on solr01-dcg which is the master of shard 1 for
  both collections, the replicator (solr01-gs) seems NOT to
  takeover.
  See logs below.
  
  Dec 3, 2012 9:55:34 AM
  org.apache.solr.cloud.ShardLeaderElectionContext
  runLeaderProcess
  INFO: Running the leader process.
  Dec 3, 2012 9:55:34 AM
  org.apache.solr.cloud.ShardLeaderElectionContext
  shouldIBeLeader
  INFO: Checking if I should try and be the leader.
  Dec 3, 2012 9:55:34 AM
  org.apache.solr.cloud.ShardLeaderElectionContext
  shouldIBeLeader
  INFO: My last published State was Active, it's okay to be the
  leader.
  Dec 3, 2012 9:55:34 AM
  org.apache.solr.cloud.ShardLeaderElectionContext
  runLeaderProcess
  INFO: I may be the new leader - try and sync
  Dec 3, 2012 9:55:34 AM org.apache.solr.cloud.SyncStrategy sync
  INFO: Sync replicas to http://solr01-gs:8983/solr/intradesk/
  Dec 3, 2012 9:55:34 AM org.apache.solr.update.PeerSync sync
  INFO: PeerSync: core=intradesk url="" class="moz-txt-link-freetext" href="http://solr01-gs:8983/solr">http://solr01-gs:8983/solr
  START replicas=[http://solr01-dcg:8983/solr/intradesk/]
  nUpdates=100
  Dec 3, 2012 9:55:34 AM org.apache.solr.update.PeerSync sync
  INFO: PeerSync: core=intradesk url="" class="moz-txt-link-freetext" href="http://solr01-gs:8983/solr">http://solr01-gs:8983/solr
  DONE. We have no versions. sync failed.
  Dec 3, 2012 9:55:34 AM org.apache.solr.common.SolrException
  log
  SEVERE: Sync Failed
  Dec 3, 2012 9:55:34 AM
  org.apache.solr.cloud.ShardLeaderElectionContext
  rejoinLeaderElection
  INFO: There is a better leader candidate than us - going back
  into recovery
  Dec 3, 2012 9:55:35 AM
  org.apache.solr.update.DefaultSolrCoreState doRecovery
  INFO: Running recovery - first canceling any ongoing recovery
  Dec 3, 2012 9:55:35 AM org.apache.solr.cloud.RecoveryStrategy
  run
  INFO: Starting recovery process. core=intradesk
  recoveringAfterStartup=false
  Dec 3, 2012 9:55:35 AM org.apache.solr.cloud.RecoveryStrategy
  doRecovery
  INFO: Attempting to PeerSync from
  http://solr01-dcg:8983/solr/intradesk/ core=intradesk -
  recoveringAfterStartup=false
  Dec 3, 2012 9:55:35 AM org.apache.solr.update.PeerSync sync
  INFO: PeerSync: core=intradesk url="" class="moz-txt-link-freetext" href="http://solr01-gs:8983/solr">http://solr01-gs:8983/solr
  START replicas=[http://solr01-dcg:8983/solr/intradesk/]
  nUpdates=100
  Dec 3, 2012 9:55:35 AM org.apache.solr.update.PeerSync sync
  WARNING: no frame of reference to tell of we've missed updates
  Dec 3, 2012 9:55:35 AM org.apache.solr.cloud.RecoveryStrategy
  doRecovery
  INFO: PeerSync Recovery was not successful - trying
  replication. core=intradesk
  Dec 3, 2012 9:55:35 AM org.apache.solr.cloud.RecoveryStrategy
  doRecovery
  INFO: Starting Replication Recovery. core=intradesk
  Dec 3, 2012 9:55:35 AM
  org.apache.solr.client.solrj.impl.HttpClientUtil createClient
  INFO: Creating new http client,
config:maxConnections=128maxConnectionsPerHost=32followRedirects=false
  Dec 3, 2012 9:55:35 AM org.apache.solr.common.SolrException
  log
  SEVERE: Error while trying to recover.
  core=intradesk:org.apache.solr.client.solrj.SolrServerException:
  Server refused connection at: http://solr01-dcg:8983/solr
   at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:406)
   at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
   at
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:199)
   at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:388)
   at
  org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:220)
  Caused by: org.apache.http.conn.HttpHostConnectException:
  Connection to http://solr01-dcg:8983 refused
   at
org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:158)
   at
org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:150)
   at
org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:121)
   at

Re: Replication in SolrCloud

2012-12-03 Thread Arkadi Colson

Never mind I think I found it.

There must be some documents into each shardso they havea version 
number. Then everything seems to work...


On 11/30/2012 04:57 PM, Mark Miller wrote:

Thanks for all the detailed info!

Yes, that is confusing. One of the sore points we have while supporting both 
std Solr and SolrCloud mode.

In SolrCloud, every node is a Master when thinking about std Solr replication. 
However, as you see on the cloud page, only one of them is a *leader*. A leader 
is different than a master.

Being a Master when it comes to the replication handler simply means you can 
replicate the index to other nodes - in SolrCloud we need every node to be 
capable of doing that. Each shard only has one leader, but every node in your 
cluster will be a replication master.

- Mark


On Nov 30, 2012, at 10:32 AM, Arkadi Colson ark...@smartbit.be wrote:


This is my setup for solrCloud 4.0 on Tomcat 7.0.33 and zookeeper 3.4.5

hosts:
- solr01-dcg (first started)
- solr01-gs (second started so becomes replicate)

collections:
- smsc

shards:
- mydoc

zookeeper:
- on solr01-dcg
- on solr01-gs

SOLR_OPTS=-Dsolr.solr.home=/opt/solr/ -Dport=8983 -Dcollection.configName=smsc 
-DzkClientTimeout=2 -DzkHost=solr01-dcg:2181,solr01-gs:2181

solr.xml:
?xml version=1.0 encoding=UTF-8 ?
solr persistent=true
   cores adminPath=/admin/cores zkClientTimeout=2 hostPort=8983
 core schema=schema.xml shard=shard1 instanceDir=/solr/mydoc/ name=mydoc 
config=solrconfig.xml collection=mydoc/
   /cores
/solr

I upload the config to zookeeper:
java -classpath .:/usr/local/tomcat/webapps/solr/WEB-INF/lib/* 
org.apache.solr.cloud.ZkCLI -cmd upconfig -zkhost 
solr01-dcg:2181,solr01-gs:2181 -confdir /opt/solr/conf -confname smsc

Linking the config to the collection:
java -classpath .:/usr/local/tomcat/webapps/solr/WEB-INF/lib/* 
org.apache.solr.cloud.ZkCLI -cmd linkconfig -collection mydoc -zkhost 
solr01-dcg.intnet.smartbit.be:2181,solr01-gs.intnet.smartbit.be:2181 -confname 
smsc

cloud on both hosts:

dcddagii.png

solr01-dcg

hhfgdeab.png

solr01-gs:

daafhdef.png
Any idea?

Thanks!

On 11/30/2012 03:15 PM, Mark Miller wrote:

On Nov 30, 2012, at 5:08 AM, Arkadi Colson ark...@smartbit.be
  wrote:



Hi

I've setup an simple 2 machine cloud with 1 shard, one replicator and 2 
collections.Everything went fine. However when I look at the interface:
http://localhost:8983/solr/#/coll1/replication
  is reporting the both machines are master. Did I do something wrong in my 
config or isit a report for manual replication configuration? Can someone else 
check this?


How? You don't really give anything to look at :)



Is it poossible to link 2 collections to the same conf in zookeeper?



Yes, that is no problem.

- Mark









--
Met vriendelijke groeten

Arkadi Colson

Smartbit bvba . Hoogstraat 13 . 3670 Meeuwen
T +32 11 64 08 80 . F +32 11 64 08 81



Re: Replication in SolrCloud

2012-11-30 Thread Mark Miller

On Nov 30, 2012, at 5:08 AM, Arkadi Colson ark...@smartbit.be wrote:

 Hi
 
 I've setup an simple 2 machine cloud with 1 shard, one replicator and 2 
 collections.Everything went fine. However when I look at the interface: 
 http://localhost:8983/solr/#/coll1/replication is reporting the both machines 
 are master. Did I do something wrong in my config or isit a report for manual 
 replication configuration? Can someone else check this?

How? You don't really give anything to look at :)

 
 Is it poossible to link 2 collections to the same conf in zookeeper?
 

Yes, that is no problem.

- Mark



Re: Replication in SolrCloud

2012-11-30 Thread Mark Miller
Thanks for all the detailed info!

Yes, that is confusing. One of the sore points we have while supporting both 
std Solr and SolrCloud mode.

In SolrCloud, every node is a Master when thinking about std Solr replication. 
However, as you see on the cloud page, only one of them is a *leader*. A leader 
is different than a master.

Being a Master when it comes to the replication handler simply means you can 
replicate the index to other nodes - in SolrCloud we need every node to be 
capable of doing that. Each shard only has one leader, but every node in your 
cluster will be a replication master.

- Mark


On Nov 30, 2012, at 10:32 AM, Arkadi Colson ark...@smartbit.be wrote:

 This is my setup for solrCloud 4.0 on Tomcat 7.0.33 and zookeeper 3.4.5
 
 hosts:
 - solr01-dcg (first started)
 - solr01-gs (second started so becomes replicate)
 
 collections:
 - smsc
 
 shards:
 - mydoc
 
 zookeeper:
 - on solr01-dcg
 - on solr01-gs
 
 SOLR_OPTS=-Dsolr.solr.home=/opt/solr/ -Dport=8983 
 -Dcollection.configName=smsc -DzkClientTimeout=2 
 -DzkHost=solr01-dcg:2181,solr01-gs:2181
 
 solr.xml:
 ?xml version=1.0 encoding=UTF-8 ?
 solr persistent=true
   cores adminPath=/admin/cores zkClientTimeout=2 hostPort=8983
 core schema=schema.xml shard=shard1 instanceDir=/solr/mydoc/ 
 name=mydoc config=solrconfig.xml collection=mydoc/
   /cores
 /solr
 
 I upload the config to zookeeper:
 java -classpath .:/usr/local/tomcat/webapps/solr/WEB-INF/lib/* 
 org.apache.solr.cloud.ZkCLI -cmd upconfig -zkhost 
 solr01-dcg:2181,solr01-gs:2181 -confdir /opt/solr/conf -confname smsc
 
 Linking the config to the collection:
 java -classpath .:/usr/local/tomcat/webapps/solr/WEB-INF/lib/* 
 org.apache.solr.cloud.ZkCLI -cmd linkconfig -collection mydoc -zkhost 
 solr01-dcg.intnet.smartbit.be:2181,solr01-gs.intnet.smartbit.be:2181 
 -confname smsc
 
 cloud on both hosts:
 
 dcddagii.png
 
 solr01-dcg
 
 hhfgdeab.png
 
 solr01-gs:
 
 daafhdef.png
 Any idea?
 
 Thanks!
 
 On 11/30/2012 03:15 PM, Mark Miller wrote:
 On Nov 30, 2012, at 5:08 AM, Arkadi Colson ark...@smartbit.be
  wrote:
 
 
 Hi
 
 I've setup an simple 2 machine cloud with 1 shard, one replicator and 2 
 collections.Everything went fine. However when I look at the interface: 
 http://localhost:8983/solr/#/coll1/replication
  is reporting the both machines are master. Did I do something wrong in my 
 config or isit a report for manual replication configuration? Can someone 
 else check this?
 
 How? You don't really give anything to look at :)
 
 
 Is it poossible to link 2 collections to the same conf in zookeeper?
 
 
 Yes, that is no problem.
 
 - Mark
 
 
 
 



Zookeeper aware Replication in SolrCloud

2011-11-04 Thread prakash chandrasekaran


hi,
i m using SolrCloud and i wanted to add Replication feature to it .. 
i followed the steps in Solr Wiki .. but when the client tried to poll for data 
from server i got below Error Message ..
in Master LogNov 3, 2011 8:34:00 PM org.apache.solr.common.SolrException 
logSEVERE: org.apache.solr.common.cloud.ZooKeeperException: 
ZkSolrResourceLoader does not support getConfigDir() - likely, what you are 
trying to do is not supported in ZooKeeper modeat 
org.apache.solr.cloud.ZkSolrResourceLoader.getConfigDir(ZkSolrResourceLoader.java:99)
at 
org.apache.solr.handler.ReplicationHandler.getConfFileInfoFromCache(ReplicationHandler.java:378)
 at 
org.apache.solr.handler.ReplicationHandler.getFileList(ReplicationHandler.java:364)

in Slave logNov 3, 2011 8:34:00 PM org.apache.solr.handler.ReplicationHandler 
doFetchSEVERE: SnapPull failed org.apache.solr.common.SolrException: Request 
failed for the url org.apache.commons.httpclient.methods.PostMethod@18eabf6  at 
org.apache.solr.handler.SnapPuller.getNamedListResponse(SnapPuller.java:197) at 
org.apache.solr.handler.SnapPuller.fetchFileList(SnapPuller.java:219)at 
org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:281) at 
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:284)
but i could see the slave pointing to correct master from link : 
http://localhost:7574/solr/replication?command=details
i m also seeing these values in replication details link .. 
(http://localhost:7574/solr/replication?command=details)
arr name=indexReplicatedAtListstrThu Nov 03 20:28:00 PDT 
2011/strstrThu Nov 03 20:27:00 PDT 2011/strstrThu Nov 03 20:26:00 PDT 
2011/strstrThu Nov 03 20:25:00 PDT 2011/str/arr arr 
name=replicationFailedAtList strThu Nov 03 20:28:00 PDT 2011/str 
strThu Nov 03 20:27:00 PDT 2011/str strThu Nov 03 20:26:00 PDT 2011/str 
strThu Nov 03 20:25:00 PDT 2011/str/arr


Thanks,Prakash