[jira] [Commented] (SOLR-12509) Improve SplitShardCmd performance and reliability

2018-08-02 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567389#comment-16567389
 ] 

ASF subversion and git services commented on SOLR-12509:


Commit b5ed6350a0ea444553242ef2b141161c0fc3830b in lucene-solr's branch 
refs/heads/master from Andrzej Bialecki
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b5ed635 ]

SOLR-12509: Fix a bug when using round-robin doc assignment.


> Improve SplitShardCmd performance and reliability
> -
>
> Key: SOLR-12509
> URL: https://issues.apache.org/jira/browse/SOLR-12509
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
>Priority: Major
> Fix For: 7.5
>
> Attachments: SOLR-12509.patch, SOLR-12509.patch
>
>
> {{SplitShardCmd}} is currently quite complex.
> Shard splitting occurs on active shards, which are still being updated, so 
> the splitting has to involve several carefully orchestrated steps, making 
> sure that new sub-shard placeholders are properly created and visible, and 
> then also applying buffered updates to the split leaders and performing 
> recovery on sub-shard replicas.
> This process could be simplified in cases where collections are not actively 
> being updated or can tolerate a little downtime - we could put the shard 
> "offline", ie. disable writing while the splitting is in progress (in order 
> to avoid users' confusion we should disable writing to the whole collection).
> The actual index splitting could perhaps be improved to use 
> {{HardLinkCopyDirectoryWrapper}} for creating a copy of the index by 
> hard-linking existing index segments, and then applying deletes to the 
> documents that don't belong in a sub-shard. However, the resulting index 
> slices that replicas would have to pull would be the same size as the whole 
> shard.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12509) Improve SplitShardCmd performance and reliability

2018-08-02 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567380#comment-16567380
 ] 

ASF subversion and git services commented on SOLR-12509:


Commit 724a65a60ab7537ab9f0c49cf0a93d2504553ae1 in lucene-solr's branch 
refs/heads/branch_7x from Andrzej Bialecki
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=724a65a ]

SOLR-12509: Fix a bug when using round-robin doc assignment.


> Improve SplitShardCmd performance and reliability
> -
>
> Key: SOLR-12509
> URL: https://issues.apache.org/jira/browse/SOLR-12509
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
>Priority: Major
> Fix For: 7.5
>
> Attachments: SOLR-12509.patch, SOLR-12509.patch
>
>
> {{SplitShardCmd}} is currently quite complex.
> Shard splitting occurs on active shards, which are still being updated, so 
> the splitting has to involve several carefully orchestrated steps, making 
> sure that new sub-shard placeholders are properly created and visible, and 
> then also applying buffered updates to the split leaders and performing 
> recovery on sub-shard replicas.
> This process could be simplified in cases where collections are not actively 
> being updated or can tolerate a little downtime - we could put the shard 
> "offline", ie. disable writing while the splitting is in progress (in order 
> to avoid users' confusion we should disable writing to the whole collection).
> The actual index splitting could perhaps be improved to use 
> {{HardLinkCopyDirectoryWrapper}} for creating a copy of the index by 
> hard-linking existing index segments, and then applying deletes to the 
> documents that don't belong in a sub-shard. However, the resulting index 
> slices that replicas would have to pull would be the same size as the whole 
> shard.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12509) Improve SplitShardCmd performance and reliability

2018-08-02 Thread Steve Rowe (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566931#comment-16566931
 ] 

Steve Rowe commented on SOLR-12509:
---

Reproducing {{SolrIndexSplitterTest.testSplitAlternatelyLink()}} failure, from 
[https://jenkins.thetaphi.de/job/Lucene-Solr-7.x-Linux/2469/]:

{noformat}
Checking out Revision 600c15d14e73274d4152e8ef1b8c0d0aae69fd18 
(refs/remotes/origin/branch_7x)
[...]
   [junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=SolrIndexSplitterTest -Dtests.method=testSplitAlternatelyLink 
-Dtests.seed=2EC831F1D9B21D7D -Dtests.multiplier=3 -Dtests.slow=true 
-Dtests.locale=pl -Dtests.timezone=CTT -Dtests.asserts=true 
-Dtests.file.encoding=US-ASCII
   [junit4] FAILURE 1.02s J1 | SolrIndexSplitterTest.testSplitAlternatelyLink 
<<<
   [junit4]> Throwable #1: java.lang.AssertionError: split index1 has wrong 
number of documents expected:<5> but was:<6>
   [junit4]>at 
__randomizedtesting.SeedInfo.seed([2EC831F1D9B21D7D:9992FDD95CED0639]:0)
   [junit4]>at 
org.apache.solr.update.SolrIndexSplitterTest.doTestSplitAlternately(SolrIndexSplitterTest.java:272)
   [junit4]>at 
org.apache.solr.update.SolrIndexSplitterTest.testSplitAlternatelyLink(SolrIndexSplitterTest.java:247)
   [junit4]>at java.lang.Thread.run(Thread.java:748)
[...]
   [junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): 
{id=PostingsFormat(name=Memory)}, 
docValues:{_version_=DocValuesFormat(name=Lucene70), 
id=DocValuesFormat(name=Asserting)}, maxPointsInLeafNode=419, 
maxMBSortInHeap=6.5586874621200195, sim=RandomSimilarity(queryNorm=false): {}, 
locale=pl, timezone=CTT
   [junit4]   2> NOTE: Linux 4.15.0-29-generic amd64/Oracle Corporation 
1.8.0_172 (64-bit)/cpus=8,threads=1,free=160622208,total=536870912
{noformat}

> Improve SplitShardCmd performance and reliability
> -
>
> Key: SOLR-12509
> URL: https://issues.apache.org/jira/browse/SOLR-12509
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
>Priority: Major
> Fix For: 7.5
>
> Attachments: SOLR-12509.patch, SOLR-12509.patch
>
>
> {{SplitShardCmd}} is currently quite complex.
> Shard splitting occurs on active shards, which are still being updated, so 
> the splitting has to involve several carefully orchestrated steps, making 
> sure that new sub-shard placeholders are properly created and visible, and 
> then also applying buffered updates to the split leaders and performing 
> recovery on sub-shard replicas.
> This process could be simplified in cases where collections are not actively 
> being updated or can tolerate a little downtime - we could put the shard 
> "offline", ie. disable writing while the splitting is in progress (in order 
> to avoid users' confusion we should disable writing to the whole collection).
> The actual index splitting could perhaps be improved to use 
> {{HardLinkCopyDirectoryWrapper}} for creating a copy of the index by 
> hard-linking existing index segments, and then applying deletes to the 
> documents that don't belong in a sub-shard. However, the resulting index 
> slices that replicas would have to pull would be the same size as the whole 
> shard.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12509) Improve SplitShardCmd performance and reliability

2018-08-01 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16565690#comment-16565690
 ] 

ASF subversion and git services commented on SOLR-12509:


Commit 7faa803a7c9699f38b8a6b3ddd3a88c4729c5e5f in lucene-solr's branch 
refs/heads/branch_7x from Andrzej Bialecki
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7faa803 ]

SOLR-12509: Improve SplitShardCmd performance and reliability.


> Improve SplitShardCmd performance and reliability
> -
>
> Key: SOLR-12509
> URL: https://issues.apache.org/jira/browse/SOLR-12509
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
>Priority: Major
> Attachments: SOLR-12509.patch, SOLR-12509.patch
>
>
> {{SplitShardCmd}} is currently quite complex.
> Shard splitting occurs on active shards, which are still being updated, so 
> the splitting has to involve several carefully orchestrated steps, making 
> sure that new sub-shard placeholders are properly created and visible, and 
> then also applying buffered updates to the split leaders and performing 
> recovery on sub-shard replicas.
> This process could be simplified in cases where collections are not actively 
> being updated or can tolerate a little downtime - we could put the shard 
> "offline", ie. disable writing while the splitting is in progress (in order 
> to avoid users' confusion we should disable writing to the whole collection).
> The actual index splitting could perhaps be improved to use 
> {{HardLinkCopyDirectoryWrapper}} for creating a copy of the index by 
> hard-linking existing index segments, and then applying deletes to the 
> documents that don't belong in a sub-shard. However, the resulting index 
> slices that replicas would have to pull would be the same size as the whole 
> shard.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12509) Improve SplitShardCmd performance and reliability

2018-08-01 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16565410#comment-16565410
 ] 

ASF subversion and git services commented on SOLR-12509:


Commit 1133bf98a5fd075173efecfb75a51493fceb62b3 in lucene-solr's branch 
refs/heads/master from Andrzej Bialecki
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1133bf9 ]

SOLR-12509: Improve SplitShardCmd performance and reliability.


> Improve SplitShardCmd performance and reliability
> -
>
> Key: SOLR-12509
> URL: https://issues.apache.org/jira/browse/SOLR-12509
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
>Priority: Major
> Attachments: SOLR-12509.patch, SOLR-12509.patch
>
>
> {{SplitShardCmd}} is currently quite complex.
> Shard splitting occurs on active shards, which are still being updated, so 
> the splitting has to involve several carefully orchestrated steps, making 
> sure that new sub-shard placeholders are properly created and visible, and 
> then also applying buffered updates to the split leaders and performing 
> recovery on sub-shard replicas.
> This process could be simplified in cases where collections are not actively 
> being updated or can tolerate a little downtime - we could put the shard 
> "offline", ie. disable writing while the splitting is in progress (in order 
> to avoid users' confusion we should disable writing to the whole collection).
> The actual index splitting could perhaps be improved to use 
> {{HardLinkCopyDirectoryWrapper}} for creating a copy of the index by 
> hard-linking existing index segments, and then applying deletes to the 
> documents that don't belong in a sub-shard. However, the resulting index 
> slices that replicas would have to pull would be the same size as the whole 
> shard.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12509) Improve SplitShardCmd performance and reliability

2018-07-20 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550905#comment-16550905
 ] 

Andrzej Bialecki  commented on SOLR-12509:
--

Thanks Shalin for the review! I attached a new patch that fixes these issues.

> Improve SplitShardCmd performance and reliability
> -
>
> Key: SOLR-12509
> URL: https://issues.apache.org/jira/browse/SOLR-12509
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
>Priority: Major
> Attachments: SOLR-12509.patch, SOLR-12509.patch
>
>
> {{SplitShardCmd}} is currently quite complex.
> Shard splitting occurs on active shards, which are still being updated, so 
> the splitting has to involve several carefully orchestrated steps, making 
> sure that new sub-shard placeholders are properly created and visible, and 
> then also applying buffered updates to the split leaders and performing 
> recovery on sub-shard replicas.
> This process could be simplified in cases where collections are not actively 
> being updated or can tolerate a little downtime - we could put the shard 
> "offline", ie. disable writing while the splitting is in progress (in order 
> to avoid users' confusion we should disable writing to the whole collection).
> The actual index splitting could perhaps be improved to use 
> {{HardLinkCopyDirectoryWrapper}} for creating a copy of the index by 
> hard-linking existing index segments, and then applying deletes to the 
> documents that don't belong in a sub-shard. However, the resulting index 
> slices that replicas would have to pull would be the same size as the whole 
> shard.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12509) Improve SplitShardCmd performance and reliability

2018-07-20 Thread Shalin Shekhar Mangar (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550662#comment-16550662
 ] 

Shalin Shekhar Mangar commented on SOLR-12509:
--

Awesome speedups!

A few minor issues:
# SolrIndexSplitter.findDocsToDelete uses the wrong key to lookup inside the 
synchronized block -- {{docsToDelete.get(readerContext.ord);}}
# There is a new {{DefaultSolrCoreState.getIndexWriterLock}} method which isn't 
used anywhere?
# Typo {{changepostd}} in {{ReplicaMutator}}
# We should rename {{index.split}} to follow the {{index.}} 
convention otherwise dangling "index.split" directories won't be cleaned up by 
{{DirectoryFactory.cleanupOldIndexDirectories}}

> Improve SplitShardCmd performance and reliability
> -
>
> Key: SOLR-12509
> URL: https://issues.apache.org/jira/browse/SOLR-12509
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
>Priority: Major
> Attachments: SOLR-12509.patch
>
>
> {{SplitShardCmd}} is currently quite complex.
> Shard splitting occurs on active shards, which are still being updated, so 
> the splitting has to involve several carefully orchestrated steps, making 
> sure that new sub-shard placeholders are properly created and visible, and 
> then also applying buffered updates to the split leaders and performing 
> recovery on sub-shard replicas.
> This process could be simplified in cases where collections are not actively 
> being updated or can tolerate a little downtime - we could put the shard 
> "offline", ie. disable writing while the splitting is in progress (in order 
> to avoid users' confusion we should disable writing to the whole collection).
> The actual index splitting could perhaps be improved to use 
> {{HardLinkCopyDirectoryWrapper}} for creating a copy of the index by 
> hard-linking existing index segments, and then applying deletes to the 
> documents that don't belong in a sub-shard. However, the resulting index 
> slices that replicas would have to pull would be the same size as the whole 
> shard.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12509) Improve SplitShardCmd performance and reliability

2018-07-19 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549474#comment-16549474
 ] 

Andrzej Bialecki  commented on SOLR-12509:
--

This patch implements a new method for shard splitting that uses 
{{HardLinkCopyDirectoryWrapper}}. The old method is still available and used by 
default, and the new method may be selected by using {{splitMethod=link}} 
request parameter (the old method can be explicitly selected with 
{{splitMethod=rewrite}}).

There's also support for a new {{timing}} parameter - when set to true the 
SPLITSHARD command returns a "timing" section with elapsed times for each 
internal phase of the command execution.

I've been testing the new implementation locally and on a cluster of 5 physical 
nodes, using collections ranging from 2 mln up to 22 mln documents (15 GB index 
size). The new method consistently outperforms the old method by a factor of 3 
to 5, depending on the index size and number of replicas.

The downside of the new method is that the resulting sub-shards initially have 
the same size as the original shard - on the shard leader these files are 
hard-linked so they don't consume additional space, but replica nodes still 
need to fetch all that data, which affects the network IO and the initial disk 
consumption on replica nodes.

Here are example timings for the old method:
{code}
  "timing":{
"time":1547111.0,
"checkDiskSpace":{
  "time":14.0},
"fillRanges":{
  "time":2.0},
"createSubSlicesAndLeadersInState":{
  "time":4439.0},
"waitForSubSliceLeadersAlive":{
  "time":1009.0},
"splitParentCore":{
  "time":1538986.0},
"applyBufferedUpdates":{
  "time":7.0},
"identifyNodesForReplicas":{
  "time":1.0},
"createReplicaPlaceholders":{
  "time":7.0},
"createCoresForReplicas":{
  "time":2173.0},
"finalCommit":{
  "time":462.0}},
{code}
After that, sub-shard shard1_0 recovered in 220753 ms, so the total time was 
ca. 1770 sec.

 
And the timings for the new method, with exactly the same initial data layout, 
hardware, etc:
{code}
  "timing":{
"time":15633.0,
"checkDiskSpace":{
  "time":5.0},
"fillRanges":{
  "time":2.0},
"createSubSlicesAndLeadersInState":{
  "time":4411.0},
"waitForSubSliceLeadersAlive":{
  "time":2.0},
"splitParentCore":{
  "time":9005.0},
"identifyNodesForReplicas":{
  "time":0.0},
"createReplicaPlaceholders":{
  "time":2.0},
"createCoresForReplicas":{
  "time":2105.0},
"finalCommit":{
  "time":95.0}},
{code}
After that, sub-shard shard1_0 recovered in 443350 ms, so the total time was 
ca. 600 sec.


> Improve SplitShardCmd performance and reliability
> -
>
> Key: SOLR-12509
> URL: https://issues.apache.org/jira/browse/SOLR-12509
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
>Priority: Major
> Attachments: SOLR-12509.patch
>
>
> {{SplitShardCmd}} is currently quite complex.
> Shard splitting occurs on active shards, which are still being updated, so 
> the splitting has to involve several carefully orchestrated steps, making 
> sure that new sub-shard placeholders are properly created and visible, and 
> then also applying buffered updates to the split leaders and performing 
> recovery on sub-shard replicas.
> This process could be simplified in cases where collections are not actively 
> being updated or can tolerate a little downtime - we could put the shard 
> "offline", ie. disable writing while the splitting is in progress (in order 
> to avoid users' confusion we should disable writing to the whole collection).
> The actual index splitting could perhaps be improved to use 
> {{HardLinkCopyDirectoryWrapper}} for creating a copy of the index by 
> hard-linking existing index segments, and then applying deletes to the 
> documents that don't belong in a sub-shard. However, the resulting index 
> slices that replicas would have to pull would be the same size as the whole 
> shard.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org