[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-12-27 Thread Lars Bohl (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13857930#comment-13857930
 ] 

Lars Bohl commented on CASSANDRA-6268:
--

I built from trunk today and restarted the cluster using the generated 
build/dist folder in place of the 2.0.3 tarball. There were still over 5 
map tasks in hadoop, the same number as before. Maybe some setting in 
cassandra.yaml needs to change, or some setting in cascading-cassandra source 
tap?

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Fix For: 1.2.13, 2.0.4

 Attachments: 6268-src-1.2.txt, 6268-src-2.0.txt, 6268-thrift-1.2.txt, 
 6268-thrift-2.0.txt


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-12-05 Thread Jeremiah Jordan (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13840857#comment-13840857
 ] 

Jeremiah Jordan commented on CASSANDRA-6268:


[~jbellis] we good to commit this now?

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 6268-src-1.2.txt, 6268-src-2.0.txt, 6268-thrift-1.2.txt, 
 6268-thrift-2.0.txt


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-12-05 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13840883#comment-13840883
 ] 

Jonathan Ellis commented on CASSANDRA-6268:
---

Unless I'm missing something the 2.0 patch does not bump the Thrift version to 
19.39.0.

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 6268-src-1.2.txt, 6268-src-2.0.txt, 6268-thrift-1.2.txt, 
 6268-thrift-2.0.txt


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-12-05 Thread Jeremiah Jordan (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13840899#comment-13840899
 ] 

Jeremiah Jordan commented on CASSANDRA-6268:


[~pkolaczk] Looks like 
interface/thrift/gen-java/org/apache/cassandra/thrift/Constants.java is 
missing from the thrift-2.0 patch

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 6268-src-1.2.txt, 6268-src-2.0.txt, 6268-thrift-1.2.txt, 
 6268-thrift-2.0.txt


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-11-13 Thread JIRA

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13821351#comment-13821351
 ] 

Piotr Kołaczkowski commented on CASSANDRA-6268:
---

Updated to 19.36.2 and 19.39.0

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 6268-src-1.2.txt, 6268-src-2.0.txt, 6268-thrift-1.2.txt, 
 6268-thrift-2.0.txt


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-11-11 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13819043#comment-13819043
 ] 

Jonathan Ellis commented on CASSANDRA-6268:
---

So I'm back to bumping to 19.36.2 and 19.39.0 as the best option.

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 6268-src-1.2.txt, 6268-src-2.0.txt, 6268-thrift-1.2.txt, 
 6268-thrift-2.0.txt


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-11-11 Thread Jeremiah Jordan (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13819049#comment-13819049
 ] 

Jeremiah Jordan commented on CASSANDRA-6268:


bq. So I'm back to bumping to 19.36.2 and 19.39.0 as the best option.

+1 to those.  I think those are the least likely to cause issues.  And if we 
change something for 2.1, we should bump the version by 3 or something, to 
leave some extra numbers for 2.0.

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 6268-src-1.2.txt, 6268-src-2.0.txt, 6268-thrift-1.2.txt, 
 6268-thrift-2.0.txt


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-11-08 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13817668#comment-13817668
 ] 

Jonathan Ellis commented on CASSANDRA-6268:
---

So, hmm.  What do we do about the Thrift version?

We should bump 1.2 version to 19.37, but we've already used 19.37 in 
2.0.something.  At least I assume we have because it's currently 19.38.

We could
# freeze thrift at 19.36.1 and 19.38.0, respectively
# bump the patch version even though this is not a bugfix
# see if the 19.37 release was actually public, and if not, go ahead and 
reuse that
# just commit to 2.0 and leave 1.2 alone

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 6268-src-1.2.txt, 6268-src-2.0.txt, 6268-thrift-1.2.txt, 
 6268-thrift-2.0.txt


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-11-08 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13817675#comment-13817675
 ] 

Jonathan Ellis commented on CASSANDRA-6268:
---

Sorry, Piotr; can you rebuild the Thrift patches w/ that change?

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 6268-src-1.2.txt, 6268-src-2.0.txt, 6268-thrift-1.2.txt, 
 6268-thrift-2.0.txt


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-11-08 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13817673#comment-13817673
 ] 

Jonathan Ellis commented on CASSANDRA-6268:
---

Looks like we went with bump patch in 1.2, minor in 2.0 for CASSANDRA-6202.

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 6268-src-1.2.txt, 6268-src-2.0.txt, 6268-thrift-1.2.txt, 
 6268-thrift-2.0.txt


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-11-08 Thread Brandon Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13817702#comment-13817702
 ] 

Brandon Williams commented on CASSANDRA-6268:
-

Aleksey pointed out we can use the nuclear option and just bump them both above 
the highest one: 1.2 to 39, and 2.0 to 41 (to avoid having to use this option 
again.)

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 6268-src-1.2.txt, 6268-src-2.0.txt, 6268-thrift-1.2.txt, 
 6268-thrift-2.0.txt


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-11-08 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13817781#comment-13817781
 ] 

Jonathan Ellis commented on CASSANDRA-6268:
---

Isn't it confusing to have 1.2.x a higher version than 2.0.y?

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 6268-src-1.2.txt, 6268-src-2.0.txt, 6268-thrift-1.2.txt, 
 6268-thrift-2.0.txt


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-11-08 Thread Jeremiah Jordan (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13817789#comment-13817789
 ] 

Jeremiah Jordan commented on CASSANDRA-6268:


So why not just 19.36.2 and 19.39.0?

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 6268-src-1.2.txt, 6268-src-2.0.txt, 6268-thrift-1.2.txt, 
 6268-thrift-2.0.txt


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-11-08 Thread Brandon Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13817811#comment-13817811
 ] 

Brandon Williams commented on CASSANDRA-6268:
-

bq. Isn't it confusing to have 1.2.x a higher version than 2.0.y?

What? 1.2 would be 39, and 2.0 would be 41.  1.2 is unlikely to ever get a new 
feature that 2.0 wouldn't, so that's fairly safe.

bq. So why not just 19.36.2 and 19.39.0

Because technically, api-wise, this isn't a bugfix.  We bent the rule slightly 
on CASSANDRA-6202 to avoid this conflict, but I like Aleksey's idea better.

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 6268-src-1.2.txt, 6268-src-2.0.txt, 6268-thrift-1.2.txt, 
 6268-thrift-2.0.txt


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-11-08 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13817824#comment-13817824
 ] 

Jonathan Ellis commented on CASSANDRA-6268:
---

1.2.12 (39) would be higher than 2.0.0 (37), which is semantically incorrect 
since it implies that 1.2.12 supports a superset of what 2.0.0 does.

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 6268-src-1.2.txt, 6268-src-2.0.txt, 6268-thrift-1.2.txt, 
 6268-thrift-2.0.txt


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-11-08 Thread Brandon Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13817836#comment-13817836
 ] 

Brandon Williams commented on CASSANDRA-6268:
-

Ah, I see.  Well, shit.

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 6268-src-1.2.txt, 6268-src-2.0.txt, 6268-thrift-1.2.txt, 
 6268-thrift-2.0.txt


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-11-07 Thread JIRA

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815758#comment-13815758
 ] 

Piotr Kołaczkowski commented on CASSANDRA-6268:
---

Ok. I also generate a separate diff for thrift changes.

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 6268-2.txt, 6268-cassandra-2.0.txt


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-11-07 Thread JIRA

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815908#comment-13815908
 ] 

Piotr Kołaczkowski commented on CASSANDRA-6268:
---

Ok, done - separate thrift / source patches attached.
I double checked the versions of thrift were right: 0.7.0 for 1.2 branch and 
0.9.1 for 2.0 branch.


 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 6268-src-1.2.txt, 6268-src-2.0.txt, 6268-thrift-1.2.txt, 
 6268-thrift-2.0.txt


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-11-06 Thread JIRA

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13814832#comment-13814832
 ] 

Piotr Kołaczkowski commented on CASSANDRA-6268:
---

There was a bug in the patch that caused describe_local_ring to be swapped with 
describe_ring.
Attached fixed patch.

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 6268-2.txt


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-11-06 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815270#comment-13815270
 ] 

Jonathan Ellis commented on CASSANDRA-6268:
---

Should I be worried that the 2.0 patch is half the size of the other?

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 6268-2.txt, 6268-cassandra-2.0.txt


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-11-06 Thread JIRA

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815340#comment-13815340
 ] 

Piotr Kołaczkowski commented on CASSANDRA-6268:
---

Hmm, most of it are changes generated by thrift. But let me double check that. 
Maybe I messed up something.

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 6268-2.txt, 6268-cassandra-2.0.txt


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-11-06 Thread Jeremiah Jordan (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815352#comment-13815352
 ] 

Jeremiah Jordan commented on CASSANDRA-6268:


Double check you are using the right version of thrift for each one.  A bunch 
of the stuff in the original patch look like formatting changes.

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 6268-2.txt, 6268-cassandra-2.0.txt


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-11-04 Thread Jeremiah Jordan (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813224#comment-13813224
 ] 

Jeremiah Jordan commented on CASSANDRA-6268:


Something else here, getLocation in the record readers doesn't take DC into 
account:

https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/hadoop/ColumnFamilyRecordReader.java#L187
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/hadoop/cql3/CqlPagingRecordReader.java#L209

So even after we get the splits right, we need to make sure the local node is 
connected to.

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 6268.txt


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-11-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13811914#comment-13811914
 ] 

Piotr Kołaczkowski commented on CASSANDRA-6268:
---

[~alexliu68] Can you provide steps to reproduce? I did test it with another DC 
configured with VNodes and didn't observe what you're saying.

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 6268.txt


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-11-01 Thread Alex Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13811788#comment-13811788
 ] 

Alex Liu commented on CASSANDRA-6268:
-

We may need merge the ranges, there are still many small ranges returned by 
describe_local_ring if  other DC is configured with vnodes.

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 6268.txt


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-10-31 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13810571#comment-13810571
 ] 

Jonathan Ellis commented on CASSANDRA-6268:
---

LGTM.  Please provide a patch against 2.0 as well.  (Note that the Thrift 
version changes from 0.7 to 0.9.1.)

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 6268.txt


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-10-30 Thread JIRA

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13809224#comment-13809224
 ] 

Piotr Kołaczkowski commented on CASSANDRA-6268:
---

Attached a new patch.

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 6268.txt


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-10-29 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808368#comment-13808368
 ] 

Jonathan Ellis commented on CASSANDRA-6268:
---

We closed CASSANDRA-6124 in favor of adding LOCAL_ONE; I can't think of a case 
where you'd want to span more than one DC but less than all.

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 
 0001-DSP-2572-Adds-ability-to-set-target-DCs-where-a-Hado.patch


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it, because splits 
 are generated from the results of describe_ring, which returns a huge number 
 of ranges. 
 The proposed fix:
 - allows for specifying the DCs the Hadoop job should be run
 - merges the consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-10-29 Thread JIRA

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808381#comment-13808381
 ] 

Piotr Kołaczkowski commented on CASSANDRA-6268:
---

I know, but LOCAL_ONE does not fix the problem. Actually the DC restriction is 
here so we don't go to the DCs with enabled vnodes, as their ranges are split. 
If I could get the name of the local DC from the place where splits are 
generated, then I could make it without this setting.

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 
 0001-DSP-2572-Adds-ability-to-set-target-DCs-where-a-Hado.patch


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-10-29 Thread Jeremy Hanna (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808383#comment-13808383
 ] 

Jeremy Hanna commented on CASSANDRA-6268:
-

The problem is that describe ring will still include ranges from all 
datacenters.  So even isolating the queries to a single non-vnode DC, you can 
still get crazy split sizes based on the ranges from the other datacenters.

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 
 0001-DSP-2572-Adds-ability-to-set-target-DCs-where-a-Hado.patch


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-10-29 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808396#comment-13808396
 ] 

Jonathan Ellis commented on CASSANDRA-6268:
---

So wouldn't the right fix be to address that root cause?

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 
 0001-DSP-2572-Adds-ability-to-set-target-DCs-where-a-Hado.patch


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-10-29 Thread JIRA

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808412#comment-13808412
 ] 

Piotr Kołaczkowski commented on CASSANDRA-6268:
---

Right, but where to get the DC name from? I order to merge ranges, I need to 
know in which DC. 

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 
 0001-DSP-2572-Adds-ability-to-set-target-DCs-where-a-Hado.patch


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-10-29 Thread Brandon Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808433#comment-13808433
 ] 

Brandon Williams commented on CASSANDRA-6268:
-

We could make something like describe_local_ring where it would only take into 
account the node's local dc.

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 
 0001-DSP-2572-Adds-ability-to-set-target-DCs-where-a-Hado.patch


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-10-29 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808437#comment-13808437
 ] 

Jonathan Ellis commented on CASSANDRA-6268:
---

Right, that's where I was going.

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 
 0001-DSP-2572-Adds-ability-to-set-target-DCs-where-a-Hado.patch


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-10-29 Thread JIRA

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808439#comment-13808439
 ] 

Piotr Kołaczkowski commented on CASSANDRA-6268:
---

I'm fine with it. I can create another patch. 
One more detail - now with LOCAL_ONE is it guaranteed that Hadoop job won't go 
out of current DC? Or is using LOCAL_ONE only a user-setting?

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 
 0001-DSP-2572-Adds-ability-to-set-target-DCs-where-a-Hado.patch


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-10-29 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808449#comment-13808449
 ] 

Jonathan Ellis commented on CASSANDRA-6268:
---

bq. with LOCAL_ONE is it guaranteed that Hadoop job won't go out of current DC

Unless you override CL to something else, that's what it means.

 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 
 0001-DSP-2572-Adds-ability-to-set-target-DCs-where-a-Hado.patch


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

2013-10-29 Thread Alex Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808497#comment-13808497
 ] 

Alex Liu commented on CASSANDRA-6268:
-

bq. Are there any valid use cases where we do like to run M/R job in more than 
one DC (e.g. all)? describe_local_ring would disable that totally.

The local DC has at least one replica, so MR at local DC is sufficient for data 
locality. describe_local_ring should return a virtual ring of the local DC 
using those replica.



 Poor performance of Hadoop if any DC is using VNodes
 

 Key: CASSANDRA-6268
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
 Attachments: 
 0001-DSP-2572-Adds-ability-to-set-target-DCs-where-a-Hado.patch


 Some customers are complaining about huge number of splits in Hadoop caused 
 by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
 generated from the results of describe_ring, which returns a huge number of 
 ranges anyways, and doesn't take into account that there will be huge number 
 of consecutive ranges residing on the nodes we'd like the M/R job to be run.
 The proposed fix:
 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
 defaults to all Hadoop DCs)
 2. merges consecutive ranges before generating Hadoop splits, so we don't 
 have artificial range splitting caused by vnodes in the other DCs
 For non-DSE users this feature is turned off by default and doesn't change 
 the old behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)