[jira] [Commented] (SPARK-6664) Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)

2015-04-03 Thread Florian Verhein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394147#comment-14394147
 ] 

Florian Verhein commented on SPARK-6664:


I guess the other thing is - we can union RDDs, so why not be able to 'undo' 
that?

 Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)
 --

 Key: SPARK-6664
 URL: https://issues.apache.org/jira/browse/SPARK-6664
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Florian Verhein

 I can't find this functionality (if I missed something, apologies!), but it 
 would be very useful for evaluating ml models.  
 *Use case example* 
 suppose you have pre-processed web logs for a few months, and now want to 
 split it into a training set (where you train a model to predict some aspect 
 of site accesses, perhaps per user) and an out of time test set (where you 
 evaluate how well your model performs in the future). This example has just a 
 single split, but in general you could want more for cross validation. You 
 may also want to have multiple overlaping intervals.   
 *Specification* 
 1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), 
 return n+1 RDDs such that values in the ith RDD are within the (i-1)th and 
 ith boundary.
 2. More complex alternative (but similar under the hood): provide a sequence 
 of possibly overlapping intervals (ordered by the start key of the interval), 
 and return the RDDs containing values within those intervals. 
 *Implementation ideas / notes for 1*
 - The ordered RDDs are likely RangePartitioned (or there should be a simple 
 way to find ranges from partitions in an ordered RDD)
 - Find the partitions containing the boundary, and split them in two.  
 - Construct the new RDDs from the original partitions (and any split ones)
 I suspect this could be done by launching only a few jobs to split the 
 partitions containing the boundaries. 
 Alternatively, it might be possible to decorate these partitions and use them 
 in more than one RDD. I.e. let one of these partitions (for boundary i) be p. 
 Apply two decorators p' and p'', where p' is masks out values above the ith 
 boundary, and p'' masks out values below the ith boundary. Any operations on 
 these partitions apply only to values not masked out. Then assign p' to the 
 ith output RDD and p'' to the (i+1)th output RDD.
 If I understand Spark correctly, this should not require any jobs. Not sure 
 whether it's worth trying this optimisation.
 *Implementation ideas / notes for 2*
 This is very similar, except that we have to handle entire (or parts) of 
 partitions belonging to more than one output RDD, since they are no longer 
 mutually exclusive. But since RDDs are immutable(??), the decorator idea 
 should still work?
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6664) Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)

2015-04-03 Thread Florian Verhein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394141#comment-14394141
 ] 

Florian Verhein commented on SPARK-6664:


Thanks [~sowen]. I disagree :-) 

...If you think there's non-stationarity you most certainly want to see how 
well a model trained in the past holds up in the future (possibly with more 
than one out of time sample if one is used for pruning, etc), and you can do 
this for temporal data by adjusting the way you do cross validation... 
actually, the exact method you describe is one common approach in time series 
data, e.g. see 
http://stats.stackexchange.com/questions/14099/using-k-fold-cross-validation-for-time-series-model-selection
Doing this multiple times does exactly what is does for normal cross-validation 
- gives you a distribution of your error estimate, rather than a single value 
(a sample of it). So it's quite important. The size of the data isn't really 
relevant to this argument (also consider that I might like to employ larger 
datasets to remove the risk of overfitting a more complex but better fitting 
model, rather than to improve my error estimates). 

Note that this proposal doesn't define how the split RDDs are used (i.e. 
unioned) to create training sets and test sets. So the test set can be a single 
RDD, or multiple ones. It's entirely up to the user.

Allowing overlapping partitions (i.e. part 2) is a little different, because 
you probably wouldn't union the resulting RDDs due to duplication. It would be 
more useful for as a primitive for bootstrapping the performance measures of 
streaming models or simulations (so, you're not resampling records, but 
resampling subsequences). 
Alternatively if you have big data but a class imbalance problem, you might 
need to resort to overlaps in the training sets to get multiple test sets with 
enough examples of your minority class.

From what I understand MLUtils.kFold is standard randomised k-fold cross 
validation *but without shuffling* (from a cursory look at the code, It looks 
like ordering will always be maintained... which should probably be documented 
if it is the case because it can lead to bad things... and adds another 
argument for #6665). Either way, since elements of its splits are 
non-consecutive, it's not applicable for time series. 

Do you know how the performance of filterByRange would compare? It should be 
pretty performant if and only if the data is RangePartitioned right? 


 Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)
 --

 Key: SPARK-6664
 URL: https://issues.apache.org/jira/browse/SPARK-6664
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Florian Verhein

 I can't find this functionality (if I missed something, apologies!), but it 
 would be very useful for evaluating ml models.  
 *Use case example* 
 suppose you have pre-processed web logs for a few months, and now want to 
 split it into a training set (where you train a model to predict some aspect 
 of site accesses, perhaps per user) and an out of time test set (where you 
 evaluate how well your model performs in the future). This example has just a 
 single split, but in general you could want more for cross validation. You 
 may also want to have multiple overlaping intervals.   
 *Specification* 
 1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), 
 return n+1 RDDs such that values in the ith RDD are within the (i-1)th and 
 ith boundary.
 2. More complex alternative (but similar under the hood): provide a sequence 
 of possibly overlapping intervals (ordered by the start key of the interval), 
 and return the RDDs containing values within those intervals. 
 *Implementation ideas / notes for 1*
 - The ordered RDDs are likely RangePartitioned (or there should be a simple 
 way to find ranges from partitions in an ordered RDD)
 - Find the partitions containing the boundary, and split them in two.  
 - Construct the new RDDs from the original partitions (and any split ones)
 I suspect this could be done by launching only a few jobs to split the 
 partitions containing the boundaries. 
 Alternatively, it might be possible to decorate these partitions and use them 
 in more than one RDD. I.e. let one of these partitions (for boundary i) be p. 
 Apply two decorators p' and p'', where p' is masks out values above the ith 
 boundary, and p'' masks out values below the ith boundary. Any operations on 
 these partitions apply only to values not masked out. Then assign p' to the 
 ith output RDD and p'' to the (i+1)th output RDD.
 If I understand Spark correctly, this should not require any jobs. Not sure 
 whether it's worth trying this optimisation.
 *Implementation ideas / notes for 2*
 

[jira] [Commented] (SPARK-6665) Randomly Shuffle an RDD

2015-04-03 Thread Florian Verhein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394291#comment-14394291
 ] 

Florian Verhein commented on SPARK-6665:


Fair enough. I'll have to implement it because I need it so may as well report 
back when I've had the chance to (perhaps there's a better place for it - e.g. 
not in the core API). 


 Randomly Shuffle an RDD 
 

 Key: SPARK-6665
 URL: https://issues.apache.org/jira/browse/SPARK-6665
 Project: Spark
  Issue Type: New Feature
  Components: Spark Shell
Reporter: Florian Verhein
Priority: Minor

 *Use case* 
 RDD created in a way that has some ordering, but you need to shuffle it 
 because the ordering would cause problems downstream. E.g.
 - will be used to train a ML algorithm that makes stochastic assumptions 
 (like SGD) 
 - used as input for cross validation. e.g. after the shuffle, you could just 
 grab partitions (or part files if saved to hdfs) as folds
 Related question in mailing list: 
 http://apache-spark-user-list.1001560.n3.nabble.com/random-shuffle-streaming-RDDs-td17965.html
 *Possible implementation*
 As mentioned by [~sowen] in the above thread, could sort by( a good  hash of( 
 the element (or key if it's paired) and a random salt)). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6665) Randomly Shuffle an RDD

2015-04-02 Thread Florian Verhein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394089#comment-14394089
 ] 

Florian Verhein commented on SPARK-6665:


Thanks for the quick response [~sowen].

Agree with your observation, but consider a) distributing the cross validation 
itself (so one job will achieve all the training and scoring on the k fold 
selections) and b) using the pre-processed and shuffled data for non-spark 
modelling, such as in R or python or vowpal wabbit (perhaps all running within 
spark jobs, using something like  sc.paralellise(jobs,jobs.size).map(_()) to 
treat spark as a grid). So if the splits already exist on hdfs it is very easy 
to use them -- and since you can control the number of partitions easily this 
gives a very simple way to quickly get something up and running in R or python, 
even if the data is big. But this is really just a nice data science hacking 
side effect of this feature, rather than a driving use case. I don't really 
agree that taking random subsamples is better because you run the risk of never 
selecting some instances.

Agree that the most important use case is random order for subsequent serial 
access (but disagree it's limited to small RDDs). For example, if you use spark 
for pre-processing followed by a large scale learner like vowpal wabbit (note 
that vw has features that mllib SGD doesn't have yet), the data should be 
shuffled since vw processes out of core, so cannot perform the randomisation 
itself through order selection (and it would really slow down the algorithm if 
it did).

It's worth pointing out that shuffling a dataset is a common enough operation 
for it to exists in other big data frameworks - e.g. I've used it in ML 
pipelines written in Scoobi and Scalding. 
I haven't implemented it myself, but I'm pretty sure it's non trivial to make 
it performant and have good randomness properties. 
So I think there's a good case to add it. 

 Randomly Shuffle an RDD 
 

 Key: SPARK-6665
 URL: https://issues.apache.org/jira/browse/SPARK-6665
 Project: Spark
  Issue Type: New Feature
  Components: Spark Shell
Reporter: Florian Verhein
Priority: Minor

 *Use case* 
 RDD created in a way that has some ordering, but you need to shuffle it 
 because the ordering would cause problems downstream. E.g.
 - will be used to train a ML algorithm that makes stochastic assumptions 
 (like SGD) 
 - used as input for cross validation. e.g. after the shuffle, you could just 
 grab partitions (or part files if saved to hdfs) as folds
 Related question in mailing list: 
 http://apache-spark-user-list.1001560.n3.nabble.com/random-shuffle-streaming-RDDs-td17965.html
 *Possible implementation*
 As mentioned by [~sowen] in the above thread, could sort by( a good  hash of( 
 the element (or key if it's paired) and a random salt)). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6664) Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)

2015-04-01 Thread Florian Verhein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florian Verhein updated SPARK-6664:
---
Description: 
I can't find this functionality (if I missed something, apologies!), but it 
would be very useful for evaluating ml models.  

*Use case example* 
suppose you have pre-processed web logs for a few months, and now want to split 
it into a training set (where you train a model to predict some aspect of site 
accesses, perhaps per user) and an out of time test set (where you evaluate how 
well your model performs in the future). This example has just a single split, 
but in general you could want more for cross validation. You may also want to 
have multiple overlaping intervals.   

*Specification* 

1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), 
return n+1 RDDs such that values in the ith RDD are within the (i-1)th and ith 
boundary.

2. More complex alternative (but similar under the hood): provide a sequence of 
possibly overlapping intervals (ordered by the start key of the interval), and 
return the RDDs containing values within those intervals. 

*Implementation ideas / notes for 1*

- The ordered RDDs are likely RangePartitioned (or there should be a simple way 
to find ranges from partitions in an ordered RDD)
- Find the partitions containing the boundary, and split them in two.  
- Construct the new RDDs from the original partitions (and any split ones)

I suspect this could be done by launching only a few jobs to split the 
partitions containing the boundaries. 
Alternatively, it might be possible to decorate these partitions and use them 
in more than one RDD. I.e. let one of these partitions (for boundary i) be p. 
Apply two decorators p' and p'', where p' is masks out values above the ith 
boundary, and p'' masks out values below the ith boundary. Any operations on 
these partitions apply only to values not masked out. Then assign p' to the ith 
output RDD and p'' to the (i+1)th output RDD.
If I understand Spark correctly, this should not require any jobs. Not sure 
whether it's worth trying this optimisation.

*Implementation ideas / notes for 2*
This is very similar, except that we have to handle entire (or parts) of 
partitions belonging to more than one output RDD, since they are no longer 
mutually exclusive. But since RDDs are immutable(??), the decorator idea should 
still work?

Thoughts?


  was:

I can't find this functionality (if I missed something, apologies!), but it 
would be very useful for evaluating ml models.  

Use case example: 
suppose you have pre-processed web logs for a few months, and now want to split 
it into a training set (where you train a model to predict some aspect of site 
accesses, perhaps per user) and an out of time test set (where you evaluate how 
well your model performs in the future). This example has just a single split, 
but in general you could want more for cross validation. You may also want to 
have multiple overlaping intervals.   

Specification: 

1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), 
return n+1 RDDs such that values in the ith RDD are within the (i-1)th and ith 
boundary.

2. More complex alternative (but similar under the hood): provide a sequence of 
possibly overlapping intervals, and return the RDDs containing values within 
those intervals. 

Implementation ideas / notes for 1:

- The ordered RDDs are likely RangePartitioned (or there should be a simple way 
to find ranges from partitions in an ordered RDD)
- Find the partitions containing the boundary, and split them in two.  
- Construct the new RDDs from the original partitions (and any split ones)

I suspect this could be done by launching only a few jobs to split the 
partitions containing the boundaries. 
Alternatively, it might be possible to decorate these partitions and use them 
in more than one RDD. I.e. let one of these partitions (for boundary i) be p. 
Apply two decorators p' and p'', where p' is masks out values above the ith 
boundary, and p'' masks out values below the ith boundary. Any operations on 
these partitions apply only to values not masked out. Then assign p' to the ith 
output RDD and p'' to the (i+1)th output RDD.
If I understand Spark correctly, this should not require any jobs. Not sure 
whether it's worth trying this optimisation.

Implementation ideas / notes for 2:
This is very similar, except that we have to handle entire (or parts) of 
partitions belonging to more than one output RDD, since they are no longer 
mutually exclusive. But since RDDs are immutable(?), the decorator idea should 
still work?

Thoughts?



 Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)
 --

 Key: SPARK-6664
 URL: https://issues.apache.org/jira/browse/SPARK-6664
   

[jira] [Created] (SPARK-6665) Randomly Shuffle an RDD

2015-04-01 Thread Florian Verhein (JIRA)
Florian Verhein created SPARK-6665:
--

 Summary: Randomly Shuffle an RDD 
 Key: SPARK-6665
 URL: https://issues.apache.org/jira/browse/SPARK-6665
 Project: Spark
  Issue Type: New Feature
  Components: Spark Shell
Reporter: Florian Verhein
Priority: Minor


*Use case* 
RDD created in a way that has some ordering, but you need to shuffle it because 
the ordering would cause problems downstream. E.g.
- will be used to train a ML algorithm that makes stochastic assumptions (like 
SGD) 
- used as input for cross validation. e.g. after the shuffle, you could just 
grab partitions (or part files if saved to hdfs) as folds

Related question in mailing list: 
http://apache-spark-user-list.1001560.n3.nabble.com/random-shuffle-streaming-RDDs-td17965.html

*Possible implementation*
As mentioned by [~sowen] in the above thread, could sort by( a good  hash of( 
the element (or key if it's paired) and a random salt)). 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6664) Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)

2015-04-01 Thread Florian Verhein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391950#comment-14391950
 ] 

Florian Verhein commented on SPARK-6664:


The closest approach I've found that should achieve the same result is calling 
OrderedRDDFunctions.filterByRange n+1 times. I assume this approach will be 
much slower, but... it may not be if it's completely lazy.. (??). I don't know 
spark well enough yet to be anywhere near sure of this.

 Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)
 --

 Key: SPARK-6664
 URL: https://issues.apache.org/jira/browse/SPARK-6664
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Florian Verhein

 I can't find this functionality (if I missed something, apologies!), but it 
 would be very useful for evaluating ml models.  
 *Use case example* 
 suppose you have pre-processed web logs for a few months, and now want to 
 split it into a training set (where you train a model to predict some aspect 
 of site accesses, perhaps per user) and an out of time test set (where you 
 evaluate how well your model performs in the future). This example has just a 
 single split, but in general you could want more for cross validation. You 
 may also want to have multiple overlaping intervals.   
 *Specification* 
 1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), 
 return n+1 RDDs such that values in the ith RDD are within the (i-1)th and 
 ith boundary.
 2. More complex alternative (but similar under the hood): provide a sequence 
 of possibly overlapping intervals (ordered by the start key of the interval), 
 and return the RDDs containing values within those intervals. 
 *Implementation ideas / notes for 1*
 - The ordered RDDs are likely RangePartitioned (or there should be a simple 
 way to find ranges from partitions in an ordered RDD)
 - Find the partitions containing the boundary, and split them in two.  
 - Construct the new RDDs from the original partitions (and any split ones)
 I suspect this could be done by launching only a few jobs to split the 
 partitions containing the boundaries. 
 Alternatively, it might be possible to decorate these partitions and use them 
 in more than one RDD. I.e. let one of these partitions (for boundary i) be p. 
 Apply two decorators p' and p'', where p' is masks out values above the ith 
 boundary, and p'' masks out values below the ith boundary. Any operations on 
 these partitions apply only to values not masked out. Then assign p' to the 
 ith output RDD and p'' to the (i+1)th output RDD.
 If I understand Spark correctly, this should not require any jobs. Not sure 
 whether it's worth trying this optimisation.
 *Implementation ideas / notes for 2*
 This is very similar, except that we have to handle entire (or parts) of 
 partitions belonging to more than one output RDD, since they are no longer 
 mutually exclusive. But since RDDs are immutable(??), the decorator idea 
 should still work?
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6664) Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)

2015-04-01 Thread Florian Verhein (JIRA)
Florian Verhein created SPARK-6664:
--

 Summary: Split Ordered RDD into multiple RDDs by keys (boundaries 
or intervals)
 Key: SPARK-6664
 URL: https://issues.apache.org/jira/browse/SPARK-6664
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Florian Verhein



I can't find this functionality (if I missed something, apologies!), but it 
would be very useful for evaluating ml models.  

Use case example: 
suppose you have pre-processed web logs for a few months, and now want to split 
it into a training set (where you train a model to predict some aspect of site 
accesses, perhaps per user) and an out of time test set (where you evaluate how 
well your model performs in the future). This example has just a single split, 
but in general you could want more for cross validation. You may also want to 
have multiple overlaping intervals.   

Specification: 

1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), 
return n+1 RDDs such that values in the ith RDD are within the (i-1)th and ith 
boundary.

2. More complex alternative (but similar under the hood): provide a sequence of 
possibly overlapping intervals, and return the RDDs containing values within 
those intervals. 

Implementation ideas / notes for 1:

- The ordered RDDs are likely RangePartitioned (or there should be a simple way 
to find ranges from partitions in an ordered RDD)
- Find the partitions containing the boundary, and split them in two.  
- Construct the new RDDs from the original partitions (and any split ones)

I suspect this could be done by launching only a few jobs to split the 
partitions containing the boundaries. 
Alternatively, it might be possible to decorate these partitions and use them 
in more than one RDD. I.e. let one of these partitions (for boundary i) be p. 
Apply two decorators p' and p'', where p' is masks out values above the ith 
boundary, and p'' masks out values below the ith boundary. Any operations on 
these partitions apply only to values not masked out. Then assign p' to the ith 
output RDD and p'' to the (i+1)th output RDD.
If I understand Spark correctly, this should not require any jobs. Not sure 
whether it's worth trying this optimisation.

Implementation ideas / notes for 2:
This is very similar, except that we have to handle entire (or parts) of 
partitions belonging to more than one output RDD, since they are no longer 
mutually exclusive. But since RDDs are immutable(?), the decorator idea should 
still work?

Thoughts?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6601) Add HDFS NFS gateway module to spark-ec2

2015-03-29 Thread Florian Verhein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florian Verhein updated SPARK-6601:
---
Description: 
Add module hdfs-nfs-gateway, which sets up the gateway for (say, 
ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes.

Note: For nfs to be available outside AWS, also requires [#6600]

  was:

Add module hdfs-nfs-gateway, which sets up the gateway for (say, 
ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes.

Note: For nfs to be available outside AWS, also requires #6600


 Add HDFS NFS gateway module to spark-ec2
 

 Key: SPARK-6601
 URL: https://issues.apache.org/jira/browse/SPARK-6601
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Reporter: Florian Verhein

 Add module hdfs-nfs-gateway, which sets up the gateway for (say, 
 ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes.
 Note: For nfs to be available outside AWS, also requires [#6600]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6601) Add HDFS NFS gateway module to spark-ec2

2015-03-29 Thread Florian Verhein (JIRA)
Florian Verhein created SPARK-6601:
--

 Summary: Add HDFS NFS gateway module to spark-ec2
 Key: SPARK-6601
 URL: https://issues.apache.org/jira/browse/SPARK-6601
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Reporter: Florian Verhein



Add module hdfs-nfs-gateway, which sets up the gateway for (say, 
ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes.

Note: For nfs to be available outside AWS, also requires #6600



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6600) Open ports in spark-ec2.py to allow HDFS NFS gateway

2015-03-29 Thread Florian Verhein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florian Verhein updated SPARK-6600:
---
Description: 
Use case: User has set up the hadoop hdfs nfs gateway service on their 
spark-ec2.py launched cluster, and wants to mount that on their local machine. 

Requires the following ports to be opened on incoming rule set for MASTER for 
both UDP and TCP: 111, 2049, 4242.
(I have tried this and it works)

Note that this issue *does not* cover the implementation of a hdfs nfs gateway 
module in the spark-ec2 project. That should be a separate issue (TODO).  

Reference:
https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html

  was:

Use case: User has set up the hadoop hdfs nfs gateway service on their 
spark-ec2.py launched cluster, and wants to mount that on their local machine. 

Requires the following ports to be opened on incoming rule set for MASTER for 
both UDP and TCP: 111, 2049, 4242.
(I have tried this and it works)

Note that this issue *does not* cover the implementation of a hdfs nfs gateway 
module in the spark-ec2 project. That should be a separate issue (TODO).  

Reference:
https://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html


 Open ports in spark-ec2.py to allow HDFS NFS gateway  
 --

 Key: SPARK-6600
 URL: https://issues.apache.org/jira/browse/SPARK-6600
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Reporter: Florian Verhein

 Use case: User has set up the hadoop hdfs nfs gateway service on their 
 spark-ec2.py launched cluster, and wants to mount that on their local 
 machine. 
 Requires the following ports to be opened on incoming rule set for MASTER for 
 both UDP and TCP: 111, 2049, 4242.
 (I have tried this and it works)
 Note that this issue *does not* cover the implementation of a hdfs nfs 
 gateway module in the spark-ec2 project. That should be a separate issue 
 (TODO).  
 Reference:
 https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6600) Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway

2015-03-29 Thread Florian Verhein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florian Verhein updated SPARK-6600:
---
Summary: Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway(was: 
Open ports in spark-ec2.py to allow HDFS NFS gateway  )

 Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway  
 --

 Key: SPARK-6600
 URL: https://issues.apache.org/jira/browse/SPARK-6600
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Reporter: Florian Verhein

 Use case: User has set up the hadoop hdfs nfs gateway service on their 
 spark-ec2.py launched cluster, and wants to mount that on their local 
 machine. 
 Requires the following ports to be opened on incoming rule set for MASTER for 
 both UDP and TCP: 111, 2049, 4242.
 (I have tried this and it works)
 Note that this issue *does not* cover the implementation of a hdfs nfs 
 gateway module in the spark-ec2 project. That should be a separate issue 
 (TODO).  
 Reference:
 https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6600) Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway

2015-03-29 Thread Florian Verhein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florian Verhein updated SPARK-6600:
---
Description: 
Use case: User has set up the hadoop hdfs nfs gateway service on their 
spark_ec2.py launched cluster, and wants to mount that on their local machine. 

Requires the following ports to be opened on incoming rule set for MASTER for 
both UDP and TCP: 111, 2049, 4242.
(I have tried this and it works)

Note that this issue *does not* cover the implementation of a hdfs nfs gateway 
module in the spark-ec2 project. See [#6601] for this.  

Reference:
https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html

  was:
Use case: User has set up the hadoop hdfs nfs gateway service on their 
spark_ec2.py launched cluster, and wants to mount that on their local machine. 

Requires the following ports to be opened on incoming rule set for MASTER for 
both UDP and TCP: 111, 2049, 4242.
(I have tried this and it works)

Note that this issue *does not* cover the implementation of a hdfs nfs gateway 
module in the spark-ec2 project. That should be a separate issue (TODO).  

Reference:
https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html


 Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway  
 --

 Key: SPARK-6600
 URL: https://issues.apache.org/jira/browse/SPARK-6600
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Reporter: Florian Verhein

 Use case: User has set up the hadoop hdfs nfs gateway service on their 
 spark_ec2.py launched cluster, and wants to mount that on their local 
 machine. 
 Requires the following ports to be opened on incoming rule set for MASTER for 
 both UDP and TCP: 111, 2049, 4242.
 (I have tried this and it works)
 Note that this issue *does not* cover the implementation of a hdfs nfs 
 gateway module in the spark-ec2 project. See [#6601] for this.  
 Reference:
 https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6601) Add HDFS NFS gateway module to spark-ec2

2015-03-29 Thread Florian Verhein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florian Verhein updated SPARK-6601:
---
Description: 
Add module hdfs-nfs-gateway, which sets up the gateway for (say, 
ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes.

Note: For nfs to be available outside AWS, also requires #6600

  was:
Add module hdfs-nfs-gateway, which sets up the gateway for (say, 
ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes.

Note: For nfs to be available outside AWS, also requires [#6600]


 Add HDFS NFS gateway module to spark-ec2
 

 Key: SPARK-6601
 URL: https://issues.apache.org/jira/browse/SPARK-6601
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Reporter: Florian Verhein

 Add module hdfs-nfs-gateway, which sets up the gateway for (say, 
 ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes.
 Note: For nfs to be available outside AWS, also requires #6600



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5879) spary_ec2.py should expose/return master and slave lists (e.g. write to file)

2015-02-19 Thread Florian Verhein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328612#comment-14328612
 ] 

Florian Verhein commented on SPARK-5879:


cc [~shivaram], any opinions on how to best do this?

 spary_ec2.py should expose/return master and slave lists (e.g. write to file)
 -

 Key: SPARK-5879
 URL: https://issues.apache.org/jira/browse/SPARK-5879
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Florian Verhein

 After running spark_ec2.py, it is often useful/necessary to know the master's 
 ip / dn. Particularly if running spark_ec2.py is part of a larger pipeline.
 For example, consider a wrapper that launches a cluster, then waits for 
 completion of some application running on it (e.g. polling via ssh), before 
 destroying the cluster.
 Some options: 
 - write `launch-variables.sh` with MASTERS and SLAVES exports (i.e. basically 
 a subset of the ec2_variables.sh that is temporarily created as part of 
 deploy_files variable substitution)
 - launch-variables.json (same info but as json) 
 Both would be useful depending on the wrapper language. 
 I think we should incorporate the cluster name for the case that multiple 
 clusters are launched. E.g. cluster_name_variables.sh/.json
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5879) spary_ec2.py should expose/return master and slave lists (e.g. write to file)

2015-02-17 Thread Florian Verhein (JIRA)
Florian Verhein created SPARK-5879:
--

 Summary: spary_ec2.py should expose/return master and slave lists 
(e.g. write to file)
 Key: SPARK-5879
 URL: https://issues.apache.org/jira/browse/SPARK-5879
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Florian Verhein



After running spark_ec2.py, it is often useful/necessary to know the master's 
ip / dn. Particularly if running spark_ec2.py is part of a larger pipeline.

For example, consider a wrapper that launches a cluster, then waits for 
completion of some application running on it (e.g. polling via ssh), before 
destroying the cluster.

Some options: 
- write `launch-variables.sh` with MASTERS and SLAVES exports (i.e. basically a 
subset of the ec2_variables.sh that is temporarily created as part of 
deploy_files variable substitution)
- launch-variables.json (same info but as json) 

Both would be useful depending on the wrapper language. 

I think we should incorporate the cluster name for the case that multiple 
clusters are launched. E.g. cluster_name_variables.sh/.json

Thoughts?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5851) spark_ec2.py ssh failure retry handling not always appropriate

2015-02-17 Thread Florian Verhein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324986#comment-14324986
 ] 

Florian Verhein commented on SPARK-5851:


That makes sense.

Yeah, I ran into it yesterday. My spark-ec2/setup.sh failed (had set -u set on 
a new component I was testing), resulting in looping over setup.sh calls. 
In this case, spark_ec2.py shouldn't retry, but fail gracefully (ideally after 
performing cleanup of the cluster, and returning a failure code)

 spark_ec2.py ssh failure retry handling not always appropriate
 --

 Key: SPARK-5851
 URL: https://issues.apache.org/jira/browse/SPARK-5851
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Florian Verhein
Priority: Minor

 The following function doesn't distinguish between the ssh failing (e.g. 
 presumably a connection issue) and the remote command that it executes 
 failing (e.g. setup.sh). The latter should probably not result in a retry. 
 Perhaps tries could be an argument that is set to 1 for certain usages. 
 # Run a command on a host through ssh, retrying up to five times
 # and then throwing an exception if ssh continues to fail.
 spark-ec2: [{{def ssh(host, opts, 
 command)}}|https://github.com/apache/spark/blob/d8f69cf78862d13a48392a0b94388b8d403523da/ec2/spark_ec2.py#L953-L975]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5851) spark_ec2.py ssh failure retry handling not always appropriate

2015-02-16 Thread Florian Verhein (JIRA)
Florian Verhein created SPARK-5851:
--

 Summary: spark_ec2.py ssh failure retry handling not always 
appropriate
 Key: SPARK-5851
 URL: https://issues.apache.org/jira/browse/SPARK-5851
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Florian Verhein
Priority: Minor



The following function doesn't distinguish between the ssh failing (e.g. 
presumably a connection issue) and the remote command that it executes failing 
(e.g. setup.sh). The latter should probably not result in a retry. 

Perhaps tries could be an argument that is set to 1 for certain usages. 

# Run a command on a host through ssh, retrying up to five times
# and then throwing an exception if ssh continues to fail.
def ssh(host, opts, command):



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5813) Spark-ec2: Switch to OracleJDK

2015-02-16 Thread Florian Verhein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322611#comment-14322611
 ] 

Florian Verhein commented on SPARK-5813:


I think it's a good idea to stick to vendor recommendations, but since I can't 
point to any concrete benefits and there is complexity around handling 
licensing issues, I don't think there's a good argument for tackling this.

 Spark-ec2: Switch to OracleJDK
 --

 Key: SPARK-5813
 URL: https://issues.apache.org/jira/browse/SPARK-5813
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Florian Verhein
Priority: Minor

 Currently using OpenJDK, however it is generally recommended to use Oracle 
 JDK, esp for Hadoop deployments, etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5813) Spark-ec2: Switch to OracleJDK

2015-02-16 Thread Florian Verhein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florian Verhein closed SPARK-5813.
--
Resolution: Won't Fix

 Spark-ec2: Switch to OracleJDK
 --

 Key: SPARK-5813
 URL: https://issues.apache.org/jira/browse/SPARK-5813
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Florian Verhein
Priority: Minor

 Currently using OpenJDK, however it is generally recommended to use Oracle 
 JDK, esp for Hadoop deployments, etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5813) Spark-ec2: Switch to OracleJDK

2015-02-15 Thread Florian Verhein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321764#comment-14321764
 ] 

Florian Verhein commented on SPARK-5813:


INAL but here are my thoughts:

The user ends up downloading it from Oracle and accepting the license terms in 
that process, so as long as they are (or made) aware then I don't really see a 
problem. It's just providing a mechanism for them to do this. i.e. It's not a 
redistribution issue.
I think a reasonable solution to this would be to have OpenJDK as the default, 
with OracleJDK as an option that the user must specifically request (and the 
option's documentation indicating that this entails acceptance of a license... 
etc)

At least, *the above is true in the case where the user builds their own AMI 
(that's the approach I take since it best suits my requirements). With provided 
AMIs I think this is more complex, because I would assume that is 
redistribution*. I guess that applies to any software that is put on the AMI 
actually... so this may be an issue that needs looking at more generally... 
I don't know how to best approach that case other than adhering to any 
redistribution terms  including these as part of an EULA for spark-ec2/AMIs or 
something? 

But with the work [~nchammas] has done, I suppose the easiest way would be to 
provide the public AMIs with OpenJDK, and add an option to build ones with 
OracleJDK if the user is inclined to do this themselves.
 
Hmmm... is this worthwhile?

 Spark-ec2: Switch to OracleJDK
 --

 Key: SPARK-5813
 URL: https://issues.apache.org/jira/browse/SPARK-5813
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Florian Verhein
Priority: Minor

 Currently using OpenJDK, however it is generally recommended to use Oracle 
 JDK, esp for Hadoop deployments, etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5813) Spark-ec2: Switch to OracleJDK

2015-02-15 Thread Florian Verhein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322208#comment-14322208
 ] 

Florian Verhein commented on SPARK-5813:


Good point. I think you're right re: scripting away - I understand it's 
sometimes done by sysadmins/ops to automate their installation processes 
in-house, but that is a different situation. Thanks for that. 

spark_ec2 works by looking up an existing ami and using it to instantiate ec2 
instances. I don't know who currently maintains these. 



 Spark-ec2: Switch to OracleJDK
 --

 Key: SPARK-5813
 URL: https://issues.apache.org/jira/browse/SPARK-5813
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Florian Verhein
Priority: Minor

 Currently using OpenJDK, however it is generally recommended to use Oracle 
 JDK, esp for Hadoop deployments, etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5813) Spark-ec2: Switch to OracleJDK

2015-02-14 Thread Florian Verhein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321748#comment-14321748
 ] 

Florian Verhein commented on SPARK-5813:


No specific technical reason esp WRT Spark... It's more of an attempt to keep 
in line with recommendations for Hadoop in production (relevant since hadoop is 
included in spark-ec2 - and cdh seems to be favoured). For example, CDH 
supports OracleJDK, Horton didn't support OpenJDK before 1.7 and OracleJDK 
still seems to be the favoured choice in production deployments, e.g. 
http://wiki.apache.org/hadoop/HadoopJavaVersions. 

I don't have first had data about how they compare performance wise. I've heard 
OracleJDK being preferred for Hadoop on that front, but I also found this 
http://www.slideshare.net/PrincipledTechnologies/big-data-technology-on-red-hat-enterprise-linux-openjdk-vs-oracle-jdk,
 so perhaps performance is less of a reason these days?

Do you know of any performance analysis done with Spark, Tachyon on OpenJDK vs 
OracleJDK?

In terms of difficulty, it's not hard to script installation of OracleJDK. E.g. 
I've gone down the path of supporting both for the above reasons here (link may 
break in future): 
https://github.com/florianverhein/spark-ec2/blob/packer/packer/java-setup.sh

Aside: Based on bugs you mentioned, is there a list somewhere of which JDK 
versions to avoid WRT Spark?

 Spark-ec2: Switch to OracleJDK
 --

 Key: SPARK-5813
 URL: https://issues.apache.org/jira/browse/SPARK-5813
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Florian Verhein
Priority: Minor

 Currently using OpenJDK, however it is generally recommended to use Oracle 
 JDK, esp for Hadoop deployments, etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2015-02-13 Thread Florian Verhein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320995#comment-14320995
 ] 

Florian Verhein commented on SPARK-3821:


RE: Java, that reminds me... We should probably be using OracleJDK rather than 
OpenJDK. But I think this should be a separate issue, so just created 
#SPARK-5813.

 Develop an automated way of creating Spark images (AMI, Docker, and others)
 ---

 Key: SPARK-3821
 URL: https://issues.apache.org/jira/browse/SPARK-3821
 Project: Spark
  Issue Type: Improvement
  Components: Build, EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
 Attachments: packer-proposal.html


 Right now the creation of Spark AMIs or Docker containers is done manually. 
 With tools like [Packer|http://www.packer.io/], we should be able to automate 
 this work, and do so in such a way that multiple types of machine images can 
 be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5813) Spark-ec2: Switch to OracleJDK

2015-02-13 Thread Florian Verhein (JIRA)
Florian Verhein created SPARK-5813:
--

 Summary: Spark-ec2: Switch to OracleJDK
 Key: SPARK-5813
 URL: https://issues.apache.org/jira/browse/SPARK-5813
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Florian Verhein
Priority: Minor


Currently using OpenJDK, however it is generally recommended to use Oracle JDK, 
esp for Hadoop deployments, etc. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5641) Allow spark_ec2.py to copy arbitrary files to cluster

2015-02-12 Thread Florian Verhein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florian Verhein updated SPARK-5641:
---
Description: 
*Updated - no longer via deploy.generic, no substitutions*

Essentially, give users an easy way to rcp a directory structure to the 
master's / as part of the cluster launch, at a useful point in the workflow 
(before setup.sh is called on the master).

Useful if binary files need to be uploaded. E.g. I use this for rpm transfer to 
install extra stuff at cluster deployment time.

However note that it could also be used to override / add to either:
- what's on the image
- what gets cloned from spark-ec2 (e.g. add new module)


  was:

Useful if binary files need to be uploaded. E.g. I use this for rpm transfer to 
install extra stuff at cluster deployment time.

However note that it could also be used to override either:
- what's on the image
- what gets cloned from spark-ec2 (since deploy_files runs afterwards)

The idea is that the user can just dump the files into ec2/deploy.generic/. 

This can be implemented by modifying deploy_files so that it simply copies the 
file (if it is of certain types), rather than treating it as a text file and 
attempting to replace template variables.

Detecting binary files is non-trivial. So the proposal is to have a list of 
file extensions that will trigger simple file copying.  



 Allow spark_ec2.py to copy arbitrary files to cluster
 -

 Key: SPARK-5641
 URL: https://issues.apache.org/jira/browse/SPARK-5641
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Florian Verhein
Priority: Minor

 *Updated - no longer via deploy.generic, no substitutions*
 Essentially, give users an easy way to rcp a directory structure to the 
 master's / as part of the cluster launch, at a useful point in the workflow 
 (before setup.sh is called on the master).
 Useful if binary files need to be uploaded. E.g. I use this for rpm transfer 
 to install extra stuff at cluster deployment time.
 However note that it could also be used to override / add to either:
 - what's on the image
 - what gets cloned from spark-ec2 (e.g. add new module)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5641) Allow spark_ec2.py to copy arbitrary files to cluster via deploy.generic

2015-02-09 Thread Florian Verhein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florian Verhein updated SPARK-5641:
---
Description: 

Useful if binary files need to be uploaded. E.g. I use this for rpm transfer to 
install extra stuff at cluster deployment time.

However note that it could also be used to override either:
- what's on the image
- what gets cloned from spark-ec2 (since deploy_files runs afterwards)

The idea is that the user can just dump the files into ec2/deploy.generic/. 

This can be implemented by modifying deploy_files so that it simply copies the 
file (if it is of certain types), rather than treating it as a text file and 
attempting to replace template variables.

Detecting binary files is non-trivial. So the proposal is to have a list of 
file extensions that will trigger simple file copying.  


  was:

Useful if binary files need to be uploaded. E.g. I use this for rpm transfer to 
install extra stuff at cluster deployment time.

Could also be used to override what's on the image, etc.

The idea is that the user can just dump the files into deploy.generic. 

This can be implemented by modifying deploy_templates so that it simply copies 
the file (if it is of certain types), rather than treating it as a text file 
and replacing template variables. 



 Allow spark_ec2.py to copy arbitrary files to cluster via deploy.generic
 

 Key: SPARK-5641
 URL: https://issues.apache.org/jira/browse/SPARK-5641
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Florian Verhein
Priority: Minor

 Useful if binary files need to be uploaded. E.g. I use this for rpm transfer 
 to install extra stuff at cluster deployment time.
 However note that it could also be used to override either:
 - what's on the image
 - what gets cloned from spark-ec2 (since deploy_files runs afterwards)
 The idea is that the user can just dump the files into ec2/deploy.generic/. 
 This can be implemented by modifying deploy_files so that it simply copies 
 the file (if it is of certain types), rather than treating it as a text file 
 and attempting to replace template variables.
 Detecting binary files is non-trivial. So the proposal is to have a list of 
 file extensions that will trigger simple file copying.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5676) License missing from spark-ec2 repo

2015-02-09 Thread Florian Verhein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313102#comment-14313102
 ] 

Florian Verhein edited comment on SPARK-5676 at 2/9/15 11:06 PM:
-

[~srowen] Yep, that's the one.

True. However it is the key part in providing the functionality of spark 
deployment on EC2, which is documented quite prominently on the Spark site, and 
the entry point of which is in the spark repo (ec2/spark_ec2.py). Bugs against 
this functionality are therefore also filed here under EC2 module.

I assume the decision to have a separate repo was for implementation/design 
reasons ( ?? ). Having spark_ec2.py cause this repo to be cloned and executed 
on EC2 is a really nice way of providing the functionality. But that's an 
assumption on my part and [~shivaram] would know best.

So from a user perspective, it would appear to be part of Spark (users may not 
even be aware that part of the functionality lives in a separate repo).

Since it's a great way to get Spark running on EC2, it would be great to get 
the licencing sorted out. This appears to be the best place to raise this issue.



was (Author: florianverhein):
[~srowen] Yep, that's the one.

True. However it is the key part in providing the functionality of spark 
deployment on EC2, which is documented quite prominently on the Spark site, and 
the entry point of which is in the spark repo (ec2/spark_ec2.py). Bugs against 
this functionality are therefore also filed here under EC2 module.
I assume the decision to have a separate repo was for implementation/design 
reasons ( ?? ). Having spark_ec2.py cause this repo to be cloned and executed 
on EC2 is a really nice way of providing the functionality. But that's an 
assumption on my part and [~shivaram] would know best.

So from a user perspective, it would appear to be part of Spark (users may not 
even be aware that part of the functionality lives in a separate repo).

Since it's a great way to get Spark running on EC2, it would be great to get 
the licencing sorted out. This appears to be the best place to raise this issue.


 License missing from spark-ec2 repo
 ---

 Key: SPARK-5676
 URL: https://issues.apache.org/jira/browse/SPARK-5676
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Florian Verhein

 There is no LICENSE file or licence headers in the code in the spark-ec2 
 repo. Also, I believe there is no contributor license agreement notification 
 in place (like there is in the main spark repo).
 It would be great to fix this (sooner better than later while contributors 
 list is small), so that users wishing to use this part of Spark are not in 
 doubt over licensing issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5676) License missing from spark-ec2 repo

2015-02-09 Thread Florian Verhein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313102#comment-14313102
 ] 

Florian Verhein commented on SPARK-5676:


[~srowen] Yep, that's the one.

True. However it is the key part in providing the functionality of spark 
deployment on EC2, which is documented quite prominently on the Spark site, and 
the entry point of which is in the spark repo (ec2/spark_ec2.py). Bugs against 
this functionality are therefore also filed here under EC2 module.
I assume the decision to have a separate repo was for implementation/design 
reasons ( ?? ). Having spark_ec2.py cause this repo to be cloned and executed 
on EC2 is a really nice way of providing the functionality. But that's an 
assumption on my part and [~shivaram] would know best.

So from a user perspective, it would appear to be part of Spark (users may not 
even be aware that part of the functionality lives in a separate repo).

Since it's a great way to get Spark running on EC2, it would be great to get 
the licencing sorted out. This appears to be the best place to raise this issue.


 License missing from spark-ec2 repo
 ---

 Key: SPARK-5676
 URL: https://issues.apache.org/jira/browse/SPARK-5676
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Florian Verhein

 There is no LICENSE file or licence headers in the code in the spark-ec2 
 repo. Also, I believe there is no contributor license agreement notification 
 in place (like there is in the main spark repo).
 It would be great to fix this (sooner better than later while contributors 
 list is small), so that users wishing to use this part of Spark are not in 
 doubt over licensing issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5676) License missing from spark-ec2 repo

2015-02-09 Thread Florian Verhein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313135#comment-14313135
 ] 

Florian Verhein commented on SPARK-5676:


Makes sense. Thanks.

 License missing from spark-ec2 repo
 ---

 Key: SPARK-5676
 URL: https://issues.apache.org/jira/browse/SPARK-5676
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Florian Verhein

 There is no LICENSE file or licence headers in the code in the spark-ec2 
 repo. Also, I believe there is no contributor license agreement notification 
 in place (like there is in the main spark repo).
 It would be great to fix this (sooner better than later while contributors 
 list is small), so that users wishing to use this part of Spark are not in 
 doubt over licensing issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5676) License missing from spark-ec2 repo

2015-02-08 Thread Florian Verhein (JIRA)
Florian Verhein created SPARK-5676:
--

 Summary: License missing from spark-ec2 repo
 Key: SPARK-5676
 URL: https://issues.apache.org/jira/browse/SPARK-5676
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Florian Verhein



There is no LICENSE file or licence headers in the code in the spark-ec2 repo. 
Also, I believe there is no contributor license agreement notification in place 
(like there is in the main spark repo).

It would be great to fix this (sooner better than later while contributors list 
is small), so that users wishing to use this part of Spark are not in doubt 
over licensing issues.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3185) SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting JOURNAL_FOLDER

2015-02-05 Thread Florian Verhein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308644#comment-14308644
 ] 

Florian Verhein commented on SPARK-3185:


[~dvohra] Sure, but the exception is thrown by tachyon... so you're not going 
to be able to fix it by changing the spark build

 SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting 
 JOURNAL_FOLDER
 ---

 Key: SPARK-3185
 URL: https://issues.apache.org/jira/browse/SPARK-3185
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.0.2
 Environment: Amazon Linux AMI
 [ec2-user@ip-172-30-1-145 ~]$ uname -a
 Linux ip-172-30-1-145 3.10.42-52.145.amzn1.x86_64 #1 SMP Tue Jun 10 23:46:43 
 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
 https://aws.amazon.com/amazon-linux-ami/2014.03-release-notes/
 The build I used (and MD5 verified):
 [ec2-user@ip-172-30-1-145 ~]$ wget 
 http://supergsego.com/apache/spark/spark-1.0.2/spark-1.0.2-bin-hadoop2.tgz
Reporter: Jeremy Chambers

 {code}
 org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 {code}
 When I launch SPARK 1.0.2 on Hadoop 2 in a new EC2 cluster, the above tachyon 
 exception is thrown when Formatting JOURNAL_FOLDER.
 No exception occurs when I launch on Hadoop 1.
 Launch used:
 {code}
 ./spark-ec2 -k spark_cluster -i /home/ec2-user/kagi/spark_cluster.ppk 
 --zone=us-east-1a --hadoop-major-version=2 --spot-price=0.0165 -s 3 launch 
 sparkProd
 {code}
 {code}
 log snippet
 Formatting Tachyon Master @ ec2-54-80-49-244.compute-1.amazonaws.com
 Formatting JOURNAL_FOLDER: /root/tachyon/libexec/../journal/
 Exception in thread main java.lang.RuntimeException: 
 org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 at tachyon.util.CommonUtils.runtimeException(CommonUtils.java:246)
 at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:73)
 at tachyon.UnderFileSystemHdfs.getClient(UnderFileSystemHdfs.java:53)
 at tachyon.UnderFileSystem.get(UnderFileSystem.java:53)
 at tachyon.Format.main(Format.java:54)
 Caused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 at org.apache.hadoop.ipc.Client.call(Client.java:1070)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
 at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
 at 
 org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:238)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:203)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
 at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
 at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:69)
 ... 3 more
 Killed 0 processes
 Killed 0 processes
 ec2-54-167-219-159.compute-1.amazonaws.com: Killed 0 processes
 ec2-54-198-198-17.compute-1.amazonaws.com: Killed 0 processes
 ec2-54-166-36-0.compute-1.amazonaws.com: Killed 0 processes
 ---end snippet---
 {code}
 *I don't have this problem when I launch without the 
 --hadoop-major-version=2 (which defaults to Hadoop 1.x).*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5641) Allow spark_ec2.py to copy arbitrary files to cluster via deploy.generic

2015-02-05 Thread Florian Verhein (JIRA)
Florian Verhein created SPARK-5641:
--

 Summary: Allow spark_ec2.py to copy arbitrary files to cluster via 
deploy.generic
 Key: SPARK-5641
 URL: https://issues.apache.org/jira/browse/SPARK-5641
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Florian Verhein
Priority: Minor



Useful if binary files need to be uploaded. E.g. I use this for rpm transfer to 
install extra stuff at cluster deployment time.

Could also be used to override what's on the image, etc.

The idea is that the user can just dump the files into deploy.generic. 

This can be implemented by modifying deploy_templates so that it simply copies 
the file (if it is of certain types), rather than treating it as a text file 
and replacing template variables. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5552) Automated data science AMI creation and data science cluster deployment on EC2

2015-02-03 Thread Florian Verhein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14304412#comment-14304412
 ] 

Florian Verhein commented on SPARK-5552:


Thanks [~sowen]. 

So it wouldn't fit in the spark repo itself (the only change there would be to 
add an option in spark_ec2.py to use an alternate spark-ec2 repo/branch). It 
would naturally live in spark-ec2, as it  involves changes to spark-ec2 for 
both use cases
- Image creation is based on the work soon to be added to spark-ec2 for this: 
https://issues.apache.org/jira/browse/SPARK-3821
- Cluster deployment+configuration is done using the spark-ec2 scripts 
themselves (but with many modifications/fixes).

Since there is a dependency between the image and the configuration (init.sh 
and setup.sh) scripts, it's not possible to solve this with just an AMI.

The extra components (actually, just vowpal wabbit and more python libraries - 
the rest already exists in spark-ec2 AMI) are just added to the image for data 
science convenience.


 Automated data science AMI creation and data science cluster deployment on EC2
 --

 Key: SPARK-5552
 URL: https://issues.apache.org/jira/browse/SPARK-5552
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Reporter: Florian Verhein

 Issue created RE: 
 https://github.com/mesos/spark-ec2/pull/90#issuecomment-72597154 (please read 
 for background)
 Goal:
 Extend spark-ec2 scripts to create an automated data science cluster 
 deployment on EC2, suitable for almost(?)-production use.
 Use cases: 
 - A user can build their own custom data science AMIs from a CentOS minimal 
 image by calling a packer configuration (good defaults should be provided, 
 some options for flexibility)
 - A user can then easily deploy a new (correctly configured) cluster using 
 these AMIs, and do so as quickly as possible.
 Components/modules: Spark + tachyon + hdfs (on instance storage) + python + R 
 + vowpal wabbit + any rpms + ... + ganglia
 Focus is on reliability (rather than e.g. supporting many versions / dev 
 testing) and speed of deployment.
 Use hadoop 2 so option to lift into yarn later.
 My current solution is here: 
 https://github.com/florianverhein/spark-ec2/tree/packer. It includes other 
 fixes/improvements as needed to get it working.
 Now that it seems to work (but has deviated a lot more from the existing code 
 base than I was expecting), I'm wondering what to do with it...
 Keen to hear ideas if anyone is interested. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5552) Automated data science AMIs creation and cluster deployment on EC2

2015-02-02 Thread Florian Verhein (JIRA)
Florian Verhein created SPARK-5552:
--

 Summary: Automated data science AMIs creation and cluster 
deployment on EC2
 Key: SPARK-5552
 URL: https://issues.apache.org/jira/browse/SPARK-5552
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Reporter: Florian Verhein


Issue created RE: 
https://github.com/mesos/spark-ec2/pull/90#issuecomment-72597154 (please read 
for background)

Goal:
Extend spark-ec2 scripts to create an automated data science cluster deployment 
on EC2, suitable for almost(?)-production use.

Use cases: 
- A user can build their own custom data science AMIs from a CentOS minimal 
image by calling a packer configuration (good defaults should be provided, some 
options for flexibility)
- A user can then easily deploy a new (correctly configured) cluster using 
these AMIs, and do so as quickly as possible.

Components/modules: Spark + tachyon + hdfs (on instance storage) + python + R + 
vowpal wabbit + any rpms + ... + ganglia

Focus is on reliability (rather than e.g. supporting many versions / dev 
testing) and speed of deployment.
Use hadoop 2 so option to lift into yarn later.

My current solution is here: 
https://github.com/florianverhein/spark-ec2/tree/packer. It includes other 
fixes/improvements as needed to get it working.

Now that it seems to work (but has deviated a lot more from the existing code 
base than I was expecting), I'm wondering what to do with it...

Keen to hear ideas if anyone is interested. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5552) Automated data science AMI creation and data science cluster deployment on EC2

2015-02-02 Thread Florian Verhein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florian Verhein updated SPARK-5552:
---
Summary: Automated data science AMI creation and data science cluster 
deployment on EC2  (was: Automated data science AMIs creation and cluster 
deployment on EC2)

 Automated data science AMI creation and data science cluster deployment on EC2
 --

 Key: SPARK-5552
 URL: https://issues.apache.org/jira/browse/SPARK-5552
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Reporter: Florian Verhein

 Issue created RE: 
 https://github.com/mesos/spark-ec2/pull/90#issuecomment-72597154 (please read 
 for background)
 Goal:
 Extend spark-ec2 scripts to create an automated data science cluster 
 deployment on EC2, suitable for almost(?)-production use.
 Use cases: 
 - A user can build their own custom data science AMIs from a CentOS minimal 
 image by calling a packer configuration (good defaults should be provided, 
 some options for flexibility)
 - A user can then easily deploy a new (correctly configured) cluster using 
 these AMIs, and do so as quickly as possible.
 Components/modules: Spark + tachyon + hdfs (on instance storage) + python + R 
 + vowpal wabbit + any rpms + ... + ganglia
 Focus is on reliability (rather than e.g. supporting many versions / dev 
 testing) and speed of deployment.
 Use hadoop 2 so option to lift into yarn later.
 My current solution is here: 
 https://github.com/florianverhein/spark-ec2/tree/packer. It includes other 
 fixes/improvements as needed to get it working.
 Now that it seems to work (but has deviated a lot more from the existing code 
 base than I was expecting), I'm wondering what to do with it...
 Keen to hear ideas if anyone is interested. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3185) SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting JOURNAL_FOLDER

2015-01-24 Thread Florian Verhein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290923#comment-14290923
 ] 

Florian Verhein commented on SPARK-3185:



Sure [~grzegorz-dubicki]. You need to build with the correct version profiles. 
See for example:

https://github.com/florianverhein/spark-ec2/blob/packer/spark/init.sh
https://github.com/florianverhein/spark-ec2/blob/packer/tachyon/init.sh

Note that I'm using Hadoop 2.4.1 (which I install on the image).


 SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting 
 JOURNAL_FOLDER
 ---

 Key: SPARK-3185
 URL: https://issues.apache.org/jira/browse/SPARK-3185
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.2
 Environment: Amazon Linux AMI
 [ec2-user@ip-172-30-1-145 ~]$ uname -a
 Linux ip-172-30-1-145 3.10.42-52.145.amzn1.x86_64 #1 SMP Tue Jun 10 23:46:43 
 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
 https://aws.amazon.com/amazon-linux-ami/2014.03-release-notes/
 The build I used (and MD5 verified):
 [ec2-user@ip-172-30-1-145 ~]$ wget 
 http://supergsego.com/apache/spark/spark-1.0.2/spark-1.0.2-bin-hadoop2.tgz
Reporter: Jeremy Chambers

 {code}
 org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 {code}
 When I launch SPARK 1.0.2 on Hadoop 2 in a new EC2 cluster, the above tachyon 
 exception is thrown when Formatting JOURNAL_FOLDER.
 No exception occurs when I launch on Hadoop 1.
 Launch used:
 {code}
 ./spark-ec2 -k spark_cluster -i /home/ec2-user/kagi/spark_cluster.ppk 
 --zone=us-east-1a --hadoop-major-version=2 --spot-price=0.0165 -s 3 launch 
 sparkProd
 {code}
 {code}
 log snippet
 Formatting Tachyon Master @ ec2-54-80-49-244.compute-1.amazonaws.com
 Formatting JOURNAL_FOLDER: /root/tachyon/libexec/../journal/
 Exception in thread main java.lang.RuntimeException: 
 org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 at tachyon.util.CommonUtils.runtimeException(CommonUtils.java:246)
 at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:73)
 at tachyon.UnderFileSystemHdfs.getClient(UnderFileSystemHdfs.java:53)
 at tachyon.UnderFileSystem.get(UnderFileSystem.java:53)
 at tachyon.Format.main(Format.java:54)
 Caused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 at org.apache.hadoop.ipc.Client.call(Client.java:1070)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
 at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
 at 
 org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:238)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:203)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
 at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
 at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:69)
 ... 3 more
 Killed 0 processes
 Killed 0 processes
 ec2-54-167-219-159.compute-1.amazonaws.com: Killed 0 processes
 ec2-54-198-198-17.compute-1.amazonaws.com: Killed 0 processes
 ec2-54-166-36-0.compute-1.amazonaws.com: Killed 0 processes
 ---end snippet---
 {code}
 *I don't have this problem when I launch without the 
 --hadoop-major-version=2 (which defaults to Hadoop 1.x).*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5331) Spark workers can't find tachyon master as spark-ec2 doesn't set spark.tachyonStore.url

2015-01-20 Thread Florian Verhein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florian Verhein updated SPARK-5331:
---
Component/s: EC2
Description: 
ps -ef | grep Tachyon 
shows Tachyon running on the master (and the slave) node with correct setting:
-Dtachyon.master.hostname=ec2-54-252-156-187.ap-southeast-2.compute.amazonaws.com

However from stderr log on worker running the SparkTachyonPi example:

15/01/20 06:00:56 INFO CacheManager: Partition rdd_0_0 not found, computing it
15/01/20 06:00:56 INFO : Trying to connect master @ localhost/127.0.0.1:19998
15/01/20 06:00:56 ERROR : Failed to connect (1) to master 
localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
15/01/20 06:00:57 ERROR : Failed to connect (2) to master 
localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
15/01/20 06:00:58 ERROR : Failed to connect (3) to master 
localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
15/01/20 06:00:59 ERROR : Failed to connect (4) to master 
localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
15/01/20 06:01:00 ERROR : Failed to connect (5) to master 
localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
15/01/20 06:01:01 WARN TachyonBlockManager: Attempt 1 to create tachyon dir 
null failed
java.io.IOException: Failed to connect to master localhost/127.0.0.1:19998 
after 5 attempts
at tachyon.client.TachyonFS.connect(TachyonFS.java:293)
at tachyon.client.TachyonFS.getFileId(TachyonFS.java:1011)
at tachyon.client.TachyonFS.exist(TachyonFS.java:633)
at 
org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:117)
at 
org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:106)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.storage.TachyonBlockManager.createTachyonDirs(TachyonBlockManager.scala:106)
at 
org.apache.spark.storage.TachyonBlockManager.init(TachyonBlockManager.scala:57)
at 
org.apache.spark.storage.BlockManager.tachyonStore$lzycompute(BlockManager.scala:94)
at 
org.apache.spark.storage.BlockManager.tachyonStore(BlockManager.scala:88)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:773)
at 
org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638)
at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:145)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: tachyon.org.apache.thrift.TException: Failed to connect to master 
localhost/127.0.0.1:19998 after 5 attempts
at tachyon.master.MasterClient.connect(MasterClient.java:178)
at tachyon.client.TachyonFS.connect(TachyonFS.java:290)
... 28 more
Caused by: tachyon.org.apache.thrift.transport.TTransportException: 
java.net.ConnectException: Connection refused
at tachyon.org.apache.thrift.transport.TSocket.open(TSocket.java:185)
at 
tachyon.org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81)
at tachyon.master.MasterClient.connect(MasterClient.java:156)
... 29 more
Caused by: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
   

[jira] [Commented] (SPARK-3185) SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting JOURNAL_FOLDER

2015-01-19 Thread Florian Verhein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283493#comment-14283493
 ] 

Florian Verhein commented on SPARK-3185:


I built tachyon with the correct hadoop version. fixed this problem for me.
correction: spark 1.2.0 uses tachyon 0.5.0 as far as I can see... but the 
spark-ec2 config is for tachyon 0.4.1 (and this causes a few problems when 
actually trying to use tachyon)


 SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting 
 JOURNAL_FOLDER
 ---

 Key: SPARK-3185
 URL: https://issues.apache.org/jira/browse/SPARK-3185
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.2
 Environment: Amazon Linux AMI
 [ec2-user@ip-172-30-1-145 ~]$ uname -a
 Linux ip-172-30-1-145 3.10.42-52.145.amzn1.x86_64 #1 SMP Tue Jun 10 23:46:43 
 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
 https://aws.amazon.com/amazon-linux-ami/2014.03-release-notes/
 The build I used (and MD5 verified):
 [ec2-user@ip-172-30-1-145 ~]$ wget 
 http://supergsego.com/apache/spark/spark-1.0.2/spark-1.0.2-bin-hadoop2.tgz
Reporter: Jeremy Chambers

 {code}
 org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 {code}
 When I launch SPARK 1.0.2 on Hadoop 2 in a new EC2 cluster, the above tachyon 
 exception is thrown when Formatting JOURNAL_FOLDER.
 No exception occurs when I launch on Hadoop 1.
 Launch used:
 {code}
 ./spark-ec2 -k spark_cluster -i /home/ec2-user/kagi/spark_cluster.ppk 
 --zone=us-east-1a --hadoop-major-version=2 --spot-price=0.0165 -s 3 launch 
 sparkProd
 {code}
 {code}
 log snippet
 Formatting Tachyon Master @ ec2-54-80-49-244.compute-1.amazonaws.com
 Formatting JOURNAL_FOLDER: /root/tachyon/libexec/../journal/
 Exception in thread main java.lang.RuntimeException: 
 org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 at tachyon.util.CommonUtils.runtimeException(CommonUtils.java:246)
 at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:73)
 at tachyon.UnderFileSystemHdfs.getClient(UnderFileSystemHdfs.java:53)
 at tachyon.UnderFileSystem.get(UnderFileSystem.java:53)
 at tachyon.Format.main(Format.java:54)
 Caused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 at org.apache.hadoop.ipc.Client.call(Client.java:1070)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
 at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
 at 
 org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:238)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:203)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
 at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
 at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:69)
 ... 3 more
 Killed 0 processes
 Killed 0 processes
 ec2-54-167-219-159.compute-1.amazonaws.com: Killed 0 processes
 ec2-54-198-198-17.compute-1.amazonaws.com: Killed 0 processes
 ec2-54-166-36-0.compute-1.amazonaws.com: Killed 0 processes
 ---end snippet---
 {code}
 *I don't have this problem when I launch without the 
 --hadoop-major-version=2 (which defaults to Hadoop 1.x).*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5331) Tachyon workers seem to ignore tachyon.master.hostname and use localhost instead

2015-01-19 Thread Florian Verhein (JIRA)
Florian Verhein created SPARK-5331:
--

 Summary: Tachyon workers seem to ignore tachyon.master.hostname 
and use localhost instead
 Key: SPARK-5331
 URL: https://issues.apache.org/jira/browse/SPARK-5331
 Project: Spark
  Issue Type: Bug
 Environment: Running on EC2 via modified spark-ec2 scripts (to get 
dependencies right so tachyon starts)
Using tachyon 0.5.0 built against hadoop 2.4.1
Spark 1.2.0 built against tachyon 0.5.0 and hadoop 0.4.1
Tachyon configured using the template in 0.5.0 but updated with slave list and 
master variables etc..

Reporter: Florian Verhein



ps -ef | grep Tachyon 
shows Tachyon running on the master (and the slave) node with correct setting:
-Dtachyon.master.hostname=ec2-54-252-156-187.ap-southeast-2.compute.amazonaws.com

However from stderr log on worker running the SparkTachyonPi example:

15/01/20 06:00:56 INFO CacheManager: Partition rdd_0_0 not found, computing it
15/01/20 06:00:56 INFO : Trying to connect master @ localhost/127.0.0.1:19998
15/01/20 06:00:56 ERROR : Failed to connect (1) to master 
localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
15/01/20 06:00:57 ERROR : Failed to connect (2) to master 
localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
15/01/20 06:00:58 ERROR : Failed to connect (3) to master 
localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
15/01/20 06:00:59 ERROR : Failed to connect (4) to master 
localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
15/01/20 06:01:00 ERROR : Failed to connect (5) to master 
localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
15/01/20 06:01:01 WARN TachyonBlockManager: Attempt 1 to create tachyon dir 
null failed
java.io.IOException: Failed to connect to master localhost/127.0.0.1:19998 
after 5 attempts
at tachyon.client.TachyonFS.connect(TachyonFS.java:293)
at tachyon.client.TachyonFS.getFileId(TachyonFS.java:1011)
at tachyon.client.TachyonFS.exist(TachyonFS.java:633)
at 
org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:117)
at 
org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:106)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.storage.TachyonBlockManager.createTachyonDirs(TachyonBlockManager.scala:106)
at 
org.apache.spark.storage.TachyonBlockManager.init(TachyonBlockManager.scala:57)
at 
org.apache.spark.storage.BlockManager.tachyonStore$lzycompute(BlockManager.scala:94)
at 
org.apache.spark.storage.BlockManager.tachyonStore(BlockManager.scala:88)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:773)
at 
org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638)
at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:145)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: tachyon.org.apache.thrift.TException: Failed to connect to master 
localhost/127.0.0.1:19998 after 5 attempts
at tachyon.master.MasterClient.connect(MasterClient.java:178)
at tachyon.client.TachyonFS.connect(TachyonFS.java:290)
... 28 more
Caused by: tachyon.org.apache.thrift.transport.TTransportException: 
java.net.ConnectException: Connection refused
at tachyon.org.apache.thrift.transport.TSocket.open(TSocket.java:185)
at 
tachyon.org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81)
at tachyon.master.MasterClient.connect(MasterClient.java:156)
... 29 more

[jira] [Commented] (SPARK-3185) SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting JOURNAL_FOLDER

2015-01-13 Thread Florian Verhein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276436#comment-14276436
 ] 

Florian Verhein commented on SPARK-3185:


I'm also getting this, though with Server IPC version 9 now that I'm using 
hadoop 2.4.1 (modification of the various hadoop init.sh scripts). I'm also 
using spark 1.2.0.

My understanding is that spark-1.2.0-bin-hadoop2.4.tgz is built against hadoop 
2.4 and tachyon 0.4.1. 
But I suspect the tachyon 0.4.1 that is installed in the spark-ec2 scripts is 
built against hadoop 1...

Does this mean building tachyon against hadoop 2.4.1 would fix this?

 SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting 
 JOURNAL_FOLDER
 ---

 Key: SPARK-3185
 URL: https://issues.apache.org/jira/browse/SPARK-3185
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.2
 Environment: Amazon Linux AMI
 [ec2-user@ip-172-30-1-145 ~]$ uname -a
 Linux ip-172-30-1-145 3.10.42-52.145.amzn1.x86_64 #1 SMP Tue Jun 10 23:46:43 
 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
 https://aws.amazon.com/amazon-linux-ami/2014.03-release-notes/
 The build I used (and MD5 verified):
 [ec2-user@ip-172-30-1-145 ~]$ wget 
 http://supergsego.com/apache/spark/spark-1.0.2/spark-1.0.2-bin-hadoop2.tgz
Reporter: Jeremy Chambers

 {code}
 org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 {code}
 When I launch SPARK 1.0.2 on Hadoop 2 in a new EC2 cluster, the above tachyon 
 exception is thrown when Formatting JOURNAL_FOLDER.
 No exception occurs when I launch on Hadoop 1.
 Launch used:
 {code}
 ./spark-ec2 -k spark_cluster -i /home/ec2-user/kagi/spark_cluster.ppk 
 --zone=us-east-1a --hadoop-major-version=2 --spot-price=0.0165 -s 3 launch 
 sparkProd
 {code}
 {code}
 log snippet
 Formatting Tachyon Master @ ec2-54-80-49-244.compute-1.amazonaws.com
 Formatting JOURNAL_FOLDER: /root/tachyon/libexec/../journal/
 Exception in thread main java.lang.RuntimeException: 
 org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 at tachyon.util.CommonUtils.runtimeException(CommonUtils.java:246)
 at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:73)
 at tachyon.UnderFileSystemHdfs.getClient(UnderFileSystemHdfs.java:53)
 at tachyon.UnderFileSystem.get(UnderFileSystem.java:53)
 at tachyon.Format.main(Format.java:54)
 Caused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 at org.apache.hadoop.ipc.Client.call(Client.java:1070)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
 at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
 at 
 org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:238)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:203)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
 at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
 at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:69)
 ... 3 more
 Killed 0 processes
 Killed 0 processes
 ec2-54-167-219-159.compute-1.amazonaws.com: Killed 0 processes
 ec2-54-198-198-17.compute-1.amazonaws.com: Killed 0 processes
 ec2-54-166-36-0.compute-1.amazonaws.com: Killed 0 processes
 ---end snippet---
 {code}
 *I don't have this problem when I launch without the 
 --hadoop-major-version=2 (which defaults to Hadoop 1.x).*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2015-01-13 Thread Florian Verhein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276572#comment-14276572
 ] 

Florian Verhein commented on SPARK-3821:


Thanks [~nchammas], that makes sense.

Created #SPARK-5241.
I'm not sure about the pre-built scenario, but am guessing e.g. 
http://s3.amazonaws.com/spark-related-packages/spark-1.2.0-bin-hadoop2.4.tgz != 
http://s3.amazonaws.com/spark-related-packages/spark-1.2.0-bin-cdh4.tgz. So 
perhaps the intent is that the spark-ec2 scripts only support cdh 
distributions...  

 Develop an automated way of creating Spark images (AMI, Docker, and others)
 ---

 Key: SPARK-3821
 URL: https://issues.apache.org/jira/browse/SPARK-3821
 Project: Spark
  Issue Type: Improvement
  Components: Build, EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
 Attachments: packer-proposal.html


 Right now the creation of Spark AMIs or Docker containers is done manually. 
 With tools like [Packer|http://www.packer.io/], we should be able to automate 
 this work, and do so in such a way that multiple types of machine images can 
 be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5241) spark-ec2 spark init scripts do not handle all hadoop (or tachyon?) dependencies correctly

2015-01-13 Thread Florian Verhein (JIRA)
Florian Verhein created SPARK-5241:
--

 Summary: spark-ec2 spark init scripts do not handle all hadoop (or 
tachyon?) dependencies correctly
 Key: SPARK-5241
 URL: https://issues.apache.org/jira/browse/SPARK-5241
 Project: Spark
  Issue Type: Bug
  Components: Build, EC2
Reporter: Florian Verhein



spark-ec2/spark/init.sh doesn't completely adhere to hadoop dependencies. This 
may also be an issue for tachyon dependencies. Related: tachyon appears require 
builds against the right version of hadoop also (probably causes this: 
SPARK-3185). 

Applies to the spark build from git checkout in spark/init.sh (I suspect this 
should also be changed to using mvn as that's the reference build according to 
the docs?).

May apply to pre-built spark in spark/init.sh as well, but I'm not sure about 
this. E.g. I thought that the hadoop2.4 and cdh4.2 builds of spark are 
different.

Also note that hadoop native is built from hadoop 2.4.1 on the AMI, and this is 
used regardless of HADOOP_MAJOR_VERSION in the *-hdfs modules.

Tachyon is hard coded to 0.4.1 (which is probably built against hadoop1.x?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2015-01-13 Thread Florian Verhein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276263#comment-14276263
 ] 

Florian Verhein commented on SPARK-3821:


This is great stuff! It'll also help serve as some documentation for AMI 
requirements when using the spark-ec2 scripts.  

Re the above, I think everything in create_image.sh can be refactored to packer 
(+ duplicate removal - e.g. root login). I've attempted to do this in a fork of 
[~nchammas]'s work, but my use case is a bit different in that I need to go 
from a fresh centos6 minimal (rather than an amazon linux AMI) and then add 
other things.

Possibly related to AMI generation in general: I've noticed that the version 
dependencies in the spark-ec2 scripts are broken. I suspect this will need to 
be handled in both the image and the setup. For example:
- It looks like Spark needs to be built with the right hadoop profile to work, 
but this isn't adhered to. This applies when spark is built from a git checkout 
or from an existing build. This is likely also the case with Tachyon too. 
Probably the cause of https://issues.apache.org/jira/browse/SPARK-3185
- The hadoop native libs are built on the image using 2.4.1, but then copied 
into whatever hadoop build is downloaded in the ephemeral-hdfs and 
persistent-hdfs scripts. I suspect that could cause issues too. Since building 
hadoop is very time consuming, it's something you'd wan't on the image - hence 
creating a dependency. 
- The version dependencies for other things like ganglia aren't documented (I 
believe this is installed on the image but duplicated again in 
spark-ec2/ganglia). I've found that the ganglia config doesn't work for me (but 
recall I'm using a different base AMI, so I'll likely get a different ganglia 
version). I have a sneaky suspicion that the hadoop configs in spark-ec2 won't 
work across the hadoop versions either (but, fingers crossed!).

Re the above, I might try keeping the entire hadoop build (from the image 
creation) for the hdfs setup.

Sorry for the sidetrack, but struggling though all this so hoping it might ring 
a bell for someone.  

p.s. With the image automation, it might also be worth considering putting more 
on the image as an option (esp for people happy to build their own AMIs). For 
example, I see no reason why the module init.sh scripts can't be run from 
packer in order to speed start-up times of the cluster :) 


 Develop an automated way of creating Spark images (AMI, Docker, and others)
 ---

 Key: SPARK-3821
 URL: https://issues.apache.org/jira/browse/SPARK-3821
 Project: Spark
  Issue Type: Improvement
  Components: Build, EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
 Attachments: packer-proposal.html


 Right now the creation of Spark AMIs or Docker containers is done manually. 
 With tools like [Packer|http://www.packer.io/], we should be able to automate 
 this work, and do so in such a way that multiple types of machine images can 
 be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org