[jira] [Updated] (SPARK-5167) Move Row into sql package and make it usable for Java

2015-01-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5167:
---
Assignee: Reynold Xin

 Move Row into sql package and make it usable for Java
 -

 Key: SPARK-5167
 URL: https://issues.apache.org/jira/browse/SPARK-5167
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 This will help us eliminate the duplicated Java code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3299) [SQL] Public API in SQLContext to list tables

2015-01-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3299:
---
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-5166

 [SQL] Public API in SQLContext to list tables
 -

 Key: SPARK-3299
 URL: https://issues.apache.org/jira/browse/SPARK-3299
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.0.2
Reporter: Evan Chan
Assignee: Bill Bejeck
Priority: Minor
  Labels: newbie

 There is no public API in SQLContext to list the current tables.  This would 
 be pretty helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2096) Correctly parse dot notations for accessing an array of structs

2015-01-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2096:
---
Target Version/s: 1.3.0  (was: 1.2.0)

 Correctly parse dot notations for accessing an array of structs
 ---

 Key: SPARK-2096
 URL: https://issues.apache.org/jira/browse/SPARK-2096
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Yin Huai
Priority: Minor
  Labels: starter
 Fix For: 1.2.0


 For example, arrayOfStruct is an array of structs and every element of this 
 array has a field called field1. arrayOfStruct[0].field1 means to access 
 the value of field1 for the first element of arrayOfStruct, but the SQL 
 parser (in sql-core) treats field1 as an alias. Also, 
 arrayOfStruct.field1 means to access all values of field1 in this array 
 of structs and the returns those values as an array. But, the SQL parser 
 cannot resolve it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5166) Stabilize Spark SQL APIs

2015-01-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5166:
---
Assignee: Reynold Xin

 Stabilize Spark SQL APIs
 

 Key: SPARK-5166
 URL: https://issues.apache.org/jira/browse/SPARK-5166
 Project: Spark
  Issue Type: Task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 Before we take Spark SQL out of alpha, we need to audit the APIs and 
 stabilize them. 
 As a general rule, everything under org.apache.spark.sql.catalyst should not 
 be exposed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5166) Stabilize Spark SQL APIs

2015-01-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5166:
---
Priority: Critical  (was: Major)

 Stabilize Spark SQL APIs
 

 Key: SPARK-5166
 URL: https://issues.apache.org/jira/browse/SPARK-5166
 Project: Spark
  Issue Type: Task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical

 Before we take Spark SQL out of alpha, we need to audit the APIs and 
 stabilize them. 
 As a general rule, everything under org.apache.spark.sql.catalyst should not 
 be exposed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5193) Make Spark SQL API usable in Java and remove the Java-specific API

2015-01-11 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-5193:
--

 Summary: Make Spark SQL API usable in Java and remove the 
Java-specific API
 Key: SPARK-5193
 URL: https://issues.apache.org/jira/browse/SPARK-5193
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


Java version of the SchemaRDD API causes high maintenance burden for Spark SQL 
itself and downstream libraries (e.g. MLlib pipeline API needs to support both 
JavaSchemaRDD and SchemaRDD). We can audit the Scala API and make it usable for 
Java, and then we can remove the Java specific version. 

Things to remove include (Java version of):
- data type
- Row
- SQLContext
- HiveContext

Things to consider:
- Scala and Java have a different collection library.
- Scala and Java (8) have different closure interface.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4861) Refactory command in spark sql

2015-01-11 Thread wangfei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272862#comment-14272862
 ] 

wangfei commented on SPARK-4861:


[~yhuai]of course if possible, but i have not find a way to remove it since in 
HiveCommandStrategy we need to distinguish hive metastore table and temporary 
table, so now still keep HiveCommandStrategy there. any idea here?

 Refactory command in spark sql
 --

 Key: SPARK-4861
 URL: https://issues.apache.org/jira/browse/SPARK-4861
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.1
Reporter: wangfei
 Fix For: 1.3.0


 Fix a todo in spark sql:  remove ```Command``` and use ```RunnableCommand``` 
 instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4508) Native Date type for SQL92 Date

2015-01-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-4508:
---
Target Version/s: 1.3.0

 Native Date type for SQL92 Date
 ---

 Key: SPARK-4508
 URL: https://issues.apache.org/jira/browse/SPARK-4508
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Adrian Wang
Assignee: Adrian Wang

 Store daysSinceEpoch as an Int(4 bytes), instead of using java.sql.Date(8 
 bytes as Long) in catalyst row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4508) build native date type to conform behavior to Hive

2015-01-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-4508:
---
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-5166

 build native date type to conform behavior to Hive
 --

 Key: SPARK-4508
 URL: https://issues.apache.org/jira/browse/SPARK-4508
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Adrian Wang

 Store daysSinceEpoch as an Int(4 bytes), instead of using java.sql.Date(8 
 bytes as Long) in catalyst row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4508) Native Date type for SQL92 Date

2015-01-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-4508:
---
Assignee: Adrian Wang

 Native Date type for SQL92 Date
 ---

 Key: SPARK-4508
 URL: https://issues.apache.org/jira/browse/SPARK-4508
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Adrian Wang
Assignee: Adrian Wang

 Store daysSinceEpoch as an Int(4 bytes), instead of using java.sql.Date(8 
 bytes as Long) in catalyst row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5193) Make Spark SQL API usable in Java and remove the Java-specific API

2015-01-11 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272860#comment-14272860
 ] 

Reynold Xin commented on SPARK-5193:


cc [~marmbrus]

 Make Spark SQL API usable in Java and remove the Java-specific API
 --

 Key: SPARK-5193
 URL: https://issues.apache.org/jira/browse/SPARK-5193
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 Java version of the SchemaRDD API causes high maintenance burden for Spark 
 SQL itself and downstream libraries (e.g. MLlib pipeline API needs to support 
 both JavaSchemaRDD and SchemaRDD). We can audit the Scala API and make it 
 usable for Java, and then we can remove the Java specific version. 
 Things to remove include (Java version of):
 - data type
 - Row
 - SQLContext
 - HiveContext
 Things to consider:
 - Scala and Java have a different collection library.
 - Scala and Java (8) have different closure interface.
 - Scala and Java can have duplicate definitions of common classes, such as 
 BigDecimal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5193) Make Spark SQL API usable in Java and remove the Java-specific API

2015-01-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5193:
---
Description: 
Java version of the SchemaRDD API causes high maintenance burden for Spark SQL 
itself and downstream libraries (e.g. MLlib pipeline API needs to support both 
JavaSchemaRDD and SchemaRDD). We can audit the Scala API and make it usable for 
Java, and then we can remove the Java specific version. 

Things to remove include (Java version of):
- data type
- Row
- SQLContext
- HiveContext

Things to consider:
- Scala and Java have a different collection library.
- Scala and Java (8) have different closure interface.
- Scala and Java can have duplicate definitions of common classes, such as 
BigDecimal.


  was:
Java version of the SchemaRDD API causes high maintenance burden for Spark SQL 
itself and downstream libraries (e.g. MLlib pipeline API needs to support both 
JavaSchemaRDD and SchemaRDD). We can audit the Scala API and make it usable for 
Java, and then we can remove the Java specific version. 

Things to remove include (Java version of):
- data type
- Row
- SQLContext
- HiveContext

Things to consider:
- Scala and Java have a different collection library.
- Scala and Java (8) have different closure interface.




 Make Spark SQL API usable in Java and remove the Java-specific API
 --

 Key: SPARK-5193
 URL: https://issues.apache.org/jira/browse/SPARK-5193
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 Java version of the SchemaRDD API causes high maintenance burden for Spark 
 SQL itself and downstream libraries (e.g. MLlib pipeline API needs to support 
 both JavaSchemaRDD and SchemaRDD). We can audit the Scala API and make it 
 usable for Java, and then we can remove the Java specific version. 
 Things to remove include (Java version of):
 - data type
 - Row
 - SQLContext
 - HiveContext
 Things to consider:
 - Scala and Java have a different collection library.
 - Scala and Java (8) have different closure interface.
 - Scala and Java can have duplicate definitions of common classes, such as 
 BigDecimal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5194) ADD JAR doesn't update classpath until reconnect

2015-01-11 Thread Oleg Danilov (JIRA)
Oleg Danilov created SPARK-5194:
---

 Summary: ADD JAR doesn't update classpath until reconnect
 Key: SPARK-5194
 URL: https://issues.apache.org/jira/browse/SPARK-5194
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Oleg Danilov


Steps to reproduce:

beeline  !connect jdbc:hive2://vmhost-vm0:1
   
0: jdbc:hive2://vmhost-vm0:1 add jar 
./target/nexr-hive-udf-0.2-SNAPSHOT.jar
0: jdbc:hive2://vmhost-vm0:1 CREATE TEMPORARY FUNCTION nvl AS 
'com.nexr.platform.hive.udf.GenericUDFNVL';
0: jdbc:hive2://vmhost-vm0:1 select nvl(imsi,'test') from 
ps_cei_index_1_week limit 1;
Error: java.lang.ClassNotFoundException: 
com.nexr.platform.hive.udf.GenericUDFNVL (state=,code=0)
0: jdbc:hive2://vmhost-vm0:1 !reconnect
Reconnecting to jdbc:hive2://vmhost-vm0:1...
Closing: org.apache.hive.jdbc.HiveConnection@3f18dc75: {1}
Connected to: Spark SQL (version 1.2.0)
Driver: null (version null)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://vmhost-vm0:1 select nvl(imsi,'test') from 
ps_cei_index_1_week limit 1;
+--+
| _c0  |
+--+
| -1   |
+--+
1 row selected (1.605 seconds)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5195) when hive table is query with alias the cache data lose effectiveness.

2015-01-11 Thread yixiaohua (JIRA)
yixiaohua created SPARK-5195:


 Summary: when hive table is query with alias  the cache data  lose 
effectiveness.
 Key: SPARK-5195
 URL: https://issues.apache.org/jira/browse/SPARK-5195
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: yixiaohua


override the MetastoreRelation's sameresult method only compare databasename 
and table name

because in previous :
cache table t1;
select count() from t1;
it will read data from memory but the sql below will not,instead it read from 
hdfs:
select count() from t1 t;

because cache data is keyed by logical plan and compare with sameResult ,so 
when table with alias the same table 's logicalplan is not the same logical 
plan with out alias so modify the sameresult method only compare databasename 
and table name



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5195) when hive table is query with alias the cache data lose effectiveness.

2015-01-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272934#comment-14272934
 ] 

Apache Spark commented on SPARK-5195:
-

User 'seayi' has created a pull request for this issue:
https://github.com/apache/spark/pull/3898

 when hive table is query with alias  the cache data  lose effectiveness.
 

 Key: SPARK-5195
 URL: https://issues.apache.org/jira/browse/SPARK-5195
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: yixiaohua

 override the MetastoreRelation's sameresult method only compare databasename 
 and table name
 because in previous :
 cache table t1;
 select count() from t1;
 it will read data from memory but the sql below will not,instead it read from 
 hdfs:
 select count() from t1 t;
 because cache data is keyed by logical plan and compare with sameResult ,so 
 when table with alias the same table 's logicalplan is not the same logical 
 plan with out alias so modify the sameresult method only compare databasename 
 and table name



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5192) Parquet fails to parse schema contains '\r'

2015-01-11 Thread cen yuhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cen yuhai updated SPARK-5192:
-
Summary: Parquet fails to parse schema contains '\r'  (was: Parquet fails 
to parse schemas contains '\r')

 Parquet fails to parse schema contains '\r'
 ---

 Key: SPARK-5192
 URL: https://issues.apache.org/jira/browse/SPARK-5192
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
 Environment: windows7 + Intellj idea 13.0.2 
Reporter: cen yuhai
Priority: Critical
 Fix For: 1.3.0


 I think this is actually a bug in parquet, when i debuged 'ParquetTestData', 
 i found a exception as below. So i  download the source of MessageTypeParser, 
 the funtion 'isWhitespace' do not check for '\r'
 private boolean isWhitespace(String t) {
   return t.equals( ) || t.equals(\t) || t.equals(\n);
 }
 So I replace all '\r' to work around this issue.
   val subTestSchema =
 
   message myrecord {
   optional boolean myboolean;
   optional int64 mylong;
   }
 .replaceAll(\r,)
 at line 0: message myrecord {
   at 
 parquet.schema.MessageTypeParser.asRepetition(MessageTypeParser.java:203)
   at parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:101)
   at 
 parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:96)
   at parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:89)
   at 
 parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:79)
   at 
 org.apache.spark.sql.parquet.ParquetTestData$.writeFile(ParquetTestData.scala:221)
   at 
 org.apache.spark.sql.parquet.ParquetQuerySuite.beforeAll(ParquetQuerySuite.scala:92)
   at 
 org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
   at 
 org.apache.spark.sql.parquet.ParquetQuerySuite.beforeAll(ParquetQuerySuite.scala:85)
   at 
 org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
   at 
 org.apache.spark.sql.parquet.ParquetQuerySuite.run(ParquetQuerySuite.scala:85)
   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5196) Add comment field in StructField

2015-01-11 Thread shengli (JIRA)
shengli created SPARK-5196:
--

 Summary: Add comment field in StructField
 Key: SPARK-5196
 URL: https://issues.apache.org/jira/browse/SPARK-5196
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: shengli
 Fix For: 1.3.0


StructField should contains name, type, nullable, comment  etc...

Add support comment field in StructField.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5196) Add comment field in StructField

2015-01-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272937#comment-14272937
 ] 

Apache Spark commented on SPARK-5196:
-

User 'OopsOutOfMemory' has created a pull request for this issue:
https://github.com/apache/spark/pull/3991

 Add comment field in StructField
 

 Key: SPARK-5196
 URL: https://issues.apache.org/jira/browse/SPARK-5196
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: shengli
 Fix For: 1.3.0


 StructField should contains name, type, nullable, comment  etc...
 Add support comment field in StructField.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5162) Python yarn-cluster mode

2015-01-11 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272943#comment-14272943
 ] 

Lianhui Wang commented on SPARK-5162:
-

[~dklassen] i submit a PR for this 
issue.https://github.com/apache/spark/pull/3976
so i think you can try it. if there are any questions or suggestions,please 
tell me.

 Python yarn-cluster mode
 

 Key: SPARK-5162
 URL: https://issues.apache.org/jira/browse/SPARK-5162
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, YARN
Reporter: Dana Klassen
  Labels: cluster, python, yarn

 Running pyspark in yarn is currently limited to ‘yarn-client’ mode. It would 
 be great to be able to submit python applications to the cluster and (just 
 like java classes) have the resource manager setup an AM on any node in the 
 cluster. Does anyone know the issues blocking this feature? I was snooping 
 around with enabling python apps:
 Removing the logic stopping python and yarn-cluster from sparkSubmit.scala
 ...
 // The following modes are not supported or applicable
 (clusterManager, deployMode) match {
   ...
   case (_, CLUSTER) if args.isPython =
 printErrorAndExit(Cluster deploy mode is currently not supported for 
 python applications.)
   ...
 }
 …
 and submitting application via:
 HADOOP_CONF_DIR={{insert conf dir}} ./bin/spark-submit --master yarn-cluster 
 --num-executors 2  —-py-files {{insert location of egg here}} 
 --executor-cores 1  ../tools/canary.py
 Everything looks to run alright, pythonRunner is picked up as main class, 
 resources get setup, yarn client gets launched but falls flat on its face:
 2015-01-08 18:48:03,444 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  DEBUG: FAILED { 
 {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py, 
 1420742868009, FILE, null }, Resource 
 {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py changed 
 on src filesystem (expected 1420742868009, was 1420742869284
 and
 2015-01-08 18:48:03,446 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
  Resource 
 {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py(-/data/4/yarn/nm/usercache/klassen/filecache/11/canary.py)
  transitioned from DOWNLOADING to FAILED
 Tracked this down to the apache hadoop code(FSDownload.java line 249) related 
 to container localization of files upon downloading. At this point thought it 
 would be best to raise the issue here and get input.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5172) spark-examples-***.jar shades a wrong Hadoop distribution

2015-01-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273041#comment-14273041
 ] 

Apache Spark commented on SPARK-5172:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/3992

 spark-examples-***.jar shades a wrong Hadoop distribution
 -

 Key: SPARK-5172
 URL: https://issues.apache.org/jira/browse/SPARK-5172
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Shixiong Zhu
Priority: Minor

 Steps to check it:
 1. Download  spark-1.2.0-bin-hadoop2.4.tgz from 
 http://www.apache.org/dyn/closer.cgi/spark/spark-1.2.0/spark-1.2.0-bin-hadoop2.4.tgz
 2. unzip `spark-examples-1.2.0-hadoop2.4.0.jar`.
 3. There is a file called `org/apache/hadoop/package-info.class` in the jar. 
 It doesn't exist in hadoop 2.4. 
 4. Run javap -classpath . -private -c -v  org.apache.hadoop.package-info
 {code}
 Compiled from package-info.java
 interface org.apache.hadoop.package-info
   SourceFile: package-info.java
   RuntimeVisibleAnnotations: length = 0x24
00 01 00 06 00 06 00 07 73 00 08 00 09 73 00 0A
00 0B 73 00 0C 00 0D 73 00 0E 00 0F 73 00 10 00
11 73 00 12 
   minor version: 0
   major version: 50
   Constant pool:
 const #1 = Asciz  org/apache/hadoop/package-info;
 const #2 = class  #1; //  org/apache/hadoop/package-info
 const #3 = Asciz  java/lang/Object;
 const #4 = class  #3; //  java/lang/Object
 const #5 = Asciz  package-info.java;
 const #6 = Asciz  Lorg/apache/hadoop/HadoopVersionAnnotation;;
 const #7 = Asciz  version;
 const #8 = Asciz  1.2.1;
 const #9 = Asciz  revision;
 const #10 = Asciz 1503152;
 const #11 = Asciz user;
 const #12 = Asciz mattf;
 const #13 = Asciz date;
 const #14 = Asciz Wed Jul 24 13:39:35 PDT 2013;
 const #15 = Asciz url;
 const #16 = Asciz 
 https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2;
 const #17 = Asciz srcChecksum;
 const #18 = Asciz 6923c86528809c4e7e6f493b6b413a9a;
 const #19 = Asciz SourceFile;
 const #20 = Asciz RuntimeVisibleAnnotations;
 {
 }
 {code}
 The version is {{1.2.1}}
 It comes because a wrong hbase version settings in examples project. Here is 
 a part of the dependencly tree when runnning mvn -Pyarn -Phadoop-2.4 
 -Dhadoop.version=2.4.0 -pl examples dependency:tree
 {noformat}
 [INFO] +- org.apache.hbase:hbase-testing-util:jar:0.98.7-hadoop1:compile
 [INFO] |  +- 
 org.apache.hbase:hbase-common:test-jar:tests:0.98.7-hadoop1:compile
 [INFO] |  +- 
 org.apache.hbase:hbase-server:test-jar:tests:0.98.7-hadoop1:compile
 [INFO] |  |  +- com.sun.jersey:jersey-core:jar:1.8:compile
 [INFO] |  |  +- com.sun.jersey:jersey-json:jar:1.8:compile
 [INFO] |  |  |  +- org.codehaus.jettison:jettison:jar:1.1:compile
 [INFO] |  |  |  +- com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
 [INFO] |  |  |  \- org.codehaus.jackson:jackson-xc:jar:1.7.1:compile
 [INFO] |  |  \- com.sun.jersey:jersey-server:jar:1.8:compile
 [INFO] |  | \- asm:asm:jar:3.3.1:test
 [INFO] |  +- org.apache.hbase:hbase-hadoop1-compat:jar:0.98.7-hadoop1:compile
 [INFO] |  +- 
 org.apache.hbase:hbase-hadoop1-compat:test-jar:tests:0.98.7-hadoop1:compile
 [INFO] |  +- org.apache.hadoop:hadoop-core:jar:1.2.1:compile
 [INFO] |  |  +- xmlenc:xmlenc:jar:0.52:compile
 [INFO] |  |  +- commons-configuration:commons-configuration:jar:1.6:compile
 [INFO] |  |  |  +- commons-digester:commons-digester:jar:1.8:compile
 [INFO] |  |  |  |  \- commons-beanutils:commons-beanutils:jar:1.7.0:compile
 [INFO] |  |  |  \- commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
 [INFO] |  |  \- commons-el:commons-el:jar:1.0:compile
 [INFO] |  +- org.apache.hadoop:hadoop-test:jar:1.2.1:compile
 [INFO] |  |  +- org.apache.ftpserver:ftplet-api:jar:1.0.0:compile
 [INFO] |  |  +- org.apache.mina:mina-core:jar:2.0.0-M5:compile
 [INFO] |  |  +- org.apache.ftpserver:ftpserver-core:jar:1.0.0:compile
 [INFO] |  |  \- org.apache.ftpserver:ftpserver-deprecated:jar:1.0.0-M2:compile
 [INFO] |  +- 
 com.github.stephenc.findbugs:findbugs-annotations:jar:1.3.9-1:compile
 [INFO] |  \- junit:junit:jar:4.10:test
 [INFO] | \- org.hamcrest:hamcrest-core:jar:1.1:test
 {noformat}
 If I ran `mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -pl examples -am 
 dependency:tree -Dhbase.profile=hadoop2`, the dependency tree is right.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5008) Persistent HDFS does not recognize EBS Volumes

2015-01-11 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273007#comment-14273007
 ] 

Nicholas Chammas commented on SPARK-5008:
-

Use [{{copy-dir}}|https://github.com/mesos/spark-ec2/blob/v4/copy-dir.sh], 
which is installed by default, from the master.

 Persistent HDFS does not recognize EBS Volumes
 --

 Key: SPARK-5008
 URL: https://issues.apache.org/jira/browse/SPARK-5008
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.2.0
 Environment: 8 Node Cluster Generated from 1.2.0 spark-ec2 script.
 -m c3.2xlarge -t c3.8xlarge --ebs-vol-size 300 --ebs-vol-type gp2 
 --ebs-vol-num 1
Reporter: Brad Willard

 Cluster is built with correct size EBS volumes. It creates the volume at 
 /dev/xvds and it mounted to /vol0. However when you start persistent hdfs 
 with start-all script, it starts but it isn't correctly configured to use the 
 EBS volume.
 I'm assuming some sym links or expected mounts are not correctly configured.
 This has worked flawlessly on all previous versions of spark.
 I have a stupid workaround by installing pssh and mucking with it by mounting 
 it to /vol, which worked, however it doesn't not work between restarts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5008) Persistent HDFS does not recognize EBS Volumes

2015-01-11 Thread Brad Willard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272991#comment-14272991
 ] 

Brad Willard commented on SPARK-5008:
-

[~nchammas] I can try that once I get back into the office. Probably by 
Wednesday. Once I update the core-site.xml, what's the correct way to sync it 
to all the slaves?

 Persistent HDFS does not recognize EBS Volumes
 --

 Key: SPARK-5008
 URL: https://issues.apache.org/jira/browse/SPARK-5008
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.2.0
 Environment: 8 Node Cluster Generated from 1.2.0 spark-ec2 script.
 -m c3.2xlarge -t c3.8xlarge --ebs-vol-size 300 --ebs-vol-type gp2 
 --ebs-vol-num 1
Reporter: Brad Willard

 Cluster is built with correct size EBS volumes. It creates the volume at 
 /dev/xvds and it mounted to /vol0. However when you start persistent hdfs 
 with start-all script, it starts but it isn't correctly configured to use the 
 EBS volume.
 I'm assuming some sym links or expected mounts are not correctly configured.
 This has worked flawlessly on all previous versions of spark.
 I have a stupid workaround by installing pssh and mucking with it by mounting 
 it to /vol, which worked, however it doesn't not work between restarts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4159) Maven build doesn't run JUnit test suites

2015-01-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273062#comment-14273062
 ] 

Apache Spark commented on SPARK-4159:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/3993

 Maven build doesn't run JUnit test suites
 -

 Key: SPARK-4159
 URL: https://issues.apache.org/jira/browse/SPARK-4159
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Patrick Wendell
Assignee: Sean Owen
Priority: Critical
  Labels: backport-needed
 Fix For: 1.3.0


 It turns out our Maven build isn't running any Java test suites, and likely 
 hasn't ever.
 After some fishing I believe the following is the issue. We use scalatest [1] 
 in our maven build which, by default can't automatically detect JUnit tests. 
 Scalatest will allow you to enumerate a list of suites via JUnitClasses, 
 but I cant' find a way for it to auto-detect all JUnit tests. It turns out 
 this works in SBT because of our use of the junit-interface[2] which does 
 this for you. 
 An okay fix for this might be to simply enable the normal (surefire) maven 
 tests in addition to our scalatest in the maven build. The only thing to 
 watch out for is that they don't overlap in some way. We'd also have to copy 
 over environment variables, memory settings, etc to that plugin.
 [1] http://www.scalatest.org/user_guide/using_the_scalatest_maven_plugin
 [2] https://github.com/sbt/junit-interface



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4033) Integer overflow when SparkPi is called with more than 25000 slices

2015-01-11 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4033.

  Resolution: Fixed
   Fix Version/s: 1.3.0
Assignee: SaintBacchus
Target Version/s: 1.3.0

 Integer overflow when SparkPi is called with more than 25000 slices
 ---

 Key: SPARK-4033
 URL: https://issues.apache.org/jira/browse/SPARK-4033
 Project: Spark
  Issue Type: Bug
  Components: Examples
Affects Versions: 1.2.0
Reporter: SaintBacchus
Assignee: SaintBacchus
 Fix For: 1.3.0


 If input of the SparkPi args is larger than the 25000, the integer 'n' inside 
 the code will be overflow, and may be a negative number.
 And it causes  the (0 until n) Seq as an empty seq, then doing the action 
 'reduce'  will throw the UnsupportedOperationException(empty collection).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5198) Change executorId more unique on mesos fine-grained mode

2015-01-11 Thread Jongyoul Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jongyoul Lee updated SPARK-5198:

Component/s: Mesos

 Change executorId more unique on mesos fine-grained mode
 

 Key: SPARK-5198
 URL: https://issues.apache.org/jira/browse/SPARK-5198
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Jongyoul Lee
 Fix For: 1.3.0, 1.2.1


 In fine-grained mode, SchedulerBackend set executor name as same as slave id 
 with any task id. It's not good to track aspecific job because of logging a 
 different in a same log file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5198) Change executorId more unique on mesos fine-grained mode

2015-01-11 Thread Jongyoul Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jongyoul Lee updated SPARK-5198:

Fix Version/s: 1.2.1
   1.3.0

 Change executorId more unique on mesos fine-grained mode
 

 Key: SPARK-5198
 URL: https://issues.apache.org/jira/browse/SPARK-5198
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Jongyoul Lee
 Fix For: 1.3.0, 1.2.1


 In fine-grained mode, SchedulerBackend set executor name as same as slave id 
 with any task id. It's not good to track aspecific job because of logging a 
 different in a same log file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5198) Change executorId more unique on mesos fine-grained mode

2015-01-11 Thread Jongyoul Lee (JIRA)
Jongyoul Lee created SPARK-5198:
---

 Summary: Change executorId more unique on mesos fine-grained mode
 Key: SPARK-5198
 URL: https://issues.apache.org/jira/browse/SPARK-5198
 Project: Spark
  Issue Type: Improvement
Reporter: Jongyoul Lee


In fine-grained mode, SchedulerBackend set executor name as same as slave id 
with any task id. It's not good to track aspecific job because of logging a 
different in a same log file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4296) Throw Expression not in GROUP BY when using same expression in group by clause and select clause

2015-01-11 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-4296:

 Target Version/s: 1.3.0, 1.2.1  (was: 1.2.0)
Affects Version/s: 1.1.1
   1.2.0
Fix Version/s: (was: 1.2.0)

 Throw Expression not in GROUP BY when using same expression in group by 
 clause and  select clause
 ---

 Key: SPARK-4296
 URL: https://issues.apache.org/jira/browse/SPARK-4296
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.1.1, 1.2.0
Reporter: Shixiong Zhu
Assignee: Cheng Lian
Priority: Blocker

 When the input data has a complex structure, using same expression in group 
 by clause and  select clause will throw Expression not in GROUP BY.
 {code:java}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext.createSchemaRDD
 case class Birthday(date: String)
 case class Person(name: String, birthday: Birthday)
 val people = sc.parallelize(List(Person(John, Birthday(1990-01-22)), 
 Person(Jim, Birthday(1980-02-28
 people.registerTempTable(people)
 val year = sqlContext.sql(select count(*), upper(birthday.date) from people 
 group by upper(birthday.date))
 year.collect
 {code}
 Here is the plan of year:
 {code:java}
 SchemaRDD[3] at RDD at SchemaRDD.scala:105
 == Query Plan ==
 == Physical Plan ==
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression 
 not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree:
 Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date 
 AS date#9) AS c1#3]
  Subquery people
   LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at 
 ExistingRDD.scala:36
 {code}
 The bug is the equality test for `Upper(birthday#1.date)` and 
 `Upper(birthday#1.date AS date#9)`.
 Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias 
 expression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4296) Throw Expression not in GROUP BY when using same expression in group by clause and select clause

2015-01-11 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273196#comment-14273196
 ] 

Yin Huai commented on SPARK-4296:
-

I was wondering if we can also find this issue at other places. Maybe we can 
resolve this issue thoroughly.

 Throw Expression not in GROUP BY when using same expression in group by 
 clause and  select clause
 ---

 Key: SPARK-4296
 URL: https://issues.apache.org/jira/browse/SPARK-4296
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.1.1, 1.2.0
Reporter: Shixiong Zhu
Assignee: Cheng Lian
Priority: Blocker

 When the input data has a complex structure, using same expression in group 
 by clause and  select clause will throw Expression not in GROUP BY.
 {code:java}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext.createSchemaRDD
 case class Birthday(date: String)
 case class Person(name: String, birthday: Birthday)
 val people = sc.parallelize(List(Person(John, Birthday(1990-01-22)), 
 Person(Jim, Birthday(1980-02-28
 people.registerTempTable(people)
 val year = sqlContext.sql(select count(*), upper(birthday.date) from people 
 group by upper(birthday.date))
 year.collect
 {code}
 Here is the plan of year:
 {code:java}
 SchemaRDD[3] at RDD at SchemaRDD.scala:105
 == Query Plan ==
 == Physical Plan ==
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression 
 not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree:
 Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date 
 AS date#9) AS c1#3]
  Subquery people
   LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at 
 ExistingRDD.scala:36
 {code}
 The bug is the equality test for `Upper(birthday#1.date)` and 
 `Upper(birthday#1.date AS date#9)`.
 Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias 
 expression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4924) Factor out code to launch Spark applications into a separate library

2015-01-11 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-4924:
--
Assignee: Marcelo Vanzin

 Factor out code to launch Spark applications into a separate library
 

 Key: SPARK-4924
 URL: https://issues.apache.org/jira/browse/SPARK-4924
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
 Attachments: spark-launcher.txt


 One of the questions we run into rather commonly is how to start a Spark 
 application from my Java/Scala program?. There currently isn't a good answer 
 to that:
 - Instantiating SparkContext has limitations (e.g., you can only have one 
 active context at the moment, plus you lose the ability to submit apps in 
 cluster mode)
 - Calling SparkSubmit directly is doable but you lose a lot of the logic 
 handled by the shell scripts
 - Calling the shell script directly is doable,  but sort of ugly from an API 
 point of view.
 I think it would be nice to have a small library that handles that for users. 
 On top of that, this library could be used by Spark itself to replace a lot 
 of the code in the current shell scripts, which have a lot of duplication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5197) Support external shuffle service in fine-grained mode on mesos cluster

2015-01-11 Thread Jongyoul Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273137#comment-14273137
 ] 

Jongyoul Lee commented on SPARK-5197:
-

Please, assign it to me.

[~andrewor14] [~adav] Please review my description

 Support external shuffle service in fine-grained mode on mesos cluster
 --

 Key: SPARK-5197
 URL: https://issues.apache.org/jira/browse/SPARK-5197
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, Mesos, Shuffle
Reporter: Jongyoul Lee

 I think dynamic allocation is almost satisfied on mesos' fine-grained mode, 
 which already offers resources dynamically, and returns automatically when a 
 task is finished. It, however, doesn't have a mechanism on support external 
 shuffle service like yarn's way, which is AuxiliaryService. Because mesos 
 doesn't support AusiliaryService, we think a different way to do this.
 - Launching a shuffle service like a spark job on same cluster
 -- Pros
 --- Support multi-tenant environment
 --- Almost same way like yarn
 -- Cons
 --- Control long running 'background' job - service - when mesos runs
 --- Satisfy all slave - or host - to have one shuffle service all the time
 - Launching jobs within shuffle service
 -- Pros
 --- Easy to implement
 --- Don't consider whether shuffle service exists or not.
 -- Cons
 --- exists multiple shuffle services under multi-tenant environment
 --- Control shuffle service port dynamically on multi-user environment
 In my opinion, the first one is better idea to support external shuffle 
 service. Please leave comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4296) Throw Expression not in GROUP BY when using same expression in group by clause and select clause

2015-01-11 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273194#comment-14273194
 ] 

Yin Huai commented on SPARK-4296:
-

[~lian cheng] Seems this issues is similar with [this 
one|https://issues.apache.org/jira/browse/SPARK-2063?focusedCommentId=14055193page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14055193].
 The main problem is that we use the last part of a reference of a field in a 
struct as the alias. Is it possible that we can fix that one as well?

 Throw Expression not in GROUP BY when using same expression in group by 
 clause and  select clause
 ---

 Key: SPARK-4296
 URL: https://issues.apache.org/jira/browse/SPARK-4296
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Shixiong Zhu
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.2.0


 When the input data has a complex structure, using same expression in group 
 by clause and  select clause will throw Expression not in GROUP BY.
 {code:java}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext.createSchemaRDD
 case class Birthday(date: String)
 case class Person(name: String, birthday: Birthday)
 val people = sc.parallelize(List(Person(John, Birthday(1990-01-22)), 
 Person(Jim, Birthday(1980-02-28
 people.registerTempTable(people)
 val year = sqlContext.sql(select count(*), upper(birthday.date) from people 
 group by upper(birthday.date))
 year.collect
 {code}
 Here is the plan of year:
 {code:java}
 SchemaRDD[3] at RDD at SchemaRDD.scala:105
 == Query Plan ==
 == Physical Plan ==
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression 
 not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree:
 Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date 
 AS date#9) AS c1#3]
  Subquery people
   LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at 
 ExistingRDD.scala:36
 {code}
 The bug is the equality test for `Upper(birthday#1.date)` and 
 `Upper(birthday#1.date AS date#9)`.
 Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias 
 expression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3340) Deprecate ADD_JARS and ADD_FILES

2015-01-11 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3340:
---
Labels: starter  (was: )

 Deprecate ADD_JARS and ADD_FILES
 

 Key: SPARK-3340
 URL: https://issues.apache.org/jira/browse/SPARK-3340
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or
  Labels: starter

 These were introduced before Spark submit even existed. Now that there are 
 many better ways of setting jars and python files through Spark submit, we 
 should deprecate these environment variables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3450) Enable specifiying the --jars CLI option multiple times

2015-01-11 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3450.

Resolution: Won't Fix

I'd prefer not to do this one, it complicates our parsing substantially. It's 
possible to just write a bash loop that creates a single long list of jars.

 Enable specifiying the --jars CLI option multiple times
 ---

 Key: SPARK-3450
 URL: https://issues.apache.org/jira/browse/SPARK-3450
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 1.0.2
Reporter: wolfgang hoschek

 spark-submit should support specifiying the --jars option multiple time, e.g. 
 --jars foo.jar,bar.jar --jars baz.jar,oops.jar should be equivalent to --jars 
 foo.jar,bar.jar,baz.jar,oops.jar
 This would allow using wrapper scripts that simplify usage for enterprise 
 customers along the following lines:
 {code}
 my-spark-submit.sh:
 jars=
 for i in /opt/myapp/*.jar; do
   if [ $i -gt 0]
   then
 jars=$jars,
   fi
   jars=$jars$i
 done
 spark-submit --jars $jars $@
 {code}
 Example usage:
 {code}
 my-spark-submit.sh --jars myUserDefinedFunction.jar 
 {code}
 The relevant enhancement code might go into SparkSubmitArguments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5073) spark.storage.memoryMapThreshold has two default values

2015-01-11 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson resolved SPARK-5073.
---
Resolution: Fixed

 spark.storage.memoryMapThreshold has two default values
 -

 Key: SPARK-5073
 URL: https://issues.apache.org/jira/browse/SPARK-5073
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Jianhui Yuan
Priority: Minor

 In org.apache.spark.storage.DiskStore:
  val minMemoryMapBytes = 
 blockManager.conf.getLong(spark.storage.memoryMapThreshold, 2 * 4096L)
 In org.apache.spark.network.util.TransportConf:
  public int memoryMapBytes() {
  return conf.getInt(spark.storage.memoryMapThreshold, 2 * 1024 * 
 1024);
  }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5197) Support external shuffle service in fine-grained mode on mesos cluster

2015-01-11 Thread Jongyoul Lee (JIRA)
Jongyoul Lee created SPARK-5197:
---

 Summary: Support external shuffle service in fine-grained mode on 
mesos cluster
 Key: SPARK-5197
 URL: https://issues.apache.org/jira/browse/SPARK-5197
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, Mesos, Shuffle
Reporter: Jongyoul Lee


I think dynamic allocation is almost satisfied on mesos' fine-grained mode, 
which already offers resources dynamically, and returns automatically when a 
task is finished. We, however, don't have a mechanism on support external 
shuffle service like yarn's way, which is AuxiliaryService. Because mesos 
doesn't support AusiliaryService, we think a different way to do this.

- Launching a shuffle service like a spark job on same cluster
-- Pros
--- Support multi-tenant environment
--- Almost same way like yarn
-- Cons
--- Control long running 'background' job - service - when mesos runs
--- Satisfy all slave - or host - to have one shuffle service all the time
- Launching jobs within shuffle service
-- Pros
--- Easy to implement
--- Don't consider whether shuffle service exists or not.
-- Cons
--- exists multiple shuffle services under multi-tenant environment
--- Control shuffle service port dynamically on multi-user environment

In my opinion, the first one is better idea to support external shuffle 
service. Please leave comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4689) Unioning 2 SchemaRDDs should return a SchemaRDD in Python, Scala, and Java

2015-01-11 Thread Bibudh Lahiri (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268720#comment-14268720
 ] 

Bibudh Lahiri edited comment on SPARK-4689 at 1/12/15 2:13 AM:
---

I'd like to work on this issue, but would need some details. I looked into 
./sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala where the 
unionAll method is defined as 

def unionAll(otherPlan: SchemaRDD) =
new SchemaRDD(sqlContext, Union(logicalPlan, otherPlan.logicalPlan))

There is no implementation of union() in SchemaRDD itself and and the API says 
it is inherited from RDD. I took two different SchemaRDD objects and applied 
union on them (it is in my fork at 
https://github.com/bibudhlahiri/spark/blob/master/dev/audit-release/sbt_app_schema_rdd/src/main/scala/SchemaRDDApp.scala
 ) , and the resultant object is of class UnionRDD. I am thinking of overriding 
union() in SchemaRDD to return a SchemaRDD, please let me know what you think. 


was (Author: bibudh):
I'd like to work on this issue, but would need some details. I looked into 
./sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala where the 
unionAll method is defined as 

def unionAll(otherPlan: SchemaRDD) =
new SchemaRDD(sqlContext, Union(logicalPlan, otherPlan.logicalPlan))

Are we looking for an implementation of union here (keeping duplicates only 
once), in addition to unionAll (keeping duplicates both the times)?

 Unioning 2 SchemaRDDs should return a SchemaRDD in Python, Scala, and Java
 --

 Key: SPARK-4689
 URL: https://issues.apache.org/jira/browse/SPARK-4689
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Chris Fregly
Priority: Minor
  Labels: starter

 Currently, you need to use unionAll() in Scala.  
 Python does not expose this functionality at the moment.
 The current work around is to use the UNION ALL HiveQL functionality detailed 
 here:  https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5198) Change executorId more unique on mesos fine-grained mode

2015-01-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273165#comment-14273165
 ] 

Apache Spark commented on SPARK-5198:
-

User 'jongyoul' has created a pull request for this issue:
https://github.com/apache/spark/pull/3994

 Change executorId more unique on mesos fine-grained mode
 

 Key: SPARK-5198
 URL: https://issues.apache.org/jira/browse/SPARK-5198
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Jongyoul Lee
 Fix For: 1.3.0, 1.2.1


 In fine-grained mode, SchedulerBackend set executor name as same as slave id 
 with any task id. It's not good to track aspecific job because of logging a 
 different in a same log file. This is a same value while launching job on 
 coarse-grained mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5197) Support external shuffle service in fine-grained mode on mesos cluster

2015-01-11 Thread Jongyoul Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jongyoul Lee updated SPARK-5197:

Description: 
I think dynamic allocation is almost satisfied on mesos' fine-grained mode, 
which already offers resources dynamically, and returns automatically when a 
task is finished. It, however, doesn't have a mechanism on support external 
shuffle service like yarn's way, which is AuxiliaryService. Because mesos 
doesn't support AusiliaryService, we think a different way to do this.

- Launching a shuffle service like a spark job on same cluster
-- Pros
--- Support multi-tenant environment
--- Almost same way like yarn
-- Cons
--- Control long running 'background' job - service - when mesos runs
--- Satisfy all slave - or host - to have one shuffle service all the time
- Launching jobs within shuffle service
-- Pros
--- Easy to implement
--- Don't consider whether shuffle service exists or not.
-- Cons
--- exists multiple shuffle services under multi-tenant environment
--- Control shuffle service port dynamically on multi-user environment

In my opinion, the first one is better idea to support external shuffle 
service. Please leave comments.

  was:
I think dynamic allocation is almost satisfied on mesos' fine-grained mode, 
which already offers resources dynamically, and returns automatically when a 
task is finished. We, however, don't have a mechanism on support external 
shuffle service like yarn's way, which is AuxiliaryService. Because mesos 
doesn't support AusiliaryService, we think a different way to do this.

- Launching a shuffle service like a spark job on same cluster
-- Pros
--- Support multi-tenant environment
--- Almost same way like yarn
-- Cons
--- Control long running 'background' job - service - when mesos runs
--- Satisfy all slave - or host - to have one shuffle service all the time
- Launching jobs within shuffle service
-- Pros
--- Easy to implement
--- Don't consider whether shuffle service exists or not.
-- Cons
--- exists multiple shuffle services under multi-tenant environment
--- Control shuffle service port dynamically on multi-user environment

In my opinion, the first one is better idea to support external shuffle 
service. Please leave comments.


 Support external shuffle service in fine-grained mode on mesos cluster
 --

 Key: SPARK-5197
 URL: https://issues.apache.org/jira/browse/SPARK-5197
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, Mesos, Shuffle
Reporter: Jongyoul Lee

 I think dynamic allocation is almost satisfied on mesos' fine-grained mode, 
 which already offers resources dynamically, and returns automatically when a 
 task is finished. It, however, doesn't have a mechanism on support external 
 shuffle service like yarn's way, which is AuxiliaryService. Because mesos 
 doesn't support AusiliaryService, we think a different way to do this.
 - Launching a shuffle service like a spark job on same cluster
 -- Pros
 --- Support multi-tenant environment
 --- Almost same way like yarn
 -- Cons
 --- Control long running 'background' job - service - when mesos runs
 --- Satisfy all slave - or host - to have one shuffle service all the time
 - Launching jobs within shuffle service
 -- Pros
 --- Easy to implement
 --- Don't consider whether shuffle service exists or not.
 -- Cons
 --- exists multiple shuffle services under multi-tenant environment
 --- Control shuffle service port dynamically on multi-user environment
 In my opinion, the first one is better idea to support external shuffle 
 service. Please leave comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5198) Change executorId more unique on mesos fine-grained mode

2015-01-11 Thread Jongyoul Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jongyoul Lee updated SPARK-5198:

Description: In fine-grained mode, SchedulerBackend set executor name as 
same as slave id with any task id. It's not good to track aspecific job because 
of logging a different in a same log file. This is a same value while launching 
job on coarse-grained mode.  (was: In fine-grained mode, SchedulerBackend set 
executor name as same as slave id with any task id. It's not good to track 
aspecific job because of logging a different in a same log file.)

 Change executorId more unique on mesos fine-grained mode
 

 Key: SPARK-5198
 URL: https://issues.apache.org/jira/browse/SPARK-5198
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Jongyoul Lee
 Fix For: 1.3.0, 1.2.1


 In fine-grained mode, SchedulerBackend set executor name as same as slave id 
 with any task id. It's not good to track aspecific job because of logging a 
 different in a same log file. This is a same value while launching job on 
 coarse-grained mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5198) Change executorId more unique on mesos fine-grained mode

2015-01-11 Thread Jongyoul Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jongyoul Lee updated SPARK-5198:

Description: 
In fine-grained mode, SchedulerBackend set executor name as same as slave id 
with any task id. It's not good to track aspecific job because of logging a 
different in a same log file. This is a same value while launching job on 
coarse-grained mode.

!Screen Shot 2015-01-12 at 11.14.39 AM.png!
!

  was:
In fine-grained mode, SchedulerBackend set executor name as same as slave id 
with any task id. It's not good to track aspecific job because of logging a 
different in a same log file. This is a same value while launching job on 
coarse-grained mode.

[


 Change executorId more unique on mesos fine-grained mode
 

 Key: SPARK-5198
 URL: https://issues.apache.org/jira/browse/SPARK-5198
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Jongyoul Lee
 Fix For: 1.3.0, 1.2.1

 Attachments: Screen Shot 2015-01-12 at 11.14.39 AM.png, Screen Shot 
 2015-01-12 at 11.34.30 AM.png, Screen Shot 2015-01-12 at 11.34.41 AM.png


 In fine-grained mode, SchedulerBackend set executor name as same as slave id 
 with any task id. It's not good to track aspecific job because of logging a 
 different in a same log file. This is a same value while launching job on 
 coarse-grained mode.
 !Screen Shot 2015-01-12 at 11.14.39 AM.png!
 !



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5198) Change executorId more unique on mesos fine-grained mode

2015-01-11 Thread Jongyoul Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jongyoul Lee updated SPARK-5198:

Description: 
In fine-grained mode, SchedulerBackend set executor name as same as slave id 
with any task id. It's not good to track aspecific job because of logging a 
different in a same log file. This is a same value while launching job on 
coarse-grained mode.

[

  was:In fine-grained mode, SchedulerBackend set executor name as same as slave 
id with any task id. It's not good to track aspecific job because of logging a 
different in a same log file. This is a same value while launching job on 
coarse-grained mode.


 Change executorId more unique on mesos fine-grained mode
 

 Key: SPARK-5198
 URL: https://issues.apache.org/jira/browse/SPARK-5198
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Jongyoul Lee
 Fix For: 1.3.0, 1.2.1

 Attachments: Screen Shot 2015-01-12 at 11.14.39 AM.png, Screen Shot 
 2015-01-12 at 11.34.30 AM.png, Screen Shot 2015-01-12 at 11.34.41 AM.png


 In fine-grained mode, SchedulerBackend set executor name as same as slave id 
 with any task id. It's not good to track aspecific job because of logging a 
 different in a same log file. This is a same value while launching job on 
 coarse-grained mode.
 [



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5198) Change executorId more unique on mesos fine-grained mode

2015-01-11 Thread Jongyoul Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jongyoul Lee updated SPARK-5198:

Attachment: Screen Shot 2015-01-12 at 11.34.41 AM.png
Screen Shot 2015-01-12 at 11.34.30 AM.png
Screen Shot 2015-01-12 at 11.14.39 AM.png

Example screenshots

 Change executorId more unique on mesos fine-grained mode
 

 Key: SPARK-5198
 URL: https://issues.apache.org/jira/browse/SPARK-5198
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Jongyoul Lee
 Fix For: 1.3.0, 1.2.1

 Attachments: Screen Shot 2015-01-12 at 11.14.39 AM.png, Screen Shot 
 2015-01-12 at 11.34.30 AM.png, Screen Shot 2015-01-12 at 11.34.41 AM.png


 In fine-grained mode, SchedulerBackend set executor name as same as slave id 
 with any task id. It's not good to track aspecific job because of logging a 
 different in a same log file. This is a same value while launching job on 
 coarse-grained mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2015-01-11 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273187#comment-14273187
 ] 

Nicholas Chammas commented on SPARK-3821:
-

Updated launch stats:
* Launching cluster with 50 slaves in {{us-east-1}}.
* Stats for best of 3 runs.

{{branch-1.3}} @ 
[{{3a95101}}|https://github.com/mesos/spark-ec2/tree/3a95101c70e6892a8a48cc54094adaed1458487a]:
{code}
Cluster is now in 'ssh-ready' state. Waited 460 seconds.
[timing] rsync /root/spark-ec2:  00h 00m 07s
[timing] setup-slave:  00h 00m 28s
[timing] scala init:  00h 00m 11s
[timing] spark init:  00h 00m 07s
[timing] ephemeral-hdfs init:  00h 12m 40s
[timing] persistent-hdfs init:  00h 12m 35s
[timing] spark-standalone init:  00h 00m 00s
[timing] tachyon init:  00h 00m 08s
[timing] ganglia init:  00h 00m 53s
[timing] scala setup:  00h 03m 11s
[timing] spark setup:  00h 21m 20s
[timing] ephemeral-hdfs setup:  00h 00m 48s
[timing] persistent-hdfs setup:  00h 00m 43s
[timing] spark-standalone setup:  00h 01m 19s
[timing] tachyon setup:  00h 03m 06s
[timing] ganglia setup:  00h 00m 32s
{code}


{{packer}} @ 
[{{273c8c5}}|https://github.com/nchammas/spark-ec2/tree/273c8c518fbc6e86e0fb4410efbe77a4d4e4ff5b]:

{code}
Cluster is now in 'ssh-ready' state. Waited 292 seconds.
[timing] rsync /root/spark-ec2:  00h 00m 20s
[timing] setup-slave:  00h 00m 19s
[timing] scala init:  00h 00m 12s
[timing] spark init:  00h 00m 08s
[timing] ephemeral-hdfs init:  00h 12m 58s
[timing] persistent-hdfs init:  00h 12m 55s
[timing] spark-standalone init:  00h 00m 00s
[timing] tachyon init:  00h 00m 10s
[timing] ganglia init:  00h 00m 15s
[timing] scala setup:  00h 03m 19s
[timing] spark setup:  00h 20m 32s
[timing] ephemeral-hdfs setup:  00h 00m 34s
[timing] persistent-hdfs setup:  00h 00m 27s
[timing] spark-standalone setup:  00h 00m 47s
[timing] tachyon setup:  00h 03m 15s
[timing] ganglia setup:  00h 00m 23s
{code}

As you can see, with the exception of time-to-SSH-availability, things are 
mostly the same across the current and Packer-generated AMIs. I've proposed 
improvements to cut down the launch times of large clusters in [a separate 
issue|SPARK-5189].

[~shivaram] - At this point I think it's safe to say that the approach proposed 
here is straightforward and worth pursuing. All we need now is a review of [the 
scripts that install various 
stuff|https://github.com/nchammas/spark-ec2/blob/273c8c518fbc6e86e0fb4410efbe77a4d4e4ff5b/packer/spark-packer.json#L63-L66]
 (e.g. Ganglia, Python 2.7, etc.) on the AMI to make sure it all makes sense.

 Develop an automated way of creating Spark images (AMI, Docker, and others)
 ---

 Key: SPARK-3821
 URL: https://issues.apache.org/jira/browse/SPARK-3821
 Project: Spark
  Issue Type: Improvement
  Components: Build, EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
 Attachments: packer-proposal.html


 Right now the creation of Spark AMIs or Docker containers is done manually. 
 With tools like [Packer|http://www.packer.io/], we should be able to automate 
 this work, and do so in such a way that multiple types of machine images can 
 be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1422) Add scripts for launching Spark on Google Compute Engine

2015-01-11 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273197#comment-14273197
 ] 

Nicholas Chammas commented on SPARK-1422:
-

[~pwendell] - I would consider doing this as well for the parent task, 
[SPARK-4399].

 Add scripts for launching Spark on Google Compute Engine
 

 Key: SPARK-1422
 URL: https://issues.apache.org/jira/browse/SPARK-1422
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Matei Zaharia





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1422) Add scripts for launching Spark on Google Compute Engine

2015-01-11 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273178#comment-14273178
 ] 

Patrick Wendell commented on SPARK-1422:


Good call NIck - yeah let's close this as being out of scope since it's being 
maintained elsewhere.

 Add scripts for launching Spark on Google Compute Engine
 

 Key: SPARK-1422
 URL: https://issues.apache.org/jira/browse/SPARK-1422
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Matei Zaharia





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1422) Add scripts for launching Spark on Google Compute Engine

2015-01-11 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1422.

Resolution: Won't Fix

 Add scripts for launching Spark on Google Compute Engine
 

 Key: SPARK-1422
 URL: https://issues.apache.org/jira/browse/SPARK-1422
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Matei Zaharia





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5199) Input metrics should show up for InputFormats that return CombineFileSplits

2015-01-11 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-5199:
-

 Summary: Input metrics should show up for InputFormats that return 
CombineFileSplits
 Key: SPARK-5199
 URL: https://issues.apache.org/jira/browse/SPARK-5199
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Sandy Ryza
Assignee: Sandy Ryza






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2621) Update task InputMetrics incrementally

2015-01-11 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273192#comment-14273192
 ] 

Sandy Ryza commented on SPARK-2621:
---

Definitely - just filed SPARK-5199 for this.

 Update task InputMetrics incrementally
 --

 Key: SPARK-2621
 URL: https://issues.apache.org/jira/browse/SPARK-2621
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4399) Support multiple cloud providers

2015-01-11 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4399.

Resolution: Won't Fix

We'll let the community take this one on.

 Support multiple cloud providers
 

 Key: SPARK-4399
 URL: https://issues.apache.org/jira/browse/SPARK-4399
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Affects Versions: 1.2.0
Reporter: Andrew Ash

 We currently have Spark startup scripts for Amazon EC2 but not for various 
 other cloud providers.  This ticket is an umbrella to support multiple cloud 
 providers in the bundled scripts, not just Amazon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5166) Stabilize Spark SQL APIs

2015-01-11 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5166:
---
Priority: Blocker  (was: Critical)

 Stabilize Spark SQL APIs
 

 Key: SPARK-5166
 URL: https://issues.apache.org/jira/browse/SPARK-5166
 Project: Spark
  Issue Type: Task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker

 Before we take Spark SQL out of alpha, we need to audit the APIs and 
 stabilize them. 
 As a general rule, everything under org.apache.spark.sql.catalyst should not 
 be exposed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5197) Support external shuffle service in fine-grained mode on mesos cluster

2015-01-11 Thread Jongyoul Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jongyoul Lee updated SPARK-5197:

Fix Version/s: 1.3.0

 Support external shuffle service in fine-grained mode on mesos cluster
 --

 Key: SPARK-5197
 URL: https://issues.apache.org/jira/browse/SPARK-5197
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, Mesos, Shuffle
Reporter: Jongyoul Lee
 Fix For: 1.3.0


 I think dynamic allocation is almost satisfied on mesos' fine-grained mode, 
 which already offers resources dynamically, and returns automatically when a 
 task is finished. It, however, doesn't have a mechanism on support external 
 shuffle service like yarn's way, which is AuxiliaryService. Because mesos 
 doesn't support AusiliaryService, we think a different way to do this.
 - Launching a shuffle service like a spark job on same cluster
 -- Pros
 --- Support multi-tenant environment
 --- Almost same way like yarn
 -- Cons
 --- Control long running 'background' job - service - when mesos runs
 --- Satisfy all slave - or host - to have one shuffle service all the time
 - Launching jobs within shuffle service
 -- Pros
 --- Easy to implement
 --- Don't consider whether shuffle service exists or not.
 -- Cons
 --- exists multiple shuffle services under multi-tenant environment
 --- Control shuffle service port dynamically on multi-user environment
 In my opinion, the first one is better idea to support external shuffle 
 service. Please leave comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5198) Change executorId more unique on mesos fine-grained mode

2015-01-11 Thread Jongyoul Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jongyoul Lee updated SPARK-5198:

Description: 
In fine-grained mode, SchedulerBackend set executor name as same as slave id 
with any task id. It's not good to track aspecific job because of logging a 
different in a same log file. This is a same value while launching job on 
coarse-grained mode.

!Screen Shot 2015-01-12 at 11.14.39 AM.png!
!Screen Shot 2015-01-12 at 11.34.30 AM.png!
!Screen Shot 2015-01-12 at 11.34.41 AM.png!

  was:
In fine-grained mode, SchedulerBackend set executor name as same as slave id 
with any task id. It's not good to track aspecific job because of logging a 
different in a same log file. This is a same value while launching job on 
coarse-grained mode.

!Screen Shot 2015-01-12 at 11.14.39 AM.png!
!


 Change executorId more unique on mesos fine-grained mode
 

 Key: SPARK-5198
 URL: https://issues.apache.org/jira/browse/SPARK-5198
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Jongyoul Lee
 Fix For: 1.3.0, 1.2.1

 Attachments: Screen Shot 2015-01-12 at 11.14.39 AM.png, Screen Shot 
 2015-01-12 at 11.34.30 AM.png, Screen Shot 2015-01-12 at 11.34.41 AM.png


 In fine-grained mode, SchedulerBackend set executor name as same as slave id 
 with any task id. It's not good to track aspecific job because of logging a 
 different in a same log file. This is a same value while launching job on 
 coarse-grained mode.
 !Screen Shot 2015-01-12 at 11.14.39 AM.png!
 !Screen Shot 2015-01-12 at 11.34.30 AM.png!
 !Screen Shot 2015-01-12 at 11.34.41 AM.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5198) Change executorId more unique on mesos fine-grained mode

2015-01-11 Thread Jongyoul Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273169#comment-14273169
 ] 

Jongyoul Lee edited comment on SPARK-5198 at 1/12/15 2:38 AM:
--

Uploaded example screenshots


was (Author: jongyoul):
Example screenshots

 Change executorId more unique on mesos fine-grained mode
 

 Key: SPARK-5198
 URL: https://issues.apache.org/jira/browse/SPARK-5198
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Jongyoul Lee
 Fix For: 1.3.0, 1.2.1

 Attachments: Screen Shot 2015-01-12 at 11.14.39 AM.png, Screen Shot 
 2015-01-12 at 11.34.30 AM.png, Screen Shot 2015-01-12 at 11.34.41 AM.png


 In fine-grained mode, SchedulerBackend set executor name as same as slave id 
 with any task id. It's not good to track aspecific job because of logging a 
 different in a same log file. This is a same value while launching job on 
 coarse-grained mode.
 !Screen Shot 2015-01-12 at 11.14.39 AM.png!
 !Screen Shot 2015-01-12 at 11.34.30 AM.png!
 !Screen Shot 2015-01-12 at 11.34.41 AM.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4951) A busy executor may be killed when dynamicAllocation is enabled

2015-01-11 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4951.

  Resolution: Fixed
   Fix Version/s: 1.2.1
  1.3.0
Target Version/s: 1.3.0, 1.2.1

 A busy executor may be killed when dynamicAllocation is enabled
 ---

 Key: SPARK-4951
 URL: https://issues.apache.org/jira/browse/SPARK-4951
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
 Fix For: 1.3.0, 1.2.1


 If a task runs more than `spark.dynamicAllocation.executorIdleTimeout`, the 
 executor which runs this task will be killed.
 The following steps (yarn-client mode) can reproduce this bug:
 1. Start `spark-shell` using
 {code}
 ./bin/spark-shell --conf spark.shuffle.service.enabled=true \
 --conf spark.dynamicAllocation.minExecutors=1 \
 --conf spark.dynamicAllocation.maxExecutors=4 \
 --conf spark.dynamicAllocation.enabled=true \
 --conf spark.dynamicAllocation.executorIdleTimeout=30 \
 --master yarn-client \
 --driver-memory 512m \
 --executor-memory 512m \
 --executor-cores 1
 {code}
 2. Wait more than 30 seconds until there is only one executor.
 3. Run the following code (a task needs at least 50 seconds to finish)
 {code}
 val r = sc.parallelize(1 to 1000, 20).map{t = Thread.sleep(1000); 
 t}.groupBy(_ % 2).collect()
 {code}
 4. Executors will be killed and allocated all the time, which makes the Job 
 fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5088) Use spark-class for running executors directly on mesos

2015-01-11 Thread Jongyoul Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jongyoul Lee updated SPARK-5088:

Fix Version/s: 1.2.1
   1.3.0

 Use spark-class for running executors directly on mesos
 ---

 Key: SPARK-5088
 URL: https://issues.apache.org/jira/browse/SPARK-5088
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, Mesos
Affects Versions: 1.2.0
Reporter: Jongyoul Lee
Priority: Minor
 Fix For: 1.3.0, 1.2.1


 - sbin/spark-executor is only used by running executor on mesos environment.
 - spark-executor calls spark-class without specific parameter internally.
 - PYTHONPATH is moved to spark-class' case.
 - Remove a redundant file for maintaining codes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5197) Support external shuffle service in fine-grained mode on mesos cluster

2015-01-11 Thread Jongyoul Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jongyoul Lee updated SPARK-5197:

Target Version/s: 1.3.0  (was: 1.3.0, 1.2.1)

 Support external shuffle service in fine-grained mode on mesos cluster
 --

 Key: SPARK-5197
 URL: https://issues.apache.org/jira/browse/SPARK-5197
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, Mesos, Shuffle
Reporter: Jongyoul Lee
 Fix For: 1.3.0


 I think dynamic allocation is almost satisfied on mesos' fine-grained mode, 
 which already offers resources dynamically, and returns automatically when a 
 task is finished. It, however, doesn't have a mechanism on support external 
 shuffle service like yarn's way, which is AuxiliaryService. Because mesos 
 doesn't support AusiliaryService, we think a different way to do this.
 - Launching a shuffle service like a spark job on same cluster
 -- Pros
 --- Support multi-tenant environment
 --- Almost same way like yarn
 -- Cons
 --- Control long running 'background' job - service - when mesos runs
 --- Satisfy all slave - or host - to have one shuffle service all the time
 - Launching jobs within shuffle service
 -- Pros
 --- Easy to implement
 --- Don't consider whether shuffle service exists or not.
 -- Cons
 --- exists multiple shuffle services under multi-tenant environment
 --- Control shuffle service port dynamically on multi-user environment
 In my opinion, the first one is better idea to support external shuffle 
 service. Please leave comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5124) Standardize internal RPC interface

2015-01-11 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273246#comment-14273246
 ] 

Reynold Xin commented on SPARK-5124:


Thanks for the response.

1. Let's not rely on the property of local actor not passing messages through a 
socket for local actor speedup. Conceptually, there is no reason to tie local 
actor implementation to RPC. DAGScheduler's actor used to be a simple queue  
event loop (before it was turned into an actor for no good reason). We can 
restore it to that.

2. Have you thought about how the fate sharing stuff would work with 
alternative RPC implementations? 

 Standardize internal RPC interface
 --

 Key: SPARK-5124
 URL: https://issues.apache.org/jira/browse/SPARK-5124
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Shixiong Zhu
 Attachments: Pluggable RPC - draft 1.pdf


 In Spark we use Akka as the RPC layer. It would be great if we can 
 standardize the internal RPC interface to facilitate testing. This will also 
 provide the foundation to try other RPC implementations in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5200) Disable web UI in Hive Thriftserver tests

2015-01-11 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-5200:
-

 Summary: Disable web UI in Hive Thriftserver tests
 Key: SPARK-5200
 URL: https://issues.apache.org/jira/browse/SPARK-5200
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Josh Rosen
Assignee: Josh Rosen


In our unit tests, we should disable the Spark Web UI when starting the Hive 
Thriftserver, since port contention during this test has been a cause of test 
failures on Jenkins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5201) ParallelCollectionRDD.slice(seq, numSlices) has int overflow when dealing with inclusive range

2015-01-11 Thread Ye Xianjin (JIRA)
Ye Xianjin created SPARK-5201:
-

 Summary: ParallelCollectionRDD.slice(seq, numSlices) has int 
overflow when dealing with inclusive range
 Key: SPARK-5201
 URL: https://issues.apache.org/jira/browse/SPARK-5201
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Ye Xianjin
 Fix For: 1.2.1


{code}
 sc.makeRDD(1 to (Int.MaxValue)).count   // result = 0
 sc.makeRDD(1 to (Int.MaxValue - 1)).count   // result = 2147483646 = 
Int.MaxValue - 1
 sc.makeRDD(1 until (Int.MaxValue)).count// result = 2147483646 = 
Int.MaxValue - 1
{code}
More details on the discussion https://github.com/apache/spark/pull/2874



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5201) ParallelCollectionRDD.slice(seq, numSlices) has int overflow when dealing with inclusive range

2015-01-11 Thread Ye Xianjin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273277#comment-14273277
 ] 

Ye Xianjin commented on SPARK-5201:
---

I will send a pr for this.

 ParallelCollectionRDD.slice(seq, numSlices) has int overflow when dealing 
 with inclusive range
 --

 Key: SPARK-5201
 URL: https://issues.apache.org/jira/browse/SPARK-5201
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Ye Xianjin
  Labels: rdd
 Fix For: 1.2.1

   Original Estimate: 2h
  Remaining Estimate: 2h

 {code}
  sc.makeRDD(1 to (Int.MaxValue)).count   // result = 0
  sc.makeRDD(1 to (Int.MaxValue - 1)).count   // result = 2147483646 = 
 Int.MaxValue - 1
  sc.makeRDD(1 until (Int.MaxValue)).count// result = 2147483646 = 
 Int.MaxValue - 1
 {code}
 More details on the discussion https://github.com/apache/spark/pull/2874



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5018) Make MultivariateGaussian public

2015-01-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5018.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3923
[https://github.com/apache/spark/pull/3923]

 Make MultivariateGaussian public
 

 Key: SPARK-5018
 URL: https://issues.apache.org/jira/browse/SPARK-5018
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Assignee: Travis Galoppo
Priority: Critical
 Fix For: 1.3.0


 MultivariateGaussian is currently private[ml], but it would be a useful 
 public class.  This JIRA will require defining a good public API for 
 distributions.
 This JIRA will be needed for finalizing the GaussianMixtureModel API, which 
 should expose MultivariateGaussian instances instead of the means and 
 covariances.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5202) HiveContext doesn't support the Variables Substitution

2015-01-11 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-5202:


 Summary: HiveContext doesn't support the Variables Substitution
 Key: SPARK-5202
 URL: https://issues.apache.org/jira/browse/SPARK-5202
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao


https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VariableSubstitution

This is a block issue for the CLI user, which will impact the existed hql 
scripts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5196) Add comment field in StructField

2015-01-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273251#comment-14273251
 ] 

Apache Spark commented on SPARK-5196:
-

User 'OopsOutOfMemory' has created a pull request for this issue:
https://github.com/apache/spark/pull/3999

 Add comment field in StructField
 

 Key: SPARK-5196
 URL: https://issues.apache.org/jira/browse/SPARK-5196
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: shengli
 Fix For: 1.3.0


 StructField should contains name, type, nullable, comment  etc...
 Add support comment field in StructField.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4908) Spark SQL built for Hive 13 fails under concurrent metadata queries

2015-01-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273271#comment-14273271
 ] 

Apache Spark commented on SPARK-4908:
-

User 'baishuo' has created a pull request for this issue:
https://github.com/apache/spark/pull/4001

 Spark SQL built for Hive 13 fails under concurrent metadata queries
 ---

 Key: SPARK-4908
 URL: https://issues.apache.org/jira/browse/SPARK-4908
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: David Ross
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.3.0, 1.2.1


 We are trunk: {{1.3.0-SNAPSHOT}}, as of this commit: 
 https://github.com/apache/spark/commit/3d0c37b8118f6057a663f959321a79b8061132b6
 We are using Spark built for Hive 13, using this option:
 {{-Phive-0.13.1}}
 In single-threaded mode, normal operations look fine. However, under 
 concurrency, with at least 2 concurrent connections, metadata queries fail.
 For example, {{USE some_db}}, {{SHOW TABLES}}, and the implicit {{USE}} 
 statement when you pass a default schema in the JDBC URL, all fail.
 {{SELECT}} queries like {{SELECT * FROM some_table}} do not have this issue.
 Here is some example code:
 {code}
 object main extends App {
   import java.sql._
   import scala.concurrent._
   import scala.concurrent.duration._
   import scala.concurrent.ExecutionContext.Implicits.global
   Class.forName(org.apache.hive.jdbc.HiveDriver)
   val host = localhost // update this
   val url = sjdbc:hive2://${host}:10511/some_db // update this
   val future = Future.traverse(1 to 3) { i =
 Future {
   println(Starting:  + i)
   try {
 val conn = DriverManager.getConnection(url)
   } catch {
 case e: Throwable = e.printStackTrace()
 println(Failed:  + i)
   }
   println(Finishing:  + i)
 }
   }
   Await.result(future, 2.minutes)
   println(done!)
 }
 {code}
 Here is the output:
 {code}
 Starting: 1
 Starting: 3
 Starting: 2
 java.sql.SQLException: 
 org.apache.spark.sql.execution.QueryExecutionException: FAILED: Operation 
 cancelled
   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:121)
   at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:109)
   at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:231)
   at 
 org.apache.hive.jdbc.HiveConnection.configureConnection(HiveConnection.java:451)
   at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:195)
   at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
   at java.sql.DriverManager.getConnection(DriverManager.java:664)
   at java.sql.DriverManager.getConnection(DriverManager.java:270)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply$mcV$sp(ConnectionManager.scala:896)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893)
   at 
 scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
   at 
 scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
   at 
 scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 Failed: 3
 Finishing: 3
 java.sql.SQLException: 
 org.apache.spark.sql.execution.QueryExecutionException: FAILED: Operation 
 cancelled
   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:121)
   at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:109)
   at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:231)
   at 
 org.apache.hive.jdbc.HiveConnection.configureConnection(HiveConnection.java:451)
   at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:195)
   at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
   at java.sql.DriverManager.getConnection(DriverManager.java:664)
   at java.sql.DriverManager.getConnection(DriverManager.java:270)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply$mcV$sp(ConnectionManager.scala:896)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893)
   at 
 

[jira] [Commented] (SPARK-5186) Vector.equals and Vector.hashCode are very inefficient

2015-01-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273224#comment-14273224
 ] 

Apache Spark commented on SPARK-5186:
-

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/3997

 Vector.equals  and Vector.hashCode are very inefficient
 ---

 Key: SPARK-5186
 URL: https://issues.apache.org/jira/browse/SPARK-5186
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Derrick Burns
   Original Estimate: 0.25h
  Remaining Estimate: 0.25h

 The implementation of Vector.equals and Vector.hashCode are correct but slow 
 for SparseVectors that are truly sparse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5200) Disable web UI in Hive Thriftserver tests

2015-01-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273247#comment-14273247
 ] 

Apache Spark commented on SPARK-5200:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/3998

 Disable web UI in Hive Thriftserver tests
 -

 Key: SPARK-5200
 URL: https://issues.apache.org/jira/browse/SPARK-5200
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Josh Rosen
Assignee: Josh Rosen
  Labels: flaky-test

 In our unit tests, we should disable the Spark Web UI when starting the Hive 
 Thriftserver, since port contention during this test has been a cause of test 
 failures on Jenkins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3561) Allow for pluggable execution contexts in Spark

2015-01-11 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273225#comment-14273225
 ] 

Patrick Wendell commented on SPARK-3561:


So if the question is: Is Spark only API or is it an integrated API/execution 
engine... we've taken a fairly clear stance over the history of the project 
that it's an integrated engine. I.e. Spark is not something like Pig where it's 
intended primarily as a user API and we expect there to be different physical 
execution engines plugged in underneath.

In the past we haven't found this prevents Spark from working well in different 
environments. For instance, with Mesos, on YARN, etc. And for this we've 
integrated at different layers such as the storage layer and the scheduling 
layer, where there were well defined API's and integration points in the 
broader ecosystem. Compared with alternatives Spark is far more flexible in 
terms of runtime environments. The RDD API is so generic that it's very easy to 
customize and integrate.

For this reason, my feeling with decoupling execution from the rest of Spark is 
that it would tie our hands architecturally and not add much benefit. I don't 
see a good reason to make this broader change in the strategy of the project.

If there are specific improvements you see for making Spark work well on YARN, 
then we can definitely look at them.

 Allow for pluggable execution contexts in Spark
 ---

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal: 
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@Experimental) not exposed to end users of Spark. 
 The trait will define 6 operations: 
 * hadoopFile 
 * newAPIHadoopFile 
 * broadcast 
 * runJob 
 * persist
 * unpersist
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext. 
 Please see the attached design doc for more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5095) Support launching multiple mesos executors in coarse grained mesos mode

2015-01-11 Thread Timothy Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273244#comment-14273244
 ] 

Timothy Chen commented on SPARK-5095:
-

[~joshdevins] [~gmaas] indeed capping the cores is actually to fix 4940, and we 
can use that to address the number of executors.

I'm trying not to have just a set of configurations that can achieve both, 
otherwise it becomes a lot harder to maintain.

I'm working on the patch now and I'll add you both on github for review.

 Support launching multiple mesos executors in coarse grained mesos mode
 ---

 Key: SPARK-5095
 URL: https://issues.apache.org/jira/browse/SPARK-5095
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen

 Currently in coarse grained mesos mode, it's expected that we only launch one 
 Mesos executor that launches one JVM process to launch multiple spark 
 executors.
 However, this become a problem when the JVM process launched is larger than 
 an ideal size (30gb is recommended value from databricks), which causes GC 
 problems reported on the mailing list.
 We should support launching mulitple executors when large enough resources 
 are available for spark to use, and these resources are still under the 
 configured limit.
 This is also applicable when users want to specifiy number of executors to be 
 launched on each node



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5202) HiveContext doesn't support the Variables Substitution

2015-01-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273293#comment-14273293
 ] 

Apache Spark commented on SPARK-5202:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/4003

 HiveContext doesn't support the Variables Substitution
 --

 Key: SPARK-5202
 URL: https://issues.apache.org/jira/browse/SPARK-5202
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao

 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VariableSubstitution
 This is a block issue for the CLI user, which will impact the existed hql 
 scripts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5201) ParallelCollectionRDD.slice(seq, numSlices) has int overflow when dealing with inclusive range

2015-01-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273289#comment-14273289
 ] 

Apache Spark commented on SPARK-5201:
-

User 'advancedxy' has created a pull request for this issue:
https://github.com/apache/spark/pull/4002

 ParallelCollectionRDD.slice(seq, numSlices) has int overflow when dealing 
 with inclusive range
 --

 Key: SPARK-5201
 URL: https://issues.apache.org/jira/browse/SPARK-5201
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Ye Xianjin
  Labels: rdd
 Fix For: 1.2.1

   Original Estimate: 2h
  Remaining Estimate: 2h

 {code}
  sc.makeRDD(1 to (Int.MaxValue)).count   // result = 0
  sc.makeRDD(1 to (Int.MaxValue - 1)).count   // result = 2147483646 = 
 Int.MaxValue - 1
  sc.makeRDD(1 until (Int.MaxValue)).count// result = 2147483646 = 
 Int.MaxValue - 1
 {code}
 More details on the discussion https://github.com/apache/spark/pull/2874



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5202) HiveContext doesn't support the Variables Substitution

2015-01-11 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-5202:
-
Description: 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VariableSubstitution

This is a block issue for the CLI user, it impacts the existed hql scripts from 
Hive.

  was:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VariableSubstitution

This is a block issue for the CLI user, which will impact the existed hql 
scripts.


 HiveContext doesn't support the Variables Substitution
 --

 Key: SPARK-5202
 URL: https://issues.apache.org/jira/browse/SPARK-5202
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao

 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VariableSubstitution
 This is a block issue for the CLI user, it impacts the existed hql scripts 
 from Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org