[jira] [Updated] (SPARK-5167) Move Row into sql package and make it usable for Java
[ https://issues.apache.org/jira/browse/SPARK-5167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5167: --- Assignee: Reynold Xin Move Row into sql package and make it usable for Java - Key: SPARK-5167 URL: https://issues.apache.org/jira/browse/SPARK-5167 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin This will help us eliminate the duplicated Java code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3299) [SQL] Public API in SQLContext to list tables
[ https://issues.apache.org/jira/browse/SPARK-3299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-3299: --- Issue Type: Sub-task (was: Improvement) Parent: SPARK-5166 [SQL] Public API in SQLContext to list tables - Key: SPARK-3299 URL: https://issues.apache.org/jira/browse/SPARK-3299 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.0.2 Reporter: Evan Chan Assignee: Bill Bejeck Priority: Minor Labels: newbie There is no public API in SQLContext to list the current tables. This would be pretty helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2096) Correctly parse dot notations for accessing an array of structs
[ https://issues.apache.org/jira/browse/SPARK-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2096: --- Target Version/s: 1.3.0 (was: 1.2.0) Correctly parse dot notations for accessing an array of structs --- Key: SPARK-2096 URL: https://issues.apache.org/jira/browse/SPARK-2096 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Yin Huai Priority: Minor Labels: starter Fix For: 1.2.0 For example, arrayOfStruct is an array of structs and every element of this array has a field called field1. arrayOfStruct[0].field1 means to access the value of field1 for the first element of arrayOfStruct, but the SQL parser (in sql-core) treats field1 as an alias. Also, arrayOfStruct.field1 means to access all values of field1 in this array of structs and the returns those values as an array. But, the SQL parser cannot resolve it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5166) Stabilize Spark SQL APIs
[ https://issues.apache.org/jira/browse/SPARK-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5166: --- Assignee: Reynold Xin Stabilize Spark SQL APIs Key: SPARK-5166 URL: https://issues.apache.org/jira/browse/SPARK-5166 Project: Spark Issue Type: Task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Before we take Spark SQL out of alpha, we need to audit the APIs and stabilize them. As a general rule, everything under org.apache.spark.sql.catalyst should not be exposed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5166) Stabilize Spark SQL APIs
[ https://issues.apache.org/jira/browse/SPARK-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5166: --- Priority: Critical (was: Major) Stabilize Spark SQL APIs Key: SPARK-5166 URL: https://issues.apache.org/jira/browse/SPARK-5166 Project: Spark Issue Type: Task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Before we take Spark SQL out of alpha, we need to audit the APIs and stabilize them. As a general rule, everything under org.apache.spark.sql.catalyst should not be exposed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5193) Make Spark SQL API usable in Java and remove the Java-specific API
Reynold Xin created SPARK-5193: -- Summary: Make Spark SQL API usable in Java and remove the Java-specific API Key: SPARK-5193 URL: https://issues.apache.org/jira/browse/SPARK-5193 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Java version of the SchemaRDD API causes high maintenance burden for Spark SQL itself and downstream libraries (e.g. MLlib pipeline API needs to support both JavaSchemaRDD and SchemaRDD). We can audit the Scala API and make it usable for Java, and then we can remove the Java specific version. Things to remove include (Java version of): - data type - Row - SQLContext - HiveContext Things to consider: - Scala and Java have a different collection library. - Scala and Java (8) have different closure interface. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4861) Refactory command in spark sql
[ https://issues.apache.org/jira/browse/SPARK-4861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272862#comment-14272862 ] wangfei commented on SPARK-4861: [~yhuai]of course if possible, but i have not find a way to remove it since in HiveCommandStrategy we need to distinguish hive metastore table and temporary table, so now still keep HiveCommandStrategy there. any idea here? Refactory command in spark sql -- Key: SPARK-4861 URL: https://issues.apache.org/jira/browse/SPARK-4861 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.1 Reporter: wangfei Fix For: 1.3.0 Fix a todo in spark sql: remove ```Command``` and use ```RunnableCommand``` instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4508) Native Date type for SQL92 Date
[ https://issues.apache.org/jira/browse/SPARK-4508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-4508: --- Target Version/s: 1.3.0 Native Date type for SQL92 Date --- Key: SPARK-4508 URL: https://issues.apache.org/jira/browse/SPARK-4508 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Adrian Wang Assignee: Adrian Wang Store daysSinceEpoch as an Int(4 bytes), instead of using java.sql.Date(8 bytes as Long) in catalyst row. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4508) build native date type to conform behavior to Hive
[ https://issues.apache.org/jira/browse/SPARK-4508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-4508: --- Issue Type: Sub-task (was: Improvement) Parent: SPARK-5166 build native date type to conform behavior to Hive -- Key: SPARK-4508 URL: https://issues.apache.org/jira/browse/SPARK-4508 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Adrian Wang Store daysSinceEpoch as an Int(4 bytes), instead of using java.sql.Date(8 bytes as Long) in catalyst row. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4508) Native Date type for SQL92 Date
[ https://issues.apache.org/jira/browse/SPARK-4508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-4508: --- Assignee: Adrian Wang Native Date type for SQL92 Date --- Key: SPARK-4508 URL: https://issues.apache.org/jira/browse/SPARK-4508 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Adrian Wang Assignee: Adrian Wang Store daysSinceEpoch as an Int(4 bytes), instead of using java.sql.Date(8 bytes as Long) in catalyst row. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5193) Make Spark SQL API usable in Java and remove the Java-specific API
[ https://issues.apache.org/jira/browse/SPARK-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272860#comment-14272860 ] Reynold Xin commented on SPARK-5193: cc [~marmbrus] Make Spark SQL API usable in Java and remove the Java-specific API -- Key: SPARK-5193 URL: https://issues.apache.org/jira/browse/SPARK-5193 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Java version of the SchemaRDD API causes high maintenance burden for Spark SQL itself and downstream libraries (e.g. MLlib pipeline API needs to support both JavaSchemaRDD and SchemaRDD). We can audit the Scala API and make it usable for Java, and then we can remove the Java specific version. Things to remove include (Java version of): - data type - Row - SQLContext - HiveContext Things to consider: - Scala and Java have a different collection library. - Scala and Java (8) have different closure interface. - Scala and Java can have duplicate definitions of common classes, such as BigDecimal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5193) Make Spark SQL API usable in Java and remove the Java-specific API
[ https://issues.apache.org/jira/browse/SPARK-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5193: --- Description: Java version of the SchemaRDD API causes high maintenance burden for Spark SQL itself and downstream libraries (e.g. MLlib pipeline API needs to support both JavaSchemaRDD and SchemaRDD). We can audit the Scala API and make it usable for Java, and then we can remove the Java specific version. Things to remove include (Java version of): - data type - Row - SQLContext - HiveContext Things to consider: - Scala and Java have a different collection library. - Scala and Java (8) have different closure interface. - Scala and Java can have duplicate definitions of common classes, such as BigDecimal. was: Java version of the SchemaRDD API causes high maintenance burden for Spark SQL itself and downstream libraries (e.g. MLlib pipeline API needs to support both JavaSchemaRDD and SchemaRDD). We can audit the Scala API and make it usable for Java, and then we can remove the Java specific version. Things to remove include (Java version of): - data type - Row - SQLContext - HiveContext Things to consider: - Scala and Java have a different collection library. - Scala and Java (8) have different closure interface. Make Spark SQL API usable in Java and remove the Java-specific API -- Key: SPARK-5193 URL: https://issues.apache.org/jira/browse/SPARK-5193 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Java version of the SchemaRDD API causes high maintenance burden for Spark SQL itself and downstream libraries (e.g. MLlib pipeline API needs to support both JavaSchemaRDD and SchemaRDD). We can audit the Scala API and make it usable for Java, and then we can remove the Java specific version. Things to remove include (Java version of): - data type - Row - SQLContext - HiveContext Things to consider: - Scala and Java have a different collection library. - Scala and Java (8) have different closure interface. - Scala and Java can have duplicate definitions of common classes, such as BigDecimal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5194) ADD JAR doesn't update classpath until reconnect
Oleg Danilov created SPARK-5194: --- Summary: ADD JAR doesn't update classpath until reconnect Key: SPARK-5194 URL: https://issues.apache.org/jira/browse/SPARK-5194 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Reporter: Oleg Danilov Steps to reproduce: beeline !connect jdbc:hive2://vmhost-vm0:1 0: jdbc:hive2://vmhost-vm0:1 add jar ./target/nexr-hive-udf-0.2-SNAPSHOT.jar 0: jdbc:hive2://vmhost-vm0:1 CREATE TEMPORARY FUNCTION nvl AS 'com.nexr.platform.hive.udf.GenericUDFNVL'; 0: jdbc:hive2://vmhost-vm0:1 select nvl(imsi,'test') from ps_cei_index_1_week limit 1; Error: java.lang.ClassNotFoundException: com.nexr.platform.hive.udf.GenericUDFNVL (state=,code=0) 0: jdbc:hive2://vmhost-vm0:1 !reconnect Reconnecting to jdbc:hive2://vmhost-vm0:1... Closing: org.apache.hive.jdbc.HiveConnection@3f18dc75: {1} Connected to: Spark SQL (version 1.2.0) Driver: null (version null) Transaction isolation: TRANSACTION_REPEATABLE_READ 0: jdbc:hive2://vmhost-vm0:1 select nvl(imsi,'test') from ps_cei_index_1_week limit 1; +--+ | _c0 | +--+ | -1 | +--+ 1 row selected (1.605 seconds) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5195) when hive table is query with alias the cache data lose effectiveness.
yixiaohua created SPARK-5195: Summary: when hive table is query with alias the cache data lose effectiveness. Key: SPARK-5195 URL: https://issues.apache.org/jira/browse/SPARK-5195 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: yixiaohua override the MetastoreRelation's sameresult method only compare databasename and table name because in previous : cache table t1; select count() from t1; it will read data from memory but the sql below will not,instead it read from hdfs: select count() from t1 t; because cache data is keyed by logical plan and compare with sameResult ,so when table with alias the same table 's logicalplan is not the same logical plan with out alias so modify the sameresult method only compare databasename and table name -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5195) when hive table is query with alias the cache data lose effectiveness.
[ https://issues.apache.org/jira/browse/SPARK-5195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272934#comment-14272934 ] Apache Spark commented on SPARK-5195: - User 'seayi' has created a pull request for this issue: https://github.com/apache/spark/pull/3898 when hive table is query with alias the cache data lose effectiveness. Key: SPARK-5195 URL: https://issues.apache.org/jira/browse/SPARK-5195 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: yixiaohua override the MetastoreRelation's sameresult method only compare databasename and table name because in previous : cache table t1; select count() from t1; it will read data from memory but the sql below will not,instead it read from hdfs: select count() from t1 t; because cache data is keyed by logical plan and compare with sameResult ,so when table with alias the same table 's logicalplan is not the same logical plan with out alias so modify the sameresult method only compare databasename and table name -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5192) Parquet fails to parse schema contains '\r'
[ https://issues.apache.org/jira/browse/SPARK-5192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] cen yuhai updated SPARK-5192: - Summary: Parquet fails to parse schema contains '\r' (was: Parquet fails to parse schemas contains '\r') Parquet fails to parse schema contains '\r' --- Key: SPARK-5192 URL: https://issues.apache.org/jira/browse/SPARK-5192 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Environment: windows7 + Intellj idea 13.0.2 Reporter: cen yuhai Priority: Critical Fix For: 1.3.0 I think this is actually a bug in parquet, when i debuged 'ParquetTestData', i found a exception as below. So i download the source of MessageTypeParser, the funtion 'isWhitespace' do not check for '\r' private boolean isWhitespace(String t) { return t.equals( ) || t.equals(\t) || t.equals(\n); } So I replace all '\r' to work around this issue. val subTestSchema = message myrecord { optional boolean myboolean; optional int64 mylong; } .replaceAll(\r,) at line 0: message myrecord { at parquet.schema.MessageTypeParser.asRepetition(MessageTypeParser.java:203) at parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:101) at parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:96) at parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:89) at parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:79) at org.apache.spark.sql.parquet.ParquetTestData$.writeFile(ParquetTestData.scala:221) at org.apache.spark.sql.parquet.ParquetQuerySuite.beforeAll(ParquetQuerySuite.scala:92) at org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187) at org.apache.spark.sql.parquet.ParquetQuerySuite.beforeAll(ParquetQuerySuite.scala:85) at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253) at org.apache.spark.sql.parquet.ParquetQuerySuite.run(ParquetQuerySuite.scala:85) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5196) Add comment field in StructField
shengli created SPARK-5196: -- Summary: Add comment field in StructField Key: SPARK-5196 URL: https://issues.apache.org/jira/browse/SPARK-5196 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: shengli Fix For: 1.3.0 StructField should contains name, type, nullable, comment etc... Add support comment field in StructField. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5196) Add comment field in StructField
[ https://issues.apache.org/jira/browse/SPARK-5196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272937#comment-14272937 ] Apache Spark commented on SPARK-5196: - User 'OopsOutOfMemory' has created a pull request for this issue: https://github.com/apache/spark/pull/3991 Add comment field in StructField Key: SPARK-5196 URL: https://issues.apache.org/jira/browse/SPARK-5196 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: shengli Fix For: 1.3.0 StructField should contains name, type, nullable, comment etc... Add support comment field in StructField. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5162) Python yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-5162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272943#comment-14272943 ] Lianhui Wang commented on SPARK-5162: - [~dklassen] i submit a PR for this issue.https://github.com/apache/spark/pull/3976 so i think you can try it. if there are any questions or suggestions,please tell me. Python yarn-cluster mode Key: SPARK-5162 URL: https://issues.apache.org/jira/browse/SPARK-5162 Project: Spark Issue Type: New Feature Components: PySpark, YARN Reporter: Dana Klassen Labels: cluster, python, yarn Running pyspark in yarn is currently limited to ‘yarn-client’ mode. It would be great to be able to submit python applications to the cluster and (just like java classes) have the resource manager setup an AM on any node in the cluster. Does anyone know the issues blocking this feature? I was snooping around with enabling python apps: Removing the logic stopping python and yarn-cluster from sparkSubmit.scala ... // The following modes are not supported or applicable (clusterManager, deployMode) match { ... case (_, CLUSTER) if args.isPython = printErrorAndExit(Cluster deploy mode is currently not supported for python applications.) ... } … and submitting application via: HADOOP_CONF_DIR={{insert conf dir}} ./bin/spark-submit --master yarn-cluster --num-executors 2 —-py-files {{insert location of egg here}} --executor-cores 1 ../tools/canary.py Everything looks to run alright, pythonRunner is picked up as main class, resources get setup, yarn client gets launched but falls flat on its face: 2015-01-08 18:48:03,444 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: DEBUG: FAILED { {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py, 1420742868009, FILE, null }, Resource {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py changed on src filesystem (expected 1420742868009, was 1420742869284 and 2015-01-08 18:48:03,446 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py(-/data/4/yarn/nm/usercache/klassen/filecache/11/canary.py) transitioned from DOWNLOADING to FAILED Tracked this down to the apache hadoop code(FSDownload.java line 249) related to container localization of files upon downloading. At this point thought it would be best to raise the issue here and get input. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5172) spark-examples-***.jar shades a wrong Hadoop distribution
[ https://issues.apache.org/jira/browse/SPARK-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273041#comment-14273041 ] Apache Spark commented on SPARK-5172: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/3992 spark-examples-***.jar shades a wrong Hadoop distribution - Key: SPARK-5172 URL: https://issues.apache.org/jira/browse/SPARK-5172 Project: Spark Issue Type: Bug Components: Build Reporter: Shixiong Zhu Priority: Minor Steps to check it: 1. Download spark-1.2.0-bin-hadoop2.4.tgz from http://www.apache.org/dyn/closer.cgi/spark/spark-1.2.0/spark-1.2.0-bin-hadoop2.4.tgz 2. unzip `spark-examples-1.2.0-hadoop2.4.0.jar`. 3. There is a file called `org/apache/hadoop/package-info.class` in the jar. It doesn't exist in hadoop 2.4. 4. Run javap -classpath . -private -c -v org.apache.hadoop.package-info {code} Compiled from package-info.java interface org.apache.hadoop.package-info SourceFile: package-info.java RuntimeVisibleAnnotations: length = 0x24 00 01 00 06 00 06 00 07 73 00 08 00 09 73 00 0A 00 0B 73 00 0C 00 0D 73 00 0E 00 0F 73 00 10 00 11 73 00 12 minor version: 0 major version: 50 Constant pool: const #1 = Asciz org/apache/hadoop/package-info; const #2 = class #1; // org/apache/hadoop/package-info const #3 = Asciz java/lang/Object; const #4 = class #3; // java/lang/Object const #5 = Asciz package-info.java; const #6 = Asciz Lorg/apache/hadoop/HadoopVersionAnnotation;; const #7 = Asciz version; const #8 = Asciz 1.2.1; const #9 = Asciz revision; const #10 = Asciz 1503152; const #11 = Asciz user; const #12 = Asciz mattf; const #13 = Asciz date; const #14 = Asciz Wed Jul 24 13:39:35 PDT 2013; const #15 = Asciz url; const #16 = Asciz https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2; const #17 = Asciz srcChecksum; const #18 = Asciz 6923c86528809c4e7e6f493b6b413a9a; const #19 = Asciz SourceFile; const #20 = Asciz RuntimeVisibleAnnotations; { } {code} The version is {{1.2.1}} It comes because a wrong hbase version settings in examples project. Here is a part of the dependencly tree when runnning mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -pl examples dependency:tree {noformat} [INFO] +- org.apache.hbase:hbase-testing-util:jar:0.98.7-hadoop1:compile [INFO] | +- org.apache.hbase:hbase-common:test-jar:tests:0.98.7-hadoop1:compile [INFO] | +- org.apache.hbase:hbase-server:test-jar:tests:0.98.7-hadoop1:compile [INFO] | | +- com.sun.jersey:jersey-core:jar:1.8:compile [INFO] | | +- com.sun.jersey:jersey-json:jar:1.8:compile [INFO] | | | +- org.codehaus.jettison:jettison:jar:1.1:compile [INFO] | | | +- com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile [INFO] | | | \- org.codehaus.jackson:jackson-xc:jar:1.7.1:compile [INFO] | | \- com.sun.jersey:jersey-server:jar:1.8:compile [INFO] | | \- asm:asm:jar:3.3.1:test [INFO] | +- org.apache.hbase:hbase-hadoop1-compat:jar:0.98.7-hadoop1:compile [INFO] | +- org.apache.hbase:hbase-hadoop1-compat:test-jar:tests:0.98.7-hadoop1:compile [INFO] | +- org.apache.hadoop:hadoop-core:jar:1.2.1:compile [INFO] | | +- xmlenc:xmlenc:jar:0.52:compile [INFO] | | +- commons-configuration:commons-configuration:jar:1.6:compile [INFO] | | | +- commons-digester:commons-digester:jar:1.8:compile [INFO] | | | | \- commons-beanutils:commons-beanutils:jar:1.7.0:compile [INFO] | | | \- commons-beanutils:commons-beanutils-core:jar:1.8.0:compile [INFO] | | \- commons-el:commons-el:jar:1.0:compile [INFO] | +- org.apache.hadoop:hadoop-test:jar:1.2.1:compile [INFO] | | +- org.apache.ftpserver:ftplet-api:jar:1.0.0:compile [INFO] | | +- org.apache.mina:mina-core:jar:2.0.0-M5:compile [INFO] | | +- org.apache.ftpserver:ftpserver-core:jar:1.0.0:compile [INFO] | | \- org.apache.ftpserver:ftpserver-deprecated:jar:1.0.0-M2:compile [INFO] | +- com.github.stephenc.findbugs:findbugs-annotations:jar:1.3.9-1:compile [INFO] | \- junit:junit:jar:4.10:test [INFO] | \- org.hamcrest:hamcrest-core:jar:1.1:test {noformat} If I ran `mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -pl examples -am dependency:tree -Dhbase.profile=hadoop2`, the dependency tree is right. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5008) Persistent HDFS does not recognize EBS Volumes
[ https://issues.apache.org/jira/browse/SPARK-5008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273007#comment-14273007 ] Nicholas Chammas commented on SPARK-5008: - Use [{{copy-dir}}|https://github.com/mesos/spark-ec2/blob/v4/copy-dir.sh], which is installed by default, from the master. Persistent HDFS does not recognize EBS Volumes -- Key: SPARK-5008 URL: https://issues.apache.org/jira/browse/SPARK-5008 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.2.0 Environment: 8 Node Cluster Generated from 1.2.0 spark-ec2 script. -m c3.2xlarge -t c3.8xlarge --ebs-vol-size 300 --ebs-vol-type gp2 --ebs-vol-num 1 Reporter: Brad Willard Cluster is built with correct size EBS volumes. It creates the volume at /dev/xvds and it mounted to /vol0. However when you start persistent hdfs with start-all script, it starts but it isn't correctly configured to use the EBS volume. I'm assuming some sym links or expected mounts are not correctly configured. This has worked flawlessly on all previous versions of spark. I have a stupid workaround by installing pssh and mucking with it by mounting it to /vol, which worked, however it doesn't not work between restarts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5008) Persistent HDFS does not recognize EBS Volumes
[ https://issues.apache.org/jira/browse/SPARK-5008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272991#comment-14272991 ] Brad Willard commented on SPARK-5008: - [~nchammas] I can try that once I get back into the office. Probably by Wednesday. Once I update the core-site.xml, what's the correct way to sync it to all the slaves? Persistent HDFS does not recognize EBS Volumes -- Key: SPARK-5008 URL: https://issues.apache.org/jira/browse/SPARK-5008 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.2.0 Environment: 8 Node Cluster Generated from 1.2.0 spark-ec2 script. -m c3.2xlarge -t c3.8xlarge --ebs-vol-size 300 --ebs-vol-type gp2 --ebs-vol-num 1 Reporter: Brad Willard Cluster is built with correct size EBS volumes. It creates the volume at /dev/xvds and it mounted to /vol0. However when you start persistent hdfs with start-all script, it starts but it isn't correctly configured to use the EBS volume. I'm assuming some sym links or expected mounts are not correctly configured. This has worked flawlessly on all previous versions of spark. I have a stupid workaround by installing pssh and mucking with it by mounting it to /vol, which worked, however it doesn't not work between restarts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4159) Maven build doesn't run JUnit test suites
[ https://issues.apache.org/jira/browse/SPARK-4159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273062#comment-14273062 ] Apache Spark commented on SPARK-4159: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/3993 Maven build doesn't run JUnit test suites - Key: SPARK-4159 URL: https://issues.apache.org/jira/browse/SPARK-4159 Project: Spark Issue Type: Bug Components: Build Reporter: Patrick Wendell Assignee: Sean Owen Priority: Critical Labels: backport-needed Fix For: 1.3.0 It turns out our Maven build isn't running any Java test suites, and likely hasn't ever. After some fishing I believe the following is the issue. We use scalatest [1] in our maven build which, by default can't automatically detect JUnit tests. Scalatest will allow you to enumerate a list of suites via JUnitClasses, but I cant' find a way for it to auto-detect all JUnit tests. It turns out this works in SBT because of our use of the junit-interface[2] which does this for you. An okay fix for this might be to simply enable the normal (surefire) maven tests in addition to our scalatest in the maven build. The only thing to watch out for is that they don't overlap in some way. We'd also have to copy over environment variables, memory settings, etc to that plugin. [1] http://www.scalatest.org/user_guide/using_the_scalatest_maven_plugin [2] https://github.com/sbt/junit-interface -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4033) Integer overflow when SparkPi is called with more than 25000 slices
[ https://issues.apache.org/jira/browse/SPARK-4033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4033. Resolution: Fixed Fix Version/s: 1.3.0 Assignee: SaintBacchus Target Version/s: 1.3.0 Integer overflow when SparkPi is called with more than 25000 slices --- Key: SPARK-4033 URL: https://issues.apache.org/jira/browse/SPARK-4033 Project: Spark Issue Type: Bug Components: Examples Affects Versions: 1.2.0 Reporter: SaintBacchus Assignee: SaintBacchus Fix For: 1.3.0 If input of the SparkPi args is larger than the 25000, the integer 'n' inside the code will be overflow, and may be a negative number. And it causes the (0 until n) Seq as an empty seq, then doing the action 'reduce' will throw the UnsupportedOperationException(empty collection). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5198) Change executorId more unique on mesos fine-grained mode
[ https://issues.apache.org/jira/browse/SPARK-5198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jongyoul Lee updated SPARK-5198: Component/s: Mesos Change executorId more unique on mesos fine-grained mode Key: SPARK-5198 URL: https://issues.apache.org/jira/browse/SPARK-5198 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Jongyoul Lee Fix For: 1.3.0, 1.2.1 In fine-grained mode, SchedulerBackend set executor name as same as slave id with any task id. It's not good to track aspecific job because of logging a different in a same log file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5198) Change executorId more unique on mesos fine-grained mode
[ https://issues.apache.org/jira/browse/SPARK-5198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jongyoul Lee updated SPARK-5198: Fix Version/s: 1.2.1 1.3.0 Change executorId more unique on mesos fine-grained mode Key: SPARK-5198 URL: https://issues.apache.org/jira/browse/SPARK-5198 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Jongyoul Lee Fix For: 1.3.0, 1.2.1 In fine-grained mode, SchedulerBackend set executor name as same as slave id with any task id. It's not good to track aspecific job because of logging a different in a same log file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5198) Change executorId more unique on mesos fine-grained mode
Jongyoul Lee created SPARK-5198: --- Summary: Change executorId more unique on mesos fine-grained mode Key: SPARK-5198 URL: https://issues.apache.org/jira/browse/SPARK-5198 Project: Spark Issue Type: Improvement Reporter: Jongyoul Lee In fine-grained mode, SchedulerBackend set executor name as same as slave id with any task id. It's not good to track aspecific job because of logging a different in a same log file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4296) Throw Expression not in GROUP BY when using same expression in group by clause and select clause
[ https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-4296: Target Version/s: 1.3.0, 1.2.1 (was: 1.2.0) Affects Version/s: 1.1.1 1.2.0 Fix Version/s: (was: 1.2.0) Throw Expression not in GROUP BY when using same expression in group by clause and select clause --- Key: SPARK-4296 URL: https://issues.apache.org/jira/browse/SPARK-4296 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0, 1.1.1, 1.2.0 Reporter: Shixiong Zhu Assignee: Cheng Lian Priority: Blocker When the input data has a complex structure, using same expression in group by clause and select clause will throw Expression not in GROUP BY. {code:java} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class Birthday(date: String) case class Person(name: String, birthday: Birthday) val people = sc.parallelize(List(Person(John, Birthday(1990-01-22)), Person(Jim, Birthday(1980-02-28 people.registerTempTable(people) val year = sqlContext.sql(select count(*), upper(birthday.date) from people group by upper(birthday.date)) year.collect {code} Here is the plan of year: {code:java} SchemaRDD[3] at RDD at SchemaRDD.scala:105 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree: Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date AS date#9) AS c1#3] Subquery people LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:36 {code} The bug is the equality test for `Upper(birthday#1.date)` and `Upper(birthday#1.date AS date#9)`. Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias expression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4296) Throw Expression not in GROUP BY when using same expression in group by clause and select clause
[ https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273196#comment-14273196 ] Yin Huai commented on SPARK-4296: - I was wondering if we can also find this issue at other places. Maybe we can resolve this issue thoroughly. Throw Expression not in GROUP BY when using same expression in group by clause and select clause --- Key: SPARK-4296 URL: https://issues.apache.org/jira/browse/SPARK-4296 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0, 1.1.1, 1.2.0 Reporter: Shixiong Zhu Assignee: Cheng Lian Priority: Blocker When the input data has a complex structure, using same expression in group by clause and select clause will throw Expression not in GROUP BY. {code:java} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class Birthday(date: String) case class Person(name: String, birthday: Birthday) val people = sc.parallelize(List(Person(John, Birthday(1990-01-22)), Person(Jim, Birthday(1980-02-28 people.registerTempTable(people) val year = sqlContext.sql(select count(*), upper(birthday.date) from people group by upper(birthday.date)) year.collect {code} Here is the plan of year: {code:java} SchemaRDD[3] at RDD at SchemaRDD.scala:105 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree: Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date AS date#9) AS c1#3] Subquery people LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:36 {code} The bug is the equality test for `Upper(birthday#1.date)` and `Upper(birthday#1.date AS date#9)`. Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias expression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4924) Factor out code to launch Spark applications into a separate library
[ https://issues.apache.org/jira/browse/SPARK-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-4924: -- Assignee: Marcelo Vanzin Factor out code to launch Spark applications into a separate library Key: SPARK-4924 URL: https://issues.apache.org/jira/browse/SPARK-4924 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Attachments: spark-launcher.txt One of the questions we run into rather commonly is how to start a Spark application from my Java/Scala program?. There currently isn't a good answer to that: - Instantiating SparkContext has limitations (e.g., you can only have one active context at the moment, plus you lose the ability to submit apps in cluster mode) - Calling SparkSubmit directly is doable but you lose a lot of the logic handled by the shell scripts - Calling the shell script directly is doable, but sort of ugly from an API point of view. I think it would be nice to have a small library that handles that for users. On top of that, this library could be used by Spark itself to replace a lot of the code in the current shell scripts, which have a lot of duplication. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5197) Support external shuffle service in fine-grained mode on mesos cluster
[ https://issues.apache.org/jira/browse/SPARK-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273137#comment-14273137 ] Jongyoul Lee commented on SPARK-5197: - Please, assign it to me. [~andrewor14] [~adav] Please review my description Support external shuffle service in fine-grained mode on mesos cluster -- Key: SPARK-5197 URL: https://issues.apache.org/jira/browse/SPARK-5197 Project: Spark Issue Type: Improvement Components: Deploy, Mesos, Shuffle Reporter: Jongyoul Lee I think dynamic allocation is almost satisfied on mesos' fine-grained mode, which already offers resources dynamically, and returns automatically when a task is finished. It, however, doesn't have a mechanism on support external shuffle service like yarn's way, which is AuxiliaryService. Because mesos doesn't support AusiliaryService, we think a different way to do this. - Launching a shuffle service like a spark job on same cluster -- Pros --- Support multi-tenant environment --- Almost same way like yarn -- Cons --- Control long running 'background' job - service - when mesos runs --- Satisfy all slave - or host - to have one shuffle service all the time - Launching jobs within shuffle service -- Pros --- Easy to implement --- Don't consider whether shuffle service exists or not. -- Cons --- exists multiple shuffle services under multi-tenant environment --- Control shuffle service port dynamically on multi-user environment In my opinion, the first one is better idea to support external shuffle service. Please leave comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4296) Throw Expression not in GROUP BY when using same expression in group by clause and select clause
[ https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273194#comment-14273194 ] Yin Huai commented on SPARK-4296: - [~lian cheng] Seems this issues is similar with [this one|https://issues.apache.org/jira/browse/SPARK-2063?focusedCommentId=14055193page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14055193]. The main problem is that we use the last part of a reference of a field in a struct as the alias. Is it possible that we can fix that one as well? Throw Expression not in GROUP BY when using same expression in group by clause and select clause --- Key: SPARK-4296 URL: https://issues.apache.org/jira/browse/SPARK-4296 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Shixiong Zhu Assignee: Cheng Lian Priority: Blocker Fix For: 1.2.0 When the input data has a complex structure, using same expression in group by clause and select clause will throw Expression not in GROUP BY. {code:java} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class Birthday(date: String) case class Person(name: String, birthday: Birthday) val people = sc.parallelize(List(Person(John, Birthday(1990-01-22)), Person(Jim, Birthday(1980-02-28 people.registerTempTable(people) val year = sqlContext.sql(select count(*), upper(birthday.date) from people group by upper(birthday.date)) year.collect {code} Here is the plan of year: {code:java} SchemaRDD[3] at RDD at SchemaRDD.scala:105 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree: Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date AS date#9) AS c1#3] Subquery people LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:36 {code} The bug is the equality test for `Upper(birthday#1.date)` and `Upper(birthday#1.date AS date#9)`. Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias expression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3340) Deprecate ADD_JARS and ADD_FILES
[ https://issues.apache.org/jira/browse/SPARK-3340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3340: --- Labels: starter (was: ) Deprecate ADD_JARS and ADD_FILES Key: SPARK-3340 URL: https://issues.apache.org/jira/browse/SPARK-3340 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or Labels: starter These were introduced before Spark submit even existed. Now that there are many better ways of setting jars and python files through Spark submit, we should deprecate these environment variables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3450) Enable specifiying the --jars CLI option multiple times
[ https://issues.apache.org/jira/browse/SPARK-3450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3450. Resolution: Won't Fix I'd prefer not to do this one, it complicates our parsing substantially. It's possible to just write a bash loop that creates a single long list of jars. Enable specifiying the --jars CLI option multiple times --- Key: SPARK-3450 URL: https://issues.apache.org/jira/browse/SPARK-3450 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 1.0.2 Reporter: wolfgang hoschek spark-submit should support specifiying the --jars option multiple time, e.g. --jars foo.jar,bar.jar --jars baz.jar,oops.jar should be equivalent to --jars foo.jar,bar.jar,baz.jar,oops.jar This would allow using wrapper scripts that simplify usage for enterprise customers along the following lines: {code} my-spark-submit.sh: jars= for i in /opt/myapp/*.jar; do if [ $i -gt 0] then jars=$jars, fi jars=$jars$i done spark-submit --jars $jars $@ {code} Example usage: {code} my-spark-submit.sh --jars myUserDefinedFunction.jar {code} The relevant enhancement code might go into SparkSubmitArguments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5073) spark.storage.memoryMapThreshold has two default values
[ https://issues.apache.org/jira/browse/SPARK-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson resolved SPARK-5073. --- Resolution: Fixed spark.storage.memoryMapThreshold has two default values - Key: SPARK-5073 URL: https://issues.apache.org/jira/browse/SPARK-5073 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Jianhui Yuan Priority: Minor In org.apache.spark.storage.DiskStore: val minMemoryMapBytes = blockManager.conf.getLong(spark.storage.memoryMapThreshold, 2 * 4096L) In org.apache.spark.network.util.TransportConf: public int memoryMapBytes() { return conf.getInt(spark.storage.memoryMapThreshold, 2 * 1024 * 1024); } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5197) Support external shuffle service in fine-grained mode on mesos cluster
Jongyoul Lee created SPARK-5197: --- Summary: Support external shuffle service in fine-grained mode on mesos cluster Key: SPARK-5197 URL: https://issues.apache.org/jira/browse/SPARK-5197 Project: Spark Issue Type: Improvement Components: Deploy, Mesos, Shuffle Reporter: Jongyoul Lee I think dynamic allocation is almost satisfied on mesos' fine-grained mode, which already offers resources dynamically, and returns automatically when a task is finished. We, however, don't have a mechanism on support external shuffle service like yarn's way, which is AuxiliaryService. Because mesos doesn't support AusiliaryService, we think a different way to do this. - Launching a shuffle service like a spark job on same cluster -- Pros --- Support multi-tenant environment --- Almost same way like yarn -- Cons --- Control long running 'background' job - service - when mesos runs --- Satisfy all slave - or host - to have one shuffle service all the time - Launching jobs within shuffle service -- Pros --- Easy to implement --- Don't consider whether shuffle service exists or not. -- Cons --- exists multiple shuffle services under multi-tenant environment --- Control shuffle service port dynamically on multi-user environment In my opinion, the first one is better idea to support external shuffle service. Please leave comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4689) Unioning 2 SchemaRDDs should return a SchemaRDD in Python, Scala, and Java
[ https://issues.apache.org/jira/browse/SPARK-4689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268720#comment-14268720 ] Bibudh Lahiri edited comment on SPARK-4689 at 1/12/15 2:13 AM: --- I'd like to work on this issue, but would need some details. I looked into ./sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala where the unionAll method is defined as def unionAll(otherPlan: SchemaRDD) = new SchemaRDD(sqlContext, Union(logicalPlan, otherPlan.logicalPlan)) There is no implementation of union() in SchemaRDD itself and and the API says it is inherited from RDD. I took two different SchemaRDD objects and applied union on them (it is in my fork at https://github.com/bibudhlahiri/spark/blob/master/dev/audit-release/sbt_app_schema_rdd/src/main/scala/SchemaRDDApp.scala ) , and the resultant object is of class UnionRDD. I am thinking of overriding union() in SchemaRDD to return a SchemaRDD, please let me know what you think. was (Author: bibudh): I'd like to work on this issue, but would need some details. I looked into ./sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala where the unionAll method is defined as def unionAll(otherPlan: SchemaRDD) = new SchemaRDD(sqlContext, Union(logicalPlan, otherPlan.logicalPlan)) Are we looking for an implementation of union here (keeping duplicates only once), in addition to unionAll (keeping duplicates both the times)? Unioning 2 SchemaRDDs should return a SchemaRDD in Python, Scala, and Java -- Key: SPARK-4689 URL: https://issues.apache.org/jira/browse/SPARK-4689 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Chris Fregly Priority: Minor Labels: starter Currently, you need to use unionAll() in Scala. Python does not expose this functionality at the moment. The current work around is to use the UNION ALL HiveQL functionality detailed here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5198) Change executorId more unique on mesos fine-grained mode
[ https://issues.apache.org/jira/browse/SPARK-5198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273165#comment-14273165 ] Apache Spark commented on SPARK-5198: - User 'jongyoul' has created a pull request for this issue: https://github.com/apache/spark/pull/3994 Change executorId more unique on mesos fine-grained mode Key: SPARK-5198 URL: https://issues.apache.org/jira/browse/SPARK-5198 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Jongyoul Lee Fix For: 1.3.0, 1.2.1 In fine-grained mode, SchedulerBackend set executor name as same as slave id with any task id. It's not good to track aspecific job because of logging a different in a same log file. This is a same value while launching job on coarse-grained mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5197) Support external shuffle service in fine-grained mode on mesos cluster
[ https://issues.apache.org/jira/browse/SPARK-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jongyoul Lee updated SPARK-5197: Description: I think dynamic allocation is almost satisfied on mesos' fine-grained mode, which already offers resources dynamically, and returns automatically when a task is finished. It, however, doesn't have a mechanism on support external shuffle service like yarn's way, which is AuxiliaryService. Because mesos doesn't support AusiliaryService, we think a different way to do this. - Launching a shuffle service like a spark job on same cluster -- Pros --- Support multi-tenant environment --- Almost same way like yarn -- Cons --- Control long running 'background' job - service - when mesos runs --- Satisfy all slave - or host - to have one shuffle service all the time - Launching jobs within shuffle service -- Pros --- Easy to implement --- Don't consider whether shuffle service exists or not. -- Cons --- exists multiple shuffle services under multi-tenant environment --- Control shuffle service port dynamically on multi-user environment In my opinion, the first one is better idea to support external shuffle service. Please leave comments. was: I think dynamic allocation is almost satisfied on mesos' fine-grained mode, which already offers resources dynamically, and returns automatically when a task is finished. We, however, don't have a mechanism on support external shuffle service like yarn's way, which is AuxiliaryService. Because mesos doesn't support AusiliaryService, we think a different way to do this. - Launching a shuffle service like a spark job on same cluster -- Pros --- Support multi-tenant environment --- Almost same way like yarn -- Cons --- Control long running 'background' job - service - when mesos runs --- Satisfy all slave - or host - to have one shuffle service all the time - Launching jobs within shuffle service -- Pros --- Easy to implement --- Don't consider whether shuffle service exists or not. -- Cons --- exists multiple shuffle services under multi-tenant environment --- Control shuffle service port dynamically on multi-user environment In my opinion, the first one is better idea to support external shuffle service. Please leave comments. Support external shuffle service in fine-grained mode on mesos cluster -- Key: SPARK-5197 URL: https://issues.apache.org/jira/browse/SPARK-5197 Project: Spark Issue Type: Improvement Components: Deploy, Mesos, Shuffle Reporter: Jongyoul Lee I think dynamic allocation is almost satisfied on mesos' fine-grained mode, which already offers resources dynamically, and returns automatically when a task is finished. It, however, doesn't have a mechanism on support external shuffle service like yarn's way, which is AuxiliaryService. Because mesos doesn't support AusiliaryService, we think a different way to do this. - Launching a shuffle service like a spark job on same cluster -- Pros --- Support multi-tenant environment --- Almost same way like yarn -- Cons --- Control long running 'background' job - service - when mesos runs --- Satisfy all slave - or host - to have one shuffle service all the time - Launching jobs within shuffle service -- Pros --- Easy to implement --- Don't consider whether shuffle service exists or not. -- Cons --- exists multiple shuffle services under multi-tenant environment --- Control shuffle service port dynamically on multi-user environment In my opinion, the first one is better idea to support external shuffle service. Please leave comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5198) Change executorId more unique on mesos fine-grained mode
[ https://issues.apache.org/jira/browse/SPARK-5198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jongyoul Lee updated SPARK-5198: Description: In fine-grained mode, SchedulerBackend set executor name as same as slave id with any task id. It's not good to track aspecific job because of logging a different in a same log file. This is a same value while launching job on coarse-grained mode. (was: In fine-grained mode, SchedulerBackend set executor name as same as slave id with any task id. It's not good to track aspecific job because of logging a different in a same log file.) Change executorId more unique on mesos fine-grained mode Key: SPARK-5198 URL: https://issues.apache.org/jira/browse/SPARK-5198 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Jongyoul Lee Fix For: 1.3.0, 1.2.1 In fine-grained mode, SchedulerBackend set executor name as same as slave id with any task id. It's not good to track aspecific job because of logging a different in a same log file. This is a same value while launching job on coarse-grained mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5198) Change executorId more unique on mesos fine-grained mode
[ https://issues.apache.org/jira/browse/SPARK-5198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jongyoul Lee updated SPARK-5198: Description: In fine-grained mode, SchedulerBackend set executor name as same as slave id with any task id. It's not good to track aspecific job because of logging a different in a same log file. This is a same value while launching job on coarse-grained mode. !Screen Shot 2015-01-12 at 11.14.39 AM.png! ! was: In fine-grained mode, SchedulerBackend set executor name as same as slave id with any task id. It's not good to track aspecific job because of logging a different in a same log file. This is a same value while launching job on coarse-grained mode. [ Change executorId more unique on mesos fine-grained mode Key: SPARK-5198 URL: https://issues.apache.org/jira/browse/SPARK-5198 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Jongyoul Lee Fix For: 1.3.0, 1.2.1 Attachments: Screen Shot 2015-01-12 at 11.14.39 AM.png, Screen Shot 2015-01-12 at 11.34.30 AM.png, Screen Shot 2015-01-12 at 11.34.41 AM.png In fine-grained mode, SchedulerBackend set executor name as same as slave id with any task id. It's not good to track aspecific job because of logging a different in a same log file. This is a same value while launching job on coarse-grained mode. !Screen Shot 2015-01-12 at 11.14.39 AM.png! ! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5198) Change executorId more unique on mesos fine-grained mode
[ https://issues.apache.org/jira/browse/SPARK-5198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jongyoul Lee updated SPARK-5198: Description: In fine-grained mode, SchedulerBackend set executor name as same as slave id with any task id. It's not good to track aspecific job because of logging a different in a same log file. This is a same value while launching job on coarse-grained mode. [ was:In fine-grained mode, SchedulerBackend set executor name as same as slave id with any task id. It's not good to track aspecific job because of logging a different in a same log file. This is a same value while launching job on coarse-grained mode. Change executorId more unique on mesos fine-grained mode Key: SPARK-5198 URL: https://issues.apache.org/jira/browse/SPARK-5198 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Jongyoul Lee Fix For: 1.3.0, 1.2.1 Attachments: Screen Shot 2015-01-12 at 11.14.39 AM.png, Screen Shot 2015-01-12 at 11.34.30 AM.png, Screen Shot 2015-01-12 at 11.34.41 AM.png In fine-grained mode, SchedulerBackend set executor name as same as slave id with any task id. It's not good to track aspecific job because of logging a different in a same log file. This is a same value while launching job on coarse-grained mode. [ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5198) Change executorId more unique on mesos fine-grained mode
[ https://issues.apache.org/jira/browse/SPARK-5198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jongyoul Lee updated SPARK-5198: Attachment: Screen Shot 2015-01-12 at 11.34.41 AM.png Screen Shot 2015-01-12 at 11.34.30 AM.png Screen Shot 2015-01-12 at 11.14.39 AM.png Example screenshots Change executorId more unique on mesos fine-grained mode Key: SPARK-5198 URL: https://issues.apache.org/jira/browse/SPARK-5198 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Jongyoul Lee Fix For: 1.3.0, 1.2.1 Attachments: Screen Shot 2015-01-12 at 11.14.39 AM.png, Screen Shot 2015-01-12 at 11.34.30 AM.png, Screen Shot 2015-01-12 at 11.34.41 AM.png In fine-grained mode, SchedulerBackend set executor name as same as slave id with any task id. It's not good to track aspecific job because of logging a different in a same log file. This is a same value while launching job on coarse-grained mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273187#comment-14273187 ] Nicholas Chammas commented on SPARK-3821: - Updated launch stats: * Launching cluster with 50 slaves in {{us-east-1}}. * Stats for best of 3 runs. {{branch-1.3}} @ [{{3a95101}}|https://github.com/mesos/spark-ec2/tree/3a95101c70e6892a8a48cc54094adaed1458487a]: {code} Cluster is now in 'ssh-ready' state. Waited 460 seconds. [timing] rsync /root/spark-ec2: 00h 00m 07s [timing] setup-slave: 00h 00m 28s [timing] scala init: 00h 00m 11s [timing] spark init: 00h 00m 07s [timing] ephemeral-hdfs init: 00h 12m 40s [timing] persistent-hdfs init: 00h 12m 35s [timing] spark-standalone init: 00h 00m 00s [timing] tachyon init: 00h 00m 08s [timing] ganglia init: 00h 00m 53s [timing] scala setup: 00h 03m 11s [timing] spark setup: 00h 21m 20s [timing] ephemeral-hdfs setup: 00h 00m 48s [timing] persistent-hdfs setup: 00h 00m 43s [timing] spark-standalone setup: 00h 01m 19s [timing] tachyon setup: 00h 03m 06s [timing] ganglia setup: 00h 00m 32s {code} {{packer}} @ [{{273c8c5}}|https://github.com/nchammas/spark-ec2/tree/273c8c518fbc6e86e0fb4410efbe77a4d4e4ff5b]: {code} Cluster is now in 'ssh-ready' state. Waited 292 seconds. [timing] rsync /root/spark-ec2: 00h 00m 20s [timing] setup-slave: 00h 00m 19s [timing] scala init: 00h 00m 12s [timing] spark init: 00h 00m 08s [timing] ephemeral-hdfs init: 00h 12m 58s [timing] persistent-hdfs init: 00h 12m 55s [timing] spark-standalone init: 00h 00m 00s [timing] tachyon init: 00h 00m 10s [timing] ganglia init: 00h 00m 15s [timing] scala setup: 00h 03m 19s [timing] spark setup: 00h 20m 32s [timing] ephemeral-hdfs setup: 00h 00m 34s [timing] persistent-hdfs setup: 00h 00m 27s [timing] spark-standalone setup: 00h 00m 47s [timing] tachyon setup: 00h 03m 15s [timing] ganglia setup: 00h 00m 23s {code} As you can see, with the exception of time-to-SSH-availability, things are mostly the same across the current and Packer-generated AMIs. I've proposed improvements to cut down the launch times of large clusters in [a separate issue|SPARK-5189]. [~shivaram] - At this point I think it's safe to say that the approach proposed here is straightforward and worth pursuing. All we need now is a review of [the scripts that install various stuff|https://github.com/nchammas/spark-ec2/blob/273c8c518fbc6e86e0fb4410efbe77a4d4e4ff5b/packer/spark-packer.json#L63-L66] (e.g. Ganglia, Python 2.7, etc.) on the AMI to make sure it all makes sense. Develop an automated way of creating Spark images (AMI, Docker, and others) --- Key: SPARK-3821 URL: https://issues.apache.org/jira/browse/SPARK-3821 Project: Spark Issue Type: Improvement Components: Build, EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Attachments: packer-proposal.html Right now the creation of Spark AMIs or Docker containers is done manually. With tools like [Packer|http://www.packer.io/], we should be able to automate this work, and do so in such a way that multiple types of machine images can be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1422) Add scripts for launching Spark on Google Compute Engine
[ https://issues.apache.org/jira/browse/SPARK-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273197#comment-14273197 ] Nicholas Chammas commented on SPARK-1422: - [~pwendell] - I would consider doing this as well for the parent task, [SPARK-4399]. Add scripts for launching Spark on Google Compute Engine Key: SPARK-1422 URL: https://issues.apache.org/jira/browse/SPARK-1422 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Matei Zaharia -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1422) Add scripts for launching Spark on Google Compute Engine
[ https://issues.apache.org/jira/browse/SPARK-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273178#comment-14273178 ] Patrick Wendell commented on SPARK-1422: Good call NIck - yeah let's close this as being out of scope since it's being maintained elsewhere. Add scripts for launching Spark on Google Compute Engine Key: SPARK-1422 URL: https://issues.apache.org/jira/browse/SPARK-1422 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Matei Zaharia -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1422) Add scripts for launching Spark on Google Compute Engine
[ https://issues.apache.org/jira/browse/SPARK-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1422. Resolution: Won't Fix Add scripts for launching Spark on Google Compute Engine Key: SPARK-1422 URL: https://issues.apache.org/jira/browse/SPARK-1422 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Matei Zaharia -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5199) Input metrics should show up for InputFormats that return CombineFileSplits
Sandy Ryza created SPARK-5199: - Summary: Input metrics should show up for InputFormats that return CombineFileSplits Key: SPARK-5199 URL: https://issues.apache.org/jira/browse/SPARK-5199 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Sandy Ryza Assignee: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2621) Update task InputMetrics incrementally
[ https://issues.apache.org/jira/browse/SPARK-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273192#comment-14273192 ] Sandy Ryza commented on SPARK-2621: --- Definitely - just filed SPARK-5199 for this. Update task InputMetrics incrementally -- Key: SPARK-2621 URL: https://issues.apache.org/jira/browse/SPARK-2621 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4399) Support multiple cloud providers
[ https://issues.apache.org/jira/browse/SPARK-4399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4399. Resolution: Won't Fix We'll let the community take this one on. Support multiple cloud providers Key: SPARK-4399 URL: https://issues.apache.org/jira/browse/SPARK-4399 Project: Spark Issue Type: New Feature Components: EC2 Affects Versions: 1.2.0 Reporter: Andrew Ash We currently have Spark startup scripts for Amazon EC2 but not for various other cloud providers. This ticket is an umbrella to support multiple cloud providers in the bundled scripts, not just Amazon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5166) Stabilize Spark SQL APIs
[ https://issues.apache.org/jira/browse/SPARK-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5166: --- Priority: Blocker (was: Critical) Stabilize Spark SQL APIs Key: SPARK-5166 URL: https://issues.apache.org/jira/browse/SPARK-5166 Project: Spark Issue Type: Task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Before we take Spark SQL out of alpha, we need to audit the APIs and stabilize them. As a general rule, everything under org.apache.spark.sql.catalyst should not be exposed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5197) Support external shuffle service in fine-grained mode on mesos cluster
[ https://issues.apache.org/jira/browse/SPARK-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jongyoul Lee updated SPARK-5197: Fix Version/s: 1.3.0 Support external shuffle service in fine-grained mode on mesos cluster -- Key: SPARK-5197 URL: https://issues.apache.org/jira/browse/SPARK-5197 Project: Spark Issue Type: Improvement Components: Deploy, Mesos, Shuffle Reporter: Jongyoul Lee Fix For: 1.3.0 I think dynamic allocation is almost satisfied on mesos' fine-grained mode, which already offers resources dynamically, and returns automatically when a task is finished. It, however, doesn't have a mechanism on support external shuffle service like yarn's way, which is AuxiliaryService. Because mesos doesn't support AusiliaryService, we think a different way to do this. - Launching a shuffle service like a spark job on same cluster -- Pros --- Support multi-tenant environment --- Almost same way like yarn -- Cons --- Control long running 'background' job - service - when mesos runs --- Satisfy all slave - or host - to have one shuffle service all the time - Launching jobs within shuffle service -- Pros --- Easy to implement --- Don't consider whether shuffle service exists or not. -- Cons --- exists multiple shuffle services under multi-tenant environment --- Control shuffle service port dynamically on multi-user environment In my opinion, the first one is better idea to support external shuffle service. Please leave comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5198) Change executorId more unique on mesos fine-grained mode
[ https://issues.apache.org/jira/browse/SPARK-5198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jongyoul Lee updated SPARK-5198: Description: In fine-grained mode, SchedulerBackend set executor name as same as slave id with any task id. It's not good to track aspecific job because of logging a different in a same log file. This is a same value while launching job on coarse-grained mode. !Screen Shot 2015-01-12 at 11.14.39 AM.png! !Screen Shot 2015-01-12 at 11.34.30 AM.png! !Screen Shot 2015-01-12 at 11.34.41 AM.png! was: In fine-grained mode, SchedulerBackend set executor name as same as slave id with any task id. It's not good to track aspecific job because of logging a different in a same log file. This is a same value while launching job on coarse-grained mode. !Screen Shot 2015-01-12 at 11.14.39 AM.png! ! Change executorId more unique on mesos fine-grained mode Key: SPARK-5198 URL: https://issues.apache.org/jira/browse/SPARK-5198 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Jongyoul Lee Fix For: 1.3.0, 1.2.1 Attachments: Screen Shot 2015-01-12 at 11.14.39 AM.png, Screen Shot 2015-01-12 at 11.34.30 AM.png, Screen Shot 2015-01-12 at 11.34.41 AM.png In fine-grained mode, SchedulerBackend set executor name as same as slave id with any task id. It's not good to track aspecific job because of logging a different in a same log file. This is a same value while launching job on coarse-grained mode. !Screen Shot 2015-01-12 at 11.14.39 AM.png! !Screen Shot 2015-01-12 at 11.34.30 AM.png! !Screen Shot 2015-01-12 at 11.34.41 AM.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5198) Change executorId more unique on mesos fine-grained mode
[ https://issues.apache.org/jira/browse/SPARK-5198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273169#comment-14273169 ] Jongyoul Lee edited comment on SPARK-5198 at 1/12/15 2:38 AM: -- Uploaded example screenshots was (Author: jongyoul): Example screenshots Change executorId more unique on mesos fine-grained mode Key: SPARK-5198 URL: https://issues.apache.org/jira/browse/SPARK-5198 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Jongyoul Lee Fix For: 1.3.0, 1.2.1 Attachments: Screen Shot 2015-01-12 at 11.14.39 AM.png, Screen Shot 2015-01-12 at 11.34.30 AM.png, Screen Shot 2015-01-12 at 11.34.41 AM.png In fine-grained mode, SchedulerBackend set executor name as same as slave id with any task id. It's not good to track aspecific job because of logging a different in a same log file. This is a same value while launching job on coarse-grained mode. !Screen Shot 2015-01-12 at 11.14.39 AM.png! !Screen Shot 2015-01-12 at 11.34.30 AM.png! !Screen Shot 2015-01-12 at 11.34.41 AM.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4951) A busy executor may be killed when dynamicAllocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4951. Resolution: Fixed Fix Version/s: 1.2.1 1.3.0 Target Version/s: 1.3.0, 1.2.1 A busy executor may be killed when dynamicAllocation is enabled --- Key: SPARK-4951 URL: https://issues.apache.org/jira/browse/SPARK-4951 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu Fix For: 1.3.0, 1.2.1 If a task runs more than `spark.dynamicAllocation.executorIdleTimeout`, the executor which runs this task will be killed. The following steps (yarn-client mode) can reproduce this bug: 1. Start `spark-shell` using {code} ./bin/spark-shell --conf spark.shuffle.service.enabled=true \ --conf spark.dynamicAllocation.minExecutors=1 \ --conf spark.dynamicAllocation.maxExecutors=4 \ --conf spark.dynamicAllocation.enabled=true \ --conf spark.dynamicAllocation.executorIdleTimeout=30 \ --master yarn-client \ --driver-memory 512m \ --executor-memory 512m \ --executor-cores 1 {code} 2. Wait more than 30 seconds until there is only one executor. 3. Run the following code (a task needs at least 50 seconds to finish) {code} val r = sc.parallelize(1 to 1000, 20).map{t = Thread.sleep(1000); t}.groupBy(_ % 2).collect() {code} 4. Executors will be killed and allocated all the time, which makes the Job fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5088) Use spark-class for running executors directly on mesos
[ https://issues.apache.org/jira/browse/SPARK-5088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jongyoul Lee updated SPARK-5088: Fix Version/s: 1.2.1 1.3.0 Use spark-class for running executors directly on mesos --- Key: SPARK-5088 URL: https://issues.apache.org/jira/browse/SPARK-5088 Project: Spark Issue Type: Improvement Components: Deploy, Mesos Affects Versions: 1.2.0 Reporter: Jongyoul Lee Priority: Minor Fix For: 1.3.0, 1.2.1 - sbin/spark-executor is only used by running executor on mesos environment. - spark-executor calls spark-class without specific parameter internally. - PYTHONPATH is moved to spark-class' case. - Remove a redundant file for maintaining codes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5197) Support external shuffle service in fine-grained mode on mesos cluster
[ https://issues.apache.org/jira/browse/SPARK-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jongyoul Lee updated SPARK-5197: Target Version/s: 1.3.0 (was: 1.3.0, 1.2.1) Support external shuffle service in fine-grained mode on mesos cluster -- Key: SPARK-5197 URL: https://issues.apache.org/jira/browse/SPARK-5197 Project: Spark Issue Type: Improvement Components: Deploy, Mesos, Shuffle Reporter: Jongyoul Lee Fix For: 1.3.0 I think dynamic allocation is almost satisfied on mesos' fine-grained mode, which already offers resources dynamically, and returns automatically when a task is finished. It, however, doesn't have a mechanism on support external shuffle service like yarn's way, which is AuxiliaryService. Because mesos doesn't support AusiliaryService, we think a different way to do this. - Launching a shuffle service like a spark job on same cluster -- Pros --- Support multi-tenant environment --- Almost same way like yarn -- Cons --- Control long running 'background' job - service - when mesos runs --- Satisfy all slave - or host - to have one shuffle service all the time - Launching jobs within shuffle service -- Pros --- Easy to implement --- Don't consider whether shuffle service exists or not. -- Cons --- exists multiple shuffle services under multi-tenant environment --- Control shuffle service port dynamically on multi-user environment In my opinion, the first one is better idea to support external shuffle service. Please leave comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5124) Standardize internal RPC interface
[ https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273246#comment-14273246 ] Reynold Xin commented on SPARK-5124: Thanks for the response. 1. Let's not rely on the property of local actor not passing messages through a socket for local actor speedup. Conceptually, there is no reason to tie local actor implementation to RPC. DAGScheduler's actor used to be a simple queue event loop (before it was turned into an actor for no good reason). We can restore it to that. 2. Have you thought about how the fate sharing stuff would work with alternative RPC implementations? Standardize internal RPC interface -- Key: SPARK-5124 URL: https://issues.apache.org/jira/browse/SPARK-5124 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Assignee: Shixiong Zhu Attachments: Pluggable RPC - draft 1.pdf In Spark we use Akka as the RPC layer. It would be great if we can standardize the internal RPC interface to facilitate testing. This will also provide the foundation to try other RPC implementations in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5200) Disable web UI in Hive Thriftserver tests
Josh Rosen created SPARK-5200: - Summary: Disable web UI in Hive Thriftserver tests Key: SPARK-5200 URL: https://issues.apache.org/jira/browse/SPARK-5200 Project: Spark Issue Type: Improvement Components: SQL Reporter: Josh Rosen Assignee: Josh Rosen In our unit tests, we should disable the Spark Web UI when starting the Hive Thriftserver, since port contention during this test has been a cause of test failures on Jenkins. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5201) ParallelCollectionRDD.slice(seq, numSlices) has int overflow when dealing with inclusive range
Ye Xianjin created SPARK-5201: - Summary: ParallelCollectionRDD.slice(seq, numSlices) has int overflow when dealing with inclusive range Key: SPARK-5201 URL: https://issues.apache.org/jira/browse/SPARK-5201 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Ye Xianjin Fix For: 1.2.1 {code} sc.makeRDD(1 to (Int.MaxValue)).count // result = 0 sc.makeRDD(1 to (Int.MaxValue - 1)).count // result = 2147483646 = Int.MaxValue - 1 sc.makeRDD(1 until (Int.MaxValue)).count// result = 2147483646 = Int.MaxValue - 1 {code} More details on the discussion https://github.com/apache/spark/pull/2874 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5201) ParallelCollectionRDD.slice(seq, numSlices) has int overflow when dealing with inclusive range
[ https://issues.apache.org/jira/browse/SPARK-5201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273277#comment-14273277 ] Ye Xianjin commented on SPARK-5201: --- I will send a pr for this. ParallelCollectionRDD.slice(seq, numSlices) has int overflow when dealing with inclusive range -- Key: SPARK-5201 URL: https://issues.apache.org/jira/browse/SPARK-5201 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Ye Xianjin Labels: rdd Fix For: 1.2.1 Original Estimate: 2h Remaining Estimate: 2h {code} sc.makeRDD(1 to (Int.MaxValue)).count // result = 0 sc.makeRDD(1 to (Int.MaxValue - 1)).count // result = 2147483646 = Int.MaxValue - 1 sc.makeRDD(1 until (Int.MaxValue)).count// result = 2147483646 = Int.MaxValue - 1 {code} More details on the discussion https://github.com/apache/spark/pull/2874 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5018) Make MultivariateGaussian public
[ https://issues.apache.org/jira/browse/SPARK-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5018. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3923 [https://github.com/apache/spark/pull/3923] Make MultivariateGaussian public Key: SPARK-5018 URL: https://issues.apache.org/jira/browse/SPARK-5018 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Assignee: Travis Galoppo Priority: Critical Fix For: 1.3.0 MultivariateGaussian is currently private[ml], but it would be a useful public class. This JIRA will require defining a good public API for distributions. This JIRA will be needed for finalizing the GaussianMixtureModel API, which should expose MultivariateGaussian instances instead of the means and covariances. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5202) HiveContext doesn't support the Variables Substitution
Cheng Hao created SPARK-5202: Summary: HiveContext doesn't support the Variables Substitution Key: SPARK-5202 URL: https://issues.apache.org/jira/browse/SPARK-5202 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VariableSubstitution This is a block issue for the CLI user, which will impact the existed hql scripts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5196) Add comment field in StructField
[ https://issues.apache.org/jira/browse/SPARK-5196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273251#comment-14273251 ] Apache Spark commented on SPARK-5196: - User 'OopsOutOfMemory' has created a pull request for this issue: https://github.com/apache/spark/pull/3999 Add comment field in StructField Key: SPARK-5196 URL: https://issues.apache.org/jira/browse/SPARK-5196 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: shengli Fix For: 1.3.0 StructField should contains name, type, nullable, comment etc... Add support comment field in StructField. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4908) Spark SQL built for Hive 13 fails under concurrent metadata queries
[ https://issues.apache.org/jira/browse/SPARK-4908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273271#comment-14273271 ] Apache Spark commented on SPARK-4908: - User 'baishuo' has created a pull request for this issue: https://github.com/apache/spark/pull/4001 Spark SQL built for Hive 13 fails under concurrent metadata queries --- Key: SPARK-4908 URL: https://issues.apache.org/jira/browse/SPARK-4908 Project: Spark Issue Type: Bug Components: SQL Reporter: David Ross Assignee: Cheng Lian Priority: Blocker Fix For: 1.3.0, 1.2.1 We are trunk: {{1.3.0-SNAPSHOT}}, as of this commit: https://github.com/apache/spark/commit/3d0c37b8118f6057a663f959321a79b8061132b6 We are using Spark built for Hive 13, using this option: {{-Phive-0.13.1}} In single-threaded mode, normal operations look fine. However, under concurrency, with at least 2 concurrent connections, metadata queries fail. For example, {{USE some_db}}, {{SHOW TABLES}}, and the implicit {{USE}} statement when you pass a default schema in the JDBC URL, all fail. {{SELECT}} queries like {{SELECT * FROM some_table}} do not have this issue. Here is some example code: {code} object main extends App { import java.sql._ import scala.concurrent._ import scala.concurrent.duration._ import scala.concurrent.ExecutionContext.Implicits.global Class.forName(org.apache.hive.jdbc.HiveDriver) val host = localhost // update this val url = sjdbc:hive2://${host}:10511/some_db // update this val future = Future.traverse(1 to 3) { i = Future { println(Starting: + i) try { val conn = DriverManager.getConnection(url) } catch { case e: Throwable = e.printStackTrace() println(Failed: + i) } println(Finishing: + i) } } Await.result(future, 2.minutes) println(done!) } {code} Here is the output: {code} Starting: 1 Starting: 3 Starting: 2 java.sql.SQLException: org.apache.spark.sql.execution.QueryExecutionException: FAILED: Operation cancelled at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:121) at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:109) at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:231) at org.apache.hive.jdbc.HiveConnection.configureConnection(HiveConnection.java:451) at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:195) at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105) at java.sql.DriverManager.getConnection(DriverManager.java:664) at java.sql.DriverManager.getConnection(DriverManager.java:270) at com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply$mcV$sp(ConnectionManager.scala:896) at com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893) at com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Failed: 3 Finishing: 3 java.sql.SQLException: org.apache.spark.sql.execution.QueryExecutionException: FAILED: Operation cancelled at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:121) at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:109) at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:231) at org.apache.hive.jdbc.HiveConnection.configureConnection(HiveConnection.java:451) at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:195) at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105) at java.sql.DriverManager.getConnection(DriverManager.java:664) at java.sql.DriverManager.getConnection(DriverManager.java:270) at com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply$mcV$sp(ConnectionManager.scala:896) at com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893) at
[jira] [Commented] (SPARK-5186) Vector.equals and Vector.hashCode are very inefficient
[ https://issues.apache.org/jira/browse/SPARK-5186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273224#comment-14273224 ] Apache Spark commented on SPARK-5186: - User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/3997 Vector.equals and Vector.hashCode are very inefficient --- Key: SPARK-5186 URL: https://issues.apache.org/jira/browse/SPARK-5186 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Reporter: Derrick Burns Original Estimate: 0.25h Remaining Estimate: 0.25h The implementation of Vector.equals and Vector.hashCode are correct but slow for SparseVectors that are truly sparse. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5200) Disable web UI in Hive Thriftserver tests
[ https://issues.apache.org/jira/browse/SPARK-5200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273247#comment-14273247 ] Apache Spark commented on SPARK-5200: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/3998 Disable web UI in Hive Thriftserver tests - Key: SPARK-5200 URL: https://issues.apache.org/jira/browse/SPARK-5200 Project: Spark Issue Type: Improvement Components: SQL Reporter: Josh Rosen Assignee: Josh Rosen Labels: flaky-test In our unit tests, we should disable the Spark Web UI when starting the Hive Thriftserver, since port contention during this test has been a cause of test failures on Jenkins. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3561) Allow for pluggable execution contexts in Spark
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273225#comment-14273225 ] Patrick Wendell commented on SPARK-3561: So if the question is: Is Spark only API or is it an integrated API/execution engine... we've taken a fairly clear stance over the history of the project that it's an integrated engine. I.e. Spark is not something like Pig where it's intended primarily as a user API and we expect there to be different physical execution engines plugged in underneath. In the past we haven't found this prevents Spark from working well in different environments. For instance, with Mesos, on YARN, etc. And for this we've integrated at different layers such as the storage layer and the scheduling layer, where there were well defined API's and integration points in the broader ecosystem. Compared with alternatives Spark is far more flexible in terms of runtime environments. The RDD API is so generic that it's very easy to customize and integrate. For this reason, my feeling with decoupling execution from the rest of Spark is that it would tie our hands architecturally and not add much benefit. I don't see a good reason to make this broader change in the strategy of the project. If there are specific improvements you see for making Spark work well on YARN, then we can definitely look at them. Allow for pluggable execution contexts in Spark --- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@Experimental) not exposed to end users of Spark. The trait will define 6 operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob * persist * unpersist Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5095) Support launching multiple mesos executors in coarse grained mesos mode
[ https://issues.apache.org/jira/browse/SPARK-5095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273244#comment-14273244 ] Timothy Chen commented on SPARK-5095: - [~joshdevins] [~gmaas] indeed capping the cores is actually to fix 4940, and we can use that to address the number of executors. I'm trying not to have just a set of configurations that can achieve both, otherwise it becomes a lot harder to maintain. I'm working on the patch now and I'll add you both on github for review. Support launching multiple mesos executors in coarse grained mesos mode --- Key: SPARK-5095 URL: https://issues.apache.org/jira/browse/SPARK-5095 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Currently in coarse grained mesos mode, it's expected that we only launch one Mesos executor that launches one JVM process to launch multiple spark executors. However, this become a problem when the JVM process launched is larger than an ideal size (30gb is recommended value from databricks), which causes GC problems reported on the mailing list. We should support launching mulitple executors when large enough resources are available for spark to use, and these resources are still under the configured limit. This is also applicable when users want to specifiy number of executors to be launched on each node -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5202) HiveContext doesn't support the Variables Substitution
[ https://issues.apache.org/jira/browse/SPARK-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273293#comment-14273293 ] Apache Spark commented on SPARK-5202: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/4003 HiveContext doesn't support the Variables Substitution -- Key: SPARK-5202 URL: https://issues.apache.org/jira/browse/SPARK-5202 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VariableSubstitution This is a block issue for the CLI user, which will impact the existed hql scripts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5201) ParallelCollectionRDD.slice(seq, numSlices) has int overflow when dealing with inclusive range
[ https://issues.apache.org/jira/browse/SPARK-5201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273289#comment-14273289 ] Apache Spark commented on SPARK-5201: - User 'advancedxy' has created a pull request for this issue: https://github.com/apache/spark/pull/4002 ParallelCollectionRDD.slice(seq, numSlices) has int overflow when dealing with inclusive range -- Key: SPARK-5201 URL: https://issues.apache.org/jira/browse/SPARK-5201 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Ye Xianjin Labels: rdd Fix For: 1.2.1 Original Estimate: 2h Remaining Estimate: 2h {code} sc.makeRDD(1 to (Int.MaxValue)).count // result = 0 sc.makeRDD(1 to (Int.MaxValue - 1)).count // result = 2147483646 = Int.MaxValue - 1 sc.makeRDD(1 until (Int.MaxValue)).count// result = 2147483646 = Int.MaxValue - 1 {code} More details on the discussion https://github.com/apache/spark/pull/2874 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5202) HiveContext doesn't support the Variables Substitution
[ https://issues.apache.org/jira/browse/SPARK-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao updated SPARK-5202: - Description: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VariableSubstitution This is a block issue for the CLI user, it impacts the existed hql scripts from Hive. was: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VariableSubstitution This is a block issue for the CLI user, which will impact the existed hql scripts. HiveContext doesn't support the Variables Substitution -- Key: SPARK-5202 URL: https://issues.apache.org/jira/browse/SPARK-5202 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VariableSubstitution This is a block issue for the CLI user, it impacts the existed hql scripts from Hive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org