[jira] [Updated] (SPARK-4648) Support Coalesce in Spark SQL.
[ https://issues.apache.org/jira/browse/SPARK-4648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravindra Pesala updated SPARK-4648: --- Description: Support Coalesce function in Spark SQL. Support type widening in Coalesce function. And replace Coalesce UDF in Spark Hive with local Coalesce function since it is memory efficient and faster. was:Support Coalesce function in Spark SQL Support Coalesce in Spark SQL. -- Key: SPARK-4648 URL: https://issues.apache.org/jira/browse/SPARK-4648 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Ravindra Pesala Support Coalesce function in Spark SQL. Support type widening in Coalesce function. And replace Coalesce UDF in Spark Hive with local Coalesce function since it is memory efficient and faster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4648) Support COALESCE function in Spark SQL and HiveQL
[ https://issues.apache.org/jira/browse/SPARK-4648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravindra Pesala updated SPARK-4648: --- Summary: Support COALESCE function in Spark SQL and HiveQL (was: Support Coalesce in Spark SQL.) Support COALESCE function in Spark SQL and HiveQL - Key: SPARK-4648 URL: https://issues.apache.org/jira/browse/SPARK-4648 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Ravindra Pesala Support Coalesce function in Spark SQL. Support type widening in Coalesce function. And replace Coalesce UDF in Spark Hive with local Coalesce function since it is memory efficient and faster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4658) Code documentation issue in DDL of datasource
Ravindra Pesala created SPARK-4658: -- Summary: Code documentation issue in DDL of datasource Key: SPARK-4658 URL: https://issues.apache.org/jira/browse/SPARK-4658 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Ravindra Pesala Priority: Minor The syntax mentioned to create table for datasource in ddl.scala file is documented with wrong syntax like {code} /** * CREATE FOREIGN TEMPORARY TABLE avroTable * USING org.apache.spark.sql.avro * OPTIONS (path ../hive/src/test/resources/data/files/episodes.avro) */ {code} but the correct syntax is {code} /** * CREATE TEMPORARY TABLE avroTable * USING org.apache.spark.sql.avro * OPTIONS (path ../hive/src/test/resources/data/files/episodes.avro) */ {code} Wrong syntax is documented in newParquet.scala like {code} `CREATE TABLE ... USING org.apache.spark.sql.parquet`. {code} but the correct syntax is {code} `CREATE TEMPORARY TABLE ... USING org.apache.spark.sql.parquet`. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4648) Use available Coalesce function in HiveQL instead of using HiveUDF. And support Coalesce in Spark SQL.
Ravindra Pesala created SPARK-4648: -- Summary: Use available Coalesce function in HiveQL instead of using HiveUDF. And support Coalesce in Spark SQL. Key: SPARK-4648 URL: https://issues.apache.org/jira/browse/SPARK-4648 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Ravindra Pesala Currently HiveQL uses Hive UDF function for Coalesce. Usually using hive udfs are memory intensive. Since Coalesce function is already available in Spark , we can make use of it. And also support Coalesce function in Spar SQL -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4648) Support Coalesce in Spark SQL.
[ https://issues.apache.org/jira/browse/SPARK-4648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravindra Pesala updated SPARK-4648: --- Summary: Support Coalesce in Spark SQL. (was: Use available Coalesce function in HiveQL instead of using HiveUDF. And support Coalesce in Spark SQL.) Support Coalesce in Spark SQL. -- Key: SPARK-4648 URL: https://issues.apache.org/jira/browse/SPARK-4648 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Ravindra Pesala Currently HiveQL uses Hive UDF function for Coalesce. Usually using hive udfs are memory intensive. Since Coalesce function is already available in Spark , we can make use of it. And also support Coalesce function in Spar SQL -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4648) Support Coalesce in Spark SQL.
[ https://issues.apache.org/jira/browse/SPARK-4648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravindra Pesala updated SPARK-4648: --- Description: Support Coalesce function in Spark SQL (was: Currently HiveQL uses Hive UDF function for Coalesce. Usually using hive udfs are memory intensive. Since Coalesce function is already available in Spark , we can make use of it. And also support Coalesce function in Spar SQL) Support Coalesce in Spark SQL. -- Key: SPARK-4648 URL: https://issues.apache.org/jira/browse/SPARK-4648 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Ravindra Pesala Support Coalesce function in Spark SQL -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4650) Supporting multi column support in count(distinct c1,c2..) in Spark SQL
Ravindra Pesala created SPARK-4650: -- Summary: Supporting multi column support in count(distinct c1,c2..) in Spark SQL Key: SPARK-4650 URL: https://issues.apache.org/jira/browse/SPARK-4650 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Ravindra Pesala Support multi column support inside count(distinct c1,c2..) which is not working in Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4650) Supporting multi column support in countDistinct function like count(distinct c1,c2..) in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-4650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravindra Pesala updated SPARK-4650: --- Summary: Supporting multi column support in countDistinct function like count(distinct c1,c2..) in Spark SQL (was: Supporting multi column support in count(distinct c1,c2..) in Spark SQL) Supporting multi column support in countDistinct function like count(distinct c1,c2..) in Spark SQL --- Key: SPARK-4650 URL: https://issues.apache.org/jira/browse/SPARK-4650 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Ravindra Pesala Support multi column support inside count(distinct c1,c2..) which is not working in Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4513) Support relational operator '=' in Spark SQL
Ravindra Pesala created SPARK-4513: -- Summary: Support relational operator '=' in Spark SQL Key: SPARK-4513 URL: https://issues.apache.org/jira/browse/SPARK-4513 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: Ravindra Pesala The relational operator '=' is not working in Spark SQL. Same works in Spark HiveQL -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4513) Support relational operator '=' in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-4513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravindra Pesala updated SPARK-4513: --- Component/s: SQL Support relational operator '=' in Spark SQL -- Key: SPARK-4513 URL: https://issues.apache.org/jira/browse/SPARK-4513 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Ravindra Pesala The relational operator '=' is not working in Spark SQL. Same works in Spark HiveQL -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4226) SparkSQL - Add support for subqueries in predicates
[ https://issues.apache.org/jira/browse/SPARK-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14199596#comment-14199596 ] Ravindra Pesala commented on SPARK-4226: Currently there is no support of subquery expressions as predicates. I am working on it. SparkSQL - Add support for subqueries in predicates --- Key: SPARK-4226 URL: https://issues.apache.org/jira/browse/SPARK-4226 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Environment: Spark 1.2 snapshot Reporter: Terry Siu I have a test table defined in Hive as follows: CREATE TABLE sparkbug ( id INT, event STRING ) STORED AS PARQUET; and insert some sample data with ids 1, 2, 3. In a Spark shell, I then create a HiveContext and then execute the following HQL to test out subquery predicates: val hc = HiveContext(hc) hc.hql(select customerid from sparkbug where customerid in (select customerid from sparkbug where customerid in (2,3))) I get the following error: java.lang.RuntimeException: Unsupported language features in query: select customerid from sparkbug where customerid in (select customerid from sparkbug where customerid in (2,3)) TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME sparkbug TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_TABLE_OR_COL customerid TOK_WHERE TOK_SUBQUERY_EXPR TOK_SUBQUERY_OP in TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME sparkbug TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_TABLE_OR_COL customerid TOK_WHERE TOK_FUNCTION in TOK_TABLE_OR_COL customerid 2 3 TOK_TABLE_OR_COL customerid scala.NotImplementedError: No parse rules for ASTNode type: 817, text: TOK_SUBQUERY_EXPR : TOK_SUBQUERY_EXPR TOK_SUBQUERY_OP in TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME sparkbug TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_TABLE_OR_COL customerid TOK_WHERE TOK_FUNCTION in TOK_TABLE_OR_COL customerid 2 3 TOK_TABLE_OR_COL customerid + org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1098) at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:252) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) This thread http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-td17401.html also brings up lack of subquery support in SparkSQL. It would be nice to have subquery predicate support in a near, future release (1.3, maybe?). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4207) Query which has syntax like 'not like' is not working in Spark SQL
Ravindra Pesala created SPARK-4207: -- Summary: Query which has syntax like 'not like' is not working in Spark SQL Key: SPARK-4207 URL: https://issues.apache.org/jira/browse/SPARK-4207 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Ravindra Pesala Queries which has 'not like' is not working in Spark SQL. Same works in Spark HiveQL. {code} sql(SELECT * FROM records where value not like 'val%') {code} The above query fails with below exception {code} Exception in thread main java.lang.RuntimeException: [1.39] failure: ``IN'' expected but `like' found SELECT * FROM records where value not like 'val%' ^ at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33) at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:75) at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:75) at org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:186) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4154) Query does not work if it has not between in Spark SQL and HQL
Ravindra Pesala created SPARK-4154: -- Summary: Query does not work if it has not between in Spark SQL and HQL Key: SPARK-4154 URL: https://issues.apache.org/jira/browse/SPARK-4154 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Ravindra Pesala if the query contains not between does not work. {code} SELECT * FROM src where key not between 10 and 20 {code} It gives the following error {code} Exception in thread main java.lang.RuntimeException: Unsupported language features in query: SELECT * FROM src where key not between 10 and 20 TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME src TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_ALLCOLREF TOK_WHERE TOK_FUNCTION between KW_TRUE TOK_TABLE_OR_COL key 10 20 scala.NotImplementedError: No parse rules for ASTNode type: 256, text: KW_TRUE : KW_TRUE + org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1088) at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:251) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4131) Support Writing data into the filesystem from queries
[ https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14188147#comment-14188147 ] Ravindra Pesala commented on SPARK-4131: I will work on this issue. Support Writing data into the filesystem from queries --- Key: SPARK-4131 URL: https://issues.apache.org/jira/browse/SPARK-4131 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.1.0 Reporter: XiaoJing wang Priority: Critical Fix For: 1.3.0 Original Estimate: 0.05h Remaining Estimate: 0.05h Writing data into the filesystem from queries,SparkSql is not support . eg: codeinsert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * from page_views; /code out: code java.lang.RuntimeException: Unsupported language features in query: insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * from page_views TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME page_views TOK_INSERT TOK_DESTINATION TOK_LOCAL_DIR '/data1/wangxj/sql_spark' TOK_SELECT TOK_SELEXPR TOK_ALLCOLREF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4120) Join of multiple tables with syntax like SELECT .. FROM T1,T2,T3.. does not work in SparkSQL
Ravindra Pesala created SPARK-4120: -- Summary: Join of multiple tables with syntax like SELECT .. FROM T1,T2,T3.. does not work in SparkSQL Key: SPARK-4120 URL: https://issues.apache.org/jira/browse/SPARK-4120 Project: Spark Issue Type: Bug Components: SQL Reporter: Ravindra Pesala Fix For: 1.2.0 The queries with more than like 2 tables does not work. {code} sql(SELECT * FROM records1 as a,records2 as b,records3 as c where a.key=b.key and a.key=c.key) {code} The above query gives following exception. {code} Exception in thread main java.lang.RuntimeException: [1.40] failure: ``UNION'' expected but `,' found SELECT * FROM records1 as a,records2 as b,records3 as c where a.key=b.key and a.key=c.key ^ at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33) at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:75) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3814) Bitwise does not work in Hive
[ https://issues.apache.org/jira/browse/SPARK-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173526#comment-14173526 ] Ravindra Pesala commented on SPARK-3814: Added support for Bitwise AND(), OR(|) ,XOR(^), NOT(~) in this PR and I updated the title of this defect. Bitwise does not work in Hive Key: SPARK-3814 URL: https://issues.apache.org/jira/browse/SPARK-3814 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Yana Kadiyska Assignee: Ravindra Pesala Priority: Minor Error: java.lang.RuntimeException: Unsupported language features in query: select (case when bit_field 1=1 then r_end - r_start else NULL end) from mytable where pkey='0178-2014-07' LIMIT 2 TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME mytable TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_FUNCTION when = TOK_TABLE_OR_COL bit_field 1 1 - TOK_TABLE_OR_COL r_end TOK_TABLE_OR_COL r_start TOK_NULL TOK_WHERE = TOK_TABLE_OR_COL pkey '0178-2014-07' TOK_LIMIT 2 SQLState: null ErrorCode: 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3814) Support for Bitwise AND(), OR(|) ,XOR(^), NOT(~) in Spark HQL and SQL
[ https://issues.apache.org/jira/browse/SPARK-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravindra Pesala updated SPARK-3814: --- Summary: Support for Bitwise AND(), OR(|) ,XOR(^), NOT(~) in Spark HQL and SQL (was: Bitwise does not work in Hive) Support for Bitwise AND(), OR(|) ,XOR(^), NOT(~) in Spark HQL and SQL -- Key: SPARK-3814 URL: https://issues.apache.org/jira/browse/SPARK-3814 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Yana Kadiyska Assignee: Ravindra Pesala Priority: Minor Error: java.lang.RuntimeException: Unsupported language features in query: select (case when bit_field 1=1 then r_end - r_start else NULL end) from mytable where pkey='0178-2014-07' LIMIT 2 TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME mytable TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_FUNCTION when = TOK_TABLE_OR_COL bit_field 1 1 - TOK_TABLE_OR_COL r_end TOK_TABLE_OR_COL r_start TOK_NULL TOK_WHERE = TOK_TABLE_OR_COL pkey '0178-2014-07' TOK_LIMIT 2 SQLState: null ErrorCode: 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3880) HBase as data source to SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166838#comment-14166838 ] Ravindra Pesala commented on SPARK-3880: There is already some work going in the direction of adding foreign data sources to Spark SQL. https://github.com/apache/spark/pull/2475. So I guess Hbase is also like foreign data source and it should fit into this design. Adding new project/context for each datasource may be cumbersome to maintain. Can we improve on the current PR to add DDL support. HBase as data source to SparkSQL Key: SPARK-3880 URL: https://issues.apache.org/jira/browse/SPARK-3880 Project: Spark Issue Type: New Feature Components: SQL Reporter: Yan Assignee: Yan Attachments: HBaseOnSpark.docx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3823) Spark Hive SQL readColumn is not reset each time for a new query
[ https://issues.apache.org/jira/browse/SPARK-3823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167239#comment-14167239 ] Ravindra Pesala commented on SPARK-3823: It seems this issue is duplicate of https://issues.apache.org/jira/browse/SPARK-3559 Spark Hive SQL readColumn is not reset each time for a new query Key: SPARK-3823 URL: https://issues.apache.org/jira/browse/SPARK-3823 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Alex Liu After a few queries running in the same hiveContext, hive.io.file.readcolumn.ids and hive.io.file.readcolumn.names values are added on by pre-running queries. e.g. running the following querys {code} hql(use sql_integration_ks) val container = hql(select * from double_table as aa JOIN boolean_table as bb on aa.type_id = bb.type_id) container.collect().foreach(println) val container = hql(select * from ascii_table ORDER BY type_id) container.collect().foreach(println) val container = hql(select shippers.shippername, COUNT(orders.orderid) AS numorders FROM orders LEFT JOIN shippers ON orders.shipperid=shippers.shipperid GROUP BY shippername) container.collect().foreach(println) val container = hql(select * from ascii_table where type_id 126) container.collect().length {code} The read column ids for the last query are [2, 0, 3, 1] read column names are : type_id,value,type_id,value,type_id,value,orderid,shipperid,shipper name, shipperid The source code is at https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScan.scala#L80 hiveContext has a shared hiveconf which add readColumns for each query. It should be reset each time for a new hive query or remove the duplicate readColumn Ids -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3834) Backticks not correctly handled in subquery aliases
[ https://issues.apache.org/jira/browse/SPARK-3834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164857#comment-14164857 ] Ravindra Pesala commented on SPARK-3834: Ok [~marmbrus] , I will work on it. Backticks not correctly handled in subquery aliases --- Key: SPARK-3834 URL: https://issues.apache.org/jira/browse/SPARK-3834 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Ravindra Pesala Priority: Blocker [~ravi.pesala] assigning to you since you fixed the last problem here. Let me know if you don't have time to work on this or if you have any questions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3814) Bitwise does not work in Hive
[ https://issues.apache.org/jira/browse/SPARK-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165790#comment-14165790 ] Ravindra Pesala commented on SPARK-3814: https://github.com/apache/spark/pull/2736 Bitwise does not work in Hive Key: SPARK-3814 URL: https://issues.apache.org/jira/browse/SPARK-3814 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Yana Kadiyska Priority: Minor Error: java.lang.RuntimeException: Unsupported language features in query: select (case when bit_field 1=1 then r_end - r_start else NULL end) from mytable where pkey='0178-2014-07' LIMIT 2 TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME mytable TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_FUNCTION when = TOK_TABLE_OR_COL bit_field 1 1 - TOK_TABLE_OR_COL r_end TOK_TABLE_OR_COL r_start TOK_NULL TOK_WHERE = TOK_TABLE_OR_COL pkey '0178-2014-07' TOK_LIMIT 2 SQLState: null ErrorCode: 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3834) Backticks not correctly handled in subquery aliases
[ https://issues.apache.org/jira/browse/SPARK-3834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165800#comment-14165800 ] Ravindra Pesala commented on SPARK-3834: https://github.com/apache/spark/pull/2737 Backticks not correctly handled in subquery aliases --- Key: SPARK-3834 URL: https://issues.apache.org/jira/browse/SPARK-3834 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Ravindra Pesala Priority: Blocker [~ravi.pesala] assigning to you since you fixed the last problem here. Let me know if you don't have time to work on this or if you have any questions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3813) Support case when conditional functions in Spark SQL
Ravindra Pesala created SPARK-3813: -- Summary: Support case when conditional functions in Spark SQL Key: SPARK-3813 URL: https://issues.apache.org/jira/browse/SPARK-3813 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Ravindra Pesala Fix For: 1.2.0 The SQL queries which has following conditional functions are not supported in Spark SQL. {code} CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END {code} The same functions can work in Spark HiveQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3813) Support case when conditional functions in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-3813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160368#comment-14160368 ] Ravindra Pesala commented on SPARK-3813: The below code gives the exception. {code} import sqlContext._ val rdd = sc.parallelize((1 to 100).map(i = Record(i, sval_$i))) rdd.registerTempTable(records) println(Result of SELECT *:) sql(SELECT case key when '93' then 'ravi' else key end FROM records).collect() {code} {code} java.lang.RuntimeException: [1.17] failure: ``UNION'' expected but identifier when found SELECT case key when 93 then 0 else 1 end FROM records ^ at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:60) at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:74) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:267) {code} Support case when conditional functions in Spark SQL -- Key: SPARK-3813 URL: https://issues.apache.org/jira/browse/SPARK-3813 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Ravindra Pesala Fix For: 1.2.0 The SQL queries which has following conditional functions are not supported in Spark SQL. {code} CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END {code} The same functions can work in Spark HiveQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3814) Bitwise does not work in Hive
[ https://issues.apache.org/jira/browse/SPARK-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161472#comment-14161472 ] Ravindra Pesala commented on SPARK-3814: Currently there is no support of Bitwise in Spark HiveQl and Spark SQL as well. I am working on this issue. And as well as we need to support Bitwise | , Bitwise ^ , Bitwise ~. I will add seperate jira for these operations and I will work on these. Thank you. Bitwise does not work in Hive Key: SPARK-3814 URL: https://issues.apache.org/jira/browse/SPARK-3814 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Yana Kadiyska Priority: Minor Error: java.lang.RuntimeException: Unsupported language features in query: select (case when bit_field 1=1 then r_end - r_start else NULL end) from mytable where pkey='0178-2014-07' LIMIT 2 TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME mytable TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_FUNCTION when = TOK_TABLE_OR_COL bit_field 1 1 - TOK_TABLE_OR_COL r_end TOK_TABLE_OR_COL r_start TOK_NULL TOK_WHERE = TOK_TABLE_OR_COL pkey '0178-2014-07' TOK_LIMIT 2 SQLState: null ErrorCode: 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3100) Spark RDD partitions are not running in the workers as per locality information given by each partition.
[ https://issues.apache.org/jira/browse/SPARK-3100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravindra Pesala updated SPARK-3100: --- Description: I created a simple custom RDD (SampleRDD.scala)and created 4 splits for 4 workers. When I run this RDD in Spark standalone cluster with 4 workers(even master machine has one worker), it runs all partitions in one node only even though I have given locality preferences in my SampleRDD program. *Sample Code* {code} class SamplePartition(rddId: Int, val idx: Int,val tableSplit:Seq[String]) extends Partition { override def hashCode(): Int = 41 * (41 + rddId) + idx override val index: Int = idx } class SampleRDD[K,V]( sc : SparkContext,keyClass: KeyVal[K,V]) extends RDD[(K,V)](sc, Nil) with Logging { override def getPartitions: Array[Partition] = { val hosts = Array(master,slave1,slave2,slave3) val result = new Array[Partition](4) for (i - 0 until result.length) { result(i) = new SamplePartition(id, i, Array(hosts(i))) } result } override def compute(theSplit: Partition, context: TaskContext) = { val iter = new Iterator[(K,V)] { val split = theSplit.asInstanceOf[SamplePartition] logInfo(Executed task for the split + split.tableSplit) // Register an on-task-completion callback to close the input stream. context.addOnCompleteCallback(() = close()) var havePair = false var finished = false override def hasNext: Boolean = { if (!finished !havePair) { finished = !false havePair = !finished } !finished } override def next(): (K,V) = { if (!hasNext) { throw new java.util.NoSuchElementException(End of stream) } havePair = false val key = new Key() val value = new Value() keyClass.getKey(key, value) } private def close() { try { // reader.close() } catch { case e: Exception = logWarning(Exception in RecordReader.close(), e) } } } iter } override def getPreferredLocations(split: Partition): Seq[String] = { val theSplit = split.asInstanceOf[SamplePartition] val s = theSplit.tableSplit.filter(_ != localhost) logInfo(Host Name : +s(0)) s } } trait KeyVal[K,V] extends Serializable { def getKey(key : Key,value : Value) : (K,V) } class KeyValImpl extends KeyVal[Key,Value] { override def getKey(key : Key,value : Value) = (key,value) } case class Key() case class Value() object SampleRDD { def main(args: Array[String]) : Unit={ val d = SparkContext.jarOfClass(this.getClass) val ar = new Array[String](d.size) var i = 0 d.foreach{ p= ar(i)=p; i = i+1 } val sc = new SparkContext(spark://master:7077, SampleSpark, /opt/spark-1.0.0-rc3/,ar) val rdd = new SampleRDD(sc,new KeyValImpl()); rdd.collect; } } {code} Following is the log it shows. INFO 18-08 16:38:33,382 - Executor updated: app-20140818163833-0005/0 is now RUNNING INFO 18-08 16:38:33,382 - Executor updated: app-20140818163833-0005/2 is now RUNNING INFO 18-08 16:38:33,383 - Executor updated: app-20140818163833-0005/1 is now RUNNING INFO 18-08 16:38:33,385 - Executor updated: app-20140818163833-0005/3 is now RUNNING INFO 18-08 16:38:34,976 - Registered executor: Actor akka.tcp://sparkExecutor@master:47563/user/Executor#-398354094 with ID 0 INFO 18-08 16:38:34,984 - Starting task 0.0:0 as TID 0 on executor 0: *master (PROCESS_LOCAL)* INFO 18-08 16:38:34,989 - Serialized task 0.0:0 as 1261 bytes in 3 ms INFO 18-08 16:38:34,992 - Starting task 0.0:1 as TID 1 on executor 0: *master (PROCESS_LOCAL)* INFO 18-08 16:38:34,993 - Serialized task 0.0:1 as 1261 bytes in 0 ms INFO 18-08 16:38:34,993 - Starting task 0.0:2 as TID 2 on executor 0: *master (PROCESS_LOCAL)* INFO 18-08 16:38:34,993 - Serialized task 0.0:2 as 1261 bytes in 0 ms INFO 18-08 16:38:34,994 - Starting task 0.0:3 as TID 3 on executor 0: *master (PROCESS_LOCAL)* INFO 18-08 16:38:34,994 - Serialized task 0.0:3 as 1261 bytes in 0 ms INFO 18-08 16:38:35,174 - Registering block manager master:42125 with 294.4 MB RAM INFO 18-08 16:38:35,296 - Registered executor: Actor akka.tcp://sparkExecutor@slave1:31726/user/Executor#492173410 with ID 2 INFO 18-08 16:38:35,302 - Registered executor: Actor akka.tcp://sparkExecutor@slave2:25769/user/Executor#1762839887 with ID 1 INFO 18-08 16:38:35,317 - Registered executor: Actor akka.tcp://sparkExecutor@slave3:51032/user/Executor#981476000 with ID 3 was: I created a simple custom RDD (SampleRDD.scala)and created 4 splits for 4 workers. When I run this RDD in Spark standalone cluster with 4 workers(even master machine has one worker), it
[jira] [Updated] (SPARK-3100) Spark RDD partitions are not running in the workers as per locality information given by each partition.
[ https://issues.apache.org/jira/browse/SPARK-3100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravindra Pesala updated SPARK-3100: --- Description: I created a simple custom RDD (SampleRDD.scala)and created 4 splits for 4 workers. When I run this RDD in Spark standalone cluster with 4 workers(even master machine has one worker), it runs all partitions in one node only even though I have given locality preferences in my SampleRDD program. *Sample Code* {code} class SamplePartition(rddId: Int, val idx: Int,val tableSplit:Seq[String]) extends Partition { override def hashCode(): Int = 41 * (41 + rddId) + idx override val index: Int = idx } class SampleRDD[K,V]( sc : SparkContext,keyClass: KeyVal[K,V]) extends RDD[(K,V)](sc, Nil) with Logging { override def getPartitions: Array[Partition] = { val hosts = Array(master,slave1,slave2,slave3) val result = new Array[Partition](4) for (i - 0 until result.length) { result(i) = new SamplePartition(id, i, Array(hosts(i))) } result } override def compute(theSplit: Partition, context: TaskContext) = { val iter = new Iterator[(K,V)] { val split = theSplit.asInstanceOf[SamplePartition] logInfo(Executed task for the split + split.tableSplit) // Register an on-task-completion callback to close the input stream. context.addOnCompleteCallback(() = close()) var havePair = false var finished = false override def hasNext: Boolean = { if (!finished !havePair) { finished = !false havePair = !finished } !finished } override def next(): (K,V) = { if (!hasNext) { throw new java.util.NoSuchElementException(End of stream) } havePair = false val key = new Key() val value = new Value() keyClass.getKey(key, value) } private def close() { try { // reader.close() } catch { case e: Exception = logWarning(Exception in RecordReader.close(), e) } } } iter } override def getPreferredLocations(split: Partition): Seq[String] = { val theSplit = split.asInstanceOf[SamplePartition] val s = theSplit.tableSplit.filter(_ != localhost) logInfo(Host Name : +s(0)) s } } trait KeyVal[K,V] extends Serializable { def getKey(key : Key,value : Value) : (K,V) } class KeyValImpl extends KeyVal[Key,Value] { override def getKey(key : Key,value : Value) = (key,value) } case class Key() case class Value() object SampleRDD { def main(args: Array[String]) : Unit={ val d = SparkContext.jarOfClass(this.getClass) val ar = new Array[String](d.size) var i = 0 d.foreach{ p= ar(i)=p; i = i+1 } val sc = new SparkContext(spark://master:7077, SampleSpark, /opt/spark-1.0.0-rc3/,ar) val rdd = new SampleRDD(sc,new KeyValImpl()); rdd.collect; } } {code} Following is the log it shows. {code} INFO 18-08 16:38:33,382 - Executor updated: app-20140818163833-0005/0 is now RUNNING INFO 18-08 16:38:33,382 - Executor updated: app-20140818163833-0005/2 is now RUNNING INFO 18-08 16:38:33,383 - Executor updated: app-20140818163833-0005/1 is now RUNNING INFO 18-08 16:38:33,385 - Executor updated: app-20140818163833-0005/3 is now RUNNING INFO 18-08 16:38:34,976 - Registered executor: Actor akka.tcp://sparkExecutor@master:47563/user/Executor#-398354094 with ID 0 INFO 18-08 16:38:34,984 - Starting task 0.0:0 as TID 0 on executor 0: master (PROCESS_LOCAL) INFO 18-08 16:38:34,989 - Serialized task 0.0:0 as 1261 bytes in 3 ms INFO 18-08 16:38:34,992 - Starting task 0.0:1 as TID 1 on executor 0: master (PROCESS_LOCAL) INFO 18-08 16:38:34,993 - Serialized task 0.0:1 as 1261 bytes in 0 ms INFO 18-08 16:38:34,993 - Starting task 0.0:2 as TID 2 on executor 0: master (PROCESS_LOCAL)* INFO 18-08 16:38:34,993 - Serialized task 0.0:2 as 1261 bytes in 0 ms INFO 18-08 16:38:34,994 - Starting task 0.0:3 as TID 3 on executor 0: master (PROCESS_LOCAL) INFO 18-08 16:38:34,994 - Serialized task 0.0:3 as 1261 bytes in 0 ms INFO 18-08 16:38:35,174 - Registering block manager master:42125 with 294.4 MB RAM INFO 18-08 16:38:35,296 - Registered executor: Actor akka.tcp://sparkExecutor@slave1:31726/user/Executor#492173410 with ID 2 INFO 18-08 16:38:35,302 - Registered executor: Actor akka.tcp://sparkExecutor@slave2:25769/user/Executor#1762839887 with ID 1 INFO 18-08 16:38:35,317 - Registered executor: Actor akka.tcp://sparkExecutor@slave3:51032/user/Executor#981476000 with ID 3 {code} Here all the tasks are assigned to master only, even I though I have mentioned the locality preferences was: I created a simple custom RDD (SampleRDD.scala)and created 4 splits for 4
[jira] [Comment Edited] (SPARK-3100) Spark RDD partitions are not running in the workers as per locality information given by each partition.
[ https://issues.apache.org/jira/browse/SPARK-3100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100664#comment-14100664 ] Ravindra Pesala edited comment on SPARK-3100 at 9/30/14 7:47 AM: - It seems there is an issue in synchronization of task allocation and registering of executors. As per the above log I observed that tasks has been allocated with single registered executor. After these tasks are allocated to registered executor,remaining executors are started registering. So this may be synchronization issue. I have added sleep in my driver for 5 seconds then it started working properly. {code} val sc = new SparkContext(spark://master:7077, SampleSpark, /opt/spark-1.0.0-rc3/,ar) *Thread.sleep(5000)* val rdd = new SampleRDD(sc,new KeyValImpl()); rdd.collect; {code} Now the tasks are assigned to all the nodes {code} INFO 18-08 19:44:36,301 - Registered executor: Actor[akka.tcp://sparkExecutor@master:32457/user/Executor#-966982652] with ID 0 INFO 18-08 19:44:36,505 - Registering block manager master:59964 with 294.4 MB RAM INFO 18-08 19:44:36,578 - Registered executor: Actor[akka.tcp://sparkExecutor@slave1:11653/user/Executor#909834981] with ID 1 INFO 18-08 19:44:36,591 - Registered executor: Actor[akka.tcp://sparkExecutor@slave2:59220/user/Executor#301495226] with ID 3 INFO 18-08 19:44:36,643 - Registered executor: Actor[akka.tcp://sparkExecutor@slave3:11232/user/Executor#-1118376183] with ID 2 INFO 18-08 19:44:36,804 - Registering block manager slave1:14596 with 294.4 MB RAM INFO 18-08 19:44:36,809 - Registering block manager slave2:10418 with 294.4 MB RAM INFO 18-08 19:44:36,871 - Registering block manager slave3:45973 with 294.4 MB RAM INFO 18-08 19:44:39,507 - Starting job: collect at SampleRDD.scala:142 INFO 18-08 19:44:39,520 - Got job 0 (collect at SampleRDD.scala:142) with 4 output partitions (allowLocal=false) INFO 18-08 19:44:39,521 - Final stage: Stage 0(collect at SampleRDD.scala:142) INFO 18-08 19:44:39,521 - Parents of final stage: List() INFO 18-08 19:44:39,526 - Missing parents: List() INFO 18-08 19:44:39,532 - Submitting Stage 0 (SampleRDD[0] at RDD at SampleRDD.scala:28), which has no missing parents INFO 18-08 19:44:39,537 - Host Name : master INFO 18-08 19:44:39,539 - Host Name : slave1 INFO 18-08 19:44:39,539 - Host Name : slave2 INFO 18-08 19:44:39,540 - Host Name : slave3 INFO 18-08 19:44:39,563 - Submitting 4 missing tasks from Stage 0 (SampleRDD[0] at RDD at SampleRDD.scala:28) INFO 18-08 19:44:39,564 - Adding task set 0.0 with 4 tasks INFO 18-08 19:44:39,579 - Starting task 0.0:2 as TID 0 on executor 3: *slave2 (NODE_LOCAL)* INFO 18-08 19:44:39,583 - Serialized task 0.0:2 as 1261 bytes in 2 ms INFO 18-08 19:44:39,585 - Starting task 0.0:0 as TID 1 on executor 0: *master (NODE_LOCAL)* INFO 18-08 19:44:39,585 - Serialized task 0.0:0 as 1261 bytes in 0 ms INFO 18-08 19:44:39,586 - Starting task 0.0:1 as TID 2 on executor 1: *slave1 (NODE_LOCAL)* INFO 18-08 19:44:39,586 - Serialized task 0.0:1 as 1261 bytes in 0 ms INFO 18-08 19:44:39,587 - Starting task 0.0:3 as TID 3 on executor 2: *slave3 (NODE_LOCAL)* INFO 18-08 19:44:39,587 - Serialized task 0.0:3 as 1261 bytes in 0 ms {code} Is it expected behavior? Please comment on it. was (Author: ravipesala): It seems there is an issue in synchronization of task allocation and registering of executors. As per the above log I observed that task allocations are done with single registered executor. After these tasks are started ,remaining executors are started registering. So this is synchronization issue. I have added sleep in my driver for 5 seconds then it started working properly. val sc = new SparkContext(spark://master:7077, SampleSpark, /opt/spark-1.0.0-rc3/,ar) *Thread.sleep(5000)* val rdd = new SampleRDD(sc,new KeyValImpl()); rdd.collect; INFO 18-08 19:44:36,301 - Registered executor: Actor[akka.tcp://sparkExecutor@master:32457/user/Executor#-966982652] with ID 0 INFO 18-08 19:44:36,505 - Registering block manager master:59964 with 294.4 MB RAM INFO 18-08 19:44:36,578 - Registered executor: Actor[akka.tcp://sparkExecutor@slave1:11653/user/Executor#909834981] with ID 1 INFO 18-08 19:44:36,591 - Registered executor: Actor[akka.tcp://sparkExecutor@slave2:59220/user/Executor#301495226] with ID 3 INFO 18-08 19:44:36,643 - Registered executor: Actor[akka.tcp://sparkExecutor@slave3:11232/user/Executor#-1118376183] with ID 2 INFO 18-08 19:44:36,804 - Registering block manager slave1:14596 with 294.4 MB RAM INFO 18-08 19:44:36,809 - Registering block manager slave2:10418 with 294.4 MB RAM INFO 18-08 19:44:36,871 - Registering block manager slave3:45973 with 294.4 MB RAM INFO 18-08 19:44:39,507 - Starting job: collect at SampleRDD.scala:142 INFO 18-08 19:44:39,520 -
[jira] [Commented] (SPARK-3708) Backticks aren't handled correctly is aliases
[ https://issues.apache.org/jira/browse/SPARK-3708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152321#comment-14152321 ] Ravindra Pesala commented on SPARK-3708: I guess here you mentioned about HiveContext as there is no support of backtick in SqlContext. I will work on this issue.Thank you. Backticks aren't handled correctly is aliases - Key: SPARK-3708 URL: https://issues.apache.org/jira/browse/SPARK-3708 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Michael Armbrust Here's a failing test case: {code} sql(SELECT k FROM (SELECT `key` AS `k` FROM src) a) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3654) Implement all extended HiveQL statements/commands with a separate parser combinator
[ https://issues.apache.org/jira/browse/SPARK-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152789#comment-14152789 ] Ravindra Pesala commented on SPARK-3654: https://github.com/apache/spark/pull/2590 Implement all extended HiveQL statements/commands with a separate parser combinator --- Key: SPARK-3654 URL: https://issues.apache.org/jira/browse/SPARK-3654 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Cheng Lian Statements and commands like {{SET}}, {{CACHE TABLE}} and {{ADD JAR}} etc. are currently parsed in a quite hacky way, like this: {code} if (sql.trim.toLowerCase.startsWith(cache table)) { sql.trim.toLowerCase.startsWith(cache table) match { ... } } {code} It would be much better to add an extra parser combinator that parses these syntax extensions first, and then fallback to the normal Hive parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3371) Spark SQL: Renaming a function expression with group by gives error
[ https://issues.apache.org/jira/browse/SPARK-3371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144514#comment-14144514 ] Ravindra Pesala commented on SPARK-3371: It seems like an issue, By default SQl parser creates the aliases to the functions in grouping expressions with generated alias names. So if user gives the alias names to the functions inside projection then it does not match the generated alias name of grouping expression. I am working on this issue, will create the PR today. Spark SQL: Renaming a function expression with group by gives error --- Key: SPARK-3371 URL: https://issues.apache.org/jira/browse/SPARK-3371 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Pei-Lun Lee {code} val sqlContext = new org.apache.spark.sql.SQLContext(sc) val rdd = sc.parallelize(List({foo:bar})) sqlContext.jsonRDD(rdd).registerAsTable(t1) sqlContext.registerFunction(len, (s: String) = s.length) sqlContext.sql(select len(foo) as a, count(1) from t1 group by len(foo)).collect() {code} running above code in spark-shell gives the following error {noformat} 14/09/03 17:20:13 ERROR Executor: Exception in task 2.0 in stage 3.0 (TID 214) org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: foo#0 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:43) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:42) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$2.apply(TreeNode.scala:201) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:199) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) {noformat} remove as a in the query causes no error -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3371) Spark SQL: Renaming a function expression with group by gives error
[ https://issues.apache.org/jira/browse/SPARK-3371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145258#comment-14145258 ] Ravindra Pesala commented on SPARK-3371: https://github.com/apache/spark/pull/2511 Spark SQL: Renaming a function expression with group by gives error --- Key: SPARK-3371 URL: https://issues.apache.org/jira/browse/SPARK-3371 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Pei-Lun Lee {code} val sqlContext = new org.apache.spark.sql.SQLContext(sc) val rdd = sc.parallelize(List({foo:bar})) sqlContext.jsonRDD(rdd).registerAsTable(t1) sqlContext.registerFunction(len, (s: String) = s.length) sqlContext.sql(select len(foo) as a, count(1) from t1 group by len(foo)).collect() {code} running above code in spark-shell gives the following error {noformat} 14/09/03 17:20:13 ERROR Executor: Exception in task 2.0 in stage 3.0 (TID 214) org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: foo#0 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:43) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:42) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$2.apply(TreeNode.scala:201) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:199) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) {noformat} remove as a in the query causes no error -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3298) [SQL] registerAsTable / registerTempTable overwrites old tables
[ https://issues.apache.org/jira/browse/SPARK-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14140124#comment-14140124 ] Ravindra Pesala commented on SPARK-3298: I guess, we should add some API like *SqlContext.isTableExists(tableName)* to check whether the table already exists or not. So by using this API user can check the table existence and then register the table. The current API *SqlContext.table(tableName)* throws exception if the table is not present,so we cannot use it for this purpose. Please comment on it. [SQL] registerAsTable / registerTempTable overwrites old tables --- Key: SPARK-3298 URL: https://issues.apache.org/jira/browse/SPARK-3298 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2 Reporter: Evan Chan Priority: Minor Labels: newbie At least in Spark 1.0.2, calling registerAsTable(a) when a had been registered before does not cause an error. However, there is no way to access the old table, even though it may be cached and taking up space. How about at least throwing an error? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3536) SELECT on empty parquet table throws exception
[ https://issues.apache.org/jira/browse/SPARK-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14140145#comment-14140145 ] Ravindra Pesala commented on SPARK-3536: It return null metadata from parquet if querying on empty parquet file while calculating splits.So we should add null check and returns the empty splits solves the issue. SELECT on empty parquet table throws exception -- Key: SPARK-3536 URL: https://issues.apache.org/jira/browse/SPARK-3536 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Labels: starter Reported by [~matei]. Reproduce as follows: {code} scala case class Data(i: Int) defined class Data scala createParquetFile[Data](testParquet) scala parquetFile(testParquet).count() 14/09/15 14:34:17 WARN scheduler.DAGScheduler: Creating new stage failed due to exception - job: 0 java.lang.NullPointerException at org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:438) at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:344) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3536) SELECT on empty parquet table throws exception
[ https://issues.apache.org/jira/browse/SPARK-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14140269#comment-14140269 ] Ravindra Pesala commented on SPARK-3536: [~isaias.barroso] I have submitted the PR 4 hours ago,but I am not sure why it is not yet linked it to jira. SELECT on empty parquet table throws exception -- Key: SPARK-3536 URL: https://issues.apache.org/jira/browse/SPARK-3536 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Labels: starter Reported by [~matei]. Reproduce as follows: {code} scala case class Data(i: Int) defined class Data scala createParquetFile[Data](testParquet) scala parquetFile(testParquet).count() 14/09/15 14:34:17 WARN scheduler.DAGScheduler: Creating new stage failed due to exception - job: 0 java.lang.NullPointerException at org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:438) at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:344) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2594) Add CACHE TABLE name AS SELECT ...
[ https://issues.apache.org/jira/browse/SPARK-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135303#comment-14135303 ] Ravindra Pesala commented on SPARK-2594: [~marmbrus] There is a confusion over eager and lazy caching for CACHE TABLE AS SELECT .. Actually the PR is created with eager caching, but there is comments from [~liancheng] in PR mentions that SQLContext.cacheTable, CACHE TABLE name are both lazy.So making all three eager also seems acceptable, but this kinda breaks downward compatibility (or at least breaks existing performance assumptions of caching functions/statements). Please comment on it. Add CACHE TABLE name AS SELECT ... Key: SPARK-2594 URL: https://issues.apache.org/jira/browse/SPARK-2594 Project: Spark Issue Type: New Feature Components: SQL Reporter: Michael Armbrust Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2594) Add CACHE TABLE name AS SELECT ...
[ https://issues.apache.org/jira/browse/SPARK-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117464#comment-14117464 ] Ravindra Pesala commented on SPARK-2594: Thank you Michael. Following are the tasks which I am planning to do to support this feature. 1. Change the SqlParser to support ADD CACHE TABLE syntax.And also we can change the HiveQl to support syntax for hive. 2. Add new strategy 'AddCacheTable' to 'SparkStrategies' 3. In the new strategy 'AddCacheTable', register the tableName with the plan and cache the same with tableName. Please review it and comment. Add CACHE TABLE name AS SELECT ... Key: SPARK-2594 URL: https://issues.apache.org/jira/browse/SPARK-2594 Project: Spark Issue Type: New Feature Components: SQL Reporter: Michael Armbrust Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2594) Add CACHE TABLE name AS SELECT ...
[ https://issues.apache.org/jira/browse/SPARK-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113342#comment-14113342 ] Ravindra Pesala commented on SPARK-2594: Please assign this to me. Add CACHE TABLE name AS SELECT ... Key: SPARK-2594 URL: https://issues.apache.org/jira/browse/SPARK-2594 Project: Spark Issue Type: New Feature Components: SQL Reporter: Michael Armbrust Priority: Critical -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2693) Support for UDAF Hive Aggregates like PERCENTILE
[ https://issues.apache.org/jira/browse/SPARK-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110707#comment-14110707 ] Ravindra Pesala commented on SPARK-2693: UDAF is deprecated in HIve, Though there can be few functions which could have implemented using this interface. We can support the same in spark for backward compatability. As you mentioned supporting UDAF in spark requires to write a wrapper. *Please assign it to me.* Support for UDAF Hive Aggregates like PERCENTILE Key: SPARK-2693 URL: https://issues.apache.org/jira/browse/SPARK-2693 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust {code} SELECT MIN(field1), MAX(field2), AVG(field3), PERCENTILE(field4), year,month,day FROM raw_data_table GROUP BY year, month, day MIN, MAX and AVG functions work fine for me, but with PERCENTILE, I get an error as shown below. Exception in thread main java.lang.RuntimeException: No handler for udf class org.apache.hadoop.hive.ql.udf.UDAFPercentile at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.hive.HiveFunctionRegistry$.lookupFunction(hiveUdfs.scala:69) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$4$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:115) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$4$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:113) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165) {code} This aggregate extends UDAF, which we don't yet have a wrapper for. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2693) Support for UDAF Hive Aggregates like PERCENTILE
[ https://issues.apache.org/jira/browse/SPARK-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110707#comment-14110707 ] Ravindra Pesala edited comment on SPARK-2693 at 8/26/14 2:05 PM: - UDAF is deprecated in HIve, Though there can be few functions which could have implemented using this interface. We can support the same in spark for backward compatability. As you mentioned supporting UDAF in spark requires to write a wrapper. Please assign it to me. was (Author: ravipesala): UDAF is deprecated in HIve, Though there can be few functions which could have implemented using this interface. We can support the same in spark for backward compatability. As you mentioned supporting UDAF in spark requires to write a wrapper. *Please assign it to me.* Support for UDAF Hive Aggregates like PERCENTILE Key: SPARK-2693 URL: https://issues.apache.org/jira/browse/SPARK-2693 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust {code} SELECT MIN(field1), MAX(field2), AVG(field3), PERCENTILE(field4), year,month,day FROM raw_data_table GROUP BY year, month, day MIN, MAX and AVG functions work fine for me, but with PERCENTILE, I get an error as shown below. Exception in thread main java.lang.RuntimeException: No handler for udf class org.apache.hadoop.hive.ql.udf.UDAFPercentile at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.hive.HiveFunctionRegistry$.lookupFunction(hiveUdfs.scala:69) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$4$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:115) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$4$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:113) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165) {code} This aggregate extends UDAF, which we don't yet have a wrapper for. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3100) Spark RDD partitions are not running in the workers as per locality information given by each partition.
Ravindra Pesala created SPARK-3100: -- Summary: Spark RDD partitions are not running in the workers as per locality information given by each partition. Key: SPARK-3100 URL: https://issues.apache.org/jira/browse/SPARK-3100 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: Running in Spark Standalone Cluster Reporter: Ravindra Pesala I created a simple custom RDD (SampleRDD.scala)and created 4 splits for 4 workers. When I run this RDD in Spark standalone cluster with 4 workers(even master machine has one worker), it runs all partitions in one node only even though I have given locality preferences in my SampleRDD program. *Sample Code* class SamplePartition(rddId: Int, val idx: Int,val tableSplit:Seq[String]) extends Partition { override def hashCode(): Int = 41 * (41 + rddId) + idx override val index: Int = idx } class SampleRDD[K,V]( sc : SparkContext,keyClass: KeyVal[K,V]) extends RDD[(K,V)](sc, Nil) with Logging { override def getPartitions: Array[Partition] = { val hosts = Array(master,slave1,slave2,slave3) val result = new Array[Partition](4) for (i - 0 until result.length) { result(i) = new SamplePartition(id, i, Array(hosts(i))) } result } override def compute(theSplit: Partition, context: TaskContext) = { val iter = new Iterator[(K,V)] { val split = theSplit.asInstanceOf[SamplePartition] logInfo(Executed task for the split + split.tableSplit) // Register an on-task-completion callback to close the input stream. context.addOnCompleteCallback(() = close()) var havePair = false var finished = false override def hasNext: Boolean = { if (!finished !havePair) { finished = !false havePair = !finished } !finished } override def next(): (K,V) = { if (!hasNext) { throw new java.util.NoSuchElementException(End of stream) } havePair = false val key = new Key() val value = new Value() keyClass.getKey(key, value) } private def close() { try { // reader.close() } catch { case e: Exception = logWarning(Exception in RecordReader.close(), e) } } } iter } override def getPreferredLocations(split: Partition): Seq[String] = { val theSplit = split.asInstanceOf[SamplePartition] val s = theSplit.tableSplit.filter(_ != localhost) logInfo(Host Name : +s(0)) s } } trait KeyVal[K,V] extends Serializable { def getKey(key : Key,value : Value) : (K,V) } class KeyValImpl extends KeyVal[Key,Value] { override def getKey(key : Key,value : Value) = (key,value) } case class Key() case class Value() object SampleRDD { def main(args: Array[String]) : Unit={ val d = SparkContext.jarOfClass(this.getClass) val ar = new Array[String](d.size) var i = 0 d.foreach{ p= ar(i)=p; i = i+1 } val sc = new SparkContext(spark://master:7077, SampleSpark, /opt/spark-1.0.0-rc3/,ar) val rdd = new SampleRDD(sc,new KeyValImpl()); rdd.collect; } } Following is the log it shows. INFO 18-08 16:38:33,382 - Executor updated: app-20140818163833-0005/0 is now RUNNING INFO 18-08 16:38:33,382 - Executor updated: app-20140818163833-0005/2 is now RUNNING INFO 18-08 16:38:33,383 - Executor updated: app-20140818163833-0005/1 is now RUNNING INFO 18-08 16:38:33,385 - Executor updated: app-20140818163833-0005/3 is now RUNNING INFO 18-08 16:38:34,976 - Registered executor: Actor akka.tcp://sparkExecutor@master:47563/user/Executor#-398354094 with ID 0 INFO 18-08 16:38:34,984 - Starting task 0.0:0 as TID 0 on executor 0: *master (PROCESS_LOCAL)* INFO 18-08 16:38:34,989 - Serialized task 0.0:0 as 1261 bytes in 3 ms INFO 18-08 16:38:34,992 - Starting task 0.0:1 as TID 1 on executor 0: *master (PROCESS_LOCAL)* INFO 18-08 16:38:34,993 - Serialized task 0.0:1 as 1261 bytes in 0 ms INFO 18-08 16:38:34,993 - Starting task 0.0:2 as TID 2 on executor 0: *master (PROCESS_LOCAL)* INFO 18-08 16:38:34,993 - Serialized task 0.0:2 as 1261 bytes in 0 ms INFO 18-08 16:38:34,994 - Starting task 0.0:3 as TID 3 on executor 0: *master (PROCESS_LOCAL)* INFO 18-08 16:38:34,994 - Serialized task 0.0:3 as 1261 bytes in 0 ms INFO 18-08 16:38:35,174 - Registering block manager master:42125 with 294.4 MB RAM INFO 18-08 16:38:35,296 - Registered executor: Actor akka.tcp://sparkExecutor@slave1:31726/user/Executor#492173410 with ID 2 INFO 18-08 16:38:35,302 - Registered executor: Actor akka.tcp://sparkExecutor@slave2:25769/user/Executor#1762839887 with ID 1 INFO 18-08 16:38:35,317 -
[jira] [Commented] (SPARK-3100) Spark RDD partitions are not running in the workers as per locality information given by each partition.
[ https://issues.apache.org/jira/browse/SPARK-3100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100664#comment-14100664 ] Ravindra Pesala commented on SPARK-3100: It seems there is an issue in synchronization of task allocation and registering of executors. As per the above log I observed that task allocations are done with single registered executor. After these tasks are started ,remaining executors are started registering. So this is synchronization issue. I have added sleep in my driver for 5 seconds then it started working properly. val sc = new SparkContext(spark://master:7077, SampleSpark, /opt/spark-1.0.0-rc3/,ar) *Thread.sleep(5000)* val rdd = new SampleRDD(sc,new KeyValImpl()); rdd.collect; INFO 18-08 19:44:36,301 - Registered executor: Actor[akka.tcp://sparkExecutor@master:32457/user/Executor#-966982652] with ID 0 INFO 18-08 19:44:36,505 - Registering block manager master:59964 with 294.4 MB RAM INFO 18-08 19:44:36,578 - Registered executor: Actor[akka.tcp://sparkExecutor@slave1:11653/user/Executor#909834981] with ID 1 INFO 18-08 19:44:36,591 - Registered executor: Actor[akka.tcp://sparkExecutor@slave2:59220/user/Executor#301495226] with ID 3 INFO 18-08 19:44:36,643 - Registered executor: Actor[akka.tcp://sparkExecutor@slave3:11232/user/Executor#-1118376183] with ID 2 INFO 18-08 19:44:36,804 - Registering block manager slave1:14596 with 294.4 MB RAM INFO 18-08 19:44:36,809 - Registering block manager slave2:10418 with 294.4 MB RAM INFO 18-08 19:44:36,871 - Registering block manager slave3:45973 with 294.4 MB RAM INFO 18-08 19:44:39,507 - Starting job: collect at SampleRDD.scala:142 INFO 18-08 19:44:39,520 - Got job 0 (collect at SampleRDD.scala:142) with 4 output partitions (allowLocal=false) INFO 18-08 19:44:39,521 - Final stage: Stage 0(collect at SampleRDD.scala:142) INFO 18-08 19:44:39,521 - Parents of final stage: List() INFO 18-08 19:44:39,526 - Missing parents: List() INFO 18-08 19:44:39,532 - Submitting Stage 0 (SampleRDD[0] at RDD at SampleRDD.scala:28), which has no missing parents INFO 18-08 19:44:39,537 - Host Name : master INFO 18-08 19:44:39,539 - Host Name : slave1 INFO 18-08 19:44:39,539 - Host Name : slave2 INFO 18-08 19:44:39,540 - Host Name : slave3 INFO 18-08 19:44:39,563 - Submitting 4 missing tasks from Stage 0 (SampleRDD[0] at RDD at SampleRDD.scala:28) INFO 18-08 19:44:39,564 - Adding task set 0.0 with 4 tasks INFO 18-08 19:44:39,579 - Starting task 0.0:2 as TID 0 on executor 3: *slave2 (NODE_LOCAL)* INFO 18-08 19:44:39,583 - Serialized task 0.0:2 as 1261 bytes in 2 ms INFO 18-08 19:44:39,585 - Starting task 0.0:0 as TID 1 on executor 0: *master (NODE_LOCAL)* INFO 18-08 19:44:39,585 - Serialized task 0.0:0 as 1261 bytes in 0 ms INFO 18-08 19:44:39,586 - Starting task 0.0:1 as TID 2 on executor 1: *slave1 (NODE_LOCAL)* INFO 18-08 19:44:39,586 - Serialized task 0.0:1 as 1261 bytes in 0 ms INFO 18-08 19:44:39,587 - Starting task 0.0:3 as TID 3 on executor 2: *slave3 (NODE_LOCAL)* INFO 18-08 19:44:39,587 - Serialized task 0.0:3 as 1261 bytes in 0 ms Spark RDD partitions are not running in the workers as per locality information given by each partition. Key: SPARK-3100 URL: https://issues.apache.org/jira/browse/SPARK-3100 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: Running in Spark Standalone Cluster Reporter: Ravindra Pesala I created a simple custom RDD (SampleRDD.scala)and created 4 splits for 4 workers. When I run this RDD in Spark standalone cluster with 4 workers(even master machine has one worker), it runs all partitions in one node only even though I have given locality preferences in my SampleRDD program. *Sample Code* class SamplePartition(rddId: Int, val idx: Int,val tableSplit:Seq[String]) extends Partition { override def hashCode(): Int = 41 * (41 + rddId) + idx override val index: Int = idx } class SampleRDD[K,V]( sc : SparkContext,keyClass: KeyVal[K,V]) extends RDD[(K,V)](sc, Nil) with Logging { override def getPartitions: Array[Partition] = { val hosts = Array(master,slave1,slave2,slave3) val result = new Array[Partition](4) for (i - 0 until result.length) { result(i) = new SamplePartition(id, i, Array(hosts(i))) } result } override def compute(theSplit: Partition, context: TaskContext) = { val iter = new Iterator[(K,V)] { val split = theSplit.asInstanceOf[SamplePartition] logInfo(Executed task for the split + split.tableSplit) // Register an on-task-completion callback to close the input stream.