date:20160712

[jira] [Updated] (SPARK-16517) can't add columns on the table witch column metadata is serializer

2016-07-12 Thread lichenglin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lichenglin updated SPARK-16517:
---
Description: 
{code}
setName("abc");
HiveContext hive = getHiveContext();
DataFrame d = hive.createDataFrame(
getJavaSparkContext().parallelize(

Arrays.asList(RowFactory.create("abc", "abc", 5.0), RowFactory.create("abcd", 
"abcd", 5.0))),
DataTypes.createStructType(

Arrays.asList(DataTypes.createStructField("card_id", DataTypes.StringType, 
true),

DataTypes.createStructField("tag_name", DataTypes.StringType, true),

DataTypes.createStructField("v", DataTypes.DoubleType, true;
d.write().partitionBy("v").mode(SaveMode.Overwrite).saveAsTable("abc");
hive.sql("alter table abc add columns(v2 double)");
hive.refreshTable("abc");
hive.sql("describe abc").show();
DataFrame d2 = hive.createDataFrame(

getJavaSparkContext().parallelize(Arrays.asList(RowFactory.create("abc", "abc", 
3.0, 4.0),
RowFactory.create("abcd", 
"abcd", 3.0, 1.0))),
new StructType(new StructField[] { 
DataTypes.createStructField("card_id", DataTypes.StringType, true),

DataTypes.createStructField("tag_name", DataTypes.StringType, true),

DataTypes.createStructField("v", DataTypes.DoubleType, true),

DataTypes.createStructField("v2", DataTypes.DoubleType, true) }));
d2.write().partitionBy("v").mode(SaveMode.Append).saveAsTable("abc");
hive.table("abc").show();
{code}
spark.sql.parquet.mergeSchema has been set to  "true".

The code's exception is here 
{code}

++-+---+
|col_name|data_type|comment|
++-+---+
| card_id|   string|   |
|tag_name|   string|   |
|   v|   double|   |
++-+---+

2016-07-13 13:40:43,637 INFO 
[org.apache.hadoop.hive.metastore.HiveMetaStore:746] - 0: get_table : 
db=default tbl=abc
2016-07-13 13:40:43,637 INFO 
[org.apache.hadoop.hive.metastore.HiveMetaStore.audit:371] - ugi=licl  
ip=unknown-ip-addr  cmd=get_table : db=default tbl=abc  
2016-07-13 13:40:43,693 INFO [org.apache.spark.storage.BlockManagerInfo:58] - 
Removed broadcast_2_piece0 on localhost:50647 in memory (size: 1176.0 B, free: 
1125.7 MB)
2016-07-13 13:40:43,700 INFO [org.apache.spark.ContextCleaner:58] - Cleaned 
accumulator 2
2016-07-13 13:40:43,702 INFO [org.apache.spark.storage.BlockManagerInfo:58] - 
Removed broadcast_1_piece0 on localhost:50647 in memory (size: 19.4 KB, free: 
1125.7 MB)
2016-07-13 13:40:43,702 INFO 
[org.apache.hadoop.hive.metastore.HiveMetaStore:746] - 0: get_table : 
db=default tbl=abc
2016-07-13 13:40:43,703 INFO 
[org.apache.hadoop.hive.metastore.HiveMetaStore.audit:371] - ugi=licl  
ip=unknown-ip-addr  cmd=get_table : db=default tbl=abc  
Exception in thread "main" java.lang.RuntimeException: 
Relation[card_id#26,tag_name#27,v#28] ParquetRelation
 requires that the query in the SELECT clause of the INSERT INTO/OVERWRITE 
statement generates the same number of columns as its schema.
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$$anonfun$apply$2.applyOrElse(rules.scala:68)
at 
org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$$anonfun$apply$2.applyOrElse(rules.scala:58)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:258)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:249)
at 
org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$.apply(rules.scala:58)
at 
org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$.apply(rules.scala:57)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
at 
scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at

[jira] [Updated] (SPARK-16517) can't add columns on the table witch column metadata is serializer

2016-07-12 Thread lichenglin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lichenglin updated SPARK-16517:
---
Description: 
{code}
setName("abc");
HiveContext hive = getHiveContext();
DataFrame d = hive.createDataFrame(
getJavaSparkContext().parallelize(

Arrays.asList(RowFactory.create("abc", "abc", 5.0), RowFactory.create("abcd", 
"abcd", 5.0))),
DataTypes.createStructType(

Arrays.asList(DataTypes.createStructField("card_id", DataTypes.StringType, 
true),

DataTypes.createStructField("tag_name", DataTypes.StringType, true),

DataTypes.createStructField("v", DataTypes.DoubleType, true;

d.write().partitionBy("v").mode(SaveMode.Overwrite).saveAsTable("abc");
hive.sql("alter table abc add columns(v2 double)");
hive.refreshTable("abc");
hive.sql("describe abc").show();
DataFrame d2 = hive.createDataFrame(

getJavaSparkContext().parallelize(Arrays.asList(RowFactory.create("abc", "abc", 
3.0, 4.0),
RowFactory.create("abcd", 
"abcd", 3.0, 1.0))),
new StructType(new StructField[] { 
DataTypes.createStructField("card_id", DataTypes.StringType, true),

DataTypes.createStructField("tag_name", DataTypes.StringType, true),

DataTypes.createStructField("v", DataTypes.DoubleType, true),

DataTypes.createStructField("v2", DataTypes.DoubleType, true) }));

d2.write().partitionBy("v").mode(SaveMode.Append).saveAsTable("abc");
hive.table("abc").show();
{code}
spark.sql.parquet.mergeSchema has been set to  "true".

The code's exception is here 
{code}

++-+---+
|col_name|data_type|comment|
++-+---+
| card_id|   string|   |
|tag_name|   string|   |
|   v|   double|   |
++-+---+

2016-07-13 13:40:43,637 INFO 
[org.apache.hadoop.hive.metastore.HiveMetaStore:746] - 0: get_table : 
db=default tbl=abc
2016-07-13 13:40:43,637 INFO 
[org.apache.hadoop.hive.metastore.HiveMetaStore.audit:371] - ugi=licl  
ip=unknown-ip-addr  cmd=get_table : db=default tbl=abc  
2016-07-13 13:40:43,693 INFO [org.apache.spark.storage.BlockManagerInfo:58] - 
Removed broadcast_2_piece0 on localhost:50647 in memory (size: 1176.0 B, free: 
1125.7 MB)
2016-07-13 13:40:43,700 INFO [org.apache.spark.ContextCleaner:58] - Cleaned 
accumulator 2
2016-07-13 13:40:43,702 INFO [org.apache.spark.storage.BlockManagerInfo:58] - 
Removed broadcast_1_piece0 on localhost:50647 in memory (size: 19.4 KB, free: 
1125.7 MB)
2016-07-13 13:40:43,702 INFO 
[org.apache.hadoop.hive.metastore.HiveMetaStore:746] - 0: get_table : 
db=default tbl=abc
2016-07-13 13:40:43,703 INFO 
[org.apache.hadoop.hive.metastore.HiveMetaStore.audit:371] - ugi=licl  
ip=unknown-ip-addr  cmd=get_table : db=default tbl=abc  
Exception in thread "main" java.lang.RuntimeException: 
Relation[card_id#26,tag_name#27,v#28] ParquetRelation
 requires that the query in the SELECT clause of the INSERT INTO/OVERWRITE 
statement generates the same number of columns as its schema.
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$$anonfun$apply$2.applyOrElse(rules.scala:68)
at 
org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$$anonfun$apply$2.applyOrElse(rules.scala:58)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:258)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:249)
at 
org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$.apply(rules.scala:58)
at 
org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$.apply(rules.scala:57)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
at 
scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
at

[jira] [Commented] (SPARK-16516) Support for pushing down filters for decimal and timestamp types in ORC

2016-07-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15374393#comment-15374393
 ] 

Apache Spark commented on SPARK-16516:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/14172

> Support for pushing down filters for decimal and timestamp types in ORC
> ---
>
> Key: SPARK-16516
> URL: https://issues.apache.org/jira/browse/SPARK-16516
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> Currently filters for {{TimestampType}} and {{DecimalType}} are not being 
> pushed down in ORC data source although ORC filters support both.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16516) Support for pushing down filters for decimal and timestamp types in ORC

2016-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16516:


Assignee: Apache Spark

> Support for pushing down filters for decimal and timestamp types in ORC
> ---
>
> Key: SPARK-16516
> URL: https://issues.apache.org/jira/browse/SPARK-16516
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>
> Currently filters for {{TimestampType}} and {{DecimalType}} are not being 
> pushed down in ORC data source although ORC filters support both.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16516) Support for pushing down filters for decimal and timestamp types in ORC

2016-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16516:


Assignee: (was: Apache Spark)

> Support for pushing down filters for decimal and timestamp types in ORC
> ---
>
> Key: SPARK-16516
> URL: https://issues.apache.org/jira/browse/SPARK-16516
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> Currently filters for {{TimestampType}} and {{DecimalType}} are not being 
> pushed down in ORC data source although ORC filters support both.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16517) can't add columns on the table witch column metadata is serializer

2016-07-12 Thread lichenglin (JIRA)

lichenglin created SPARK-16517:
--

 Summary: can't add columns on the table witch column metadata is 
serializer
 Key: SPARK-16517
 URL: https://issues.apache.org/jira/browse/SPARK-16517
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.2
Reporter: lichenglin


{code}
setName("abc");
HiveContext hive = getHiveContext();
DataFrame d = hive.createDataFrame(
getJavaSparkContext().parallelize(

Arrays.asList(RowFactory.create("abc", "abc", 5.0), RowFactory.create("abcd", 
"abcd", 5.0))),
DataTypes.createStructType(

Arrays.asList(DataTypes.createStructField("card_id", DataTypes.StringType, 
true),

DataTypes.createStructField("tag_name", DataTypes.StringType, true),

DataTypes.createStructField("v", DataTypes.DoubleType, true;

d.write().partitionBy("v").mode(SaveMode.Overwrite).saveAsTable("abc");
hive.sql("alter table abc add columns(v2 double)");
hive.refreshTable("abc");
hive.sql("describe abc").show();
DataFrame d2 = hive.createDataFrame(

getJavaSparkContext().parallelize(Arrays.asList(RowFactory.create("abc", "abc", 
3.0, 4.0),
RowFactory.create("abcd", 
"abcd", 3.0, 1.0))),
new StructType(new StructField[] { 
DataTypes.createStructField("card_id", DataTypes.StringType, true),

DataTypes.createStructField("tag_name", DataTypes.StringType, true),

DataTypes.createStructField("v", DataTypes.DoubleType, true),

DataTypes.createStructField("v2", DataTypes.DoubleType, true) }));

d2.write().partitionBy("v").mode(SaveMode.Append).saveAsTable("abc");
hive.table("abc").show();
{code}
spark.sql.parquet.mergeSchema has been set to  "true".

The code's exception is here 
{code}

++-+---+
|col_name|data_type|comment|
++-+---+
| card_id|   string|   |
|tag_name|   string|   |
|   v|   double|   |
++-+---+

2016-07-13 13:40:43,637 INFO 
[org.apache.hadoop.hive.metastore.HiveMetaStore:746] - 0: get_table : 
db=default tbl=abc
2016-07-13 13:40:43,637 INFO 
[org.apache.hadoop.hive.metastore.HiveMetaStore.audit:371] - ugi=licl  
ip=unknown-ip-addr  cmd=get_table : db=default tbl=abc  
2016-07-13 13:40:43,693 INFO [org.apache.spark.storage.BlockManagerInfo:58] - 
Removed broadcast_2_piece0 on localhost:50647 in memory (size: 1176.0 B, free: 
1125.7 MB)
2016-07-13 13:40:43,700 INFO [org.apache.spark.ContextCleaner:58] - Cleaned 
accumulator 2
2016-07-13 13:40:43,702 INFO [org.apache.spark.storage.BlockManagerInfo:58] - 
Removed broadcast_1_piece0 on localhost:50647 in memory (size: 19.4 KB, free: 
1125.7 MB)
2016-07-13 13:40:43,702 INFO 
[org.apache.hadoop.hive.metastore.HiveMetaStore:746] - 0: get_table : 
db=default tbl=abc
2016-07-13 13:40:43,703 INFO 
[org.apache.hadoop.hive.metastore.HiveMetaStore.audit:371] - ugi=licl  
ip=unknown-ip-addr  cmd=get_table : db=default tbl=abc  
Exception in thread "main" java.lang.RuntimeException: 
Relation[card_id#26,tag_name#27,v#28] ParquetRelation
 requires that the query in the SELECT clause of the INSERT INTO/OVERWRITE 
statement generates the same number of columns as its schema.
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$$anonfun$apply$2.applyOrElse(rules.scala:68)
at 
org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$$anonfun$apply$2.applyOrElse(rules.scala:58)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:258)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:249)
at 
org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$.apply(rules.scala:58)
at 
org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$.apply(rules.scala:57)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
at

[jira] [Created] (SPARK-16516) Support for pushing down filters for decimal and timestamp types in ORC

2016-07-12 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-16516:


 Summary: Support for pushing down filters for decimal and 
timestamp types in ORC
 Key: SPARK-16516
 URL: https://issues.apache.org/jira/browse/SPARK-16516
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Hyukjin Kwon


Currently filters for {{TimestampType}} and {{DecimalType}} are not being 
pushed down in ORC data source although ORC filters support both.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16283) Implement percentile_approx SQL function

2016-07-12 Thread Liwei Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15374295#comment-15374295
 ] 

Liwei Lin edited comment on SPARK-16283 at 7/13/16 4:01 AM:


Hive's percentile_approx implementation computes approximate percentile values 
from a histogram (please refer to 
[Hive/GenericUDAFPercentileApprox.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox.java]
 and 
[Hive/NumericHistogram.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java]
 for details):
- Hive's percentile_approx's signature is: {{\_FUNC\_(expr, pc, \[nb\])}}
- parameter \[nb\] -- the number of histogram bins to use -- is optionally 
specified by users
- if the number of unique values in the actual dataset is less than or equals 
to this \[nb\], we can expect an exact result; otherwise there are no 
approximation guarantees


Our Dataset's approxQuantile() implementation is not really histogram-based 
(and thus differs from Hive's implementation):
- our Dataset's approxQuantile()'s signature is something like: 
{{\_FUNC\_(expr, pc, relativeError)}}
- parameter relativeError is specified by users and should be in \[0, 1\]; our 
approximation is deterministicly bounded by this relativeError -- please refer 
to 
[Spark/DataFrameStatFunctions.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L39]
 for details


Since there's no direct deterministic relationship between \[nb\] and 
relativeError, it seems hard to build Hive's percentile_approx on top of our 
Dataset's approxQuantile(). So, [~rxin], [~thunterdb], should we: (a) port 
Hive' implementation into Spark, and provide {{\_FUNC\_(expr, pc, \[nb\])}} on 
top of it, or (b) provide {{\_FUNC\_(expr, pc, relativeError)}} directly on top 
of our Dataset's approxQuantile() implementation, but this might be 
incompatible with Hive? Could you share some thoughts? Thanks !


was (Author: proflin):
Hive's percentile_approx implementation computes approximate percentile values 
from a histogram (please refer to 
[Hive/GenericUDAFPercentileApprox.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox.java]
 and 
[Hive/NumericHistogram.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java]
 for details):
- Hive's percentile_approx's signature is: {{\_FUNC\_(expr, pc, \[nb\])}}
- parameter \[nb\] -- the number of histogram bins to use -- is optionally 
specified by users
- if the number of unique values in the actual dataset is less than or equals 
to this \[nb\], we can expect an exact result; otherwise there are no 
approximation guarantees


Our Dataset's approxQuantile() implementation is not really histogram-based 
(and thus differs from Hive's implementation):
- our Dataset's approxQuantile()'s signature is something like: 
{{\_FUNC\_(expr, pc, relativeError)}}
- parameter relativeError is specified by users and should be in \[0, 1\]; our 
approximation is deterministicly bounded by this relativeError -- please refer 
to 
[Spark/DataFrameStatFunctions.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L39]
 for details


Since there's no direct deterministic relationship between \[nb\] and 
relativeError, it seems hard to build Hive's percentile_approx on top of our 
Dataset's approxQuantile(). So, [~rxin], [~thunterdb], should we: (a) port 
Hive' implementation into Spark, and provide {{\_FUNC\_(expr, pc, \[nb\])}} on 
top of it, or (b) provide {{\_FUNC\_(expr, pc, relativeError)}} directly on top 
of our Dataset's approxQuantile() implementation, but this might be 
incompatible with Hive? Thanks !

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16283) Implement percentile_approx SQL function

2016-07-12 Thread Liwei Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15374295#comment-15374295
 ] 

Liwei Lin commented on SPARK-16283:
---

Hive's percentile_approx implementation computes approximate percentile values 
from a histogram (please refer to 
[Hive/GenericUDAFPercentileApprox.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox.java]
 and 
[Hive/NumericHistogram.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java]
 for details):
- Hive's percentile_approx's signature is: {{\_FUNC\_(expr, pc, \[nb\])}}
- parameter \[nb\] -- the number of histogram bins to use -- is optionally 
specified by users
- if the number of unique values in the actual dataset is less than or equals 
to this \[nb\], we can expect an exact result; otherwise there are no 
approximation guarantees


Our Dataset's approxQuantile() implementation is not really histogram-based 
(and thus differs from Hive's implementation):
- our Dataset's approxQuantile()'s signature is something like: 
{{\_FUNC\_(expr, pc, relativeError)}}
- parameter relativeError is specified by users and should be in \[0, 1\]; our 
approximation is deterministicly bounded by this relativeError -- please refer 
to 
[Spark/DataFrameStatFunctions.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L39]
 for details


Since there's no direct deterministic relationship between \[nb\] and 
relativeError, it seems hard to build Hive's percentile_approx on top of our 
Dataset's approxQuantile(). So, [~rxin], [~thunterdb], should we: (a) port 
Hive' implementation into Spark, and provide {{\_FUNC\_(expr, pc, \[nb\])}} on 
top of it, or (b) provide {{\_FUNC\_(expr, pc, relativeError)}} directly on top 
of our Dataset's approxQuantile() implementation, but this might be 
incompatible with Hive? Thanks !

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16509) Rename window.partitionBy and window.orderBy

2016-07-12 Thread Sun Rui (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15374259#comment-15374259
 ] 

Sun Rui commented on SPARK-16509:
-

yes, let's rename them to pass R package check. maybe windowOrderBy and 
windowPartitionBy?

> Rename window.partitionBy and window.orderBy
> 
>
> Key: SPARK-16509
> URL: https://issues.apache.org/jira/browse/SPARK-16509
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> Right now R CMD check [1]  interprets window.partitonBy and window.orderBy as 
> S3 functions defined on the "partitionBy" class or "orderBy" class (similar 
> to say summary.lm) 
> To avoid confusion I think we should just rename the functions and not use 
> `.` in them ?
> cc [~sunrui]
> [1] https://gist.github.com/shivaram/62866c4ca59c5d34b8963939cf04b5eb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16515) [SPARK][SQL] transformation script got failure for python script

2016-07-12 Thread Adrian Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15374241#comment-15374241
 ] 

Adrian Wang commented on SPARK-16515:
-

The problem is spark did not find the right record writer from its conf when it 
has to write records to standard output. So when python read data from standard 
input, it crashes.

> [SPARK][SQL] transformation script got failure for python script
> 
>
> Key: SPARK-16515
> URL: https://issues.apache.org/jira/browse/SPARK-16515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Priority: Critical
>
> Run below SQL and get transformation script error for python script like 
> below error message.
> Query SQL:
> {code}
> CREATE VIEW q02_spark_sql_engine_validation_power_test_0_temp AS
> SELECT DISTINCT
>   sessionid,
>   wcs_item_sk
> FROM
> (
>   FROM
>   (
> SELECT
>   wcs_user_sk,
>   wcs_item_sk,
>   (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
> FROM web_clickstreams
> WHERE wcs_item_sk IS NOT NULL
> AND   wcs_user_sk IS NOT NULL
> DISTRIBUTE BY wcs_user_sk
> SORT BY
>   wcs_user_sk,
>   tstamp_inSec -- "sessionize" reducer script requires the cluster by uid 
> and sort by tstamp
>   ) clicksAnWebPageType
>   REDUCE
> wcs_user_sk,
> tstamp_inSec,
> wcs_item_sk
>   USING 'python q2-sessionize.py 3600'
>   AS (
> wcs_item_sk BIGINT,
> sessionid STRING)
> ) q02_tmp_sessionize
> CLUSTER BY sessionid
> {code}
> Error Message:
> {code}
> 16/07/06 16:59:02 WARN scheduler.TaskSetManager: Lost task 5.0 in stage 157.0 
> (TID 171, hw-node5): org.apache.spark.SparkException: Subprocess exited with 
> status 1. Error: Traceback (most recent call last):
>   File "q2-sessionize.py", line 49, in 
> user_sk, tstamp_str, item_sk  = line.strip().split("\t")
> ValueError: too many values to unpack
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.checkFailureAndPropagate(ScriptTransformation.scala:144)
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.hasNext(ScriptTransformation.scala:192)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Subprocess exited with status 1. 
> Error: Traceback (most recent call last):
>   File "q2-sessionize.py", line 49, in 
> user_sk, tstamp_str, item_sk  = line.strip().split("\t")
> ValueError: too many values to unpack
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.checkFailureAndPropagate(ScriptTransformation.scala:144)
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.hasNext(ScriptTransformation.scala:181)
>   ... 14 more
> 16/07/06 16:59:02 INFO scheduler.TaskSetManager: Lost task 7.0 in stage 157.0 
> (TID 173) on executor hw-node5: org.apache.spark.SparkException (Subprocess 
> exited with status 1. Error: Traceback (most recent call last):
>   File "q2-sessionize.py", line 49, in 
> user_sk, tstamp_str, item_sk  = line.strip().split("\t")
> ValueError: too many values to unpack
> ) [duplicate 1]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16515) [SPARK][SQL] transformation script got failure for python script

2016-07-12 Thread Yi Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15374234#comment-15374234
 ] 

Yi Zhou commented on SPARK-16515:
-

Thank you for your quick attention for this issue.

> [SPARK][SQL] transformation script got failure for python script
> 
>
> Key: SPARK-16515
> URL: https://issues.apache.org/jira/browse/SPARK-16515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Priority: Critical
>
> Run below SQL and get transformation script error for python script like 
> below error message.
> Query SQL:
> {code}
> CREATE VIEW q02_spark_sql_engine_validation_power_test_0_temp AS
> SELECT DISTINCT
>   sessionid,
>   wcs_item_sk
> FROM
> (
>   FROM
>   (
> SELECT
>   wcs_user_sk,
>   wcs_item_sk,
>   (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
> FROM web_clickstreams
> WHERE wcs_item_sk IS NOT NULL
> AND   wcs_user_sk IS NOT NULL
> DISTRIBUTE BY wcs_user_sk
> SORT BY
>   wcs_user_sk,
>   tstamp_inSec -- "sessionize" reducer script requires the cluster by uid 
> and sort by tstamp
>   ) clicksAnWebPageType
>   REDUCE
> wcs_user_sk,
> tstamp_inSec,
> wcs_item_sk
>   USING 'python q2-sessionize.py 3600'
>   AS (
> wcs_item_sk BIGINT,
> sessionid STRING)
> ) q02_tmp_sessionize
> CLUSTER BY sessionid
> {code}
> Error Message:
> {code}
> 16/07/06 16:59:02 WARN scheduler.TaskSetManager: Lost task 5.0 in stage 157.0 
> (TID 171, hw-node5): org.apache.spark.SparkException: Subprocess exited with 
> status 1. Error: Traceback (most recent call last):
>   File "q2-sessionize.py", line 49, in 
> user_sk, tstamp_str, item_sk  = line.strip().split("\t")
> ValueError: too many values to unpack
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.checkFailureAndPropagate(ScriptTransformation.scala:144)
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.hasNext(ScriptTransformation.scala:192)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Subprocess exited with status 1. 
> Error: Traceback (most recent call last):
>   File "q2-sessionize.py", line 49, in 
> user_sk, tstamp_str, item_sk  = line.strip().split("\t")
> ValueError: too many values to unpack
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.checkFailureAndPropagate(ScriptTransformation.scala:144)
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.hasNext(ScriptTransformation.scala:181)
>   ... 14 more
> 16/07/06 16:59:02 INFO scheduler.TaskSetManager: Lost task 7.0 in stage 157.0 
> (TID 173) on executor hw-node5: org.apache.spark.SparkException (Subprocess 
> exited with status 1. Error: Traceback (most recent call last):
>   File "q2-sessionize.py", line 49, in 
> user_sk, tstamp_str, item_sk  = line.strip().split("\t")
> ValueError: too many values to unpack
> ) [duplicate 1]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16515) [SPARK][SQL] transformation script got failure for python script

2016-07-12 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15374214#comment-15374214
 ] 

Hyukjin Kwon commented on SPARK-16515:
--

Oh, I will refer the PR. Thanks.

> [SPARK][SQL] transformation script got failure for python script
> 
>
> Key: SPARK-16515
> URL: https://issues.apache.org/jira/browse/SPARK-16515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Priority: Critical
>
> Run below SQL and get transformation script error for python script like 
> below error message.
> Query SQL:
> {code}
> CREATE VIEW q02_spark_sql_engine_validation_power_test_0_temp AS
> SELECT DISTINCT
>   sessionid,
>   wcs_item_sk
> FROM
> (
>   FROM
>   (
> SELECT
>   wcs_user_sk,
>   wcs_item_sk,
>   (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
> FROM web_clickstreams
> WHERE wcs_item_sk IS NOT NULL
> AND   wcs_user_sk IS NOT NULL
> DISTRIBUTE BY wcs_user_sk
> SORT BY
>   wcs_user_sk,
>   tstamp_inSec -- "sessionize" reducer script requires the cluster by uid 
> and sort by tstamp
>   ) clicksAnWebPageType
>   REDUCE
> wcs_user_sk,
> tstamp_inSec,
> wcs_item_sk
>   USING 'python q2-sessionize.py 3600'
>   AS (
> wcs_item_sk BIGINT,
> sessionid STRING)
> ) q02_tmp_sessionize
> CLUSTER BY sessionid
> {code}
> Error Message:
> {code}
> 16/07/06 16:59:02 WARN scheduler.TaskSetManager: Lost task 5.0 in stage 157.0 
> (TID 171, hw-node5): org.apache.spark.SparkException: Subprocess exited with 
> status 1. Error: Traceback (most recent call last):
>   File "q2-sessionize.py", line 49, in 
> user_sk, tstamp_str, item_sk  = line.strip().split("\t")
> ValueError: too many values to unpack
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.checkFailureAndPropagate(ScriptTransformation.scala:144)
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.hasNext(ScriptTransformation.scala:192)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Subprocess exited with status 1. 
> Error: Traceback (most recent call last):
>   File "q2-sessionize.py", line 49, in 
> user_sk, tstamp_str, item_sk  = line.strip().split("\t")
> ValueError: too many values to unpack
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.checkFailureAndPropagate(ScriptTransformation.scala:144)
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.hasNext(ScriptTransformation.scala:181)
>   ... 14 more
> 16/07/06 16:59:02 INFO scheduler.TaskSetManager: Lost task 7.0 in stage 157.0 
> (TID 173) on executor hw-node5: org.apache.spark.SparkException (Subprocess 
> exited with status 1. Error: Traceback (most recent call last):
>   File "q2-sessionize.py", line 49, in 
> user_sk, tstamp_str, item_sk  = line.strip().split("\t")
> ValueError: too many values to unpack
> ) [duplicate 1]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16515) [SPARK][SQL] transformation script got failure for python script

2016-07-12 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15374210#comment-15374210
 ] 

Hyukjin Kwon commented on SPARK-16515:
--

Hi, do you mind if I ask narrow it down or some codes with sample data so that 
I can reproduce?
I cannot reproduce with the information above. It would be nicer if there is 
some codes and sample data to test.
BTW, it seems not a Spark issue anyway assuming from the message:

{code}
  File "q2-sessionize.py", line 49, in 
user_sk, tstamp_str, item_sk  = line.strip().split("\t")
ValueError: too many values to unpack
{code}

> [SPARK][SQL] transformation script got failure for python script
> 
>
> Key: SPARK-16515
> URL: https://issues.apache.org/jira/browse/SPARK-16515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Priority: Critical
>
> Run below SQL and get transformation script error for python script like 
> below error message.
> Query SQL:
> {code}
> CREATE VIEW q02_spark_sql_engine_validation_power_test_0_temp AS
> SELECT DISTINCT
>   sessionid,
>   wcs_item_sk
> FROM
> (
>   FROM
>   (
> SELECT
>   wcs_user_sk,
>   wcs_item_sk,
>   (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
> FROM web_clickstreams
> WHERE wcs_item_sk IS NOT NULL
> AND   wcs_user_sk IS NOT NULL
> DISTRIBUTE BY wcs_user_sk
> SORT BY
>   wcs_user_sk,
>   tstamp_inSec -- "sessionize" reducer script requires the cluster by uid 
> and sort by tstamp
>   ) clicksAnWebPageType
>   REDUCE
> wcs_user_sk,
> tstamp_inSec,
> wcs_item_sk
>   USING 'python q2-sessionize.py 3600'
>   AS (
> wcs_item_sk BIGINT,
> sessionid STRING)
> ) q02_tmp_sessionize
> CLUSTER BY sessionid
> {code}
> Error Message:
> {code}
> 16/07/06 16:59:02 WARN scheduler.TaskSetManager: Lost task 5.0 in stage 157.0 
> (TID 171, hw-node5): org.apache.spark.SparkException: Subprocess exited with 
> status 1. Error: Traceback (most recent call last):
>   File "q2-sessionize.py", line 49, in 
> user_sk, tstamp_str, item_sk  = line.strip().split("\t")
> ValueError: too many values to unpack
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.checkFailureAndPropagate(ScriptTransformation.scala:144)
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.hasNext(ScriptTransformation.scala:192)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Subprocess exited with status 1. 
> Error: Traceback (most recent call last):
>   File "q2-sessionize.py", line 49, in 
> user_sk, tstamp_str, item_sk  = line.strip().split("\t")
> ValueError: too many values to unpack
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.checkFailureAndPropagate(ScriptTransformation.scala:144)
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.hasNext(ScriptTransformation.scala:181)
>   ... 14 more
> 16/07/06 16:59:02 INFO scheduler.TaskSetManager: Lost task 7.0 in stage 157.0 
> (TID 173) on executor hw-node5: org.apache.spark.SparkException (Subprocess 
> exited with status 1. Error: Traceback (most recent call last):
>   File "q2-sessionize.py", line 49, in 
> user_sk, tstamp_str, item_sk  = line.strip().split("\t")
> ValueError: too many values to unpack
> ) [duplicate 1]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (SPARK-16515) [SPARK][SQL] transformation script got failure for python script

2016-07-12 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15374210#comment-15374210
 ] 

Hyukjin Kwon edited comment on SPARK-16515 at 7/13/16 2:35 AM:
---

Hi, do you mind if I ask narrow it down or some codes with sample data so that 
I can reproduce?
I cannot reproduce with the information above. It would be nicer if there is 
some codes and sample data to test.


was (Author: hyukjin.kwon):
Hi, do you mind if I ask narrow it down or some codes with sample data so that 
I can reproduce?
I cannot reproduce with the information above. It would be nicer if there is 
some codes and sample data to test.
BTW, it seems not a Spark issue anyway assuming from the message:

{code}
  File "q2-sessionize.py", line 49, in 
user_sk, tstamp_str, item_sk  = line.strip().split("\t")
ValueError: too many values to unpack
{code}

> [SPARK][SQL] transformation script got failure for python script
> 
>
> Key: SPARK-16515
> URL: https://issues.apache.org/jira/browse/SPARK-16515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Priority: Critical
>
> Run below SQL and get transformation script error for python script like 
> below error message.
> Query SQL:
> {code}
> CREATE VIEW q02_spark_sql_engine_validation_power_test_0_temp AS
> SELECT DISTINCT
>   sessionid,
>   wcs_item_sk
> FROM
> (
>   FROM
>   (
> SELECT
>   wcs_user_sk,
>   wcs_item_sk,
>   (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
> FROM web_clickstreams
> WHERE wcs_item_sk IS NOT NULL
> AND   wcs_user_sk IS NOT NULL
> DISTRIBUTE BY wcs_user_sk
> SORT BY
>   wcs_user_sk,
>   tstamp_inSec -- "sessionize" reducer script requires the cluster by uid 
> and sort by tstamp
>   ) clicksAnWebPageType
>   REDUCE
> wcs_user_sk,
> tstamp_inSec,
> wcs_item_sk
>   USING 'python q2-sessionize.py 3600'
>   AS (
> wcs_item_sk BIGINT,
> sessionid STRING)
> ) q02_tmp_sessionize
> CLUSTER BY sessionid
> {code}
> Error Message:
> {code}
> 16/07/06 16:59:02 WARN scheduler.TaskSetManager: Lost task 5.0 in stage 157.0 
> (TID 171, hw-node5): org.apache.spark.SparkException: Subprocess exited with 
> status 1. Error: Traceback (most recent call last):
>   File "q2-sessionize.py", line 49, in 
> user_sk, tstamp_str, item_sk  = line.strip().split("\t")
> ValueError: too many values to unpack
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.checkFailureAndPropagate(ScriptTransformation.scala:144)
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.hasNext(ScriptTransformation.scala:192)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Subprocess exited with status 1. 
> Error: Traceback (most recent call last):
>   File "q2-sessionize.py", line 49, in 
> user_sk, tstamp_str, item_sk  = line.strip().split("\t")
> ValueError: too many values to unpack
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.checkFailureAndPropagate(ScriptTransformation.scala:144)
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.hasNext(ScriptTransformation.scala:181)
>   ... 14 more
> 16/07/06 16:59:02 INFO scheduler.TaskSetManager: Lost task 7.0 in stage 157.0 
> (TID 173) on executor hw-node5: org.apache.spark.SparkException (Subprocess 
> exited with status 1. Error: Traceback (most

[jira] [Assigned] (SPARK-16515) [SPARK][SQL] transformation script got failure for python script

2016-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16515:


Assignee: (was: Apache Spark)

> [SPARK][SQL] transformation script got failure for python script
> 
>
> Key: SPARK-16515
> URL: https://issues.apache.org/jira/browse/SPARK-16515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Priority: Critical
>
> Run below SQL and get transformation script error for python script like 
> below error message.
> Query SQL:
> {code}
> CREATE VIEW q02_spark_sql_engine_validation_power_test_0_temp AS
> SELECT DISTINCT
>   sessionid,
>   wcs_item_sk
> FROM
> (
>   FROM
>   (
> SELECT
>   wcs_user_sk,
>   wcs_item_sk,
>   (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
> FROM web_clickstreams
> WHERE wcs_item_sk IS NOT NULL
> AND   wcs_user_sk IS NOT NULL
> DISTRIBUTE BY wcs_user_sk
> SORT BY
>   wcs_user_sk,
>   tstamp_inSec -- "sessionize" reducer script requires the cluster by uid 
> and sort by tstamp
>   ) clicksAnWebPageType
>   REDUCE
> wcs_user_sk,
> tstamp_inSec,
> wcs_item_sk
>   USING 'python q2-sessionize.py 3600'
>   AS (
> wcs_item_sk BIGINT,
> sessionid STRING)
> ) q02_tmp_sessionize
> CLUSTER BY sessionid
> {code}
> Error Message:
> {code}
> 16/07/06 16:59:02 WARN scheduler.TaskSetManager: Lost task 5.0 in stage 157.0 
> (TID 171, hw-node5): org.apache.spark.SparkException: Subprocess exited with 
> status 1. Error: Traceback (most recent call last):
>   File "q2-sessionize.py", line 49, in 
> user_sk, tstamp_str, item_sk  = line.strip().split("\t")
> ValueError: too many values to unpack
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.checkFailureAndPropagate(ScriptTransformation.scala:144)
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.hasNext(ScriptTransformation.scala:192)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Subprocess exited with status 1. 
> Error: Traceback (most recent call last):
>   File "q2-sessionize.py", line 49, in 
> user_sk, tstamp_str, item_sk  = line.strip().split("\t")
> ValueError: too many values to unpack
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.checkFailureAndPropagate(ScriptTransformation.scala:144)
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.hasNext(ScriptTransformation.scala:181)
>   ... 14 more
> 16/07/06 16:59:02 INFO scheduler.TaskSetManager: Lost task 7.0 in stage 157.0 
> (TID 173) on executor hw-node5: org.apache.spark.SparkException (Subprocess 
> exited with status 1. Error: Traceback (most recent call last):
>   File "q2-sessionize.py", line 49, in 
> user_sk, tstamp_str, item_sk  = line.strip().split("\t")
> ValueError: too many values to unpack
> ) [duplicate 1]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16515) [SPARK][SQL] transformation script got failure for python script

2016-07-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15374207#comment-15374207
 ] 

Apache Spark commented on SPARK-16515:
--

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/14169

> [SPARK][SQL] transformation script got failure for python script
> 
>
> Key: SPARK-16515
> URL: https://issues.apache.org/jira/browse/SPARK-16515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Priority: Critical
>
> Run below SQL and get transformation script error for python script like 
> below error message.
> Query SQL:
> {code}
> CREATE VIEW q02_spark_sql_engine_validation_power_test_0_temp AS
> SELECT DISTINCT
>   sessionid,
>   wcs_item_sk
> FROM
> (
>   FROM
>   (
> SELECT
>   wcs_user_sk,
>   wcs_item_sk,
>   (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
> FROM web_clickstreams
> WHERE wcs_item_sk IS NOT NULL
> AND   wcs_user_sk IS NOT NULL
> DISTRIBUTE BY wcs_user_sk
> SORT BY
>   wcs_user_sk,
>   tstamp_inSec -- "sessionize" reducer script requires the cluster by uid 
> and sort by tstamp
>   ) clicksAnWebPageType
>   REDUCE
> wcs_user_sk,
> tstamp_inSec,
> wcs_item_sk
>   USING 'python q2-sessionize.py 3600'
>   AS (
> wcs_item_sk BIGINT,
> sessionid STRING)
> ) q02_tmp_sessionize
> CLUSTER BY sessionid
> {code}
> Error Message:
> {code}
> 16/07/06 16:59:02 WARN scheduler.TaskSetManager: Lost task 5.0 in stage 157.0 
> (TID 171, hw-node5): org.apache.spark.SparkException: Subprocess exited with 
> status 1. Error: Traceback (most recent call last):
>   File "q2-sessionize.py", line 49, in 
> user_sk, tstamp_str, item_sk  = line.strip().split("\t")
> ValueError: too many values to unpack
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.checkFailureAndPropagate(ScriptTransformation.scala:144)
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.hasNext(ScriptTransformation.scala:192)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Subprocess exited with status 1. 
> Error: Traceback (most recent call last):
>   File "q2-sessionize.py", line 49, in 
> user_sk, tstamp_str, item_sk  = line.strip().split("\t")
> ValueError: too many values to unpack
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.checkFailureAndPropagate(ScriptTransformation.scala:144)
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.hasNext(ScriptTransformation.scala:181)
>   ... 14 more
> 16/07/06 16:59:02 INFO scheduler.TaskSetManager: Lost task 7.0 in stage 157.0 
> (TID 173) on executor hw-node5: org.apache.spark.SparkException (Subprocess 
> exited with status 1. Error: Traceback (most recent call last):
>   File "q2-sessionize.py", line 49, in 
> user_sk, tstamp_str, item_sk  = line.strip().split("\t")
> ValueError: too many values to unpack
> ) [duplicate 1]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16515) [SPARK][SQL] transformation script got failure for python script

2016-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16515:


Assignee: Apache Spark

> [SPARK][SQL] transformation script got failure for python script
> 
>
> Key: SPARK-16515
> URL: https://issues.apache.org/jira/browse/SPARK-16515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Assignee: Apache Spark
>Priority: Critical
>
> Run below SQL and get transformation script error for python script like 
> below error message.
> Query SQL:
> {code}
> CREATE VIEW q02_spark_sql_engine_validation_power_test_0_temp AS
> SELECT DISTINCT
>   sessionid,
>   wcs_item_sk
> FROM
> (
>   FROM
>   (
> SELECT
>   wcs_user_sk,
>   wcs_item_sk,
>   (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
> FROM web_clickstreams
> WHERE wcs_item_sk IS NOT NULL
> AND   wcs_user_sk IS NOT NULL
> DISTRIBUTE BY wcs_user_sk
> SORT BY
>   wcs_user_sk,
>   tstamp_inSec -- "sessionize" reducer script requires the cluster by uid 
> and sort by tstamp
>   ) clicksAnWebPageType
>   REDUCE
> wcs_user_sk,
> tstamp_inSec,
> wcs_item_sk
>   USING 'python q2-sessionize.py 3600'
>   AS (
> wcs_item_sk BIGINT,
> sessionid STRING)
> ) q02_tmp_sessionize
> CLUSTER BY sessionid
> {code}
> Error Message:
> {code}
> 16/07/06 16:59:02 WARN scheduler.TaskSetManager: Lost task 5.0 in stage 157.0 
> (TID 171, hw-node5): org.apache.spark.SparkException: Subprocess exited with 
> status 1. Error: Traceback (most recent call last):
>   File "q2-sessionize.py", line 49, in 
> user_sk, tstamp_str, item_sk  = line.strip().split("\t")
> ValueError: too many values to unpack
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.checkFailureAndPropagate(ScriptTransformation.scala:144)
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.hasNext(ScriptTransformation.scala:192)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Subprocess exited with status 1. 
> Error: Traceback (most recent call last):
>   File "q2-sessionize.py", line 49, in 
> user_sk, tstamp_str, item_sk  = line.strip().split("\t")
> ValueError: too many values to unpack
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.checkFailureAndPropagate(ScriptTransformation.scala:144)
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.hasNext(ScriptTransformation.scala:181)
>   ... 14 more
> 16/07/06 16:59:02 INFO scheduler.TaskSetManager: Lost task 7.0 in stage 157.0 
> (TID 173) on executor hw-node5: org.apache.spark.SparkException (Subprocess 
> exited with status 1. Error: Traceback (most recent call last):
>   File "q2-sessionize.py", line 49, in 
> user_sk, tstamp_str, item_sk  = line.strip().split("\t")
> ValueError: too many values to unpack
> ) [duplicate 1]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16515) [SPARK][SQL] transformation script got failure for python script

2016-07-12 Thread Yi Zhou (JIRA)

Yi Zhou created SPARK-16515:
---

 Summary: [SPARK][SQL] transformation script got failure for python 
script
 Key: SPARK-16515
 URL: https://issues.apache.org/jira/browse/SPARK-16515
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Yi Zhou
Priority: Critical


Run below SQL and get transformation script error for python script like below 
error message.
Query SQL:
{code}
CREATE VIEW q02_spark_sql_engine_validation_power_test_0_temp AS
SELECT DISTINCT
  sessionid,
  wcs_item_sk
FROM
(
  FROM
  (
SELECT
  wcs_user_sk,
  wcs_item_sk,
  (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
FROM web_clickstreams
WHERE wcs_item_sk IS NOT NULL
AND   wcs_user_sk IS NOT NULL
DISTRIBUTE BY wcs_user_sk
SORT BY
  wcs_user_sk,
  tstamp_inSec -- "sessionize" reducer script requires the cluster by uid 
and sort by tstamp
  ) clicksAnWebPageType
  REDUCE
wcs_user_sk,
tstamp_inSec,
wcs_item_sk
  USING 'python q2-sessionize.py 3600'
  AS (
wcs_item_sk BIGINT,
sessionid STRING)
) q02_tmp_sessionize
CLUSTER BY sessionid
{code}

Error Message:
{code}
16/07/06 16:59:02 WARN scheduler.TaskSetManager: Lost task 5.0 in stage 157.0 
(TID 171, hw-node5): org.apache.spark.SparkException: Subprocess exited with 
status 1. Error: Traceback (most recent call last):
  File "q2-sessionize.py", line 49, in 
user_sk, tstamp_str, item_sk  = line.strip().split("\t")
ValueError: too many values to unpack
at 
org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.checkFailureAndPropagate(ScriptTransformation.scala:144)
at 
org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.hasNext(ScriptTransformation.scala:192)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Subprocess exited with status 1. 
Error: Traceback (most recent call last):
  File "q2-sessionize.py", line 49, in 
user_sk, tstamp_str, item_sk  = line.strip().split("\t")
ValueError: too many values to unpack
at 
org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.checkFailureAndPropagate(ScriptTransformation.scala:144)
at 
org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.hasNext(ScriptTransformation.scala:181)
... 14 more

16/07/06 16:59:02 INFO scheduler.TaskSetManager: Lost task 7.0 in stage 157.0 
(TID 173) on executor hw-node5: org.apache.spark.SparkException (Subprocess 
exited with status 1. Error: Traceback (most recent call last):
  File "q2-sessionize.py", line 49, in 
user_sk, tstamp_str, item_sk  = line.strip().split("\t")
ValueError: too many values to unpack
) [duplicate 1]
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16514) RegexExtract and RegexReplace crash on non-nullable input

2016-07-12 Thread Eric Liang (JIRA)

Eric Liang created SPARK-16514:
--

 Summary: RegexExtract and RegexReplace crash on non-nullable input
 Key: SPARK-16514
 URL: https://issues.apache.org/jira/browse/SPARK-16514
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Eric Liang
Priority: Critical


RegexExtract and RegexReplace currently crash on non-nullable input due use of 
a hard-coded local variable name (e.g. compiles fail with java.lang.Exception: 
failed to compile: org.codehaus.commons.compiler.CompileException: File 
'generated.java', Line 85, Column 26: Redefinition of local variable "m").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16514) RegexExtract and RegexReplace crash on non-nullable input

2016-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16514:


Assignee: Apache Spark

> RegexExtract and RegexReplace crash on non-nullable input
> -
>
> Key: SPARK-16514
> URL: https://issues.apache.org/jira/browse/SPARK-16514
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Eric Liang
>Assignee: Apache Spark
>Priority: Critical
>
> RegexExtract and RegexReplace currently crash on non-nullable input due use 
> of a hard-coded local variable name (e.g. compiles fail with 
> java.lang.Exception: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 85, Column 26: Redefinition of local variable "m").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16514) RegexExtract and RegexReplace crash on non-nullable input

2016-07-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15374200#comment-15374200
 ] 

Apache Spark commented on SPARK-16514:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/14168

> RegexExtract and RegexReplace crash on non-nullable input
> -
>
> Key: SPARK-16514
> URL: https://issues.apache.org/jira/browse/SPARK-16514
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Eric Liang
>Priority: Critical
>
> RegexExtract and RegexReplace currently crash on non-nullable input due use 
> of a hard-coded local variable name (e.g. compiles fail with 
> java.lang.Exception: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 85, Column 26: Redefinition of local variable "m").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16514) RegexExtract and RegexReplace crash on non-nullable input

2016-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16514:


Assignee: (was: Apache Spark)

> RegexExtract and RegexReplace crash on non-nullable input
> -
>
> Key: SPARK-16514
> URL: https://issues.apache.org/jira/browse/SPARK-16514
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Eric Liang
>Priority: Critical
>
> RegexExtract and RegexReplace currently crash on non-nullable input due use 
> of a hard-coded local variable name (e.g. compiles fail with 
> java.lang.Exception: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 85, Column 26: Redefinition of local variable "m").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16334) [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException

2016-07-12 Thread Vladimir Ivanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363438#comment-15363438
 ] 

Vladimir Ivanov edited comment on SPARK-16334 at 7/13/16 1:51 AM:
--

Hi, we discovered problem with the same stacktrace in Spark 2.0. In our case 
it's thrown during DataFrame.rdd.aggregate call. Moreover it somehow depends on 
volume of data, because it is not thrown when we change filter criteria 
accordingly. We used SparkSQL to write these parquet files and didn't 
explicitly specify WriterVersion option so I believe whatever version is set by 
default was used.


was (Author: vivanov):
Hi, we discovered problem with the same stacktrace in Spark 2.0. In our case 
it's thrown during DataFrame.rdd call. Moreover it somehow depends on volume of 
data, because it is not thrown when we change filter criteria accordingly. We 
used SparkSQL to write these parquet files and didn't explicitly specify 
WriterVersion option so I believe whatever version is set by default was used.

> [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException
> -
>
> Key: SPARK-16334
> URL: https://issues.apache.org/jira/browse/SPARK-16334
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Priority: Critical
>  Labels: sql
>
> Query:
> {code}
> select * from blabla where user_id = 415706251
> {code}
> Error:
> {code}
> 16/06/30 14:07:27 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 0.0 
> (TID 3, hadoop6): java.lang.ArrayIndexOutOfBoundsException: 6934
> at 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:119)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:273)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:170)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
> at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Work on 1.6.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15382) monotonicallyIncreasingId doesn't work when data is upsampled

2016-07-12 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15374160#comment-15374160
 ] 

Takeshi Yamamuro commented on SPARK-15382:
--

okay, thanks :) I'll check this.

> monotonicallyIncreasingId doesn't work when data is upsampled
> -
>
> Key: SPARK-15382
> URL: https://issues.apache.org/jira/browse/SPARK-15382
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Mateusz Buśkiewicz
>
> Assigned ids are not unique
> {code}
> from pyspark.sql import Row
> from pyspark.sql.functions import monotonicallyIncreasingId
> hiveContext.createDataFrame([Row(a=1), Row(a=2)]).sample(True, 
> 10.0).withColumn('id', monotonicallyIncreasingId()).collect()
> {code}
> Output:
> {code}
> [Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16513) Spark executor deadlocks itself in memory management

2016-07-12 Thread Steven Lowenthal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Lowenthal updated SPARK-16513:
-
Attachment: screenshot-1.png

> Spark executor deadlocks itself in memory management
> 
>
> Key: SPARK-16513
> URL: https://issues.apache.org/jira/browse/SPARK-16513
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Steven Lowenthal
> Attachments: screenshot-1.png, sparklog
>
>
> I have a spark streaming application which uses stateful RDDs (2 to be 
> exact), but a given job only uses one.  The last part of the executor stderr 
> log is enclosed.  There is no output in stdout.  There are 3 concurrent Spark 
> tasks on the executor deadlocked as follows:  
> org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1029)
> org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1009)
> org.apache.spark.storage.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:460)
> org.apache.spark.storage.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:449)
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> org.apache.spark.storage.MemoryStore.evictBlocksToFreeSpace(MemoryStore.scala:449)
> org.apache.spark.memory.StorageMemoryPool.acquireMemory(StorageMemoryPool.scala:89)
> org.apache.spark.memory.StorageMemoryPool.acquireMemory(StorageMemoryPool.scala:69)
> org.apache.spark.memory.UnifiedMemoryManager.acquireStorageMemory(UnifiedMemoryManager.scala:155)
> org.apache.spark.memory.UnifiedMemoryManager.acquireUnrollMemory(UnifiedMemoryManager.scala:162)
> org.apache.spark.storage.MemoryStore.reserveUnrollMemoryForThisTask(MemoryStore.scala:493)
> org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:291)
> org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> org.apache.spark.scheduler.Task.run(Task.scala:89)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> org.apache.spark.storage.MemoryStore.tryToPut(MemoryStore.scala:379)
> org.apache.spark.storage.MemoryStore.tryToPut(MemoryStore.scala:346)
> org.apache.spark.storage.MemoryStore.putArray(MemoryStore.scala:133)
> org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:800)
> org.apache.spark.storage.BlockManager.putArray(BlockManager.scala:676)
> org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:175)
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> org.apache.spark.scheduler.Task.run(Task.scala:89)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> org.apache.spark.memory.MemoryManager.releaseExecutionMemory(MemoryManager.scala:120)
> org.apache.spark.memory.TaskMemoryManager.releaseExecutionMemory(TaskMemoryManager.java:201)
> org.apache.spark.util.collection.Spillable$class.releaseMemory(Spillable.scala:111)
> org.apache.spark.util.collection.ExternalSorter.releaseMemory(ExternalSorter.scala:89)
>

[jira] [Updated] (SPARK-16513) Spark executor deadlocks itself in memory management

2016-07-12 Thread Steven Lowenthal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Lowenthal updated SPARK-16513:
-
Summary: Spark executor deadlocks itself in memory management  (was: Spark 
executore deadlocks itself in memory management)

> Spark executor deadlocks itself in memory management
> 
>
> Key: SPARK-16513
> URL: https://issues.apache.org/jira/browse/SPARK-16513
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Steven Lowenthal
> Attachments: sparklog
>
>
> I have a spark streaming application which uses stateful RDDs (2 to be 
> exact), but a given job only uses one.  The last part of the executor stderr 
> log is enclosed.  There is no output in stdout.  There are 3 concurrent Spark 
> tasks on the executor deadlocked as follows:  
> org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1029)
> org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1009)
> org.apache.spark.storage.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:460)
> org.apache.spark.storage.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:449)
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> org.apache.spark.storage.MemoryStore.evictBlocksToFreeSpace(MemoryStore.scala:449)
> org.apache.spark.memory.StorageMemoryPool.acquireMemory(StorageMemoryPool.scala:89)
> org.apache.spark.memory.StorageMemoryPool.acquireMemory(StorageMemoryPool.scala:69)
> org.apache.spark.memory.UnifiedMemoryManager.acquireStorageMemory(UnifiedMemoryManager.scala:155)
> org.apache.spark.memory.UnifiedMemoryManager.acquireUnrollMemory(UnifiedMemoryManager.scala:162)
> org.apache.spark.storage.MemoryStore.reserveUnrollMemoryForThisTask(MemoryStore.scala:493)
> org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:291)
> org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> org.apache.spark.scheduler.Task.run(Task.scala:89)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> org.apache.spark.storage.MemoryStore.tryToPut(MemoryStore.scala:379)
> org.apache.spark.storage.MemoryStore.tryToPut(MemoryStore.scala:346)
> org.apache.spark.storage.MemoryStore.putArray(MemoryStore.scala:133)
> org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:800)
> org.apache.spark.storage.BlockManager.putArray(BlockManager.scala:676)
> org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:175)
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> org.apache.spark.scheduler.Task.run(Task.scala:89)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> org.apache.spark.memory.MemoryManager.releaseExecutionMemory(MemoryManager.scala:120)
> org.apache.spark.memory.TaskMemoryManager.releaseExecutionMemory(TaskMemoryManager.java:201)
> org.apache.spark.util.collection.Spillable$class.releaseMemory(Spillable.scala:111)
>

[jira] [Updated] (SPARK-16513) Spark executor deadlocks itself in memory management

2016-07-12 Thread Steven Lowenthal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Lowenthal updated SPARK-16513:
-
Attachment: sparklog

> Spark executor deadlocks itself in memory management
> 
>
> Key: SPARK-16513
> URL: https://issues.apache.org/jira/browse/SPARK-16513
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Steven Lowenthal
> Attachments: sparklog
>
>
> I have a spark streaming application which uses stateful RDDs (2 to be 
> exact), but a given job only uses one.  The last part of the executor stderr 
> log is enclosed.  There is no output in stdout.  There are 3 concurrent Spark 
> tasks on the executor deadlocked as follows:  
> org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1029)
> org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1009)
> org.apache.spark.storage.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:460)
> org.apache.spark.storage.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:449)
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> org.apache.spark.storage.MemoryStore.evictBlocksToFreeSpace(MemoryStore.scala:449)
> org.apache.spark.memory.StorageMemoryPool.acquireMemory(StorageMemoryPool.scala:89)
> org.apache.spark.memory.StorageMemoryPool.acquireMemory(StorageMemoryPool.scala:69)
> org.apache.spark.memory.UnifiedMemoryManager.acquireStorageMemory(UnifiedMemoryManager.scala:155)
> org.apache.spark.memory.UnifiedMemoryManager.acquireUnrollMemory(UnifiedMemoryManager.scala:162)
> org.apache.spark.storage.MemoryStore.reserveUnrollMemoryForThisTask(MemoryStore.scala:493)
> org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:291)
> org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> org.apache.spark.scheduler.Task.run(Task.scala:89)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> org.apache.spark.storage.MemoryStore.tryToPut(MemoryStore.scala:379)
> org.apache.spark.storage.MemoryStore.tryToPut(MemoryStore.scala:346)
> org.apache.spark.storage.MemoryStore.putArray(MemoryStore.scala:133)
> org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:800)
> org.apache.spark.storage.BlockManager.putArray(BlockManager.scala:676)
> org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:175)
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> org.apache.spark.scheduler.Task.run(Task.scala:89)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> org.apache.spark.memory.MemoryManager.releaseExecutionMemory(MemoryManager.scala:120)
> org.apache.spark.memory.TaskMemoryManager.releaseExecutionMemory(TaskMemoryManager.java:201)
> org.apache.spark.util.collection.Spillable$class.releaseMemory(Spillable.scala:111)
> org.apache.spark.util.collection.ExternalSorter.releaseMemory(ExternalSorter.scala:89)
> org.apache.spark.util.collection.ExternalSorter.stop(ExternalSorter.scala:694)
>

[jira] [Created] (SPARK-16513) Spark executore deadlocks itself in memory management

2016-07-12 Thread Steven Lowenthal (JIRA)

Steven Lowenthal created SPARK-16513:


 Summary: Spark executore deadlocks itself in memory management
 Key: SPARK-16513
 URL: https://issues.apache.org/jira/browse/SPARK-16513
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.1
Reporter: Steven Lowenthal


I have a spark streaming application which uses stateful RDDs (2 to be exact), 
but a given job only uses one.  The last part of the executor stderr log is 
enclosed.  There is no output in stdout.  There are 3 concurrent Spark tasks on 
the executor deadlocked as follows:  


org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1029)
org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1009)
org.apache.spark.storage.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:460)
org.apache.spark.storage.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:449)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
org.apache.spark.storage.MemoryStore.evictBlocksToFreeSpace(MemoryStore.scala:449)
org.apache.spark.memory.StorageMemoryPool.acquireMemory(StorageMemoryPool.scala:89)
org.apache.spark.memory.StorageMemoryPool.acquireMemory(StorageMemoryPool.scala:69)
org.apache.spark.memory.UnifiedMemoryManager.acquireStorageMemory(UnifiedMemoryManager.scala:155)
org.apache.spark.memory.UnifiedMemoryManager.acquireUnrollMemory(UnifiedMemoryManager.scala:162)
org.apache.spark.storage.MemoryStore.reserveUnrollMemoryForThisTask(MemoryStore.scala:493)
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:291)
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:89)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)



org.apache.spark.storage.MemoryStore.tryToPut(MemoryStore.scala:379)
org.apache.spark.storage.MemoryStore.tryToPut(MemoryStore.scala:346)
org.apache.spark.storage.MemoryStore.putArray(MemoryStore.scala:133)
org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:800)
org.apache.spark.storage.BlockManager.putArray(BlockManager.scala:676)
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:175)
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:89)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)



org.apache.spark.memory.MemoryManager.releaseExecutionMemory(MemoryManager.scala:120)
org.apache.spark.memory.TaskMemoryManager.releaseExecutionMemory(TaskMemoryManager.java:201)
org.apache.spark.util.collection.Spillable$class.releaseMemory(Spillable.scala:111)
org.apache.spark.util.collection.ExternalSorter.releaseMemory(ExternalSorter.scala:89)
org.apache.spark.util.collection.ExternalSorter.stop(ExternalSorter.scala:694)
org.apache.spark.shuffle.sort.SortShuffleWriter.stop(SortShuffleWriter.scala:95)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:74)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:89)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)

[jira] [Commented] (SPARK-16512) No way to load CSV data without dropping whole rows when some of data is not matched with given schema

2016-07-12 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15374150#comment-15374150
 ] 

Hyukjin Kwon commented on SPARK-16512:
--

I will work on this as soon as https://github.com/databricks/spark-csv/pull/298 
is merged.

> No way to load CSV data without dropping whole rows when some of data is not 
> matched with given schema
> --
>
> Key: SPARK-16512
> URL: https://issues.apache.org/jira/browse/SPARK-16512
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, there is no way to read CSV data without dropping whole rows when 
> some of data is not matched with given schema.
> It seems there are some usecases as below:
> {code}
> a,b
> 1,c
> {code}
> Here, {{a}} can be a dirty data in real usecases.
> But codes below:
> {code}
> val path = "/tmp/test.csv"
> val schema = StructType(
>   StructField("a", IntegerType, nullable = true) ::
>   StructField("b", StringType, nullable = true) :: Nil
> val df = spark.read
>   .format("csv")
>   .option("mode", "PERMISSIVE")
>   .schema(schema)
>   .load(path)
> df.show()
> {code}
> emits the exception below:
> {code}
> java.lang.NumberFormatException: For input string: "a"
>   at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>   at java.lang.Integer.parseInt(Integer.java:580)
>   at java.lang.Integer.parseInt(Integer.java:615)
>   at 
> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
>   at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:244)
> {code}
> With {{DROPMALFORM}} and {{FAILFAST}}, it will be dropped or failed with an 
> exception.
> FYI, this is not the case for JSON because JSON data sources can handle this 
> with {{PERMISSIVE}} mode as below:
> {code}
> val rdd = spark.sparkContext.makeRDD(Seq("{\"a\" : 1}", "{\"a\" : \"a\"}"))
> val schema = StructType(StructField("a", IntegerType, nullable = true) :: Nil)
> spark.read.option("mode", "PERMISSIVE").schema(schema).json(rdd).show()
> {code}
> {code}
> ++
> |   a|
> ++
> |   1|
> |null|
> ++
> {code}
> Please refer https://github.com/databricks/spark-csv/pull/298



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15382) monotonicallyIncreasingId doesn't work when data is upsampled

2016-07-12 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15374148#comment-15374148
 ] 

Hyukjin Kwon commented on SPARK-15382:
--

I was just looking into this but don't mind if you open a PR :)

> monotonicallyIncreasingId doesn't work when data is upsampled
> -
>
> Key: SPARK-15382
> URL: https://issues.apache.org/jira/browse/SPARK-15382
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Mateusz Buśkiewicz
>
> Assigned ids are not unique
> {code}
> from pyspark.sql import Row
> from pyspark.sql.functions import monotonicallyIncreasingId
> hiveContext.createDataFrame([Row(a=1), Row(a=2)]).sample(True, 
> 10.0).withColumn('id', monotonicallyIncreasingId()).collect()
> {code}
> Output:
> {code}
> [Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9140) Replace TimeTracker by Stopwatch

2016-07-12 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9140:
-
Target Version/s: 2.1.0

> Replace TimeTracker by Stopwatch
> 
>
> Key: SPARK-9140
> URL: https://issues.apache.org/jira/browse/SPARK-9140
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Priority: Minor
>
> We can replace TImeTracker in tree implementations by Stopwatch. The initial 
> PR could use local stopwatches only.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16512) No way to load CSV data without dropping whole rows when some of data is not matched with given schema

2016-07-12 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-16512:
-
Description: 
Currently, there is no way to read CSV data without dropping whole rows when 
some of data is not matched with given schema.

It seems there are some usecases as below:

{code}
a,b
1,c
{code}

Here, {{a}} can be a dirty data in real usecases.

But codes below:

{code}
val path = "/tmp/test.csv"
val schema = StructType(
  StructField("a", IntegerType, nullable = true) ::
  StructField("b", StringType, nullable = true) :: Nil
val df = spark.read
  .format("csv")
  .option("mode", "PERMISSIVE")
  .schema(schema)
  .load(path)
df.show()
{code}

emits the exception below:

{code}
java.lang.NumberFormatException: For input string: "a"
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at 
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
at 
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:244)
{code}

With {{DROPMALFORM}} and {{FAILFAST}}, it will be dropped or failed with an 
exception.

FYI, this is not the case for JSON because JSON data sources can handle this 
with {{PERMISSIVE}} mode as below:

{code}
val rdd = spark.sparkContext.makeRDD(Seq("{\"a\" : 1}", "{\"a\" : \"a\"}"))
val schema = StructType(StructField("a", IntegerType, nullable = true) :: Nil)
spark.read.option("mode", "PERMISSIVE").schema(schema).json(rdd).show()
{code}

{code}
++
|   a|
++
|   1|
|null|
++
{code}

Please refer https://github.com/databricks/spark-csv/pull/298

  was:
Currently, there is no way to read CSV data without dropping whole rows when 
some of data is not matched with given schema.

It seems there are some usecases as below:

{code}
a,b
1,c
{code}

Here, it seems the {{a}} can be a dirty data.

But codes below:

{code}
val path = testFile(carsFile)
val schema = StructType(
  StructField("a", IntegerType, nullable = true) ::
  StructField("b", StringType, nullable = true) :: Nil)
val df = spark.read
  .format("csv")
  .option("mode", "PERMISSIVE")
  .schema(schema)
  .load(path)
df.show()
{code}

emits the exception below:

{code}
java.lang.NumberFormatException: For input string: "a"
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at 
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
at 
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:244)
{code}

With {{DROPMALFORM}} and {{FAILFAST}}, it will be dropped or failed with an 
exception.

FYI, this is not the case for JSON because JSON data sources can handle this 
with {{PERMISSIVE}} mode as below:

{code}
val rdd = spark.sparkContext.makeRDD(Seq("{\"a\" : 1}", "{\"a\" : \"a\"}"))
val schema = StructType(StructField("a", IntegerType, nullable = true) :: Nil)
spark.read.option("mode", "PERMISSIVE").schema(schema).json(rdd).show()
{code}

{code}
++
|   a|
++
|   1|
|null|
++
{code}

Please refer https://github.com/databricks/spark-csv/pull/298


> No way to load CSV data without dropping whole rows when some of data is not 
> matched with given schema
> --
>
> Key: SPARK-16512
> URL: https://issues.apache.org/jira/browse/SPARK-16512
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, there is no way to read CSV data without dropping whole rows when 
> some of data is not matched with given schema.
> It seems there are some usecases as below:
> {code}
> a,b
> 1,c
> {code}
> Here, {{a}} can be a dirty data in real usecases.
> But codes below:
> {code}
> val path = "/tmp/test.csv"
> val schema = StructType(
>   StructField("a", IntegerType, nullable = true) ::
>   StructField("b", StringType, nullable = true) :: Nil
> val df = spark.read
>   .format("csv")
>   .option("mode", "PERMISSIVE")
>   .schema(schema)
>   .load(path)
> df.show()
> {code}
> emits the exception below:
> {code}
> java.lang.NumberFormatException: For input string: "a"
>   at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>   at java.lang.Integer.parseInt(Integer.java:580)
>   at java.lang.Integer.parseInt(Integer.java:615)
>   at 
>

[jira] [Created] (SPARK-16512) No way to load CSV data without dropping whole rows when some of data is not matched with given schema

2016-07-12 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-16512:


 Summary: No way to load CSV data without dropping whole rows when 
some of data is not matched with given schema
 Key: SPARK-16512
 URL: https://issues.apache.org/jira/browse/SPARK-16512
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Hyukjin Kwon
Priority: Minor


Currently, there is no way to read CSV data without dropping whole rows when 
some of data is not matched with given schema.

It seems there are some usecases as below:

{code}
a,b
1,c
{code}

Here, it seems the {{a}} can be a dirty data.

But codes below:

{code}
val path = testFile(carsFile)
val schema = StructType(
  StructField("a", IntegerType, nullable = true) ::
  StructField("b", StringType, nullable = true) :: Nil)
val df = spark.read
  .format("csv")
  .option("mode", "PERMISSIVE")
  .schema(schema)
  .load(path)
df.show()
{code}

emits the exception below:

{code}
java.lang.NumberFormatException: For input string: "a"
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at 
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
at 
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:244)
{code}

With {{DROPMALFORM}} and {{FAILFAST}}, it will be dropped or failed with an 
exception.

FYI, this is not the case for JSON because JSON data sources can handle this 
with {{PERMISSIVE}} mode as below:

{code}
val rdd = spark.sparkContext.makeRDD(Seq("{\"a\" : 1}", "{\"a\" : \"a\"}"))
val schema = StructType(StructField("a", IntegerType, nullable = true) :: Nil)
spark.read.option("mode", "PERMISSIVE").schema(schema).json(rdd).show()
{code}

{code}
++
|   a|
++
|   1|
|null|
++
{code}

Please refer https://github.com/databricks/spark-csv/pull/298



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-15581) MLlib 2.1 Roadmap

2016-07-12 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15581:
--
Comment: was deleted

(was: Yuhao's in relocation process and will have no email/IM access from July 
1st to July 6th.

I’ve returned my laptop to Intel Shanghai and will have no access to Intel 
network during the period. Hope I can be back online on July 6th (PST).

My backup email address: hhb...@gmail.com. Mobile: +86 
13738085700.

Regards,
Yuhao

)

> MLlib 2.1 Roadmap
> -
>
> Key: SPARK-15581
> URL: https://issues.apache.org/jira/browse/SPARK-15581
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> This is a master list for MLlib improvements we are working on for the next 
> release. Please view this as a wish list rather than a definite plan, for we 
> don't have an accurate estimate of available resources. Due to limited review 
> bandwidth, features appearing on this list will get higher priority during 
> code review. But feel free to suggest new items to the list in comments. We 
> are experimenting with this process. Your feedback would be greatly 
> appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps to improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add a "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on 
> JIRA.
> * If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
> please ping a maintainer to make a final pass.
> * After merging a PR, create and link JIRAs for Python, example code, and 
> documentation if applicable.
> h1. Roadmap (*WIP*)
> This is NOT [a complete list of MLlib JIRAs for 2.1| 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority].
>  We only include umbrella JIRAs and high-level tasks.
> Major efforts in this release:
> * Feature parity for the DataFrames-based API (`spark.ml`), relative to the 
> RDD-based API
> * ML persistence
> * Python API feature parity and test coverage
> * R API expansion and improvements
> * Note about new features: As usual, we expect to expand the feature set of 
> MLlib.  However, we will prioritize API parity, bug fixes, and improvements 
> over new features.
> Note `spark.mllib` is in maintenance mode now.  We will accept bug fixes for 
> it, but new features, APIs, and improvements will only be added to `spark.ml`.
> h2. Critical feature parity in DataFrame-based API
> * Umbrella JIRA: [SPARK-4591]
> h2. Persistence
> * Complete persistence within MLlib
> ** Python tuning (SPARK-13786)
> * MLlib in R format: compatibility with other languages (SPARK-15572)
> * Impose backwards compatibility for persistence (SPARK-15573)
> h2. Python API
> * Standardize unit tests for Scala and Python to improve and consolidate test 
> coverage for Params, persistence, and other common functionality (SPARK-15571)
> * Improve Python API handling of Params, persistence (SPARK-14771) 
> (SPARK-14706)
> ** Note: The linked JIRAs for this are

[jira] [Commented] (SPARK-16194) No way to dynamically set env vars on driver in cluster mode

2016-07-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15374115#comment-15374115
 ] 

Apache Spark commented on SPARK-16194:
--

User 'mgummelt' has created a pull request for this issue:
https://github.com/apache/spark/pull/14167

> No way to dynamically set env vars on driver in cluster mode
> 
>
> Key: SPARK-16194
> URL: https://issues.apache.org/jira/browse/SPARK-16194
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Michael Gummelt
>Priority: Minor
>
> I often need to dynamically configure a driver when submitting in cluster 
> mode, but there's currently no way of setting env vars.  {{spark-env.sh}} 
> lets me set env vars, but I have to statically build that into my spark 
> distribution.  I need a solution for specifying them in {{spark-submit}}.  
> Much like {{spark.executorEnv.[ENV]}}, but for drivers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16194) No way to dynamically set env vars on driver in cluster mode

2016-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16194:


Assignee: Apache Spark

> No way to dynamically set env vars on driver in cluster mode
> 
>
> Key: SPARK-16194
> URL: https://issues.apache.org/jira/browse/SPARK-16194
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Michael Gummelt
>Assignee: Apache Spark
>Priority: Minor
>
> I often need to dynamically configure a driver when submitting in cluster 
> mode, but there's currently no way of setting env vars.  {{spark-env.sh}} 
> lets me set env vars, but I have to statically build that into my spark 
> distribution.  I need a solution for specifying them in {{spark-submit}}.  
> Much like {{spark.executorEnv.[ENV]}}, but for drivers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16194) No way to dynamically set env vars on driver in cluster mode

2016-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16194:


Assignee: (was: Apache Spark)

> No way to dynamically set env vars on driver in cluster mode
> 
>
> Key: SPARK-16194
> URL: https://issues.apache.org/jira/browse/SPARK-16194
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Michael Gummelt
>Priority: Minor
>
> I often need to dynamically configure a driver when submitting in cluster 
> mode, but there's currently no way of setting env vars.  {{spark-env.sh}} 
> lets me set env vars, but I have to statically build that into my spark 
> distribution.  I need a solution for specifying them in {{spark-submit}}.  
> Much like {{spark.executorEnv.[ENV]}}, but for drivers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15382) monotonicallyIncreasingId doesn't work when data is upsampled

2016-07-12 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15374060#comment-15374060
 ] 

Takeshi Yamamuro commented on SPARK-15382:
--

[~hyukjin.kwon]  Do you take this?

> monotonicallyIncreasingId doesn't work when data is upsampled
> -
>
> Key: SPARK-15382
> URL: https://issues.apache.org/jira/browse/SPARK-15382
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Mateusz Buśkiewicz
>
> Assigned ids are not unique
> {code}
> from pyspark.sql import Row
> from pyspark.sql.functions import monotonicallyIncreasingId
> hiveContext.createDataFrame([Row(a=1), Row(a=2)]).sample(True, 
> 10.0).withColumn('id', monotonicallyIncreasingId()).collect()
> {code}
> Output:
> {code}
> [Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16511) SparkLauncher should allow setting working directory for spark-submit process

2016-07-12 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15374029#comment-15374029
 ] 

Marcelo Vanzin commented on SPARK-16511:


Exposing the ProcessBuilder is fine, I just would like it to be exposed 
consistently across both APIs, not just one of them.

> SparkLauncher should allow setting working directory for spark-submit process
> -
>
> Key: SPARK-16511
> URL: https://issues.apache.org/jira/browse/SPARK-16511
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Robert Kruszewski
>
> Setting directory on java ProcessBuilder is only legitimate way of changing 
> working directory for a process in java. With applications like spark job 
> server if we run multiple drivers in client mode they can conflict with each 
> other. There's probably a lot more subtle breaks that can cause by running 
> driver process in same directory as the parent launcher.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16284) Implement reflect and java_method SQL function

2016-07-12 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-16284:

Fix Version/s: 2.1.0

> Implement reflect and java_method SQL function
> --
>
> Key: SPARK-16284
> URL: https://issues.apache.org/jira/browse/SPARK-16284
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Peter Lee
> Fix For: 2.0.1, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16284) Implement reflect and java_method SQL function

2016-07-12 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-16284.
-
   Resolution: Fixed
Fix Version/s: 2.0.1

Issue resolved by pull request 14138
[https://github.com/apache/spark/pull/14138]

> Implement reflect and java_method SQL function
> --
>
> Key: SPARK-16284
> URL: https://issues.apache.org/jira/browse/SPARK-16284
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Peter Lee
> Fix For: 2.0.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16511) SparkLauncher should allow setting working directory for spark-submit process

2016-07-12 Thread Robert Kruszewski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15374024#comment-15374024
 ] 

Robert Kruszewski commented on SPARK-16511:
---

>From the comments on original PR are we fine exposing full ProcessBuilder? I 
>don't think there was conclusion. We could just expose directory. The error 
>stream redirection isn't that much of a concern with startApplication but I 
>might be missing something here.

> SparkLauncher should allow setting working directory for spark-submit process
> -
>
> Key: SPARK-16511
> URL: https://issues.apache.org/jira/browse/SPARK-16511
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Robert Kruszewski
>
> Setting directory on java ProcessBuilder is only legitimate way of changing 
> working directory for a process in java. With applications like spark job 
> server if we run multiple drivers in client mode they can conflict with each 
> other. There's probably a lot more subtle breaks that can cause by running 
> driver process in same directory as the parent launcher.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16511) SparkLauncher should allow setting working directory for spark-submit process

2016-07-12 Thread Robert Kruszewski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kruszewski updated SPARK-16511:
--
Description: Setting directory on java ProcessBuilder is only legitimate 
way of changing working directory for a process in java. With applications like 
spark job server if we run multiple drivers in client mode they can conflict 
with each other. There's probably a lot more subtle breaks that can cause by 
running driver process in same directory as the parent launcher.  (was: Setting 
directory on java ProcessBuilder is only legitimate of changing working 
directory for a process in java. With applications like spark job server if we 
run multiple drivers in client mode they can conflict with each other. There's 
probably a lot more subtle breaks that can cause by running driver process in 
same directory as the parent launcher.)

> SparkLauncher should allow setting working directory for spark-submit process
> -
>
> Key: SPARK-16511
> URL: https://issues.apache.org/jira/browse/SPARK-16511
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Robert Kruszewski
>
> Setting directory on java ProcessBuilder is only legitimate way of changing 
> working directory for a process in java. With applications like spark job 
> server if we run multiple drivers in client mode they can conflict with each 
> other. There's probably a lot more subtle breaks that can cause by running 
> driver process in same directory as the parent launcher.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16511) SparkLauncher should allow setting working directory for spark-submit process

2016-07-12 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15374007#comment-15374007
 ] 

Marcelo Vanzin commented on SPARK-16511:


The PR for SPARK-14702 would allow this, but it hasn't been updated by its 
submitter in a long time.

> SparkLauncher should allow setting working directory for spark-submit process
> -
>
> Key: SPARK-16511
> URL: https://issues.apache.org/jira/browse/SPARK-16511
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Robert Kruszewski
>
> Setting directory on java ProcessBuilder is only legitimate of changing 
> working directory for a process in java. With applications like spark job 
> server if we run multiple drivers in client mode they can conflict with each 
> other. There's probably a lot more subtle breaks that can cause by running 
> driver process in same directory as the parent launcher.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16511) SparkLauncher should allow setting working directory for spark-submit process

2016-07-12 Thread Robert Kruszewski (JIRA)

Robert Kruszewski created SPARK-16511:
-

 Summary: SparkLauncher should allow setting working directory for 
spark-submit process
 Key: SPARK-16511
 URL: https://issues.apache.org/jira/browse/SPARK-16511
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Robert Kruszewski


Setting directory on java ProcessBuilder is only legitimate of changing working 
directory for a process in java. With applications like spark job server if we 
run multiple drivers in client mode they can conflict with each other. There's 
probably a lot more subtle breaks that can cause by running driver process in 
same directory as the parent launcher.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16503) SparkSession should provide Spark version

2016-07-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373994#comment-15373994
 ] 

Apache Spark commented on SPARK-16503:
--

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14165

> SparkSession should provide Spark version
> -
>
> Key: SPARK-16503
> URL: https://issues.apache.org/jira/browse/SPARK-16503
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Joseph K. Bradley
>
> SparkContext.version is a useful field.
> SparkSession should provide something similar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16503) SparkSession should provide Spark version

2016-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16503:


Assignee: Apache Spark

> SparkSession should provide Spark version
> -
>
> Key: SPARK-16503
> URL: https://issues.apache.org/jira/browse/SPARK-16503
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> SparkContext.version is a useful field.
> SparkSession should provide something similar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16503) SparkSession should provide Spark version

2016-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16503:


Assignee: (was: Apache Spark)

> SparkSession should provide Spark version
> -
>
> Key: SPARK-16503
> URL: https://issues.apache.org/jira/browse/SPARK-16503
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Joseph K. Bradley
>
> SparkContext.version is a useful field.
> SparkSession should provide something similar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16506) Subsequent dataframe join dont work

2016-07-12 Thread Liwei Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373963#comment-15373963
 ] 

Liwei Lin commented on SPARK-16506:
---

Hi [~timotta], thanks for reporting this. This can be reproduced in 2.0 as 
well. Let me take a look into it and submit a patch. Thanks.

> Subsequent dataframe join dont work
> ---
>
> Key: SPARK-16506
> URL: https://issues.apache.org/jira/browse/SPARK-16506
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: Tiago Albineli Motta
>Priority: Minor
>  Labels: bug, dataframe, error, join, joins, sql
>
> Here is the example code:
> {quote}
>   import sql.implicits._
>   val objs = sc.parallelize(Seq(("1", "um"), ("2", "dois"), ("3", 
> "tres"))).toDF.selectExpr("_1 as id", "_2 as name")
>   
>   val rawj = sc.parallelize(Seq(("1", "2"),  ("1", "3"), ("2", "3"), 
> ("2", "1"))).toDF.selectExpr("_1 as id1", "_2 as id2")
>   
>   val join1 = rawj.join(objs, objs("id") === rawj("id1"))
> .withColumnRenamed("id", "anything")
> 
>   println("works...")
>   val join2a = join1.join(objs, 'id2 === 'id )
>   join2a.show()
>   
>   println("works...")
>   val join2b = objs.join(join1, objs("id") === join1("id2"))
>   join2b.show()
>   
>   println("do not works...")
>   val join2c = join1.join(objs, join1("id2") === objs("id") )
>   join2c.show()
> {quote}
> Fisrt two joins work. But the last one gave me this error:
> {quote}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) id#2 missing from anything#8,name#14,name#3,id1#6,id2#7,id#13 in 
> operator !Join Inner, Some((id2#7 = id#2));
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:183)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105)
> {quote}
> Without the first column rename, the error happens in silence since the join 
> get empty:
> {quote}
>   import sql.implicits._
>   val objs = sc.parallelize(Seq(("1", "um"), ("2", "dois"), ("3", 
> "tres"))).toDF.selectExpr("_1 as id", "_2 as name")
>   
>   val rawj = sc.parallelize(Seq(("1", "2"),  ("1", "3"), ("2", "3"), 
> ("2", "1"))).toDF.selectExpr("_1 as id1", "_2 as id2")
>   
>   val join1 = rawj.join(objs, objs("id") === rawj("id1"))
>   
>   println("do not works...")
>   val join2c = join1.join(objs, join1("id2") === objs("id") )
>   join2c.show()
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16468) Confusing results when describe() used on DataFrame with chr columns

2016-07-12 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373942#comment-15373942
 ] 

Dongjoon Hyun commented on SPARK-16468:
---

IMHO, I don't need to fix.

> Confusing results when describe() used on DataFrame with chr columns
> 
>
> Key: SPARK-16468
> URL: https://issues.apache.org/jira/browse/SPARK-16468
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.1
> Environment: Databricks.com
>Reporter: Neil Dewar
>Priority: Minor
>
> The describe() function returns statistical summaries on numeric columns of a 
> DataFrame.  If the DataFrame contains columns of type chr, only the count, 
> min and max stats are returned.
> When a dataframe contains a mixture of numeric and chr columns, the results 
> become jumbled together.
> Example:
> sdfR <- createDataFrame(sqlContext, ToothGrowth)
> collect(describe(sdfR))
> Results:
>summarylen supp   dose
> 1   count 60   60 60
> 2mean 18.816  1.1667
> 3  stddev  7.649315171887615  0.6288721857330792
> 4 min4.2   OJ0.5
> 5 max   33.9   VC2.0
> There appear to be two problems here:
> (1) The mean and stdev values have not been rounded for the columns where 
> there are valid values
> (2) There is no ability to distinguish that the supp column has no values in 
> mean and stdev rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16468) Confusing results when describe() used on DataFrame with chr columns

2016-07-12 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373914#comment-15373914
 ] 

Shivaram Venkataraman commented on SPARK-16468:
---

[~dongjoon] [~n...@dewar-us.com] Is there something specific that we need to 
fix here ? If so can we update the issue description to describe it better ?

> Confusing results when describe() used on DataFrame with chr columns
> 
>
> Key: SPARK-16468
> URL: https://issues.apache.org/jira/browse/SPARK-16468
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.1
> Environment: Databricks.com
>Reporter: Neil Dewar
>Priority: Minor
>
> The describe() function returns statistical summaries on numeric columns of a 
> DataFrame.  If the DataFrame contains columns of type chr, only the count, 
> min and max stats are returned.
> When a dataframe contains a mixture of numeric and chr columns, the results 
> become jumbled together.
> Example:
> sdfR <- createDataFrame(sqlContext, ToothGrowth)
> collect(describe(sdfR))
> Results:
>summarylen supp   dose
> 1   count 60   60 60
> 2mean 18.816  1.1667
> 3  stddev  7.649315171887615  0.6288721857330792
> 4 min4.2   OJ0.5
> 5 max   33.9   VC2.0
> There appear to be two problems here:
> (1) The mean and stdev values have not been rounded for the columns where 
> there are valid values
> (2) There is no ability to distinguish that the supp column has no values in 
> mean and stdev rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16510) Move SparkR test JAR into Spark, include its source code

2016-07-12 Thread Shivaram Venkataraman (JIRA)

Shivaram Venkataraman created SPARK-16510:
-

 Summary: Move SparkR test JAR into Spark, include its source code
 Key: SPARK-16510
 URL: https://issues.apache.org/jira/browse/SPARK-16510
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Shivaram Venkataraman


One of the `NOTE`s in the R CMD check is that we currently include a test JAR 
file in SparkR which is a binary only artifact. I think we can take two steps 
to address this

(a) I think we should include the source code for this in say core/src/test/ or 
something like that. As far as I know the JAR file just needs to have a single 
method. 

(b) We should move the JAR file out of the SparkR test support and into some 
other location in Spark. The trouble is that its tricky to run the test with 
CRAN mode then. We could either disable the test for CRAN or download the JAR 
from an external URL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16509) Rename window.partitionBy and window.orderBy

2016-07-12 Thread Shivaram Venkataraman (JIRA)

Shivaram Venkataraman created SPARK-16509:
-

 Summary: Rename window.partitionBy and window.orderBy
 Key: SPARK-16509
 URL: https://issues.apache.org/jira/browse/SPARK-16509
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Shivaram Venkataraman


Right now R CMD check [1]  interprets window.partitonBy and window.orderBy as 
S3 functions defined on the "partitionBy" class or "orderBy" class (similar to 
say summary.lm) 

To avoid confusion I think we should just rename the functions and not use `.` 
in them ?

cc [~sunrui]

[1] https://gist.github.com/shivaram/62866c4ca59c5d34b8963939cf04b5eb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16508) Fix documentation warnings found by R CMD check

2016-07-12 Thread Shivaram Venkataraman (JIRA)

Shivaram Venkataraman created SPARK-16508:
-

 Summary: Fix documentation warnings found by R CMD check
 Key: SPARK-16508
 URL: https://issues.apache.org/jira/browse/SPARK-16508
 Project: Spark
  Issue Type: Sub-task
Reporter: Shivaram Venkataraman


A full list of warnings after the fixes in SPARK-16507 is at 
https://gist.github.com/shivaram/62866c4ca59c5d34b8963939cf04b5eb 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16507) Add CRAN checks to SparkR

2016-07-12 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-16507:
--
Assignee: Shivaram Venkataraman

> Add CRAN checks to SparkR 
> --
>
> Key: SPARK-16507
> URL: https://issues.apache.org/jira/browse/SPARK-16507
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
>
> One of the steps to publishing SparkR is to pass the `R CMD check --as-cran`. 
> We should add a script to do this and fix any errors / warnings we find



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16507) Add CRAN checks to SparkR

2016-07-12 Thread Shivaram Venkataraman (JIRA)

Shivaram Venkataraman created SPARK-16507:
-

 Summary: Add CRAN checks to SparkR 
 Key: SPARK-16507
 URL: https://issues.apache.org/jira/browse/SPARK-16507
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Shivaram Venkataraman


One of the steps to publishing SparkR is to pass the `R CMD check --as-cran`. 
We should add a script to do this and fix any errors / warnings we find



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15923) Spark Application rest api returns "no such app: "

2016-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15923:


Assignee: (was: Apache Spark)

> Spark Application rest api returns "no such app: "
> -
>
> Key: SPARK-15923
> URL: https://issues.apache.org/jira/browse/SPARK-15923
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Yesha Vora
>
> Env : secure cluster
> Scenario:
> * Run SparkPi application in yarn-client or yarn-cluster mode
> * After application finishes, check Spark HS rest api to get details like 
> jobs / executor etc. 
> {code}
> http://:18080/api/v1/applications/application_1465778870517_0001/1/executors{code}
>  
> Rest api return HTTP Code: 404 and prints "HTTP Data: no such app: 
> application_1465778870517_0001"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15923) Spark Application rest api returns "no such app: "

2016-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15923:


Assignee: Apache Spark

> Spark Application rest api returns "no such app: "
> -
>
> Key: SPARK-15923
> URL: https://issues.apache.org/jira/browse/SPARK-15923
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Yesha Vora
>Assignee: Apache Spark
>
> Env : secure cluster
> Scenario:
> * Run SparkPi application in yarn-client or yarn-cluster mode
> * After application finishes, check Spark HS rest api to get details like 
> jobs / executor etc. 
> {code}
> http://:18080/api/v1/applications/application_1465778870517_0001/1/executors{code}
>  
> Rest api return HTTP Code: 404 and prints "HTTP Data: no such app: 
> application_1465778870517_0001"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15923) Spark Application rest api returns "no such app: "

2016-07-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373804#comment-15373804
 ] 

Apache Spark commented on SPARK-15923:
--

User 'Sherry302' has created a pull request for this issue:
https://github.com/apache/spark/pull/14163

> Spark Application rest api returns "no such app: "
> -
>
> Key: SPARK-15923
> URL: https://issues.apache.org/jira/browse/SPARK-15923
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Yesha Vora
>
> Env : secure cluster
> Scenario:
> * Run SparkPi application in yarn-client or yarn-cluster mode
> * After application finishes, check Spark HS rest api to get details like 
> jobs / executor etc. 
> {code}
> http://:18080/api/v1/applications/application_1465778870517_0001/1/executors{code}
>  
> Rest api return HTTP Code: 404 and prints "HTTP Data: no such app: 
> application_1465778870517_0001"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-15923) Spark Application rest api returns "no such app: "

2016-07-12 Thread Weiqing Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiqing Yang reopened SPARK-15923:
--

Reopen the jira to update monitoring.md

> Spark Application rest api returns "no such app: "
> -
>
> Key: SPARK-15923
> URL: https://issues.apache.org/jira/browse/SPARK-15923
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Yesha Vora
>
> Env : secure cluster
> Scenario:
> * Run SparkPi application in yarn-client or yarn-cluster mode
> * After application finishes, check Spark HS rest api to get details like 
> jobs / executor etc. 
> {code}
> http://:18080/api/v1/applications/application_1465778870517_0001/1/executors{code}
>  
> Rest api return HTTP Code: 404 and prints "HTTP Data: no such app: 
> application_1465778870517_0001"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15705) Spark won't read ORC schema from metastore for partitioned tables

2016-07-12 Thread Nic Eggert (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373786#comment-15373786
 ] 

Nic Eggert commented on SPARK-15705:


Found a work-around: Set spark.sql.hive.convertMetastoreOrc=false.

> Spark won't read ORC schema from metastore for partitioned tables
> -
>
> Key: SPARK-15705
> URL: https://issues.apache.org/jira/browse/SPARK-15705
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: HDP 2.3.4 (Hive 1.2.1, Hadoop 2.7.1)
>Reporter: Nic Eggert
>Priority: Critical
>
> Spark does not seem to read the schema from the Hive metastore for 
> partitioned tables stored as ORC files. It appears to read the schema from 
> the files themselves, which, if they were created with Hive, does not match 
> the metastore schema (at least not before before Hive 2.0, see HIVE-4243). To 
> reproduce:
> In Hive:
> {code}
> hive> create table default.test (id BIGINT, name STRING) partitioned by 
> (state STRING) stored as orc;
> hive> insert into table default.test partition (state="CA") values (1, 
> "mike"), (2, "steve"), (3, "bill");
> {code}
> In Spark
> {code}
> scala> spark.table("default.test").printSchema
> {code}
> Expected result: Spark should preserve the column names that were defined in 
> Hive.
> Actual Result:
> {code}
> root
>  |-- _col0: long (nullable = true)
>  |-- _col1: string (nullable = true)
>  |-- state: string (nullable = true)
> {code}
> Possibly related to SPARK-14959?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16504) UDAF should be typed

2016-07-12 Thread Vladimir Feinberg (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373769#comment-15373769
 ] 

Vladimir Feinberg commented on SPARK-16504:
---

fwiw {{merge}} has type {{(MAB, Row):Unit}} instead of {{(MAB, MAB): Unit}} or 
even more preferably {{(MAB, MAB): MAB}} for some reason.

> UDAF should be typed
> 
>
> Key: SPARK-16504
> URL: https://issues.apache.org/jira/browse/SPARK-16504
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Vladimir Feinberg
>
> Currently, UDAFs can be implemented by using a generic 
> {{MutableAggregationBuffer}}. This type-less class requires the user specify 
> the schema.
> If the user wants to create vector output from a UDAF, this requires 
> specifying an output schema with a VectorUDT(), which is only accessible 
> through a DeveloperApi.
> Since we would prefer not to expose VectorUDT, the only option would be to 
> resolve the user's inability to (legally) specify a schema containing a 
> VectorUDT the same way that we would do so for creating dataframes: by type 
> inference, just like createDataFrame does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-972) PySpark's "cannot run multiple SparkContexts at once" message should give source locations

2016-07-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373766#comment-15373766
 ] 

Apache Spark commented on SPARK-972:


User 'jyotiska' has created a pull request for this issue:
https://github.com/apache/spark/pull/34

> PySpark's "cannot run multiple SparkContexts at once" message should give 
> source locations
> --
>
> Key: SPARK-972
> URL: https://issues.apache.org/jira/browse/SPARK-972
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Josh Rosen
>Assignee: Jyotiska NK
>Priority: Minor
> Fix For: 1.0.0
>
>
> It can be difficult to debug PySpark's "Cannot run multiple SparkContexts at 
> once" error message if you're not sure where the first context is being 
> created; it would be helpful if the SparkContext class remembered the 
> linenumber/location where the  active context was created and printed it in 
> the error message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13786) Pyspark ml.tuning support export/import

2016-07-12 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373715#comment-15373715
 ] 

Joseph K. Bradley commented on SPARK-13786:
---

[~yinxusen] [~yanboliang] Just pinging to let you know I'd like to focus on 
this as soon as 2.0 QA is done.  Thanks for waiting!

> Pyspark ml.tuning support export/import
> ---
>
> Key: SPARK-13786
> URL: https://issues.apache.org/jira/browse/SPARK-13786
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> This should follow whatever implementation is chosen for Pipeline (since 
> these are all meta-algorithms).
> Note this will also require persistence for Evaluators.  Hopefully that can 
> leverage the Java implementations; there is not a real need to make Python 
> Evaluators be MLWritable, as far as I can tell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16505) YARN shuffle service should throw errors when it fails to start

2016-07-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373654#comment-15373654
 ] 

Apache Spark commented on SPARK-16505:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14162

> YARN shuffle service should throw errors when it fails to start
> ---
>
> Key: SPARK-16505
> URL: https://issues.apache.org/jira/browse/SPARK-16505
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>
> Right now the YARN shuffle service will swallow errors that happen during 
> startup and just log them:
> {code}
> try {
>   blockHandler = new ExternalShuffleBlockHandler(transportConf, 
> registeredExecutorFile);
> } catch (Exception e) {
>   logger.error("Failed to initialize external shuffle service", e);
> }
> {code}
> This causes two undesirable things to happen:
> - because {{blockHandler}} will remain {{null}} when an error happens, every 
> request to the shuffle service will cause an NPE
> - because the NM is running, containers may be assigned to that host, only to 
> fail to register with the shuffle service.
> Example of the first:
> {noformat}
> 2016-05-25 15:01:12,198  ERROR org.apache.spark.network.TransportContext: 
> Error while initializing Netty pipeline
> java.lang.NullPointerException
>   at 
> org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77)
>   at 
> org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159)
>   at 
> org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135)
>   at 
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123)
>   at 
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116)
>   at 
> io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69)
> {noformat}
> Example of the second:
> {noformat}
> 16/05/25 15:01:12 INFO storage.BlockManager: Registering executor with local 
> external shuffle service.
> 16/05/25 15:01:12 ERROR client.TransportClient: Failed to send RPC 
> 5736508221708472525 to qxhddn01.ascap.com/10.6.41.31:7337: 
> java.nio.channels.ClosedChannelException
> java.nio.channels.ClosedChannelException
> 16/05/25 15:01:12 ERROR storage.BlockManager: Failed to connect to external 
> shuffle server, will retry 2 more times after waiting 5 seconds...
> java.lang.RuntimeException: java.io.IOException: Failed to send RPC 
> 5736508221708472525 to qxhddn01.ascap.com/10.6.41.31:7337: 
> java.nio.channels.ClosedChannelException
>   at 
> org.spark-project.guava.base.Throwables.propagate(Throwables.java:160)
>   at 
> org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:272)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16505) YARN shuffle service should throw errors when it fails to start

2016-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16505:


Assignee: (was: Apache Spark)

> YARN shuffle service should throw errors when it fails to start
> ---
>
> Key: SPARK-16505
> URL: https://issues.apache.org/jira/browse/SPARK-16505
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>
> Right now the YARN shuffle service will swallow errors that happen during 
> startup and just log them:
> {code}
> try {
>   blockHandler = new ExternalShuffleBlockHandler(transportConf, 
> registeredExecutorFile);
> } catch (Exception e) {
>   logger.error("Failed to initialize external shuffle service", e);
> }
> {code}
> This causes two undesirable things to happen:
> - because {{blockHandler}} will remain {{null}} when an error happens, every 
> request to the shuffle service will cause an NPE
> - because the NM is running, containers may be assigned to that host, only to 
> fail to register with the shuffle service.
> Example of the first:
> {noformat}
> 2016-05-25 15:01:12,198  ERROR org.apache.spark.network.TransportContext: 
> Error while initializing Netty pipeline
> java.lang.NullPointerException
>   at 
> org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77)
>   at 
> org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159)
>   at 
> org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135)
>   at 
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123)
>   at 
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116)
>   at 
> io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69)
> {noformat}
> Example of the second:
> {noformat}
> 16/05/25 15:01:12 INFO storage.BlockManager: Registering executor with local 
> external shuffle service.
> 16/05/25 15:01:12 ERROR client.TransportClient: Failed to send RPC 
> 5736508221708472525 to qxhddn01.ascap.com/10.6.41.31:7337: 
> java.nio.channels.ClosedChannelException
> java.nio.channels.ClosedChannelException
> 16/05/25 15:01:12 ERROR storage.BlockManager: Failed to connect to external 
> shuffle server, will retry 2 more times after waiting 5 seconds...
> java.lang.RuntimeException: java.io.IOException: Failed to send RPC 
> 5736508221708472525 to qxhddn01.ascap.com/10.6.41.31:7337: 
> java.nio.channels.ClosedChannelException
>   at 
> org.spark-project.guava.base.Throwables.propagate(Throwables.java:160)
>   at 
> org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:272)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16505) YARN shuffle service should throw errors when it fails to start

2016-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16505:


Assignee: Apache Spark

> YARN shuffle service should throw errors when it fails to start
> ---
>
> Key: SPARK-16505
> URL: https://issues.apache.org/jira/browse/SPARK-16505
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>
> Right now the YARN shuffle service will swallow errors that happen during 
> startup and just log them:
> {code}
> try {
>   blockHandler = new ExternalShuffleBlockHandler(transportConf, 
> registeredExecutorFile);
> } catch (Exception e) {
>   logger.error("Failed to initialize external shuffle service", e);
> }
> {code}
> This causes two undesirable things to happen:
> - because {{blockHandler}} will remain {{null}} when an error happens, every 
> request to the shuffle service will cause an NPE
> - because the NM is running, containers may be assigned to that host, only to 
> fail to register with the shuffle service.
> Example of the first:
> {noformat}
> 2016-05-25 15:01:12,198  ERROR org.apache.spark.network.TransportContext: 
> Error while initializing Netty pipeline
> java.lang.NullPointerException
>   at 
> org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77)
>   at 
> org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159)
>   at 
> org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135)
>   at 
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123)
>   at 
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116)
>   at 
> io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69)
> {noformat}
> Example of the second:
> {noformat}
> 16/05/25 15:01:12 INFO storage.BlockManager: Registering executor with local 
> external shuffle service.
> 16/05/25 15:01:12 ERROR client.TransportClient: Failed to send RPC 
> 5736508221708472525 to qxhddn01.ascap.com/10.6.41.31:7337: 
> java.nio.channels.ClosedChannelException
> java.nio.channels.ClosedChannelException
> 16/05/25 15:01:12 ERROR storage.BlockManager: Failed to connect to external 
> shuffle server, will retry 2 more times after waiting 5 seconds...
> java.lang.RuntimeException: java.io.IOException: Failed to send RPC 
> 5736508221708472525 to qxhddn01.ascap.com/10.6.41.31:7337: 
> java.nio.channels.ClosedChannelException
>   at 
> org.spark-project.guava.base.Throwables.propagate(Throwables.java:160)
>   at 
> org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:272)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-16437) SparkR read.df() from parquet got error: SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder"

2016-07-12 Thread Xin Ren (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Ren closed SPARK-16437.
---

> SparkR read.df() from parquet got error: SLF4J: Failed to load class 
> "org.slf4j.impl.StaticLoggerBinder"
> 
>
> Key: SPARK-16437
> URL: https://issues.apache.org/jira/browse/SPARK-16437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Xin Ren
>Priority: Minor
>
> build SparkR with command
> {code}
> build/mvn -DskipTests -Psparkr package
> {code}
> start SparkR console
> {code}
> ./bin/sparkR
> {code}
> then get error
> {code}
>  Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version  2.0.0-SNAPSHOT
> /_/
>  SparkSession available as 'spark'.
> >
> >
> > library(SparkR)
> >
> > df <- read.df("examples/src/main/resources/users.parquet")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> >
> >
> > head(df)
> 16/07/07 23:20:54 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> name favorite_color favorite_numbers
> 1 Alyssa3, 9, 15, 20
> 2Benred NULL
> {code}
> Reference
> * seems need to add a lib from slf4j to point to older version
> http://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder
> * on slf4j official site: http://www.slf4j.org/codes.html#StaticLoggerBinder



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16437) SparkR read.df() from parquet got error: SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder"

2016-07-12 Thread Xin Ren (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Ren resolved SPARK-16437.
-
Resolution: Not A Problem

> SparkR read.df() from parquet got error: SLF4J: Failed to load class 
> "org.slf4j.impl.StaticLoggerBinder"
> 
>
> Key: SPARK-16437
> URL: https://issues.apache.org/jira/browse/SPARK-16437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Xin Ren
>Priority: Minor
>
> build SparkR with command
> {code}
> build/mvn -DskipTests -Psparkr package
> {code}
> start SparkR console
> {code}
> ./bin/sparkR
> {code}
> then get error
> {code}
>  Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version  2.0.0-SNAPSHOT
> /_/
>  SparkSession available as 'spark'.
> >
> >
> > library(SparkR)
> >
> > df <- read.df("examples/src/main/resources/users.parquet")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> >
> >
> > head(df)
> 16/07/07 23:20:54 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> name favorite_color favorite_numbers
> 1 Alyssa3, 9, 15, 20
> 2Benred NULL
> {code}
> Reference
> * seems need to add a lib from slf4j to point to older version
> http://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder
> * on slf4j official site: http://www.slf4j.org/codes.html#StaticLoggerBinder



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16506) Subsequent dataframe join dont work

2016-07-12 Thread Tiago Albineli Motta (JIRA)

Tiago Albineli Motta created SPARK-16506:


 Summary: Subsequent dataframe join dont work
 Key: SPARK-16506
 URL: https://issues.apache.org/jira/browse/SPARK-16506
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.2
Reporter: Tiago Albineli Motta
Priority: Minor


Here is the example code:

{quote}
  import sql.implicits._

  val objs = sc.parallelize(Seq(("1", "um"), ("2", "dois"), ("3", 
"tres"))).toDF.selectExpr("_1 as id", "_2 as name")
  
  val rawj = sc.parallelize(Seq(("1", "2"),  ("1", "3"), ("2", "3"), ("2", 
"1"))).toDF.selectExpr("_1 as id1", "_2 as id2")
  
  val join1 = rawj.join(objs, objs("id") === rawj("id1"))
.withColumnRenamed("id", "anything")

  println("works...")
  val join2a = join1.join(objs, 'id2 === 'id )
  join2a.show()
  
  println("works...")
  val join2b = objs.join(join1, objs("id") === join1("id2"))
  join2b.show()
  
  println("do not works...")
  val join2c = join1.join(objs, join1("id2") === objs("id") )
  join2c.show()
{quote}

Fisrt two joins work. But the last one gave me this error:

{quote}
Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
attribute(s) id#2 missing from anything#8,name#14,name#3,id1#6,id2#7,id#13 in 
operator !Join Inner, Some((id2#7 = id#2));
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:183)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105)
{quote}

Without the first column rename, the error happens in silence since the join 
get empty:

{quote}
  import sql.implicits._

  val objs = sc.parallelize(Seq(("1", "um"), ("2", "dois"), ("3", 
"tres"))).toDF.selectExpr("_1 as id", "_2 as name")
  
  val rawj = sc.parallelize(Seq(("1", "2"),  ("1", "3"), ("2", "3"), ("2", 
"1"))).toDF.selectExpr("_1 as id1", "_2 as id2")
  
  val join1 = rawj.join(objs, objs("id") === rawj("id1"))
  
  println("do not works...")
  val join2c = join1.join(objs, join1("id2") === objs("id") )
  join2c.show()
{quote}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16505) YARN shuffle service should throw errors when it fails to start

2016-07-12 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-16505:
--

 Summary: YARN shuffle service should throw errors when it fails to 
start
 Key: SPARK-16505
 URL: https://issues.apache.org/jira/browse/SPARK-16505
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.0.0
Reporter: Marcelo Vanzin


Right now the YARN shuffle service will swallow errors that happen during 
startup and just log them:

{code}
try {
  blockHandler = new ExternalShuffleBlockHandler(transportConf, 
registeredExecutorFile);
} catch (Exception e) {
  logger.error("Failed to initialize external shuffle service", e);
}
{code}

This causes two undesirable things to happen:

- because {{blockHandler}} will remain {{null}} when an error happens, every 
request to the shuffle service will cause an NPE
- because the NM is running, containers may be assigned to that host, only to 
fail to register with the shuffle service.

Example of the first:

{noformat}
2016-05-25 15:01:12,198  ERROR org.apache.spark.network.TransportContext: Error 
while initializing Netty pipeline
java.lang.NullPointerException
at 
org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77)
at 
org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159)
at 
org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135)
at 
org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123)
at 
org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116)
at 
io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69)
{noformat}

Example of the second:

{noformat}
16/05/25 15:01:12 INFO storage.BlockManager: Registering executor with local 
external shuffle service.
16/05/25 15:01:12 ERROR client.TransportClient: Failed to send RPC 
5736508221708472525 to qxhddn01.ascap.com/10.6.41.31:7337: 
java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
16/05/25 15:01:12 ERROR storage.BlockManager: Failed to connect to external 
shuffle server, will retry 2 more times after waiting 5 seconds...
java.lang.RuntimeException: java.io.IOException: Failed to send RPC 
5736508221708472525 to qxhddn01.ascap.com/10.6.41.31:7337: 
java.nio.channels.ClosedChannelException
at 
org.spark-project.guava.base.Throwables.propagate(Throwables.java:160)
at 
org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:272)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-16502) Update depreciated method "ParquetFileReader" from parquet

2016-07-12 Thread Xin Ren (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Ren closed SPARK-16502.
---
Resolution: Invalid

> Update depreciated method "ParquetFileReader" from parquet
> --
>
> Key: SPARK-16502
> URL: https://issues.apache.org/jira/browse/SPARK-16502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Xin Ren
>
> During code compile, got below depreciation message. Need to update the 
> method invocation.
> {code}
> /Users/renxin/workspace/spark/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
> Warning:(140, 19) java: 
> ParquetFileReader(org.apache.hadoop.conf.Configuration,org.apache.hadoop.fs.Path,java.util.List,java.util.List)
>  in org.apache.parquet.hadoop.ParquetFileReader has been deprecated
> Warning:(204, 19) java: 
> ParquetFileReader(org.apache.hadoop.conf.Configuration,org.apache.hadoop.fs.Path,java.util.List,java.util.List)
>  in org.apache.parquet.hadoop.ParquetFileReader has been deprecated
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16048) spark-shell unresponsive after "FetchFailedException: java.lang.UnsupportedOperationException: Unsupported shuffle manager" with YARN and spark.shuffle.service.enabled

2016-07-12 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-16048.

Resolution: Duplicate

This was fixed by SPARK-14731.

> spark-shell unresponsive after "FetchFailedException: 
> java.lang.UnsupportedOperationException: Unsupported shuffle manager" with 
> YARN and spark.shuffle.service.enabled
> ---
>
> Key: SPARK-16048
> URL: https://issues.apache.org/jira/browse/SPARK-16048
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Shell, YARN
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> With Spark on YARN with external shuffle service 
> {{java.lang.UnsupportedOperationException: Unsupported shuffle manager: 
> org.apache.spark.shuffle.sort.SortShuffleManager}} exception makes 
> spark-shell unresponsive.
> {code}
> $ YARN_CONF_DIR=hadoop-conf ./bin/spark-shell --master yarn -c 
> spark.shuffle.service.enabled=true --deploy-mode client -c 
> spark.scheduler.mode=FAIR --num-executors 2
> ...
> Spark context Web UI available at http://192.168.1.9:4040
> Spark context available as 'sc' (master = yarn, app id = 
> application_1466255040841_0002).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.0.0-SNAPSHOT
>   /_/
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_92)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> sc.parallelize(0 to 4, 1).map(n => (n % 2, n)).groupByKey.map(n => { 
> Thread.sleep(5 * 1000); n }).count
> org.apache.spark.SparkException: Job aborted due to stage failure: 
> ResultStage 1 (count at :25) has failed the maximum allowable number 
> of times: 4. Most recent failure reason: 
> org.apache.spark.shuffle.FetchFailedException: 
> java.lang.UnsupportedOperationException: Unsupported shuffle manager: 
> org.apache.spark.shuffle.sort.SortShuffleManager
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:191)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.handleMessage(ExternalShuffleBlockHandler.java:85)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:72)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:159)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:107)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:254)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
>

[jira] [Updated] (SPARK-16502) Update depreciated method "ParquetFileReader" from parquet

2016-07-12 Thread Xin Ren (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Ren updated SPARK-16502:

Description: 
During code compile, got below depreciation message. Need to update the method 
invocation.

{code}
/Users/renxin/workspace/spark/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
Warning:(140, 19) java: 
ParquetFileReader(org.apache.hadoop.conf.Configuration,org.apache.hadoop.fs.Path,java.util.List,java.util.List)
 in org.apache.parquet.hadoop.ParquetFileReader has been deprecated

Warning:(204, 19) java: 
ParquetFileReader(org.apache.hadoop.conf.Configuration,org.apache.hadoop.fs.Path,java.util.List,java.util.List)
 in org.apache.parquet.hadoop.ParquetFileReader has been deprecated
{code}

  was:
During code compile, got below depreciation message. Need to update the method 
invocation.

{code}
/Users/renxin/workspace/spark/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
Warning:(140, 19) java: 
ParquetFileReader(org.apache.hadoop.conf.Configuration,org.apache.hadoop.fs.Path,java.util.List,java.util.List)
 in org.apache.parquet.hadoop.ParquetFileReader has been deprecated
Warning:(204, 19) java: 
ParquetFileReader(org.apache.hadoop.conf.Configuration,org.apache.hadoop.fs.Path,java.util.List,java.util.List)
 in org.apache.parquet.hadoop.ParquetFileReader has been deprecated
{code}


> Update depreciated method "ParquetFileReader" from parquet
> --
>
> Key: SPARK-16502
> URL: https://issues.apache.org/jira/browse/SPARK-16502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Xin Ren
>
> During code compile, got below depreciation message. Need to update the 
> method invocation.
> {code}
> /Users/renxin/workspace/spark/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
> Warning:(140, 19) java: 
> ParquetFileReader(org.apache.hadoop.conf.Configuration,org.apache.hadoop.fs.Path,java.util.List,java.util.List)
>  in org.apache.parquet.hadoop.ParquetFileReader has been deprecated
> Warning:(204, 19) java: 
> ParquetFileReader(org.apache.hadoop.conf.Configuration,org.apache.hadoop.fs.Path,java.util.List,java.util.List)
>  in org.apache.parquet.hadoop.ParquetFileReader has been deprecated
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16502) Update depreciated method "ParquetFileReader" from parquet

2016-07-12 Thread Xin Ren (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Ren updated SPARK-16502:

Description: 
During code compile, got below depreciation message. Need to update the method 
invocation.

{code}
/Users/renxin/workspace/spark/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
Warning:(140, 19) java: 
ParquetFileReader(org.apache.hadoop.conf.Configuration,org.apache.hadoop.fs.Path,java.util.List,java.util.List)
 in org.apache.parquet.hadoop.ParquetFileReader has been deprecated
Warning:(204, 19) java: 
ParquetFileReader(org.apache.hadoop.conf.Configuration,org.apache.hadoop.fs.Path,java.util.List,java.util.List)
 in org.apache.parquet.hadoop.ParquetFileReader has been deprecated
{code}

  was:
During code compile, got below depreciation message. Need to update the method 
invocation.

{code}
/Users/quickmobile/workspace/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala
Warning:(448, 28) method listType in object ConversionPatterns is deprecated: 
see corresponding Javadoc for more information.
ConversionPatterns.listType(
   ^
Warning:(464, 28) method listType in object ConversionPatterns is deprecated: 
see corresponding Javadoc for more information.
ConversionPatterns.listType(
   ^
{code}


> Update depreciated method "ParquetFileReader" from parquet
> --
>
> Key: SPARK-16502
> URL: https://issues.apache.org/jira/browse/SPARK-16502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Xin Ren
>
> During code compile, got below depreciation message. Need to update the 
> method invocation.
> {code}
> /Users/renxin/workspace/spark/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
> Warning:(140, 19) java: 
> ParquetFileReader(org.apache.hadoop.conf.Configuration,org.apache.hadoop.fs.Path,java.util.List,java.util.List)
>  in org.apache.parquet.hadoop.ParquetFileReader has been deprecated
> Warning:(204, 19) java: 
> ParquetFileReader(org.apache.hadoop.conf.Configuration,org.apache.hadoop.fs.Path,java.util.List,java.util.List)
>  in org.apache.parquet.hadoop.ParquetFileReader has been deprecated
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16502) Update depreciated method "ParquetFileReader" from parquet

2016-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16502:


Assignee: Apache Spark

> Update depreciated method "ParquetFileReader" from parquet
> --
>
> Key: SPARK-16502
> URL: https://issues.apache.org/jira/browse/SPARK-16502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Xin Ren
>Assignee: Apache Spark
>
> During code compile, got below depreciation message. Need to update the 
> method invocation.
> {code}
> /Users/quickmobile/workspace/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala
> Warning:(448, 28) method listType in object ConversionPatterns is deprecated: 
> see corresponding Javadoc for more information.
> ConversionPatterns.listType(
>^
> Warning:(464, 28) method listType in object ConversionPatterns is deprecated: 
> see corresponding Javadoc for more information.
> ConversionPatterns.listType(
>^
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16502) Update depreciated method "ParquetFileReader" from parquet

2016-07-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373591#comment-15373591
 ] 

Apache Spark commented on SPARK-16502:
--

User 'keypointt' has created a pull request for this issue:
https://github.com/apache/spark/pull/14160

> Update depreciated method "ParquetFileReader" from parquet
> --
>
> Key: SPARK-16502
> URL: https://issues.apache.org/jira/browse/SPARK-16502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Xin Ren
>
> During code compile, got below depreciation message. Need to update the 
> method invocation.
> {code}
> /Users/quickmobile/workspace/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala
> Warning:(448, 28) method listType in object ConversionPatterns is deprecated: 
> see corresponding Javadoc for more information.
> ConversionPatterns.listType(
>^
> Warning:(464, 28) method listType in object ConversionPatterns is deprecated: 
> see corresponding Javadoc for more information.
> ConversionPatterns.listType(
>^
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16502) Update depreciated method "ParquetFileReader" from parquet

2016-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16502:


Assignee: (was: Apache Spark)

> Update depreciated method "ParquetFileReader" from parquet
> --
>
> Key: SPARK-16502
> URL: https://issues.apache.org/jira/browse/SPARK-16502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Xin Ren
>
> During code compile, got below depreciation message. Need to update the 
> method invocation.
> {code}
> /Users/quickmobile/workspace/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala
> Warning:(448, 28) method listType in object ConversionPatterns is deprecated: 
> see corresponding Javadoc for more information.
> ConversionPatterns.listType(
>^
> Warning:(464, 28) method listType in object ConversionPatterns is deprecated: 
> see corresponding Javadoc for more information.
> ConversionPatterns.listType(
>^
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16119) Support "DROP TABLE ... PURGE" if Hive client supports it

2016-07-12 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-16119.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 2.1.0

> Support "DROP TABLE ... PURGE" if Hive client supports it
> -
>
> Key: SPARK-16119
> URL: https://issues.apache.org/jira/browse/SPARK-16119
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 2.1.0
>
>
> There's currently code that explicitly disables the "PURGE" flag when 
> dropping a table:
> {code}
> if (ctx.PURGE != null) {
>   throw operationNotAllowed("DROP TABLE ... PURGE", ctx)
> }
> {code}
> That flag is necessary in certain situations where the table data cannot be 
> moved to the trash (which will be tried unless "PURGE" is requested). If the 
> client supports it (Hive >= 0.14.0 according to the Hive docs), we should 
> allow that option to be defined.
> For non-Hive tables, as far as I can understand, "PURGE" is the current 
> behavior of Spark.
> The same limitation exists currently for "ALTER TABLE ... DROP PARTITION" so 
> should probably also be covered by this change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16504) UDAF should be typed

2016-07-12 Thread Vladimir Feinberg (JIRA)

Vladimir Feinberg created SPARK-16504:
-

 Summary: UDAF should be typed
 Key: SPARK-16504
 URL: https://issues.apache.org/jira/browse/SPARK-16504
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Vladimir Feinberg


Currently, UDAFs can be implemented by using a generic 
{{MutableAggregationBuffer}}. This type-less class requires the user specify 
the schema.

If the user wants to create vector output from a UDAF, this requires specifying 
an output schema with a VectorUDT(), which is only accessible through a 
DeveloperApi.

Since we would prefer not to expose VectorUDT, the only option would be to 
resolve the user's inability to (legally) specify a schema containing a 
VectorUDT the same way that we would do so for creating dataframes: by type 
inference, just like createDataFrame does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16503) SparkSession should provide Spark version

2016-07-12 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-16503:
-

 Summary: SparkSession should provide Spark version
 Key: SPARK-16503
 URL: https://issues.apache.org/jira/browse/SPARK-16503
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Joseph K. Bradley


SparkContext.version is a useful field.
SparkSession should provide something similar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16502) Update depreciated method "ParquetFileReader" from parquet

2016-07-12 Thread Xin Ren (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373494#comment-15373494
 ] 

Xin Ren commented on SPARK-16502:
-

I'm working on it.

> Update depreciated method "ParquetFileReader" from parquet
> --
>
> Key: SPARK-16502
> URL: https://issues.apache.org/jira/browse/SPARK-16502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Xin Ren
>
> During code compile, got below depreciation message. Need to update the 
> method invocation.
> {code}
> /Users/quickmobile/workspace/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala
> Warning:(448, 28) method listType in object ConversionPatterns is deprecated: 
> see corresponding Javadoc for more information.
> ConversionPatterns.listType(
>^
> Warning:(464, 28) method listType in object ConversionPatterns is deprecated: 
> see corresponding Javadoc for more information.
> ConversionPatterns.listType(
>^
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16502) Update depreciated method "ParquetFileReader" from parquet

2016-07-12 Thread Xin Ren (JIRA)

Xin Ren created SPARK-16502:
---

 Summary: Update depreciated method "ParquetFileReader" from parquet
 Key: SPARK-16502
 URL: https://issues.apache.org/jira/browse/SPARK-16502
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Xin Ren


During code compile, got below depreciation message. Need to update the method 
invocation.

{code}
/Users/quickmobile/workspace/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala
Warning:(448, 28) method listType in object ConversionPatterns is deprecated: 
see corresponding Javadoc for more information.
ConversionPatterns.listType(
   ^
Warning:(464, 28) method listType in object ConversionPatterns is deprecated: 
see corresponding Javadoc for more information.
ConversionPatterns.listType(
   ^
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16437) SparkR read.df() from parquet got error: SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder"

2016-07-12 Thread Xin Ren (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15372210#comment-15372210
 ] 

Xin Ren edited comment on SPARK-16437 at 7/12/16 6:04 PM:
--

I worked on this for couple days, and I found it's not caused by Spark, but the 
parquet library "parquet-mr/parquet-hadoop".

I've debug by step, and found this error is from here: 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L820

and after digging into "parquet-hadoop", it's mostly probably because this 
library is missing the slf4j binder:
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L231

But it's technically not a bug, since Spark is using latest version of slf4j 
and parquet
{code}
1.7.16
1.8.1
{code}
and since 1.6 SLF4J is defaulting to no-operation (NOP) logger implementation, 
so should be ok.



was (Author: iamshrek):
I worked on this for couple days, and I found it's not caused by Spark, but the 
parquet library "parquet-mr/parquet-hadoop".

I've debug by step, and found this error is from here: 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L820

and after digging into "parquet-hadoop", it's mostly probably because this 
library is missing the slf4j binder:
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L231

But it's technically not a bug, since Spark is using 
{code}1.7.16{code}, and since 1.6 SLF4J is 
defaulting to no-operation (NOP) logger implementation, so should be ok.


> SparkR read.df() from parquet got error: SLF4J: Failed to load class 
> "org.slf4j.impl.StaticLoggerBinder"
> 
>
> Key: SPARK-16437
> URL: https://issues.apache.org/jira/browse/SPARK-16437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Xin Ren
>Priority: Minor
>
> build SparkR with command
> {code}
> build/mvn -DskipTests -Psparkr package
> {code}
> start SparkR console
> {code}
> ./bin/sparkR
> {code}
> then get error
> {code}
>  Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version  2.0.0-SNAPSHOT
> /_/
>  SparkSession available as 'spark'.
> >
> >
> > library(SparkR)
> >
> > df <- read.df("examples/src/main/resources/users.parquet")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> >
> >
> > head(df)
> 16/07/07 23:20:54 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> name favorite_color favorite_numbers
> 1 Alyssa3, 9, 15, 20
> 2Benred NULL
> {code}
> Reference
> * seems need to add a lib from slf4j to point to older version
> http://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder
> * on slf4j official site: http://www.slf4j.org/codes.html#StaticLoggerBinder



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16437) SparkR read.df() from parquet got error: SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder"

2016-07-12 Thread Xin Ren (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373338#comment-15373338
 ] 

Xin Ren commented on SPARK-16437:
-

hi [~srowen], could you please have a look here?

I think the SLF4J error of this ticket is from parquet library 
"parquet-mr/parquet-hadoop", not Spark's problem.

But I still have very tiny changes on style, should I submit the PR or just 
ignore it? since just 2 lines change..
https://github.com/apache/spark/compare/master...keypointt:SPARK-16437?expand=1

thank you very much :)

> SparkR read.df() from parquet got error: SLF4J: Failed to load class 
> "org.slf4j.impl.StaticLoggerBinder"
> 
>
> Key: SPARK-16437
> URL: https://issues.apache.org/jira/browse/SPARK-16437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Xin Ren
>Priority: Minor
>
> build SparkR with command
> {code}
> build/mvn -DskipTests -Psparkr package
> {code}
> start SparkR console
> {code}
> ./bin/sparkR
> {code}
> then get error
> {code}
>  Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version  2.0.0-SNAPSHOT
> /_/
>  SparkSession available as 'spark'.
> >
> >
> > library(SparkR)
> >
> > df <- read.df("examples/src/main/resources/users.parquet")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> >
> >
> > head(df)
> 16/07/07 23:20:54 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> name favorite_color favorite_numbers
> 1 Alyssa3, 9, 15, 20
> 2Benred NULL
> {code}
> Reference
> * seems need to add a lib from slf4j to point to older version
> http://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder
> * on slf4j official site: http://www.slf4j.org/codes.html#StaticLoggerBinder



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16490) Python mllib example for chi-squared feature selector

2016-07-12 Thread Ruben Janssen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1537#comment-1537
 ] 

Ruben Janssen commented on SPARK-16490:
---

Hi, I can work on this

> Python mllib example for chi-squared feature selector
> -
>
> Key: SPARK-16490
> URL: https://issues.apache.org/jira/browse/SPARK-16490
> Project: Spark
>  Issue Type: Task
>  Components: MLlib, PySpark
>Reporter: Shuai Lin
>Priority: Minor
>  Labels: starter
>
> There are java & scala examples for {{ChiSqSelector}} in mllib, but the 
> correspondent python example is missing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16405) Add metrics and source for external shuffle service

2016-07-12 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16405.
-
   Resolution: Fixed
 Assignee: YangyangLiu
Fix Version/s: 2.1.0

> Add metrics and source for external shuffle service
> ---
>
> Key: SPARK-16405
> URL: https://issues.apache.org/jira/browse/SPARK-16405
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: YangyangLiu
>Assignee: YangyangLiu
>  Labels: Metrics, Monitoring, features
> Fix For: 2.1.0
>
>
> ExternalShuffleService is essential for spark. In order to better monitor 
> shuffle service, we added various metrics in shuffle service and  
> ExternalShuffleServiceSource for metric system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16414) Can not get user config when calling SparkHadoopUtil.get.conf in other places, such as DataSourceStrategy

2016-07-12 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-16414.

   Resolution: Fixed
 Assignee: sharkd tu
Fix Version/s: 2.0.0

> Can not get user config when calling SparkHadoopUtil.get.conf in other 
> places, such as DataSourceStrategy
> -
>
> Key: SPARK-16414
> URL: https://issues.apache.org/jira/browse/SPARK-16414
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.2
>Reporter: sharkd tu
>Assignee: sharkd tu
>  Labels: easyfix
> Fix For: 2.0.0
>
>
> Fix bugs for "Can not get user config when calling SparkHadoopUtil.get.conf 
> in other places".
> The `SparkHadoopUtil` singleton was instantiated before `ApplicationMaster`, 
> So the `sparkConf` and `conf` in the `SparkHadoopUtil` singleton didn't 
> include user's configuration. 
> see https://github.com/apache/spark/pull/14088



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15752) Optimize metadata only query that has an aggregate whose children are deterministic project or filter operators

2016-07-12 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15752:

Fix Version/s: 2.1.0

> Optimize metadata only query that has an aggregate whose children are 
> deterministic project or filter operators
> ---
>
> Key: SPARK-15752
> URL: https://issues.apache.org/jira/browse/SPARK-15752
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>Assignee: Lianhui Wang
> Fix For: 2.1.0
>
>
> when query only use metadata (example: partition key), it can return results 
> based on metadata without scaning files. Hive did it in HIVE-1003.
> design document:
> https://docs.google.com/document/d/1Bmi4-PkTaBQ0HVaGjIqa3eA12toKX52QaiUyhb6WQiM/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16489) Test harness to prevent expression code generation from reusing variable names

2016-07-12 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16489.
-
  Resolution: Fixed
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0  (was: 2.1.0)

> Test harness to prevent expression code generation from reusing variable names
> --
>
> Key: SPARK-16489
> URL: https://issues.apache.org/jira/browse/SPARK-16489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> In code generation, it is incorrect for expressions to reuse variable names 
> across different instances of itself. As an example, SPARK-16488 reports a 
> bug in which pmod expression reuses variable name "r".
> This patch updates ExpressionEvalHelper test harness to always project two 
> instances of the same expression, which will help us catch variable reuse 
> problems in expression unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16491) Crc32 should use different variable names (not "checksum")

2016-07-12 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16491.
-
   Resolution: Fixed
 Assignee: Reynold Xin
Fix Version/s: 2.0.0

Resolved by https://github.com/apache/spark/pull/14146

> Crc32 should use different variable names (not "checksum")
> --
>
> Key: SPARK-16491
> URL: https://issues.apache.org/jira/browse/SPARK-16491
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13547) Add SQL query in web UI's SQL Tab

2016-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13547:


Assignee: Apache Spark

> Add SQL query in web UI's SQL Tab
> -
>
> Key: SPARK-13547
> URL: https://issues.apache.org/jira/browse/SPARK-13547
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 1.6.0
>Reporter: Jeff Zhang
>Assignee: Apache Spark
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> It would be nice to have the sql query in sql tab



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13547) Add SQL query in web UI's SQL Tab

2016-07-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373235#comment-15373235
 ] 

Apache Spark commented on SPARK-13547:
--

User 'nblintao' has created a pull request for this issue:
https://github.com/apache/spark/pull/14158

> Add SQL query in web UI's SQL Tab
> -
>
> Key: SPARK-13547
> URL: https://issues.apache.org/jira/browse/SPARK-13547
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 1.6.0
>Reporter: Jeff Zhang
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> It would be nice to have the sql query in sql tab



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13547) Add SQL query in web UI's SQL Tab

2016-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13547:


Assignee: (was: Apache Spark)

> Add SQL query in web UI's SQL Tab
> -
>
> Key: SPARK-13547
> URL: https://issues.apache.org/jira/browse/SPARK-13547
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 1.6.0
>Reporter: Jeff Zhang
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> It would be nice to have the sql query in sql tab



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15752) Optimize metadata only query that has an aggregate whose children are deterministic project or filter operators

2016-07-12 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-15752.
---
Resolution: Fixed
  Assignee: Lianhui Wang

> Optimize metadata only query that has an aggregate whose children are 
> deterministic project or filter operators
> ---
>
> Key: SPARK-15752
> URL: https://issues.apache.org/jira/browse/SPARK-15752
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>Assignee: Lianhui Wang
>
> when query only use metadata (example: partition key), it can return results 
> based on metadata without scaning files. Hive did it in HIVE-1003.
> design document:
> https://docs.google.com/document/d/1Bmi4-PkTaBQ0HVaGjIqa3eA12toKX52QaiUyhb6WQiM/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16500) Add LBFG training not convergence warning for all ML algorithm

2016-07-12 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-16500:
---
Component/s: Optimizer

> Add LBFG training not convergence warning for all ML algorithm
> --
>
> Key: SPARK-16500
> URL: https://issues.apache.org/jira/browse/SPARK-16500
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, Optimizer
>Affects Versions: 2.0.0
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> This is an extension task for #SPARK-16470 .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16501) spark.mesos.secret exposed on UI and command line

2016-07-12 Thread Eric Daniel (JIRA)

Eric Daniel created SPARK-16501:
---

 Summary: spark.mesos.secret exposed on UI and command line
 Key: SPARK-16501
 URL: https://issues.apache.org/jira/browse/SPARK-16501
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit, Web UI
Affects Versions: 1.6.2
Reporter: Eric Daniel


There are two related problems with spark.mesos.secret:

1) The web UI shows its value in the "environment" tab
2) Passing it as a command-line option to spark-submit (or creating a 
SparkContext from python, with the effect of launching spark-submit)  exposes 
it to "ps"

I'll be happy to submit a patch but I could use some advice first.

The first problem is easy enough, just don't show that value in the UI

For the second problem, I'm not sure what the best solution is. A 
"spark.mesos.secret-file" parameter would let the user store the secret in a 
non-world-readable file. Alternatively, the mesos secret could be obtained from 
the environment, which other users don't have access to.  Either solution would 
work in client mode, but I don't know if they're workable in cluster mode.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16500) Add LBFG training not convergence warning for all ML algorithm

2016-07-12 Thread Weichen Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373062#comment-15373062
 ] 

Weichen Xu commented on SPARK-16500:


OK. I'll keep it in mind in future task. Thanks!

> Add LBFG training not convergence warning for all ML algorithm
> --
>
> Key: SPARK-16500
> URL: https://issues.apache.org/jira/browse/SPARK-16500
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> This is an extension task for #SPARK-16470 .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16500) Add LBFG training not convergence warning for all ML algorithm

2016-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16500:


Assignee: Apache Spark

> Add LBFG training not convergence warning for all ML algorithm
> --
>
> Key: SPARK-16500
> URL: https://issues.apache.org/jira/browse/SPARK-16500
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> This is an extension task for #SPARK-16470 .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 162 matches

Mail list logo