[jira] [Updated] (SPARK-19040) I just want to report a dug

2016-12-30 Thread xiaowei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiaowei updated SPARK-19040:

Labels: patch  (was: )

> I just want to report a dug
> ---
>
> Key: SPARK-19040
> URL: https://issues.apache.org/jira/browse/SPARK-19040
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: xiaowei
>  Labels: patch
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19040) I just want to report a dug

2016-12-30 Thread xiaowei (JIRA)
xiaowei created SPARK-19040:
---

 Summary: I just want to report a dug
 Key: SPARK-19040
 URL: https://issues.apache.org/jira/browse/SPARK-19040
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 2.1.0
Reporter: xiaowei
 Fix For: 2.1.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19037) Run count(distinct x) from sub query found some errors

2016-12-30 Thread J.P Feng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15788833#comment-15788833
 ] 

J.P Feng edited comment on SPARK-19037 at 12/31/16 3:31 AM:


yes, my spark version is  2.1 downloaded from official site. Thanks for your 
reply. I will try the patch.


was (Author: snodawn):
yes, my spark version is version 2.1 downloaded from official site. Thanks for 
your reply. I will try the patch.

> Run count(distinct x) from sub query found some errors
> --
>
> Key: SPARK-19037
> URL: https://issues.apache.org/jira/browse/SPARK-19037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.1.0
> Environment: spark 2.1.0, scala 2.11 
>Reporter: J.P Feng
>  Labels: distinct, sparkSQL, sub-query
>
> when i use spark-shell or spark-sql to execute count(distinct name) from 
> subquery, some errors occur:
> select count(distinct name) from (select * from mytest limit 10) as a
> if i do this in hive-server2, i can get the correct result.
> if i just execute select count(name) from (select * from mytest limit 10) as 
> a, i can also get the right result.
> besides, i found the same errors when i use distinct(),groupby() with 
> subquery.
> I think there maybe some bugs when doing key-reduce jobs with subquery.
> I will add the errors in new comment.
> besides, i test dropDuplicates in spark-shell:
> 1. spark.sql("select * from mytest limit 10").dropDuplicates("name").show
> it will throw some exceptions
> 2. spark.table("mytest").dropDuplicates("name").show
> it will return the right result



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19037) Run count(distinct x) from sub query found some errors

2016-12-30 Thread J.P Feng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15788833#comment-15788833
 ] 

J.P Feng commented on SPARK-19037:
--

yes, my spark version is version 2.1 downloaded from official site. Thanks for 
your reply. I will try the patch.

> Run count(distinct x) from sub query found some errors
> --
>
> Key: SPARK-19037
> URL: https://issues.apache.org/jira/browse/SPARK-19037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.1.0
> Environment: spark 2.1.0, scala 2.11 
>Reporter: J.P Feng
>  Labels: distinct, sparkSQL, sub-query
>
> when i use spark-shell or spark-sql to execute count(distinct name) from 
> subquery, some errors occur:
> select count(distinct name) from (select * from mytest limit 10) as a
> if i do this in hive-server2, i can get the correct result.
> if i just execute select count(name) from (select * from mytest limit 10) as 
> a, i can also get the right result.
> besides, i found the same errors when i use distinct(),groupby() with 
> subquery.
> I think there maybe some bugs when doing key-reduce jobs with subquery.
> I will add the errors in new comment.
> besides, i test dropDuplicates in spark-shell:
> 1. spark.sql("select * from mytest limit 10").dropDuplicates("name").show
> it will throw some exceptions
> 2. spark.table("mytest").dropDuplicates("name").show
> it will return the right result



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19039) UDF ClosureCleaner bug when UDF, col applied in paste mode in REPL

2016-12-30 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-19039:
-

 Summary: UDF ClosureCleaner bug when UDF, col applied in paste 
mode in REPL
 Key: SPARK-19039
 URL: https://issues.apache.org/jira/browse/SPARK-19039
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0, 2.0.2, 1.6.3
Reporter: Joseph K. Bradley


When I try this:
* Define UDF
* Apply UDF to get Column
* Use Column in a DataFrame

I can find weird behavior in the spark-shell when using paste mode.

To reproduce this, paste this into the spark-shell:
{code}
import org.apache.spark.sql.functions._
val df = spark.createDataFrame(Seq(
  ("hi", 1),
  ("there", 2),
  ("the", 3),
  ("end", 4)
)).toDF("a", "b")

val myNumbers = Set(1,2,3)
val tmpUDF = udf { (n: Int) => myNumbers.contains(n) }

val rowHasMyNumber = tmpUDF($"b")
df.where(rowHasMyNumber).show()
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18404) RPC call from executor to driver blocks when getting map output locations (Netty Only)

2016-12-30 Thread Jeffrey Shmain (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15788734#comment-15788734
 ] 

Jeffrey Shmain commented on SPARK-18404:


Yes, but with Akka response is in milliseconds (few sec). With RPC it
sometimes 20-40 sec.  Which is quite noticble in something like query
engine where response is expected in under 10 secs.




> RPC call from executor to driver blocks when getting map output locations 
> (Netty Only)
> --
>
> Key: SPARK-18404
> URL: https://issues.apache.org/jira/browse/SPARK-18404
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Jeffrey Shmain
>
> Compared identical application run on Spark 1.5 and Spark 1.6.  Noticed that 
> jobs became slower. After looking at it closer, found that 75% of tasks 
> finished same or above, and 25% had significant delays (unrelated to data 
> skew and GC)
> After more debugging noticed that the executors are blocking for few seconds 
> (sometimes 25) on this call:
> https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L199
>logInfo("Doing the fetch; tracker endpoint = " + trackerEndpoint)
> // This try-finally prevents hangs due to timeouts:
> try {
>   val fetchedBytes = 
> askTracker[Array[Byte]](GetMapOutputStatuses(shuffleId))
>   fetchedStatuses = 
> MapOutputTracker.deserializeMapStatuses(fetchedBytes)
>   logInfo("Got the output locations")
> So the regression seems to be related changing the default from from Akka to 
> Netty.  
> This was an application working with RDDs, submitting 10 concurrent queries 
> at a time.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19038) Can't find keytab file when using Hive catalog

2016-12-30 Thread Peter Parente (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Parente updated SPARK-19038:
--
Description: 
h2. Stack Trace

{noformat}
Py4JJavaErrorTraceback (most recent call last)
 in ()
> 1 sdf = sql.createDataFrame(df)

/opt/spark2/python/pyspark/sql/context.py in createDataFrame(self, data, 
schema, samplingRatio, verifySchema)
307 Py4JJavaError: ...
308 """
--> 309 return self.sparkSession.createDataFrame(data, schema, 
samplingRatio, verifySchema)
310 
311 @since(1.3)

/opt/spark2/python/pyspark/sql/session.py in createDataFrame(self, data, 
schema, samplingRatio, verifySchema)
524 rdd, schema = self._createFromLocal(map(prepare, data), 
schema)
525 jrdd = 
self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
--> 526 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), 
schema.json())
527 df = DataFrame(jdf, self._wrapped)
528 df._schema = schema

/opt/spark2/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in 
__call__(self, *args)
   1131 answer = self.gateway_client.send_command(command)
   1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135 for temp_arg in temp_args:

/opt/spark2/python/pyspark/sql/utils.py in deco(*a, **kw)
 61 def deco(*a, **kw):
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:
 65 s = e.java_exception.toString()

/opt/spark2/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(

Py4JJavaError: An error occurred while calling o47.applySchemaToPythonRDD.
: org.apache.spark.SparkException: Keytab file: 
.keytab-f0b9b814-460e-4fa8-8e7d-029186b696c4 specified in spark.yarn.keytab 
does not exist
at 
org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:113)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
at 
org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
at 
org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
at 
org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
at 
org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
at 
org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
at 
org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
at 
org.apache.spark.sql.hive.HiveSessionState$$anon$1.(HiveSessionState.scala:63)
at 
org.apache.spark.sql.hive.HiveSessionState.analyzer$lzycompute(HiveSessionState.scala:63)
at 
org.apache.spark.sql.hive.HiveSessionState.analyzer(HiveSessionState.scala:62)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at 
org.apache.spark.sql.SparkSession.applySchemaToPythonRDD(SparkSession.scala:666)
at 
org.apache.spark.sql.SparkSession.applySchemaToPythonRDD(SparkSession.scala:656)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)

[jira] [Updated] (SPARK-19038) Can't find keytab file when using Hive catalog

2016-12-30 Thread Peter Parente (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Parente updated SPARK-19038:
--
Environment: Hadoop / YARN 2.6, pyspark, yarn-client mode  (was: Hadoop / 
YARN 2.6, pyspark)

> Can't find keytab file when using Hive catalog
> --
>
> Key: SPARK-19038
> URL: https://issues.apache.org/jira/browse/SPARK-19038
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.2
> Environment: Hadoop / YARN 2.6, pyspark, yarn-client mode
>Reporter: Peter Parente
>  Labels: hive, kerberos, pyspark
>
> h2. Stack Trace
> {noformat}
> Py4JJavaErrorTraceback (most recent call last)
>  in ()
> > 1 sdf = sql.createDataFrame(df)
> /opt/spark2/python/pyspark/sql/context.py in createDataFrame(self, data, 
> schema, samplingRatio, verifySchema)
> 307 Py4JJavaError: ...
> 308 """
> --> 309 return self.sparkSession.createDataFrame(data, schema, 
> samplingRatio, verifySchema)
> 310 
> 311 @since(1.3)
> /opt/spark2/python/pyspark/sql/session.py in createDataFrame(self, data, 
> schema, samplingRatio, verifySchema)
> 524 rdd, schema = self._createFromLocal(map(prepare, data), 
> schema)
> 525 jrdd = 
> self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
> --> 526 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), 
> schema.json())
> 527 df = DataFrame(jdf, self._wrapped)
> 528 df._schema = schema
> /opt/spark2/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
>1131 answer = self.gateway_client.send_command(command)
>1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, self.name)
>1134 
>1135 for temp_arg in temp_args:
> /opt/spark2/python/pyspark/sql/utils.py in deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> /opt/spark2/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 317 raise Py4JJavaError(
> 318 "An error occurred while calling {0}{1}{2}.\n".
> --> 319 format(target_id, ".", name), value)
> 320 else:
> 321 raise Py4JError(
> Py4JJavaError: An error occurred while calling o47.applySchemaToPythonRDD.
> : org.apache.spark.SparkException: Keytab file: 
> .keytab-f0b9b814-460e-4fa8-8e7d-029186b696c4 specified in spark.yarn.keytab 
> does not exist
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:113)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
>   at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
>   at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
>   at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
>   at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
>   at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
>   at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
>   at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
>   at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
>   at 
> org.apache.spark.sql.hive.HiveSessionState$$anon$1.(HiveSessionState.scala:63)
>   at 
> org.apache.spark.sql.hive.HiveSessionState.analyzer$lzycompute(HiveSessionState.scala:63)
>   at 
> org.apache.spark.sql.hive.HiveSessionState.analyzer(HiveSessionState.scala:62)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
>   at 
> org.apache.spark.sql.SparkSession.applySchemaToPythonRDD(SparkSession.scala:666)
>   at 
> org.apache.spark.sql.SparkSession.applySchemaToPythonRDD(SparkSession.scala:656)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0

[jira] [Created] (SPARK-19038) Can't find keytab file when using Hive catalog

2016-12-30 Thread Peter Parente (JIRA)
Peter Parente created SPARK-19038:
-

 Summary: Can't find keytab file when using Hive catalog
 Key: SPARK-19038
 URL: https://issues.apache.org/jira/browse/SPARK-19038
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.0.2
 Environment: Hadoop / YARN 2.6, pyspark
Reporter: Peter Parente


## Stack Trace

{noformat}
Py4JJavaErrorTraceback (most recent call last)
 in ()
> 1 sdf = sql.createDataFrame(df)

/opt/spark2/python/pyspark/sql/context.py in createDataFrame(self, data, 
schema, samplingRatio, verifySchema)
307 Py4JJavaError: ...
308 """
--> 309 return self.sparkSession.createDataFrame(data, schema, 
samplingRatio, verifySchema)
310 
311 @since(1.3)

/opt/spark2/python/pyspark/sql/session.py in createDataFrame(self, data, 
schema, samplingRatio, verifySchema)
524 rdd, schema = self._createFromLocal(map(prepare, data), 
schema)
525 jrdd = 
self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
--> 526 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), 
schema.json())
527 df = DataFrame(jdf, self._wrapped)
528 df._schema = schema

/opt/spark2/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in 
__call__(self, *args)
   1131 answer = self.gateway_client.send_command(command)
   1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135 for temp_arg in temp_args:

/opt/spark2/python/pyspark/sql/utils.py in deco(*a, **kw)
 61 def deco(*a, **kw):
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:
 65 s = e.java_exception.toString()

/opt/spark2/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(

Py4JJavaError: An error occurred while calling o47.applySchemaToPythonRDD.
: org.apache.spark.SparkException: Keytab file: 
.keytab-f0b9b814-460e-4fa8-8e7d-029186b696c4 specified in spark.yarn.keytab 
does not exist
at 
org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:113)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
at 
org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
at 
org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
at 
org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
at 
org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
at 
org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
at 
org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
at 
org.apache.spark.sql.hive.HiveSessionState$$anon$1.(HiveSessionState.scala:63)
at 
org.apache.spark.sql.hive.HiveSessionState.analyzer$lzycompute(HiveSessionState.scala:63)
at 
org.apache.spark.sql.hive.HiveSessionState.analyzer(HiveSessionState.scala:62)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at 
org.apache.spark.sql.SparkSession.applySchemaToPythonRDD(SparkSession.scala:666)
at 
org.apache.spark.sql.SparkSession.applySchemaToPythonRDD(SparkSession.scala:656)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway

[jira] [Updated] (SPARK-19038) Can't find keytab file when using Hive catalog

2016-12-30 Thread Peter Parente (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Parente updated SPARK-19038:
--
Description: 
h2. Stack Trace

{noformat}
Py4JJavaErrorTraceback (most recent call last)
 in ()
> 1 sdf = sql.createDataFrame(df)

/opt/spark2/python/pyspark/sql/context.py in createDataFrame(self, data, 
schema, samplingRatio, verifySchema)
307 Py4JJavaError: ...
308 """
--> 309 return self.sparkSession.createDataFrame(data, schema, 
samplingRatio, verifySchema)
310 
311 @since(1.3)

/opt/spark2/python/pyspark/sql/session.py in createDataFrame(self, data, 
schema, samplingRatio, verifySchema)
524 rdd, schema = self._createFromLocal(map(prepare, data), 
schema)
525 jrdd = 
self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
--> 526 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), 
schema.json())
527 df = DataFrame(jdf, self._wrapped)
528 df._schema = schema

/opt/spark2/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in 
__call__(self, *args)
   1131 answer = self.gateway_client.send_command(command)
   1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135 for temp_arg in temp_args:

/opt/spark2/python/pyspark/sql/utils.py in deco(*a, **kw)
 61 def deco(*a, **kw):
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:
 65 s = e.java_exception.toString()

/opt/spark2/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(

Py4JJavaError: An error occurred while calling o47.applySchemaToPythonRDD.
: org.apache.spark.SparkException: Keytab file: 
.keytab-f0b9b814-460e-4fa8-8e7d-029186b696c4 specified in spark.yarn.keytab 
does not exist
at 
org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:113)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
at 
org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
at 
org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
at 
org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
at 
org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
at 
org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
at 
org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
at 
org.apache.spark.sql.hive.HiveSessionState$$anon$1.(HiveSessionState.scala:63)
at 
org.apache.spark.sql.hive.HiveSessionState.analyzer$lzycompute(HiveSessionState.scala:63)
at 
org.apache.spark.sql.hive.HiveSessionState.analyzer(HiveSessionState.scala:62)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at 
org.apache.spark.sql.SparkSession.applySchemaToPythonRDD(SparkSession.scala:666)
at 
org.apache.spark.sql.SparkSession.applySchemaToPythonRDD(SparkSession.scala:656)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)

[jira] [Resolved] (SPARK-19016) Document scalable partition handling feature in the programming guide

2016-12-30 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-19016.

   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

Issue resolved by pull request 16424
[https://github.com/apache/spark/pull/16424]

> Document scalable partition handling feature in the programming guide
> -
>
> Key: SPARK-19016
> URL: https://issues.apache.org/jira/browse/SPARK-19016
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
> Fix For: 2.1.1, 2.2.0
>
>
> Currently, we only mention this in the migration guide. Should also document 
> it in the programming guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14975) Predicted Probability per training instance for Gradient Boosted Trees in mllib.

2016-12-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14975:


Assignee: Apache Spark

> Predicted Probability per training instance for Gradient Boosted Trees in 
> mllib. 
> -
>
> Key: SPARK-14975
> URL: https://issues.apache.org/jira/browse/SPARK-14975
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Partha Talukder
>Assignee: Apache Spark
>Priority: Minor
>  Labels: mllib
>
> This function available for Logistic Regression, SVM etc. 
> (model.setThreshold()) but not for GBT.  In comparison to "gbm" package in R, 
> where we can specify the distribution and get predicted probabilities or 
> classes. I understand that this algorithm works with "Classification" and 
> "Regression" algo's. Is there any way where in GBT  we can get predicted 
> probabilities  or provide thresholds to the model?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14975) Predicted Probability per training instance for Gradient Boosted Trees in mllib.

2016-12-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15788293#comment-15788293
 ] 

Apache Spark commented on SPARK-14975:
--

User 'imatiach-msft' has created a pull request for this issue:
https://github.com/apache/spark/pull/16441

> Predicted Probability per training instance for Gradient Boosted Trees in 
> mllib. 
> -
>
> Key: SPARK-14975
> URL: https://issues.apache.org/jira/browse/SPARK-14975
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Partha Talukder
>Priority: Minor
>  Labels: mllib
>
> This function available for Logistic Regression, SVM etc. 
> (model.setThreshold()) but not for GBT.  In comparison to "gbm" package in R, 
> where we can specify the distribution and get predicted probabilities or 
> classes. I understand that this algorithm works with "Classification" and 
> "Regression" algo's. Is there any way where in GBT  we can get predicted 
> probabilities  or provide thresholds to the model?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14975) Predicted Probability per training instance for Gradient Boosted Trees in mllib.

2016-12-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14975:


Assignee: (was: Apache Spark)

> Predicted Probability per training instance for Gradient Boosted Trees in 
> mllib. 
> -
>
> Key: SPARK-14975
> URL: https://issues.apache.org/jira/browse/SPARK-14975
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Partha Talukder
>Priority: Minor
>  Labels: mllib
>
> This function available for Logistic Regression, SVM etc. 
> (model.setThreshold()) but not for GBT.  In comparison to "gbm" package in R, 
> where we can specify the distribution and get predicted probabilities or 
> classes. I understand that this algorithm works with "Classification" and 
> "Regression" algo's. Is there any way where in GBT  we can get predicted 
> probabilities  or provide thresholds to the model?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe

2016-12-30 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15788152#comment-15788152
 ] 

Shixiong Zhu commented on SPARK-17463:
--

[~sunil.rangwani] Could you provide the codes that use CollectionAccumulator?

> Serialization of accumulators in heartbeats is not thread-safe
> --
>
> Key: SPARK-17463
> URL: https://issues.apache.org/jira/browse/SPARK-17463
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> Check out the following {{ConcurrentModificationException}}:
> {code}
> 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 
> attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862)
> at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:766)
> at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)

[jira] [Resolved] (SPARK-18123) org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable the case senstivity issue

2016-12-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-18123.
-
   Resolution: Resolved
Fix Version/s: 2.2.0

> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable  the case 
> senstivity issue
> --
>
> Key: SPARK-18123
> URL: https://issues.apache.org/jira/browse/SPARK-18123
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Paul Wu
>Assignee: Dongjoon Hyun
> Fix For: 2.2.0
>
>
> Blindly quoting every field name for inserting is the issue (Line 110-119, 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala).
> /**
>* Returns a PreparedStatement that inserts a row into table via conn.
>*/
>   def insertStatement(conn: Connection, table: String, rddSchema: StructType, 
> dialect: JdbcDialect)
>   : PreparedStatement = {
> val columns = rddSchema.fields.map(x => 
> dialect.quoteIdentifier(x.name)).mkString(",")
> val placeholders = rddSchema.fields.map(_ => "?").mkString(",")
> val sql = s"INSERT INTO $table ($columns) VALUES ($placeholders)"
> conn.prepareStatement(sql)
>   }
> This code causes the following issue (it does not happen to 1.6.x):
> I have issue with the saveTable method in Spark 2.0/2.0.1. I tried to save a 
> dataset to Oracle database, but the fields must be uppercase to succeed. This 
> is not an expected behavior: If only the table names were quoted, this 
> utility should concern the case sensitivity.  The code below throws the 
> exception: Caused by: java.sql.SQLSyntaxErrorException: ORA-00904: 
> "DATETIME_gmt": invalid identifier. 
> String detailSQL ="select CAST('2016-09-25 17:00:00' AS TIMESTAMP) 
> DATETIME_gmt, '1' NODEB";
> hc.sql("set spark.sql.caseSensitive=false");
> Dataset ds = hc.sql(detailSQL);
> ds.show();
> 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable(ds, url, 
> detailTable, p);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18123) org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable the case senstivity issue

2016-12-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-18123:

Assignee: Dongjoon Hyun

> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable  the case 
> senstivity issue
> --
>
> Key: SPARK-18123
> URL: https://issues.apache.org/jira/browse/SPARK-18123
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Paul Wu
>Assignee: Dongjoon Hyun
> Fix For: 2.2.0
>
>
> Blindly quoting every field name for inserting is the issue (Line 110-119, 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala).
> /**
>* Returns a PreparedStatement that inserts a row into table via conn.
>*/
>   def insertStatement(conn: Connection, table: String, rddSchema: StructType, 
> dialect: JdbcDialect)
>   : PreparedStatement = {
> val columns = rddSchema.fields.map(x => 
> dialect.quoteIdentifier(x.name)).mkString(",")
> val placeholders = rddSchema.fields.map(_ => "?").mkString(",")
> val sql = s"INSERT INTO $table ($columns) VALUES ($placeholders)"
> conn.prepareStatement(sql)
>   }
> This code causes the following issue (it does not happen to 1.6.x):
> I have issue with the saveTable method in Spark 2.0/2.0.1. I tried to save a 
> dataset to Oracle database, but the fields must be uppercase to succeed. This 
> is not an expected behavior: If only the table names were quoted, this 
> utility should concern the case sensitivity.  The code below throws the 
> exception: Caused by: java.sql.SQLSyntaxErrorException: ORA-00904: 
> "DATETIME_gmt": invalid identifier. 
> String detailSQL ="select CAST('2016-09-25 17:00:00' AS TIMESTAMP) 
> DATETIME_gmt, '1' NODEB";
> hc.sql("set spark.sql.caseSensitive=false");
> Dataset ds = hc.sql(detailSQL);
> ds.show();
> 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable(ds, url, 
> detailTable, p);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19037) Run count(distinct x) from sub query found some errors

2016-12-30 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-19037.
---
Resolution: Duplicate

> Run count(distinct x) from sub query found some errors
> --
>
> Key: SPARK-19037
> URL: https://issues.apache.org/jira/browse/SPARK-19037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.1.0
> Environment: spark 2.1.0, scala 2.11 
>Reporter: J.P Feng
>  Labels: distinct, sparkSQL, sub-query
>
> when i use spark-shell or spark-sql to execute count(distinct name) from 
> subquery, some errors occur:
> select count(distinct name) from (select * from mytest limit 10) as a
> if i do this in hive-server2, i can get the correct result.
> if i just execute select count(name) from (select * from mytest limit 10) as 
> a, i can also get the right result.
> besides, i found the same errors when i use distinct(),groupby() with 
> subquery.
> I think there maybe some bugs when doing key-reduce jobs with subquery.
> I will add the errors in new comment.
> besides, i test dropDuplicates in spark-shell:
> 1. spark.sql("select * from mytest limit 10").dropDuplicates("name").show
> it will throw some exceptions
> 2. spark.table("mytest").dropDuplicates("name").show
> it will return the right result



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19037) Run count(distinct x) from sub query found some errors

2016-12-30 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15788001#comment-15788001
 ] 

Herman van Hovell commented on SPARK-19037:
---

NVM - this fix will be in the 2.1.1 release.

> Run count(distinct x) from sub query found some errors
> --
>
> Key: SPARK-19037
> URL: https://issues.apache.org/jira/browse/SPARK-19037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.1.0
> Environment: spark 2.1.0, scala 2.11 
>Reporter: J.P Feng
>  Labels: distinct, sparkSQL, sub-query
>
> when i use spark-shell or spark-sql to execute count(distinct name) from 
> subquery, some errors occur:
> select count(distinct name) from (select * from mytest limit 10) as a
> if i do this in hive-server2, i can get the correct result.
> if i just execute select count(name) from (select * from mytest limit 10) as 
> a, i can also get the right result.
> besides, i found the same errors when i use distinct(),groupby() with 
> subquery.
> I think there maybe some bugs when doing key-reduce jobs with subquery.
> I will add the errors in new comment.
> besides, i test dropDuplicates in spark-shell:
> 1. spark.sql("select * from mytest limit 10").dropDuplicates("name").show
> it will throw some exceptions
> 2. spark.table("mytest").dropDuplicates("name").show
> it will return the right result



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19037) Run count(distinct x) from sub query found some errors

2016-12-30 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15787995#comment-15787995
 ] 

Herman van Hovell commented on SPARK-19037:
---

This should be fixed in 2.1. See: 
https://github.com/apache/spark/commit/021952d5808715d0b9d6c716f8b67cd550f7982e

Are you using the official 2.1. release?

> Run count(distinct x) from sub query found some errors
> --
>
> Key: SPARK-19037
> URL: https://issues.apache.org/jira/browse/SPARK-19037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.1.0
> Environment: spark 2.1.0, scala 2.11 
>Reporter: J.P Feng
>  Labels: distinct, sparkSQL, sub-query
>
> when i use spark-shell or spark-sql to execute count(distinct name) from 
> subquery, some errors occur:
> select count(distinct name) from (select * from mytest limit 10) as a
> if i do this in hive-server2, i can get the correct result.
> if i just execute select count(name) from (select * from mytest limit 10) as 
> a, i can also get the right result.
> besides, i found the same errors when i use distinct(),groupby() with 
> subquery.
> I think there maybe some bugs when doing key-reduce jobs with subquery.
> I will add the errors in new comment.
> besides, i test dropDuplicates in spark-shell:
> 1. spark.sql("select * from mytest limit 10").dropDuplicates("name").show
> it will throw some exceptions
> 2. spark.table("mytest").dropDuplicates("name").show
> it will return the right result



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14975) Predicted Probability per training instance for Gradient Boosted Trees in mllib.

2016-12-30 Thread Ilya Matiach (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15787910#comment-15787910
 ] 

Ilya Matiach commented on SPARK-14975:
--

I can take a look into this issue.  It looks like GBT needs to be fixed in ml 
and mllib to output probabilities.

> Predicted Probability per training instance for Gradient Boosted Trees in 
> mllib. 
> -
>
> Key: SPARK-14975
> URL: https://issues.apache.org/jira/browse/SPARK-14975
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Partha Talukder
>Priority: Minor
>  Labels: mllib
>
> This function available for Logistic Regression, SVM etc. 
> (model.setThreshold()) but not for GBT.  In comparison to "gbm" package in R, 
> where we can specify the distribution and get predicted probabilities or 
> classes. I understand that this algorithm works with "Classification" and 
> "Regression" algo's. Is there any way where in GBT  we can get predicted 
> probabilities  or provide thresholds to the model?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe

2016-12-30 Thread Sunil Rangwani (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15787809#comment-15787809
 ] 

Sunil Rangwani commented on SPARK-17463:


Hi 
I get the same error with Spark 2.0.2 on EMR 5.2.0
I am using the CollectionAccumulator and I get the same stack trace. Hard to 
say from the stack trace what is the root cause. 

{noformat}
16/12/30 13:49:11 WARN NettyRpcEndpointRef: Error sending message [message = 
Heartbeat(1,[Lscala.Tuple2;@63f6e6d6,BlockManagerId(1, , 
45163))] in 3 attempts
org.apache.spark.SparkException: Exception thrown in awaitResult
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1953)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.ConcurrentModificationException
at java.util.ArrayList.writeObject(ArrayList.java:766)
at sun.reflect.GeneratedMethodAccessor55.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at 
java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
at 
java.util.Collections$SynchronizedCollection.writeObject(Collections.java:2081)
at sun.reflect.GeneratedMethodAccessor54.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at 
java

[jira] [Commented] (SPARK-18781) Allow MatrixFactorizationModel.predict to skip user/product approximation count

2016-12-30 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15787786#comment-15787786
 ] 

Liang-Chi Hsieh commented on SPARK-18781:
-

[~eyal] Do you have estimated time cost of the approximate count? Since it is 
just an approximation, I think it should not take big portion of the whole time 
of predict.

If the batches are large, as [~sowen] said, this overhead is trivial compared 
with the predication. And I agreed that this is the predicate method is meant 
for.

Even if the batches are small, I don't think the approximation will largely 
increase the time of running predict. I can't image that the given 
usersProducts will cost much more time in approximating count than the joining 
+ ddot in predication.


> Allow MatrixFactorizationModel.predict to skip user/product approximation 
> count
> ---
>
> Key: SPARK-18781
> URL: https://issues.apache.org/jira/browse/SPARK-18781
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Eyal Allweil
>Priority: Minor
>
> When 
> [MatrixFactorizationModel.predict|https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.html#predict(org.apache.spark.rdd.RDD)]
>  is used, it first calculates an approximation count of the users and 
> products in order to determine the most efficient way to proceed. In many 
> cases, the answer to this question is fixed (typically there are more users 
> than products by an order of magnitude) and this check is unnecessary. Adding 
> a parameter to this predict method to allow choosing the implementation (and 
> skipping the check) would be nice.
> It would be especially nice in development cycles when you are repeatedly 
> tweaking your model and which pairs you're predicting for and this 
> approximate count represents a meaningful portion of the time you wait for 
> results.
> I can provide a pull request with this ability added that preserves the 
> existing behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19037) Run count(distinct x) from sub query found some errors

2016-12-30 Thread J.P Feng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15787733#comment-15787733
 ] 

J.P Feng commented on SPARK-19037:
--

errors logs when doing dropDuplicates with sub-query in spark-shell:

scala> spark.sql("select * from mytest limit 10").dropDuplicates("name").show
120.073: [GC [PSYoungGen: 233234K->12801K(282112K)] 378713K->165495K(624128K), 
1.8045200 secs] [Times: user=6.52 sys=7.43, real=1.80 secs] 
[Stage 0:> (0 + 8) / 
16]124.182: [GC [PSYoungGen: 227841K->45026K(279552K)] 
380535K->202214K(621568K), 0.9970190 secs] [Times: user=2.87 sys=4.96, 
real=1.00 secs] 
[Stage 0:>(0 + 16) / 
16]16/12/30 21:58:21 ERROR (Executor): Exception in task 0.0 in stage 1.0 (TID 
16)
java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
16/12/30 21:58:21 WARN (TaskSetManager): Lost task 0.0 in stage 1.0 (TID 16, 
localhost, executor driver): java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

> Run count(distinct x) from sub query found some errors
> --
>
> Key: SPARK-19037
> URL: https://issues.apache.org/jira/browse/SPARK-19037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.1.0
> Environment: spark 2.1.0, scala 2.11 
>Reporter: J.P Feng
>  Labels: distinct, sparkSQL, sub-query
>
> when i use spark-shell or spark-sql to execute count(distinct name) from 
> subquery, some errors occur:
> select count(distinct name) from (select * from mytest limit 10) as a
> if i do this in hive-server2, i can get the correct result.
> if i just execute select count(name) from (select * from mytest limit 10) as 
> a, i can al

[jira] [Comment Edited] (SPARK-19037) Run count(distinct x) from sub query found some errors

2016-12-30 Thread J.P Feng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15787713#comment-15787713
 ] 

J.P Feng edited comment on SPARK-19037 at 12/30/16 2:02 PM:


errors logs when doing query in spark-sql

java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey1$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
16/12/30 21:49:36 WARN (TaskSetManager): Lost task 0.0 in stage 2.0 (TID 17, 
localhost, executor driver): java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey1$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

16/12/30 21:49:36 ERROR (TaskSetManager): Task 0 in stage 2.0 failed 1 times; 
aborting job
16/12/30 21:49:36 INFO (TaskSchedulerImpl): Removed TaskSet 2.0, whose tasks 
have all completed, from pool default
16/12/30 21:49:36 INFO (TaskSchedulerImpl): Cancelling stage 2
16/12/30 21:49:36 INFO (DAGScheduler): ResultStage 2 (processCmd at 
CliDriver.java:376) failed in 0.221 s due to Job aborted due to stage failure: 
Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
2.0 (TID 17, localhost, executor driver): java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey1$(Unknown
 Source)
at 
org.apache.spark.sql.cata

[jira] [Updated] (SPARK-19037) Run count(distinct x) from sub query found some errors

2016-12-30 Thread J.P Feng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

J.P Feng updated SPARK-19037:
-
Description: 
when i use spark-shell or spark-sql to execute count(distinct name) from 
subquery, some errors occur:

select count(distinct name) from (select * from mytest limit 10) as a

if i do this in hive-server2, i can get the correct result.

if i just execute select count(name) from (select * from mytest limit 10) as a, 
i can also get the right result.

besides, i found the same errors when i use distinct(),groupby() with subquery.

I think there maybe some bugs when doing key-reduce jobs with subquery.

I will add the errors in new comment.


besides, i test dropDuplicates in spark-shell:

1. spark.sql("select * from mytest limit 10").dropDuplicates("name").show

it will throw some exceptions

2. spark.table("mytest").dropDuplicates("name").show

it will return the right result



  was:
when i use spark-shell or spark-sql to execute count(distinct name) from 
subquery, some errors occur:

select count(distinct name) from (select * from mytest limit 10) as a

if i do this in hive-server2, i can get the correct result.

if i just execute select count(name) from (select * from mytest limit 10) as a, 
i can also get the right result.

besides, i found the same errors when i use max(), distinct(),groupby() with 
subquery.

I think there maybe some bugs when doing key-reduce jobs with subquery.

I will add the errors in new comment.


besides, i test dropDuplicates in spark-shell:

1. spark.sql("select * from mytest limit 10").dropDuplicates("name").show

it will throw some exceptions

2. spark.table("mytest").dropDuplicates("name").show

it will return the right result




> Run count(distinct x) from sub query found some errors
> --
>
> Key: SPARK-19037
> URL: https://issues.apache.org/jira/browse/SPARK-19037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.1.0
> Environment: spark 2.1.0, scala 2.11 
>Reporter: J.P Feng
>  Labels: distinct, sparkSQL, sub-query
>
> when i use spark-shell or spark-sql to execute count(distinct name) from 
> subquery, some errors occur:
> select count(distinct name) from (select * from mytest limit 10) as a
> if i do this in hive-server2, i can get the correct result.
> if i just execute select count(name) from (select * from mytest limit 10) as 
> a, i can also get the right result.
> besides, i found the same errors when i use distinct(),groupby() with 
> subquery.
> I think there maybe some bugs when doing key-reduce jobs with subquery.
> I will add the errors in new comment.
> besides, i test dropDuplicates in spark-shell:
> 1. spark.sql("select * from mytest limit 10").dropDuplicates("name").show
> it will throw some exceptions
> 2. spark.table("mytest").dropDuplicates("name").show
> it will return the right result



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19037) Run count(distinct x) from sub query found some errors

2016-12-30 Thread J.P Feng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

J.P Feng updated SPARK-19037:
-
Summary: Run count(distinct x) from sub query found some errors  (was: Run 
count(distinct name) from sub query found some errors)

> Run count(distinct x) from sub query found some errors
> --
>
> Key: SPARK-19037
> URL: https://issues.apache.org/jira/browse/SPARK-19037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.1.0
> Environment: spark 2.1.0, scala 2.11 
>Reporter: J.P Feng
>  Labels: distinct, sparkSQL, sub-query
>
> when i use spark-shell or spark-sql to execute count(distinct name) from 
> subquery, some errors occur:
> select count(distinct name) from (select * from mytest limit 10) as a
> if i do this in hive-server2, i can get the correct result.
> if i just execute select count(name) from (select * from mytest limit 10) as 
> a, i can also get the right result.
> besides, i found the same errors when i use max(), distinct(),groupby() with 
> subquery.
> I think there maybe some bugs when doing key-reduce jobs with subquery.
> I will add the errors in new comment.
> besides, i test dropDuplicates in spark-shell:
> 1. spark.sql("select * from mytest limit 10").dropDuplicates("name").show
> it will throw some exceptions
> 2. spark.table("mytest").dropDuplicates("name").show
> it will return the right result



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19037) Run count(distinct name) from sub query found some errors

2016-12-30 Thread J.P Feng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

J.P Feng updated SPARK-19037:
-
Description: 
when i use spark-shell or spark-sql to execute count(distinct name) from 
subquery, some errors occur:

select count(distinct name) from (select * from mytest limit 10) as a

if i do this in hive-server2, i can get the correct result.

if i just execute select count(name) from (select * from mytest limit 10) as a, 
i can also get the right result.

besides, i found the same errors when i use max(), distinct(),groupby() with 
subquery.

I think there maybe some bugs when doing key-reduce jobs with subquery.

I will add the errors in new comment.


besides, i test dropDuplicates in spark-shell:

1. spark.sql("select * from mytest limit 10").dropDuplicates("name").show

it will throw some exceptions

2. spark.table("mytest").dropDuplicates("name").show

it will return the right result



  was:
when i use spark-shell or spark-sql to execute count(distinct name) from 
subquery, some errors occur:

select count(distinct name) from (select * from mytest limit 10) as a

if i do this in hive-server2, i can get the correct result.

if i just execute select count(name) from (select * from mytest limit 10) as a, 
i can also get the right result.

besides, i found the same errors when i use max(), distinct(),groupby() with 
subquery.

I think there maybe some bugs when doing key-reduce jobs with subquery.

I will add the errors in new comment.


> Run count(distinct name) from sub query found some errors
> -
>
> Key: SPARK-19037
> URL: https://issues.apache.org/jira/browse/SPARK-19037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.1.0
> Environment: spark 2.1.0, scala 2.11 
>Reporter: J.P Feng
>  Labels: distinct, sparkSQL, sub-query
>
> when i use spark-shell or spark-sql to execute count(distinct name) from 
> subquery, some errors occur:
> select count(distinct name) from (select * from mytest limit 10) as a
> if i do this in hive-server2, i can get the correct result.
> if i just execute select count(name) from (select * from mytest limit 10) as 
> a, i can also get the right result.
> besides, i found the same errors when i use max(), distinct(),groupby() with 
> subquery.
> I think there maybe some bugs when doing key-reduce jobs with subquery.
> I will add the errors in new comment.
> besides, i test dropDuplicates in spark-shell:
> 1. spark.sql("select * from mytest limit 10").dropDuplicates("name").show
> it will throw some exceptions
> 2. spark.table("mytest").dropDuplicates("name").show
> it will return the right result



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19037) Run count(distinct name) from sub query found some errors

2016-12-30 Thread J.P Feng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15787713#comment-15787713
 ] 

J.P Feng edited comment on SPARK-19037 at 12/30/16 1:54 PM:


errors logs:

java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey1$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
16/12/30 21:49:36 WARN (TaskSetManager): Lost task 0.0 in stage 2.0 (TID 17, 
localhost, executor driver): java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey1$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

16/12/30 21:49:36 ERROR (TaskSetManager): Task 0 in stage 2.0 failed 1 times; 
aborting job
16/12/30 21:49:36 INFO (TaskSchedulerImpl): Removed TaskSet 2.0, whose tasks 
have all completed, from pool default
16/12/30 21:49:36 INFO (TaskSchedulerImpl): Cancelling stage 2
16/12/30 21:49:36 INFO (DAGScheduler): ResultStage 2 (processCmd at 
CliDriver.java:376) failed in 0.221 s due to Job aborted due to stage failure: 
Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
2.0 (TID 17, localhost, executor driver): java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey1$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedCla

[jira] [Created] (SPARK-19037) Run count(distinct name) from sub query found some errors

2016-12-30 Thread J.P Feng (JIRA)
J.P Feng created SPARK-19037:


 Summary: Run count(distinct name) from sub query found some errors
 Key: SPARK-19037
 URL: https://issues.apache.org/jira/browse/SPARK-19037
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, SQL
Affects Versions: 2.1.0
 Environment: spark 2.1.0, scala 2.11 
Reporter: J.P Feng


when i use spark-shell or spark-sql to execute count(distinct name) from 
subquery, some errors occur:

select count(distinct name) from (select * from mytest limit 10) as a

if i do this in hive-server2, i can get the correct result.

if i just execute select count(name) from (select * from mytest limit 10) as a, 
i can also get the right result.

besides, i found the same errors when i use max(), distinct(),groupby() with 
subquery.

I think there maybe some bugs when doing key-reduce jobs with subquery.

I will add the errors in new comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19037) Run count(distinct name) from sub query found some errors

2016-12-30 Thread J.P Feng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15787714#comment-15787714
 ] 

J.P Feng commented on SPARK-19037:
--

errors logs:

> Run count(distinct name) from sub query found some errors
> -
>
> Key: SPARK-19037
> URL: https://issues.apache.org/jira/browse/SPARK-19037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.1.0
> Environment: spark 2.1.0, scala 2.11 
>Reporter: J.P Feng
>  Labels: distinct, sparkSQL, sub-query
>
> when i use spark-shell or spark-sql to execute count(distinct name) from 
> subquery, some errors occur:
> select count(distinct name) from (select * from mytest limit 10) as a
> if i do this in hive-server2, i can get the correct result.
> if i just execute select count(name) from (select * from mytest limit 10) as 
> a, i can also get the right result.
> besides, i found the same errors when i use max(), distinct(),groupby() with 
> subquery.
> I think there maybe some bugs when doing key-reduce jobs with subquery.
> I will add the errors in new comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19037) Run count(distinct name) from sub query found some errors

2016-12-30 Thread J.P Feng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15787713#comment-15787713
 ] 

J.P Feng commented on SPARK-19037:
--

errors logs:

> Run count(distinct name) from sub query found some errors
> -
>
> Key: SPARK-19037
> URL: https://issues.apache.org/jira/browse/SPARK-19037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.1.0
> Environment: spark 2.1.0, scala 2.11 
>Reporter: J.P Feng
>  Labels: distinct, sparkSQL, sub-query
>
> when i use spark-shell or spark-sql to execute count(distinct name) from 
> subquery, some errors occur:
> select count(distinct name) from (select * from mytest limit 10) as a
> if i do this in hive-server2, i can get the correct result.
> if i just execute select count(name) from (select * from mytest limit 10) as 
> a, i can also get the right result.
> besides, i found the same errors when i use max(), distinct(),groupby() with 
> subquery.
> I think there maybe some bugs when doing key-reduce jobs with subquery.
> I will add the errors in new comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-19037) Run count(distinct name) from sub query found some errors

2016-12-30 Thread J.P Feng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

J.P Feng updated SPARK-19037:
-
Comment: was deleted

(was: errors logs:)

> Run count(distinct name) from sub query found some errors
> -
>
> Key: SPARK-19037
> URL: https://issues.apache.org/jira/browse/SPARK-19037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.1.0
> Environment: spark 2.1.0, scala 2.11 
>Reporter: J.P Feng
>  Labels: distinct, sparkSQL, sub-query
>
> when i use spark-shell or spark-sql to execute count(distinct name) from 
> subquery, some errors occur:
> select count(distinct name) from (select * from mytest limit 10) as a
> if i do this in hive-server2, i can get the correct result.
> if i just execute select count(name) from (select * from mytest limit 10) as 
> a, i can also get the right result.
> besides, i found the same errors when i use max(), distinct(),groupby() with 
> subquery.
> I think there maybe some bugs when doing key-reduce jobs with subquery.
> I will add the errors in new comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18857) SparkSQL ThriftServer hangs while extracting huge data volumes in incremental collect mode

2016-12-30 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15787622#comment-15787622
 ] 

Dongjoon Hyun commented on SPARK-18857:
---

Hi, [~vishalagrwal].
I agree with you. This is an important problem.
At least, I made a PR as a first attempt. In any ways, I hope this will be 
resolved soon.

> SparkSQL ThriftServer hangs while extracting huge data volumes in incremental 
> collect mode
> --
>
> Key: SPARK-18857
> URL: https://issues.apache.org/jira/browse/SPARK-18857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: vishal agrawal
> Attachments: GC-spark-1.6.3, GC-spark-2.0.2
>
>
> We are trying to run a sql query on our spark cluster and extracting around 
> 200 million records through SparkSQL ThriftServer interface. This query works 
> fine for Spark 1.6.3 version, however for spark 2.0.2, thrift server hangs 
> after fetching data from a few partitions (we are using incremental collect 
> mode with 400 partitions). As per documentation max memory taken up by thrift 
> server should be what is required by the biggest data partition. But we 
> observed that Thrift server is not releasing the old partitions memory 
> whenever the GC occurs even though it has moved to next partition data 
> fetches. which is not the case with 1.6.3 version.
> On further investigation we found that SparkExecuteStatementOperation.scala 
> was modified for "[SPARK-16563][SQL] fix spark sql thrift server FetchResults 
> bug" and result set iterator was duplicated to keep a reference to the first 
> set.
> +  val (itra, itrb) = iter.duplicate
> +  iterHeader = itra
> +  iter = itrb
> We suspect that this is resulting in the memory not being cleared on GC. To 
> confirm this we created an iterator in our test class and fetched the data 
> once without duplicating and second time with creating a duplicate. we could 
> see that in first instance it ran fine and fetched the entire data set while 
> in second instance driver hanged after fetching data from a few partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18857) SparkSQL ThriftServer hangs while extracting huge data volumes in incremental collect mode

2016-12-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18857:


Assignee: (was: Apache Spark)

> SparkSQL ThriftServer hangs while extracting huge data volumes in incremental 
> collect mode
> --
>
> Key: SPARK-18857
> URL: https://issues.apache.org/jira/browse/SPARK-18857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: vishal agrawal
> Attachments: GC-spark-1.6.3, GC-spark-2.0.2
>
>
> We are trying to run a sql query on our spark cluster and extracting around 
> 200 million records through SparkSQL ThriftServer interface. This query works 
> fine for Spark 1.6.3 version, however for spark 2.0.2, thrift server hangs 
> after fetching data from a few partitions (we are using incremental collect 
> mode with 400 partitions). As per documentation max memory taken up by thrift 
> server should be what is required by the biggest data partition. But we 
> observed that Thrift server is not releasing the old partitions memory 
> whenever the GC occurs even though it has moved to next partition data 
> fetches. which is not the case with 1.6.3 version.
> On further investigation we found that SparkExecuteStatementOperation.scala 
> was modified for "[SPARK-16563][SQL] fix spark sql thrift server FetchResults 
> bug" and result set iterator was duplicated to keep a reference to the first 
> set.
> +  val (itra, itrb) = iter.duplicate
> +  iterHeader = itra
> +  iter = itrb
> We suspect that this is resulting in the memory not being cleared on GC. To 
> confirm this we created an iterator in our test class and fetched the data 
> once without duplicating and second time with creating a duplicate. we could 
> see that in first instance it ran fine and fetched the entire data set while 
> in second instance driver hanged after fetching data from a few partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18857) SparkSQL ThriftServer hangs while extracting huge data volumes in incremental collect mode

2016-12-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15787618#comment-15787618
 ] 

Apache Spark commented on SPARK-18857:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/16440

> SparkSQL ThriftServer hangs while extracting huge data volumes in incremental 
> collect mode
> --
>
> Key: SPARK-18857
> URL: https://issues.apache.org/jira/browse/SPARK-18857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: vishal agrawal
> Attachments: GC-spark-1.6.3, GC-spark-2.0.2
>
>
> We are trying to run a sql query on our spark cluster and extracting around 
> 200 million records through SparkSQL ThriftServer interface. This query works 
> fine for Spark 1.6.3 version, however for spark 2.0.2, thrift server hangs 
> after fetching data from a few partitions (we are using incremental collect 
> mode with 400 partitions). As per documentation max memory taken up by thrift 
> server should be what is required by the biggest data partition. But we 
> observed that Thrift server is not releasing the old partitions memory 
> whenever the GC occurs even though it has moved to next partition data 
> fetches. which is not the case with 1.6.3 version.
> On further investigation we found that SparkExecuteStatementOperation.scala 
> was modified for "[SPARK-16563][SQL] fix spark sql thrift server FetchResults 
> bug" and result set iterator was duplicated to keep a reference to the first 
> set.
> +  val (itra, itrb) = iter.duplicate
> +  iterHeader = itra
> +  iter = itrb
> We suspect that this is resulting in the memory not being cleared on GC. To 
> confirm this we created an iterator in our test class and fetched the data 
> once without duplicating and second time with creating a duplicate. we could 
> see that in first instance it ran fine and fetched the entire data set while 
> in second instance driver hanged after fetching data from a few partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18857) SparkSQL ThriftServer hangs while extracting huge data volumes in incremental collect mode

2016-12-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18857:


Assignee: Apache Spark

> SparkSQL ThriftServer hangs while extracting huge data volumes in incremental 
> collect mode
> --
>
> Key: SPARK-18857
> URL: https://issues.apache.org/jira/browse/SPARK-18857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: vishal agrawal
>Assignee: Apache Spark
> Attachments: GC-spark-1.6.3, GC-spark-2.0.2
>
>
> We are trying to run a sql query on our spark cluster and extracting around 
> 200 million records through SparkSQL ThriftServer interface. This query works 
> fine for Spark 1.6.3 version, however for spark 2.0.2, thrift server hangs 
> after fetching data from a few partitions (we are using incremental collect 
> mode with 400 partitions). As per documentation max memory taken up by thrift 
> server should be what is required by the biggest data partition. But we 
> observed that Thrift server is not releasing the old partitions memory 
> whenever the GC occurs even though it has moved to next partition data 
> fetches. which is not the case with 1.6.3 version.
> On further investigation we found that SparkExecuteStatementOperation.scala 
> was modified for "[SPARK-16563][SQL] fix spark sql thrift server FetchResults 
> bug" and result set iterator was duplicated to keep a reference to the first 
> set.
> +  val (itra, itrb) = iter.duplicate
> +  iterHeader = itra
> +  iter = itrb
> We suspect that this is resulting in the memory not being cleared on GC. To 
> confirm this we created an iterator in our test class and fetched the data 
> once without duplicating and second time with creating a duplicate. we could 
> see that in first instance it ran fine and fetched the entire data set while 
> in second instance driver hanged after fetching data from a few partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18808) ml.KMeansModel.transform is very inefficient

2016-12-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-18808:
-

Assignee: Sean Owen

> ml.KMeansModel.transform is very inefficient
> 
>
> Key: SPARK-18808
> URL: https://issues.apache.org/jira/browse/SPARK-18808
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Michel Lemay
>Assignee: Sean Owen
> Fix For: 2.2.0
>
>
> The function ml.KMeansModel.transform will call the 
> parentModel.predict(features) method on each row which in turns will 
> normalize all clusterCenters from mllib.KMeansModel.clusterCentersWithNorm 
> every time!
> This is a serious waste of resources!  In my profiling, 
> clusterCentersWithNorm represent 99% of the sampling!  
> This should have been implemented with a broadcast variable as it is done in 
> other functions like computeCost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18808) ml.KMeansModel.transform is very inefficient

2016-12-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18808.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16328
[https://github.com/apache/spark/pull/16328]

> ml.KMeansModel.transform is very inefficient
> 
>
> Key: SPARK-18808
> URL: https://issues.apache.org/jira/browse/SPARK-18808
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Michel Lemay
> Fix For: 2.2.0
>
>
> The function ml.KMeansModel.transform will call the 
> parentModel.predict(features) method on each row which in turns will 
> normalize all clusterCenters from mllib.KMeansModel.clusterCentersWithNorm 
> every time!
> This is a serious waste of resources!  In my profiling, 
> clusterCentersWithNorm represent 99% of the sampling!  
> This should have been implemented with a broadcast variable as it is done in 
> other functions like computeCost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18808) ml.KMeansModel.transform is very inefficient

2016-12-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18808:
--
Priority: Minor  (was: Major)

> ml.KMeansModel.transform is very inefficient
> 
>
> Key: SPARK-18808
> URL: https://issues.apache.org/jira/browse/SPARK-18808
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Michel Lemay
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.2.0
>
>
> The function ml.KMeansModel.transform will call the 
> parentModel.predict(features) method on each row which in turns will 
> normalize all clusterCenters from mllib.KMeansModel.clusterCentersWithNorm 
> every time!
> This is a serious waste of resources!  In my profiling, 
> clusterCentersWithNorm represent 99% of the sampling!  
> This should have been implemented with a broadcast variable as it is done in 
> other functions like computeCost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18857) SparkSQL ThriftServer hangs while extracting huge data volumes in incremental collect mode

2016-12-30 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15787428#comment-15787428
 ] 

Dongjoon Hyun commented on SPARK-18857:
---

Thank you for testing and sharing that information!

> SparkSQL ThriftServer hangs while extracting huge data volumes in incremental 
> collect mode
> --
>
> Key: SPARK-18857
> URL: https://issues.apache.org/jira/browse/SPARK-18857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: vishal agrawal
> Attachments: GC-spark-1.6.3, GC-spark-2.0.2
>
>
> We are trying to run a sql query on our spark cluster and extracting around 
> 200 million records through SparkSQL ThriftServer interface. This query works 
> fine for Spark 1.6.3 version, however for spark 2.0.2, thrift server hangs 
> after fetching data from a few partitions (we are using incremental collect 
> mode with 400 partitions). As per documentation max memory taken up by thrift 
> server should be what is required by the biggest data partition. But we 
> observed that Thrift server is not releasing the old partitions memory 
> whenever the GC occurs even though it has moved to next partition data 
> fetches. which is not the case with 1.6.3 version.
> On further investigation we found that SparkExecuteStatementOperation.scala 
> was modified for "[SPARK-16563][SQL] fix spark sql thrift server FetchResults 
> bug" and result set iterator was duplicated to keep a reference to the first 
> set.
> +  val (itra, itrb) = iter.duplicate
> +  iterHeader = itra
> +  iter = itrb
> We suspect that this is resulting in the memory not being cleared on GC. To 
> confirm this we created an iterator in our test class and fetched the data 
> once without duplicating and second time with creating a duplicate. we could 
> see that in first instance it ran fine and fetched the entire data set while 
> in second instance driver hanged after fetching data from a few partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19034) Download packages on 'spark.apache.org/downloads.html' contain release 2.0.2

2016-12-30 Thread Sanjay Dasgupta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15787426#comment-15787426
 ] 

Sanjay Dasgupta commented on SPARK-19034:
-

Yes, the SPARK_HOME was the issue.

Apologies for the confusion.

> Download packages on 'spark.apache.org/downloads.html' contain release 2.0.2
> 
>
> Key: SPARK-19034
> URL: https://issues.apache.org/jira/browse/SPARK-19034
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.0
> Environment: All
>Reporter: Sanjay Dasgupta
>  Labels: distribution, download
>
> Download packages on 'https://spark.apache.org/downloads.html' have the right 
> name ( spark-2.1.0-bin-...) but contain the release 2.0.2 software



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19032) Non-deterministic results using aggregation first across multiple workers

2016-12-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19032.
---
Resolution: Not A Problem

Agree, I don't see a reason to expect that sort order is preserved by groupBy.

> Non-deterministic results using aggregation first across multiple workers
> -
>
> Key: SPARK-19032
> URL: https://issues.apache.org/jira/browse/SPARK-19032
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 1.6.1
> Environment: Standalone Spark 1.6.1 cluster on EC2 with 2 worker 
> nodes, one executor each.
>Reporter: Harry Weppner
>
> We've come across a situation results aggregated using {{first}} on a sorted 
> df are non-deterministic. Given the explanation for the plan there appears to 
> be a plausible explanation but creates more question on the usefulness of 
> these aggregation functions in a spark cluster.
> Here's a minimal example to reproduce:
> {code}
> val df = 
> sc.parallelize(Seq(("a","prod1",0.6),("a","prod2",0.4),("a","prod2",0.4),("a","prod2",0.4),("a","prod2",0.4))).toDF("account","product","probability")
> var p = 
> df.sort($"probability".desc).groupBy($"account").agg(first($"product"),first($"probability")).show();
> +---+++
> |account|first(product)()|first(probability)()|
> +---+++
> |  a|   prod1| 0.6|
> +---+++
> p: Unit = ()
> // Repeat and notice that result will occasionally be different
> +---+++
> |account|first(product)()|first(probability)()|
> +---+++
> |  a|   prod2| 0.4|
> +---+++
> p: Unit = ()
> scala> 
> df.sort($"probability".desc).groupBy($"account").agg(first($"product"),first($"probability")).explain(true);
> == Parsed Logical Plan ==
> 'Aggregate ['account], 
> [unresolvedalias('account),(first('product)(),mode=Complete,isDistinct=false) 
> AS 
> first(product)()#523,(first('probability)(),mode=Complete,isDistinct=false) 
> AS first(probability)()#524]
> +- Sort [probability#5 DESC], true
>+- Project [_1#0 AS account#3,_2#1 AS product#4,_3#2 AS probability#5]
>   +- LogicalRDD [_1#0,_2#1,_3#2], MapPartitionsRDD[1] at 
> rddToDataFrameHolder at :27
> == Analyzed Logical Plan ==
> account: string, first(product)(): string, first(probability)(): double
> Aggregate [account#3], 
> [account#3,(first(product#4)(),mode=Complete,isDistinct=false) AS 
> first(product)()#523,(first(probability#5)(),mode=Complete,isDistinct=false) 
> AS first(probability)()#524]
> +- Sort [probability#5 DESC], true
>+- Project [_1#0 AS account#3,_2#1 AS product#4,_3#2 AS probability#5]
>   +- LogicalRDD [_1#0,_2#1,_3#2], MapPartitionsRDD[1] at 
> rddToDataFrameHolder at :27
> == Optimized Logical Plan ==
> Aggregate [account#3], 
> [account#3,(first(product#4)(),mode=Complete,isDistinct=false) AS 
> first(product)()#523,(first(probability#5)(),mode=Complete,isDistinct=false) 
> AS first(probability)()#524]
> +- Sort [probability#5 DESC], true
>+- Project [_1#0 AS account#3,_2#1 AS product#4,_3#2 AS probability#5]
>   +- LogicalRDD [_1#0,_2#1,_3#2], MapPartitionsRDD[1] at 
> rddToDataFrameHolder at :27
> == Physical Plan ==
> SortBasedAggregate(key=[account#3], 
> functions=[(first(product#4)(),mode=Final,isDistinct=false),(first(probability#5)(),mode=Final,isDistinct=false)],
>  output=[account#3,first(product)()#523,first(probability)()#524])
> +- ConvertToSafe
>+- Sort [account#3 ASC], false, 0
>   +- TungstenExchange hashpartitioning(account#3,200), None
>  +- ConvertToUnsafe
> +- SortBasedAggregate(key=[account#3], 
> functions=[(first(product#4)(),mode=Partial,isDistinct=false),(first(probability#5)(),mode=Partial,isDistinct=false)],
>  output=[account#3,first#532,valueSet#533,first#534,valueSet#535])
>+- ConvertToSafe
>   +- Sort [account#3 ASC], false, 0
>  +- Sort [probability#5 DESC], true, 0
> +- ConvertToUnsafe
>+- Exchange rangepartitioning(probability#5 
> DESC,200), None
>   +- ConvertToSafe
>  +- Project [_1#0 AS account#3,_2#1 AS 
> product#4,_3#2 AS probability#5]
> +- Scan ExistingRDD[_1#0,_2#1,_3#2]
> {code}
> My working hypothesis is that after {{TungstenExchange hashpartitioning}} the 
>  _global_ sort order on {{probability}} is lost leading to non-deterministic 
> results.
> If this hypothesis is valid, then how useful are aggregation functions s

[jira] [Commented] (SPARK-19035) rand() function in case when cause failed

2016-12-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15787416#comment-15787416
 ] 

Sean Owen commented on SPARK-19035:
---

The difference in behavior does sound like at least a cosmetic bug, but it may 
not actually work in either case, and that could be correct. Why not group on b 
here? Grouping on a different non-deterministic value doesn't sound reasonable 
anyway.

> rand() function in case when cause failed
> -
>
> Key: SPARK-19035
> URL: https://issues.apache.org/jira/browse/SPARK-19035
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Feng Yuan
>
> *In this case:*
>select 
>case when a=1 then 1 else concat(a,cast(rand() as 
> string)) end b,count(1) 
>from 
>yuanfeng1_a 
>group by 
>case when a=1 then 1 else concat(a,cast(rand() as 
> string)) end;
> *Throw error:*
> Error in query: expression 'yuanfeng1_a.`a`' is neither present in the group 
> by, nor is it an aggregate function. Add to group by or wrap in first() (or 
> first_value) if you don't care which value you get.;;
> Aggregate [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE 
> concat(cast(a#2075 as string), cast(rand(519367429988179997) as string)) 
> END], [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE concat(cast(a#2075 
> as string), cast(rand(8090243936131101651) as string)) END AS b#2074]
> +- MetastoreRelation default, yuanfeng1_a
> select case when a=1 then 1 else rand() end b,count(1) from yuanfeng1_a group 
> by case when a=1 then rand() end also output this
> *Notice*:
> If replace rand() as 1,it work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-19035) rand() function in case when cause failed

2016-12-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19035:
--
Comment: was deleted

(was: The core issue is why :
select case when a=1 then 1 else *1* end b,count(1) from yuanfeng1_a group by 
case when a=1 then *1* end work
but
select case when a=1 then 1 else *rand()* end b,count(1) from yuanfeng1_a group 
by case when a=1 then *rand()* end does not work.

This usage is correct,but rand() make trouble.)

> rand() function in case when cause failed
> -
>
> Key: SPARK-19035
> URL: https://issues.apache.org/jira/browse/SPARK-19035
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Feng Yuan
>
> *In this case:*
>select 
>case when a=1 then 1 else concat(a,cast(rand() as 
> string)) end b,count(1) 
>from 
>yuanfeng1_a 
>group by 
>case when a=1 then 1 else concat(a,cast(rand() as 
> string)) end;
> *Throw error:*
> Error in query: expression 'yuanfeng1_a.`a`' is neither present in the group 
> by, nor is it an aggregate function. Add to group by or wrap in first() (or 
> first_value) if you don't care which value you get.;;
> Aggregate [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE 
> concat(cast(a#2075 as string), cast(rand(519367429988179997) as string)) 
> END], [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE concat(cast(a#2075 
> as string), cast(rand(8090243936131101651) as string)) END AS b#2074]
> +- MetastoreRelation default, yuanfeng1_a
> select case when a=1 then 1 else rand() end b,count(1) from yuanfeng1_a group 
> by case when a=1 then rand() end also output this
> *Notice*:
> If replace rand() as 1,it work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19034) Download packages on 'spark.apache.org/downloads.html' contain release 2.0.2

2016-12-30 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15787410#comment-15787410
 ] 

Dongjoon Hyun commented on SPARK-19034:
---

Hi, [~sanjay.dasgu...@gmail.com].

I tried the binary you mentioned. For me, it works correctly like the 
following. Could you check your `SPARK_HOME` environment variable?

{code}
spark-2.1.0-bin-hadoop2.4$ SPARK_HOME=$PWD bin/spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
16/12/30 02:24:22 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
16/12/30 02:24:26 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
Spark context Web UI available at http://192.168.1.78:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1483093463544).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0
  /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_102)
Type in expressions to have them evaluated.
Type :help for more information.
{code}

> Download packages on 'spark.apache.org/downloads.html' contain release 2.0.2
> 
>
> Key: SPARK-19034
> URL: https://issues.apache.org/jira/browse/SPARK-19034
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.0
> Environment: All
>Reporter: Sanjay Dasgupta
>  Labels: distribution, download
>
> Download packages on 'https://spark.apache.org/downloads.html' have the right 
> name ( spark-2.1.0-bin-...) but contain the release 2.0.2 software



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19034) Download packages on 'spark.apache.org/downloads.html' contain release 2.0.2

2016-12-30 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15787411#comment-15787411
 ] 

Dongjoon Hyun commented on SPARK-19034:
---

+1

> Download packages on 'spark.apache.org/downloads.html' contain release 2.0.2
> 
>
> Key: SPARK-19034
> URL: https://issues.apache.org/jira/browse/SPARK-19034
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.0
> Environment: All
>Reporter: Sanjay Dasgupta
>  Labels: distribution, download
>
> Download packages on 'https://spark.apache.org/downloads.html' have the right 
> name ( spark-2.1.0-bin-...) but contain the release 2.0.2 software



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-19035) rand() function in case when cause failed

2016-12-30 Thread Feng Yuan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Yuan updated SPARK-19035:
--
Comment: was deleted

(was: The core issue is why :
select case when a=1 then 1 else *1* end b,count(1) from yuanfeng1_a group by 
case when a=1 then *1* end work
but
select case when a=1 then 1 else *rand()* end b,count(1) from yuanfeng1_a group 
by case when a=1 then *rand()* end does not work.

This usage is correct,but rand() make trouble.)

> rand() function in case when cause failed
> -
>
> Key: SPARK-19035
> URL: https://issues.apache.org/jira/browse/SPARK-19035
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Feng Yuan
>
> *In this case:*
>select 
>case when a=1 then 1 else concat(a,cast(rand() as 
> string)) end b,count(1) 
>from 
>yuanfeng1_a 
>group by 
>case when a=1 then 1 else concat(a,cast(rand() as 
> string)) end;
> *Throw error:*
> Error in query: expression 'yuanfeng1_a.`a`' is neither present in the group 
> by, nor is it an aggregate function. Add to group by or wrap in first() (or 
> first_value) if you don't care which value you get.;;
> Aggregate [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE 
> concat(cast(a#2075 as string), cast(rand(519367429988179997) as string)) 
> END], [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE concat(cast(a#2075 
> as string), cast(rand(8090243936131101651) as string)) END AS b#2074]
> +- MetastoreRelation default, yuanfeng1_a
> select case when a=1 then 1 else rand() end b,count(1) from yuanfeng1_a group 
> by case when a=1 then rand() end also output this
> *Notice*:
> If replace rand() as 1,it work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19034) Download packages on 'spark.apache.org/downloads.html' contain release 2.0.2

2016-12-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15787406#comment-15787406
 ] 

Sean Owen commented on SPARK-19034:
---

I don't see that:

{code}
srowen@instance-1:~/spark-2.1.0-bin-hadoop2.4$ ./bin/spark-shell 
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
16/12/30 10:22:08 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
16/12/30 10:22:16 WARN ObjectStore: Version information not found in metastore. 
hive.metastore.schema.verification is not enabled so recording the schema 
version 1.2.0
16/12/30 10:22:17 WARN ObjectStore: Failed to get database default, returning 
NoSuchObjectException
16/12/30 10:22:17 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
Spark context Web UI available at http://10.240.0.2:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1483093329719).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0
  /_/
 
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_111)
Type in expressions to have them evaluated.
Type :help for more information.
{code}

Maybe you're running a different spark-shell inadvertently?

> Download packages on 'spark.apache.org/downloads.html' contain release 2.0.2
> 
>
> Key: SPARK-19034
> URL: https://issues.apache.org/jira/browse/SPARK-19034
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.0
> Environment: All
>Reporter: Sanjay Dasgupta
>  Labels: distribution, download
>
> Download packages on 'https://spark.apache.org/downloads.html' have the right 
> name ( spark-2.1.0-bin-...) but contain the release 2.0.2 software



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19035) rand() function in case when cause failed

2016-12-30 Thread Feng Yuan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15787403#comment-15787403
 ] 

Feng Yuan commented on SPARK-19035:
---

The core issue is why :
select case when a=1 then 1 else *1* end b,count(1) from yuanfeng1_a group by 
case when a=1 then *1* end work
but
select case when a=1 then 1 else *rand()* end b,count(1) from yuanfeng1_a group 
by case when a=1 then *rand()* end does not work.

This usage is correct,but rand() make trouble.

> rand() function in case when cause failed
> -
>
> Key: SPARK-19035
> URL: https://issues.apache.org/jira/browse/SPARK-19035
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Feng Yuan
>
> *In this case:*
>select 
>case when a=1 then 1 else concat(a,cast(rand() as 
> string)) end b,count(1) 
>from 
>yuanfeng1_a 
>group by 
>case when a=1 then 1 else concat(a,cast(rand() as 
> string)) end;
> *Throw error:*
> Error in query: expression 'yuanfeng1_a.`a`' is neither present in the group 
> by, nor is it an aggregate function. Add to group by or wrap in first() (or 
> first_value) if you don't care which value you get.;;
> Aggregate [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE 
> concat(cast(a#2075 as string), cast(rand(519367429988179997) as string)) 
> END], [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE concat(cast(a#2075 
> as string), cast(rand(8090243936131101651) as string)) END AS b#2074]
> +- MetastoreRelation default, yuanfeng1_a
> select case when a=1 then 1 else rand() end b,count(1) from yuanfeng1_a group 
> by case when a=1 then rand() end also output this
> *Notice*:
> If replace rand() as 1,it work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19035) rand() function in case when cause failed

2016-12-30 Thread Feng Yuan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15787402#comment-15787402
 ] 

Feng Yuan commented on SPARK-19035:
---

The core issue is why :
select case when a=1 then 1 else *1* end b,count(1) from yuanfeng1_a group by 
case when a=1 then *1* end work
but
select case when a=1 then 1 else *rand()* end b,count(1) from yuanfeng1_a group 
by case when a=1 then *rand()* end does not work.

This usage is correct,but rand() make trouble.

> rand() function in case when cause failed
> -
>
> Key: SPARK-19035
> URL: https://issues.apache.org/jira/browse/SPARK-19035
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Feng Yuan
>
> *In this case:*
>select 
>case when a=1 then 1 else concat(a,cast(rand() as 
> string)) end b,count(1) 
>from 
>yuanfeng1_a 
>group by 
>case when a=1 then 1 else concat(a,cast(rand() as 
> string)) end;
> *Throw error:*
> Error in query: expression 'yuanfeng1_a.`a`' is neither present in the group 
> by, nor is it an aggregate function. Add to group by or wrap in first() (or 
> first_value) if you don't care which value you get.;;
> Aggregate [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE 
> concat(cast(a#2075 as string), cast(rand(519367429988179997) as string)) 
> END], [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE concat(cast(a#2075 
> as string), cast(rand(8090243936131101651) as string)) END AS b#2074]
> +- MetastoreRelation default, yuanfeng1_a
> select case when a=1 then 1 else rand() end b,count(1) from yuanfeng1_a group 
> by case when a=1 then rand() end also output this
> *Notice*:
> If replace rand() as 1,it work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19035) rand() function in case when cause failed

2016-12-30 Thread Feng Yuan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15787404#comment-15787404
 ] 

Feng Yuan commented on SPARK-19035:
---

The core issue is why :
select case when a=1 then 1 else *1* end b,count(1) from yuanfeng1_a group by 
case when a=1 then *1* end work
but
select case when a=1 then 1 else *rand()* end b,count(1) from yuanfeng1_a group 
by case when a=1 then *rand()* end does not work.

This usage is correct,but rand() make trouble.

> rand() function in case when cause failed
> -
>
> Key: SPARK-19035
> URL: https://issues.apache.org/jira/browse/SPARK-19035
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Feng Yuan
>
> *In this case:*
>select 
>case when a=1 then 1 else concat(a,cast(rand() as 
> string)) end b,count(1) 
>from 
>yuanfeng1_a 
>group by 
>case when a=1 then 1 else concat(a,cast(rand() as 
> string)) end;
> *Throw error:*
> Error in query: expression 'yuanfeng1_a.`a`' is neither present in the group 
> by, nor is it an aggregate function. Add to group by or wrap in first() (or 
> first_value) if you don't care which value you get.;;
> Aggregate [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE 
> concat(cast(a#2075 as string), cast(rand(519367429988179997) as string)) 
> END], [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE concat(cast(a#2075 
> as string), cast(rand(8090243936131101651) as string)) END AS b#2074]
> +- MetastoreRelation default, yuanfeng1_a
> select case when a=1 then 1 else rand() end b,count(1) from yuanfeng1_a group 
> by case when a=1 then rand() end also output this
> *Notice*:
> If replace rand() as 1,it work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18737) Serialization setting "spark.serializer" ignored in Spark 2.x

2016-12-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18737.
---
Resolution: Duplicate

Good find, that looks like the same issue. It looks like the resolution was 
indeed to disable the Kryo auto-pick for streaming from the Java API. It may 
still be that just registering some more classes works around it too.

> Serialization setting "spark.serializer" ignored in Spark 2.x
> -
>
> Key: SPARK-18737
> URL: https://issues.apache.org/jira/browse/SPARK-18737
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Dr. Michael Menzel
>
> The following exception occurs although the JavaSerializer has been activated:
> 16/11/22 10:49:24 INFO TaskSetManager: Starting task 0.0 in stage 9.0 (TID 
> 77, ip-10-121-14-147.eu-central-1.compute.internal, partition 1, RACK_LOCAL, 
> 5621 bytes)
> 16/11/22 10:49:24 INFO YarnSchedulerBackend$YarnDriverEndpoint: Launching 
> task 77 on executor id: 2 hostname: 
> ip-10-121-14-147.eu-central-1.compute.internal.
> 16/11/22 10:49:24 INFO BlockManagerInfo: Added broadcast_11_piece0 in memory 
> on ip-10-121-14-147.eu-central-1.compute.internal:45059 (size: 879.0 B, free: 
> 410.4 MB)
> 16/11/22 10:49:24 WARN TaskSetManager: Lost task 0.0 in stage 9.0 (TID 77, 
> ip-10-121-14-147.eu-central-1.compute.internal): 
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 13994
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
> at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
> at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
> at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at org.apache.spark.util.NextIterator.foreach(NextIterator.scala:21)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
> at org.apache.spark.util.NextIterator.to(NextIterator.scala:21)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
> at org.apache.spark.util.NextIterator.toBuffer(NextIterator.scala:21)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
> at org.apache.spark.util.NextIterator.toArray(NextIterator.scala:21)
> at 
> org.apache.spark.rdd.RDD$$anonfun$toLocalIterator$1$$anonfun$org$apache$spark$rdd$RDD$$anonfun$$collectPartition$1$1.apply(RDD.scala:927)
> at 
> org.apache.spark.rdd.RDD$$anonfun$toLocalIterator$1$$anonfun$org$apache$spark$rdd$RDD$$anonfun$$collectPartition$1$1.apply(RDD.scala:927)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> The code runs perfectly with Spark 1.6.0. Since we moved to 2.0.0 and now 
> 2.0.1, we see the Kyro deserialization exception and over time the Spark 
> streaming job stops processing since too many tasks failed.
> Our action was to use conf.set("spark.serializer", 
> "org.apache.spark.serializer.JavaSerializer") and to disable Kryo class 
> registration with conf.set("spark.kryo.registrationRequired", false). We hope 
> to identify the root cause of the exception. 
> However, setting the serializer to JavaSerializer is oviously ignored by the 
> Spark-internals. Despite the setting we still see the exception printed in 
> the log and tasks fail. The occurence seems to be non-deterministic, but to 
> become more frequent over time.
> Several questions we could not answer during our troubleshooting:
> 1. How can the debug log for Kryo be enabled? -- We tried following the 
> minilog documentation, but no output can be f

[jira] [Reopened] (SPARK-18737) Serialization setting "spark.serializer" ignored in Spark 2.x

2016-12-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-18737:
---

> Serialization setting "spark.serializer" ignored in Spark 2.x
> -
>
> Key: SPARK-18737
> URL: https://issues.apache.org/jira/browse/SPARK-18737
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Dr. Michael Menzel
>
> The following exception occurs although the JavaSerializer has been activated:
> 16/11/22 10:49:24 INFO TaskSetManager: Starting task 0.0 in stage 9.0 (TID 
> 77, ip-10-121-14-147.eu-central-1.compute.internal, partition 1, RACK_LOCAL, 
> 5621 bytes)
> 16/11/22 10:49:24 INFO YarnSchedulerBackend$YarnDriverEndpoint: Launching 
> task 77 on executor id: 2 hostname: 
> ip-10-121-14-147.eu-central-1.compute.internal.
> 16/11/22 10:49:24 INFO BlockManagerInfo: Added broadcast_11_piece0 in memory 
> on ip-10-121-14-147.eu-central-1.compute.internal:45059 (size: 879.0 B, free: 
> 410.4 MB)
> 16/11/22 10:49:24 WARN TaskSetManager: Lost task 0.0 in stage 9.0 (TID 77, 
> ip-10-121-14-147.eu-central-1.compute.internal): 
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 13994
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
> at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
> at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
> at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at org.apache.spark.util.NextIterator.foreach(NextIterator.scala:21)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
> at org.apache.spark.util.NextIterator.to(NextIterator.scala:21)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
> at org.apache.spark.util.NextIterator.toBuffer(NextIterator.scala:21)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
> at org.apache.spark.util.NextIterator.toArray(NextIterator.scala:21)
> at 
> org.apache.spark.rdd.RDD$$anonfun$toLocalIterator$1$$anonfun$org$apache$spark$rdd$RDD$$anonfun$$collectPartition$1$1.apply(RDD.scala:927)
> at 
> org.apache.spark.rdd.RDD$$anonfun$toLocalIterator$1$$anonfun$org$apache$spark$rdd$RDD$$anonfun$$collectPartition$1$1.apply(RDD.scala:927)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> The code runs perfectly with Spark 1.6.0. Since we moved to 2.0.0 and now 
> 2.0.1, we see the Kyro deserialization exception and over time the Spark 
> streaming job stops processing since too many tasks failed.
> Our action was to use conf.set("spark.serializer", 
> "org.apache.spark.serializer.JavaSerializer") and to disable Kryo class 
> registration with conf.set("spark.kryo.registrationRequired", false). We hope 
> to identify the root cause of the exception. 
> However, setting the serializer to JavaSerializer is oviously ignored by the 
> Spark-internals. Despite the setting we still see the exception printed in 
> the log and tasks fail. The occurence seems to be non-deterministic, but to 
> become more frequent over time.
> Several questions we could not answer during our troubleshooting:
> 1. How can the debug log for Kryo be enabled? -- We tried following the 
> minilog documentation, but no output can be found.
> 2. Is the serializer setting effective for Spark internal serializations? How 
> can the JavaSerialize be forced on internal serializations for worker to 
> driver communication?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-19034) Download packages on 'spark.apache.org/downloads.html' contain release 2.0.2

2016-12-30 Thread Sanjay Dasgupta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15787390#comment-15787390
 ] 

Sanjay Dasgupta commented on SPARK-19034:
-

The "Direct download" link to the "Pre-built for Hadoop 2.4" package is the 
following:

http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-hadoop2.4.tgz

When I run the "spark-shell" from this package it clearly announces itself as 
"version 2.0.2". Running "spark.version" in the REPL also produces "res0: 
String = 2.0.2"

> Download packages on 'spark.apache.org/downloads.html' contain release 2.0.2
> 
>
> Key: SPARK-19034
> URL: https://issues.apache.org/jira/browse/SPARK-19034
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.0
> Environment: All
>Reporter: Sanjay Dasgupta
>  Labels: distribution, download
>
> Download packages on 'https://spark.apache.org/downloads.html' have the right 
> name ( spark-2.1.0-bin-...) but contain the release 2.0.2 software



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18933) Different log output between Terminal screen and stderr file

2016-12-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15787331#comment-15787331
 ] 

Sean Owen commented on SPARK-18933:
---

stdout and stderr exist for all processes. You can redirect them as you like.

> Different log output between Terminal screen and stderr file
> 
>
> Key: SPARK-18933
> URL: https://issues.apache.org/jira/browse/SPARK-18933
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Documentation, Web UI
>Affects Versions: 1.6.3
> Environment: Yarn mode and standalone mode
>Reporter: Sean Wong
>   Original Estimate: 612h
>  Remaining Estimate: 612h
>
> First of all, I use the default log4j.properties in the Spark conf/
> But I found that the log output(e.g., INFO) is different between Terminal 
> screen and stderr File. Some INFO logs exist in both of them. Some INFO logs 
> exist in either of them. Why this happens? Is it supposed that the output 
> logs are same between the terminal screen and stderr file? 
> Then I did a Test. I modified the source code in SparkContext.scala and add 
> one line log code "logInfo("This is textFile")" in the textFile function. 
> However, after running an application, I found the log "This is textFile" 
> shown in the terminal screen. no such log in the stderr file. I am not sure 
> if this is a bug. So, hope you can solve this question. Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18986) ExternalAppendOnlyMap shouldn't fail when forced to spill before calling its iterator

2016-12-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15787325#comment-15787325
 ] 

Sean Owen commented on SPARK-18986:
---

(Priority really doesn't mean anything. Feel free to help review the PR.)

> ExternalAppendOnlyMap shouldn't fail when forced to spill before calling its 
> iterator
> -
>
> Key: SPARK-18986
> URL: https://issues.apache.org/jira/browse/SPARK-18986
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Liang-Chi Hsieh
>
> {{ExternalAppendOnlyMap.forceSpill}} now uses an assert to check if an 
> iterator is not null in the map. However, the assertion is only true after 
> the map is asked for iterator. Before it, if another memory consumer asks 
> more memory than currently available, {{ExternalAppendOnlyMap.forceSpill}} is 
> also be called too. In this case, we will see failure like this:
> {code}
> [info]   java.lang.AssertionError: assertion failed
> [info]   at scala.Predef$.assert(Predef.scala:156)
> [info]   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.forceSpill(ExternalAppendOnlyMap.scala:196)
> [info]   at 
> org.apache.spark.util.collection.Spillable.spill(Spillable.scala:111)
> [info]   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMapSuite$$anonfun$13.apply$mcV$sp(ExternalAppendOnly
> MapSuite.scala:294)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19034) Download packages on 'spark.apache.org/downloads.html' contain release 2.0.2

2016-12-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19034.
---
Resolution: Invalid

I checked several of the 2.1.0 artifacts and they appear to contain the correct 
2.1.0 release. If you're going to file a bug you need to say what exact file, 
and why you think it's not the right version.

> Download packages on 'spark.apache.org/downloads.html' contain release 2.0.2
> 
>
> Key: SPARK-19034
> URL: https://issues.apache.org/jira/browse/SPARK-19034
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.0
> Environment: All
>Reporter: Sanjay Dasgupta
>  Labels: distribution, download
>
> Download packages on 'https://spark.apache.org/downloads.html' have the right 
> name ( spark-2.1.0-bin-...) but contain the release 2.0.2 software



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18738) Some Spark SQL queries has poor performance on HDFS Erasure Coding feature when enabling dynamic allocation.

2016-12-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18738.
---
Resolution: Not A Problem

I am going to close this as not a problem unless there's more info here ... 
clearly the problem is data locality being lower in the EC case. You can still 
tune Spark to not mind locality and possibly use all the executors, but 2/3 of 
the tasks are having to read data remotely then, which is its own overhead. 
It's not clear why only 1/3 of your machines appears to have data available for 
local execution. This also makes me wonder if you literally have 3x fewer data 
nodes in the EC case.

> Some Spark SQL queries has poor performance on HDFS Erasure Coding feature 
> when enabling dynamic allocation.
> 
>
> Key: SPARK-18738
> URL: https://issues.apache.org/jira/browse/SPARK-18738
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Lifeng Wang
>
> We run TPCx-BB with Spark SQL engine on local cluster using Spark 2.0.3 trunk 
> and Hadoop 3.0 alpha 2 trunk. We run Spark SQL queries with same data size on 
> both Erasure Coding and 3-replication.  The test results show that some 
> queries has much worse performance on EC compared to 3-replication. After 
> initial investigations, we found spark starts one third executors to execute 
> queries on EC compared to 3-replication. 
> We use query 30 as example, our cluster can totally launch 108 executors. 
> When we run the query from 3-replication database, spark will start all 108 
> executors to execute the query.  When we run the query from Erasure Coding 
> database, spark will launch 108 executors and kill 72 executors due to 
> they’re idle, at last there are only 36 executors to execute the query which 
> leads to poor performance.
> This issue only happens when we enable dynamic allocations mechanism. When we 
> disable the dynamic allocations, Spark SQL query on EC has the similar 
> performance with on 3-replication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18781) Allow MatrixFactorizationModel.predict to skip user/product approximation count

2016-12-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15787309#comment-15787309
 ] 

Sean Owen commented on SPARK-18781:
---

I take it back, I see why remembering the result probably doesn't help anything.
[~viirya] what do you think of the idea of a new method that maybe takes an 
optional boolean "partition by users" to let the caller avoid this overhead?
I think the RDD predict method is meant for large batches where this overhead 
is trivial, so still not sure about complicating the API for this. This is 
never going to be a real-time method.

> Allow MatrixFactorizationModel.predict to skip user/product approximation 
> count
> ---
>
> Key: SPARK-18781
> URL: https://issues.apache.org/jira/browse/SPARK-18781
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Eyal Allweil
>Priority: Minor
>
> When 
> [MatrixFactorizationModel.predict|https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.html#predict(org.apache.spark.rdd.RDD)]
>  is used, it first calculates an approximation count of the users and 
> products in order to determine the most efficient way to proceed. In many 
> cases, the answer to this question is fixed (typically there are more users 
> than products by an order of magnitude) and this check is unnecessary. Adding 
> a parameter to this predict method to allow choosing the implementation (and 
> skipping the check) would be nice.
> It would be especially nice in development cycles when you are repeatedly 
> tweaking your model and which pairs you're predicting for and this 
> approximate count represents a meaningful portion of the time you wait for 
> results.
> I can provide a pull request with this ability added that preserves the 
> existing behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19035) rand() function in case when cause failed

2016-12-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15787298#comment-15787298
 ] 

Sean Owen commented on SPARK-19035:
---

I'm not sure that's supposed to work. You are using a in the group by, but it 
isn't present in the group by expression. That's what the error says, right? 
why not just group on b?

> rand() function in case when cause failed
> -
>
> Key: SPARK-19035
> URL: https://issues.apache.org/jira/browse/SPARK-19035
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Feng Yuan
>
> *In this case:*
>select 
>case when a=1 then 1 else concat(a,cast(rand() as 
> string)) end b,count(1) 
>from 
>yuanfeng1_a 
>group by 
>case when a=1 then 1 else concat(a,cast(rand() as 
> string)) end;
> *Throw error:*
> Error in query: expression 'yuanfeng1_a.`a`' is neither present in the group 
> by, nor is it an aggregate function. Add to group by or wrap in first() (or 
> first_value) if you don't care which value you get.;;
> Aggregate [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE 
> concat(cast(a#2075 as string), cast(rand(519367429988179997) as string)) 
> END], [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE concat(cast(a#2075 
> as string), cast(rand(8090243936131101651) as string)) END AS b#2074]
> +- MetastoreRelation default, yuanfeng1_a
> select case when a=1 then 1 else rand() end b,count(1) from yuanfeng1_a group 
> by case when a=1 then rand() end also output this
> *Notice*:
> If replace rand() as 1,it work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19036) Merging dealyed micro batches

2016-12-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15787290#comment-15787290
 ] 

Sean Owen commented on SPARK-19036:
---

This completely overlooks that merging batches means delaying any processing. I 
think that kind of defeats the point of streaming with interval X. If you can 
tolerate higher latency, you can just increase the interval X to get overall 
better throughput.

> Merging dealyed micro batches
> -
>
> Key: SPARK-19036
> URL: https://issues.apache.org/jira/browse/SPARK-19036
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Sungho Ham
>Priority: Minor
>
> Efficiency of parallel execution get worsen when data is not evenly 
> distributed by both time and message. Theses skews make streaming batch 
> delayed despite sufficient resources. 
> Merging small-sized delayed batches could help increase efficiency of micro 
> batch.  
> Here is an example.
> ||batch_time ||   messages ||
> |4 |1   <--  current time |
> |3 |1   |
> |2 | 1  |
> |1 |1000  <-- processing |
> After long-running batch (t=1),  three batches has only one message. These 
> batches cannot utilize parallelism of Spark. By merging stream RDDs from t=1 
> to t=4 into one RDD, Spark can process them 3 times faster.
> If processing time of each message is highly skewed also, not only utilizing 
> parallelism matters. Suppose batches from t=2 to t=4 consist of one BIG 
> message and ten small messages. Then, merging three RDDs still could 
> considerably improve efficiency.
> ||batch_time  ||  messages ||   
> | 4 |  1+10   <--  current time |
> | 3 |  1+10   |
> | 2 |  1+10|
> | 1 |  1000  <-- processing  |
> There could be two parameters to describe merging behavior.
> - delay_time_limit:  when to start merge 
> - merge_record_limit: when to stop merge



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18930) Inserting in partitioned table - partitioned field should be last in select statement.

2016-12-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15787284#comment-15787284
 ] 

Sean Owen commented on SPARK-18930:
---

You say the result of the select is correct. If you want a different 
projection, you just change your SELECT right?

> Inserting in partitioned table - partitioned field should be last in select 
> statement. 
> ---
>
> Key: SPARK-18930
> URL: https://issues.apache.org/jira/browse/SPARK-18930
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>
> CREATE TABLE temp.test_partitioning_4 (
>   num string
>  ) 
> PARTITIONED BY (
>   day string)
>   stored as parquet
> INSERT INTO TABLE temp.test_partitioning_4 PARTITION (day)
> select day, count(*) as num from 
> hss.session where year=2016 and month=4 
> group by day
> Resulted schema on HDFS: /temp.db/test_partitioning_3/day=62456298, 
> emp.db/test_partitioning_3/day=69094345
> As you can imagine these numbers are num of records. But! When I do select * 
> from  temp.test_partitioning_4 data is correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19036) Merging dealyed micro batches

2016-12-30 Thread Sungho Ham (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungho Ham updated SPARK-19036:
---
Summary: Merging dealyed micro batches  (was: Merging dealyed micro batches 
for parallelism )

> Merging dealyed micro batches
> -
>
> Key: SPARK-19036
> URL: https://issues.apache.org/jira/browse/SPARK-19036
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Sungho Ham
>Priority: Minor
>
> Efficiency of parallel execution get worsen when data is not evenly 
> distributed by both time and message. Theses skews make streaming batch 
> delayed despite sufficient resources. 
> Merging small-sized delayed batches could help increase efficiency of micro 
> batch.  
> Here is an example.
> ||batch_time ||   messages ||
> |4 |1   <--  current time |
> |3 |1   |
> |2 | 1  |
> |1 |1000  <-- processing |
> After long-running batch (t=1),  three batches has only one message. These 
> batches cannot utilize parallelism of Spark. By merging stream RDDs from t=1 
> to t=4 into one RDD, Spark can process them 3 times faster.
> If processing time of each message is highly skewed also, not only utilizing 
> parallelism matters. Suppose batches from t=2 to t=4 consist of one BIG 
> message and ten small messages. Then, merging three RDDs still could 
> considerably improve efficiency.
> ||batch_time  ||  messages ||   
> | 4 |  1+10   <--  current time |
> | 3 |  1+10   |
> | 2 |  1+10|
> | 1 |  1000  <-- processing  |
> There could be two parameters to describe merging behavior.
> - delay_time_limit:  when to start merge 
> - merge_record_limit: when to stop merge



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19036) Merging dealyed micro batches for parallelism

2016-12-30 Thread Sungho Ham (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungho Ham updated SPARK-19036:
---
Component/s: Structured Streaming

> Merging dealyed micro batches for parallelism 
> --
>
> Key: SPARK-19036
> URL: https://issues.apache.org/jira/browse/SPARK-19036
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Sungho Ham
>Priority: Minor
>
> Efficiency of parallel execution get worsen when data is not evenly 
> distributed by both time and message. Theses skews make streaming batch 
> delayed despite sufficient resources. 
> Merging small-sized delayed batches could help increase efficiency of micro 
> batch.  
> Here is an example.
> ||batch_time ||   messages ||
> |4 |1   <--  current time |
> |3 |1   |
> |2 | 1  |
> |1 |1000  <-- processing |
> After long-running batch (t=1),  three batches has only one message. These 
> batches cannot utilize parallelism of Spark. By merging stream RDDs from t=1 
> to t=4 into one RDD, Spark can process them 3 times faster.
> If processing time of each message is highly skewed also, not only utilizing 
> parallelism matters. Suppose batches from t=2 to t=4 consist of one BIG 
> message and ten small messages. Then, merging three RDDs still could 
> considerably improve efficiency.
> ||batch_time  ||  messages ||   
> | 4 |  1+10   <--  current time |
> | 3 |  1+10   |
> | 2 |  1+10|
> | 1 |  1000  <-- processing  |
> There could be two parameters to describe merging behavior.
> - delay_time_limit:  when to start merge 
> - merge_record_limit: when to stop merge



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19036) Merging dealyed micro batches for parallelism

2016-12-30 Thread Sungho Ham (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungho Ham updated SPARK-19036:
---
Description: 
Efficiency of parallel execution get worsen when data is not evenly distributed 
by both time and message. Theses skews make streaming batch delayed despite 
sufficient resources. 

Merging small-sized delayed batches could help increase efficiency of micro 
batch.  

Here is an example.

||batch_time ||   messages ||
|4 |1   <--  current time |
|3 |1   |
|2 | 1  |
|1 |1000  <-- processing |

After long-running batch (t=1),  three batches has only one message. These 
batches cannot utilize parallelism of Spark. By merging stream RDDs from t=1 to 
t=4 into one RDD, Spark can process them 3 times faster.

If processing time of each message is highly skewed also, not only utilizing 
parallelism matters. Suppose batches from t=2 to t=4 consist of one BIG message 
and ten small messages. Then, merging three RDDs still could considerably 
improve efficiency.

||batch_time  ||  messages ||   
| 4 |  1+10   <--  current time |
| 3 |  1+10   |
| 2 |  1+10|
| 1 |  1000  <-- processing  |

There could be two parameters to describe merging behavior.
- delay_time_limit:  when to start merge 
- merge_record_limit: when to stop merge



  was:
Efficiency of parallel execution get worsen when data is not evenly distributed 
by both time and message. Theses skews make streaming batch delayed despite 
sufficient resources. 

Merging small-sized delayed batches could help increase efficiency of micro 
batch.  

Here is an example.

||batch_time ||   messages ||
|4 |1   <--  current time |
|3 |1   |
|2 | 1  |
|1 |1000  <-- processing |

After long-running batch (t=1),  three batches has only one message. These 
batches cannot utilize parallelism of Spark. By merging stream RDDs from t=1 to 
t=4 into one RDD, Spark can process them 3 times faster.

If processing time of each message is highly skewed also, not only utilizing 
parallelism matters. Suppose batches from t=2 to t=4 consist of one BIG message 
and ten small messages. Then, merging three RDDs still could considerably 
improve efficiency.

||batch_time  ||  messages ||   
| 4 |  1+10   <--  current time |
| 3 |  1+10   |
| 2 |  1+10|
| 1 |  1000  <-- processing  |

There could two parameters to describe merging behavior.
- delay_time_limit:  when to start merge 
- merge_record_limit: when to stop merge




> Merging dealyed micro batches for parallelism 
> --
>
> Key: SPARK-19036
> URL: https://issues.apache.org/jira/browse/SPARK-19036
> Project: Spark
>  Issue Type: New Feature
>Reporter: Sungho Ham
>Priority: Minor
>
> Efficiency of parallel execution get worsen when data is not evenly 
> distributed by both time and message. Theses skews make streaming batch 
> delayed despite sufficient resources. 
> Merging small-sized delayed batches could help increase efficiency of micro 
> batch.  
> Here is an example.
> ||batch_time ||   messages ||
> |4 |1   <--  current time |
> |3 |1   |
> |2 | 1  |
> |1 |1000  <-- processing |
> After long-running batch (t=1),  three batches has only one message. These 
> batches cannot utilize parallelism of Spark. By merging stream RDDs from t=1 
> to t=4 into one RDD, Spark can process them 3 times faster.
> If processing time of each message is highly skewed also, not only utilizing 
> parallelism matters. Suppose batches from t=2 to t=4 consist of one BIG 
> message and ten small messages. Then, merging three RDDs still could 
> considerably improve efficiency.
> ||batch_time  ||  messages ||   
> | 4 |  1+10   <--  current time |
> | 3 |  1+10   |
> | 2 |  1+10|
> | 1 |  1000  <-- processing  |
> There could be two parameters to describe merging behavior.
> - delay_time_limit:  when to start merge 
> - merge_record_limit: when to stop merge



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19036) Merging dealyed micro batches for parallelism

2016-12-30 Thread Sungho Ham (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungho Ham updated SPARK-19036:
---
Description: 
Efficiency of parallel execution get worsen when data is not evenly distributed 
by both time and message. Theses skews make streaming batch delayed despite 
sufficient resources. 

Merging small-sized delayed batches could help increase efficiency of micro 
batch.  

Here is an example.

||batch_time ||   messages ||
|4 |1   <--  current time |
|3 |1   |
|2 | 1  |
|1 |1000  <-- processing |

After long-running batch (t=1),  three batches has only one message. These 
batches cannot utilize parallelism of Spark. By merging stream RDDs from t=1 to 
t=4 into one RDD, Spark can process them 3 times faster.

If processing time of each message is highly skewed also, not only utilizing 
parallelism matters. Suppose batches from t=2 to t=4 consist of one BIG message 
and ten small messages. Then, merging three RDDs still could considerably 
improve efficiency.

||batch_time  ||  messages ||   
| 4 |  1+10   <--  current time |
| 3 |  1+10   |
| 2 |  1+10|
| 1 |  1000  <-- processing  |

There could two parameters to describe merging behavior.
- delay_time_limit:  when to start merge 
- merge_record_limit: when to stop merge



  was:
Efficiency of parallel execution get worsen when data is not evenly distributed 
by both time and message. Theses skews make streaming batch delayed despite 
sufficient resources. 

Merging small-sized delayed batches could help increase efficiency of micro 
batch.  

Here is an example.

batch_timemessages 
-
4  1   <--  current time
3  1 
2  1 
1  1000  <-- processing 

After long-running batch (t=1),  three batches has only one message. These 
batches cannot utilize parallelism of Spark. By merging stream RDDs from t=1 to 
t=4 into one RDD, Spark can process them 3 times faster.

If processing time of each message is highly skewed also, not only utilizing 
parallelism matters. Suppose batches from t=2 to t=4 consist of one BIG message 
and ten small messages. Then, merging three RDDs still could considerably 
improve efficiency.

batch_timemessages 
-
4  1+10   <--  current time
3  1+10   
2  1+10
1  1000  <-- processing 

There could two parameters to describe merging behavior.
- delay_time_limit:  when to start merge 
- merge_record_limit: when to stop merge




> Merging dealyed micro batches for parallelism 
> --
>
> Key: SPARK-19036
> URL: https://issues.apache.org/jira/browse/SPARK-19036
> Project: Spark
>  Issue Type: New Feature
>Reporter: Sungho Ham
>Priority: Minor
>
> Efficiency of parallel execution get worsen when data is not evenly 
> distributed by both time and message. Theses skews make streaming batch 
> delayed despite sufficient resources. 
> Merging small-sized delayed batches could help increase efficiency of micro 
> batch.  
> Here is an example.
> ||batch_time ||   messages ||
> |4 |1   <--  current time |
> |3 |1   |
> |2 | 1  |
> |1 |1000  <-- processing |
> After long-running batch (t=1),  three batches has only one message. These 
> batches cannot utilize parallelism of Spark. By merging stream RDDs from t=1 
> to t=4 into one RDD, Spark can process them 3 times faster.
> If processing time of each message is highly skewed also, not only utilizing 
> parallelism matters. Suppose batches from t=2 to t=4 consist of one BIG 
> message and ten small messages. Then, merging three RDDs still could 
> considerably improve efficiency.
> ||batch_time  ||  messages ||   
> | 4 |  1+10   <--  current time |
> | 3 |  1+10   |
> | 2 |  1+10|
> | 1 |  1000  <-- processing  |
> There could two parameters to describe merging behavior.
> - delay_time_limit:  when to start merge 
> - merge_record_limit: when to stop merge



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19036) Merging dealyed micro batches for parallelism

2016-12-30 Thread Sungho Ham (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungho Ham updated SPARK-19036:
---
Description: 
Efficiency of parallel execution get worsen when data is not evenly distributed 
by both time and message. Theses skews make streaming batch delayed despite 
sufficient resources. 

Merging small-sized delayed batches could help increase efficiency of micro 
batch.  

Here is an example.

batch_timemessages 
-
4  1   <--  current time
3  1 
2  1 
1  1000  <-- processing 

After long-running batch (t=1),  three batches has only one message. These 
batches cannot utilize parallelism of Spark. By merging stream RDDs from t=1 to 
t=4 into one RDD, Spark can process them 3 times faster.

If processing time of each message is highly skewed also, not only utilizing 
parallelism matters. Suppose batches from t=2 to t=4 consist of one BIG message 
and ten small messages. Then, merging three RDDs still could considerably 
improve efficiency.

batch_timemessages 
-
4  1+10   <--  current time
3  1+10   
2  1+10
1  1000  <-- processing 

There could two parameters to describe merging behavior.
- delay_time_limit:  when to start merge 
- merge_record_limit: when to stop merge



  was:
Efficiency of parallel execution get worsen when data is not evenly distributed 
by both time and message. Theses skews make streaming batch delayed despite 
sufficient resources. 

Merging small-sized delayed batches could help increase efficiency of micro 
batch.  

Here is an example.

tmessages 
-
41   <--  current time
31 
21 
11000  <-- processing 

After long-running batch (t=1),  three batches has only one message. These 
batches cannot utilize parallelism of Spark. By merging stream RDDs from t=1 to 
t=4 into one RDD, Spark can process them 3 times faster.

If processing time of each message is highly skewed also, not only utilizing 
parallelism matters. Suppose batches from t=2 to t=4 consist of one BIG message 
and ten small messages. Then, merging three RDDs still could considerably 
improve efficiency.

tmessages 
-
41+10   <--  current time
31+10   
21+10
11000  <-- processing 

There could two parameters to describe merging behavior.
- delay_time_limit:  when to start merge 
- merge_record_limit: when to stop merge




> Merging dealyed micro batches for parallelism 
> --
>
> Key: SPARK-19036
> URL: https://issues.apache.org/jira/browse/SPARK-19036
> Project: Spark
>  Issue Type: New Feature
>Reporter: Sungho Ham
>Priority: Minor
>
> Efficiency of parallel execution get worsen when data is not evenly 
> distributed by both time and message. Theses skews make streaming batch 
> delayed despite sufficient resources. 
> Merging small-sized delayed batches could help increase efficiency of micro 
> batch.  
> Here is an example.
> batch_timemessages 
> -
> 4  1   <--  current time
> 3  1 
> 2  1 
> 1  1000  <-- processing 
> After long-running batch (t=1),  three batches has only one message. These 
> batches cannot utilize parallelism of Spark. By merging stream RDDs from t=1 
> to t=4 into one RDD, Spark can process them 3 times faster.
> If processing time of each message is highly skewed also, not only utilizing 
> parallelism matters. Suppose batches from t=2 to t=4 consist of one BIG 
> message and ten small messages. Then, merging three RDDs still could 
> considerably improve efficiency.
> batch_timemessages 
> -
> 4  1+10   <--  current time
> 3  1+10   
> 2  1+10
> 1  1000  <-- processing 
> There could two parameters to describe merging behavior.
> - delay_time_limit:  when to start merge 
> - merge_record_limit: when to stop merge



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19036) Merging dealyed micro batches for parallelism

2016-12-30 Thread Sungho Ham (JIRA)
Sungho Ham created SPARK-19036:
--

 Summary: Merging dealyed micro batches for parallelism 
 Key: SPARK-19036
 URL: https://issues.apache.org/jira/browse/SPARK-19036
 Project: Spark
  Issue Type: New Feature
Reporter: Sungho Ham
Priority: Minor


Efficiency of parallel execution get worsen when data is not evenly distributed 
by both time and message. Theses skews make streaming batch delayed despite 
sufficient resources. 

Merging small-sized delayed batches could help increase efficiency of micro 
batch.  

Here is an example.

tmessages 
-
41   <--  current time
31 
21 
11000  <-- processing 

After long-running batch (t=1),  three batches has only one message. These 
batches cannot utilize parallelism of Spark. By merging stream RDDs from t=1 to 
t=4 into one RDD, Spark can process them 3 times faster.

If processing time of each message is highly skewed also, not only utilizing 
parallelism matters. Suppose batches from t=2 to t=4 consist of one BIG message 
and ten small messages. Then, merging three RDDs still could considerably 
improve efficiency.

tmessages 
-
41+10   <--  current time
31+10   
21+10
11000  <-- processing 

There could two parameters to describe merging behavior.
- delay_time_limit:  when to start merge 
- merge_record_limit: when to stop merge





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19012) CreateOrReplaceTempView throws org.apache.spark.sql.catalyst.parser.ParseException when viewName first char is numerical

2016-12-30 Thread Jork Zijlstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15787194#comment-15787194
 ] 

Jork Zijlstra commented on SPARK-19012:
---

[~hvanhovell] Its already working for me, I was already prefixing the 
tableOrViewName. 
I thought you needed an example on how a developers mind work in (mis)using 
other peoples code.

Its nice to see that its been resolved in just 2 days. 



> CreateOrReplaceTempView throws 
> org.apache.spark.sql.catalyst.parser.ParseException when viewName first char 
> is numerical
> 
>
> Key: SPARK-19012
> URL: https://issues.apache.org/jira/browse/SPARK-19012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.0.2, 2.1.0
>Reporter: Jork Zijlstra
>Assignee: Dongjoon Hyun
> Fix For: 2.2.0
>
>
> Using a viewName where the the fist char is a numerical value on 
> dataframe.createOrReplaceTempView(viewName: String) causes:
> {code}
> Exception in thread "main" 
> org.apache.spark.sql.catalyst.parser.ParseException: 
> mismatched input '1468079114' expecting {'SELECT', 'FROM', 'ADD', 'AS', 
> 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 
> 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 
> 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 
> 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 
> 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 
> 'NATURAL', 'ON', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 
> 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 
> 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 
> 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 
> 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 
> 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 
> 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 
> 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 
> 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 
> 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 
> 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 
> 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 
> 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', 
> 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 
> 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 
> 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 
> 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, 
> DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 
> 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 
> 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 
> 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 
> 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', 
> IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 0)
> == SQL ==
> 1
> {code}
> {code}
> val tableOrViewName = "1" //fails
> val tableOrViewName = "a" //works
> sparkSession.read.orc(path).createOrReplaceTempView(tableOrViewName)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19035) rand() function in case when cause failed

2016-12-30 Thread Feng Yuan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Yuan updated SPARK-19035:
--
Summary: rand() function in case when cause failed  (was: rand() function 
in case when cause will failed)

> rand() function in case when cause failed
> -
>
> Key: SPARK-19035
> URL: https://issues.apache.org/jira/browse/SPARK-19035
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Feng Yuan
>
> *In this case:*
>select 
>case when a=1 then 1 else concat(a,cast(rand() as 
> string)) end b,count(1) 
>from 
>yuanfeng1_a 
>group by 
>case when a=1 then 1 else concat(a,cast(rand() as 
> string)) end;
> *Throw error:*
> Error in query: expression 'yuanfeng1_a.`a`' is neither present in the group 
> by, nor is it an aggregate function. Add to group by or wrap in first() (or 
> first_value) if you don't care which value you get.;;
> Aggregate [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE 
> concat(cast(a#2075 as string), cast(rand(519367429988179997) as string)) 
> END], [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE concat(cast(a#2075 
> as string), cast(rand(8090243936131101651) as string)) END AS b#2074]
> +- MetastoreRelation default, yuanfeng1_a
> select case when a=1 then 1 else rand() end b,count(1) from yuanfeng1_a group 
> by case when a=1 then rand() end also output this
> *Notice*:
> If replace rand() as 1,it work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org