date:20170728

[jira] [Commented] (SPARK-21560) Add hold mode for the LiveListenerBus

2017-07-28 Thread Li Yuanjian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106030#comment-16106030
 ] 

Li Yuanjian commented on SPARK-21560:
-

Something wrong with the sync between PR and JIRA, the PR corresponding to this 
: https://github.com/apache/spark/pull/18760

> Add hold mode for the LiveListenerBus
> -
>
> Key: SPARK-21560
> URL: https://issues.apache.org/jira/browse/SPARK-21560
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Li Yuanjian
>
> As the comments in SPARK-18838, we also face the same problem about critical 
> events dropped while the event queue is full. 
> There's no doubt that improving the performance of the processing thread is 
> important, whether the solution is multithreading or any others like 
> SPARK-20776, but maybe we still need the hold strategy when the event queue 
> is full, and restart after some room released. The hold strategy open or not 
> and the empty rate should both configurable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21573) Tests failing with run-tests.py SyntaxError occasionally in Jenkins

2017-07-28 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106029#comment-16106029
 ] 

Dongjoon Hyun commented on SPARK-21573:
---

Thank you for pinging me, [~hyukjin.kwon].
+1 for the idea.

> Tests failing with run-tests.py SyntaxError occasionally in Jenkins
> ---
>
> Key: SPARK-21573
> URL: https://issues.apache.org/jira/browse/SPARK-21573
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> It looks default {{python}} in the path at few places such as 
> {{./dev/run-tests}} use Python 2.6 in Jenkins and it fails to execute 
> {{run-tests.py}}:
> {code}
> python2.6 run-tests.py
>   File "run-tests.py", line 124
> {m: set(m.dependencies).intersection(modules_to_test) for m in 
> modules_to_test}, sort=True)
> ^
> SyntaxError: invalid syntax
> {code}
> It looks there are quite some places to fix to support Python 2.6 in 
> {{run-tests.py}} and related Python scripts.
> We might just try to set Python 2.7 in few other scripts running this if 
> available.
> Please also see 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Tests-failing-with-run-tests-py-SyntaxError-td22030.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21573) Tests failing with run-tests.py SyntaxError occasionally in Jenkins

2017-07-28 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106024#comment-16106024
 ] 

holdenk commented on SPARK-21573:
-

Yes we did drop 2.6 support. We should change the script to python2.7 
explicitly and also, while we are thinking about it, probably consider throwing 
an exception when PySpark is launched with Python 2.6 so its clear why its 
failing.

> Tests failing with run-tests.py SyntaxError occasionally in Jenkins
> ---
>
> Key: SPARK-21573
> URL: https://issues.apache.org/jira/browse/SPARK-21573
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> It looks default {{python}} in the path at few places such as 
> {{./dev/run-tests}} use Python 2.6 in Jenkins and it fails to execute 
> {{run-tests.py}}:
> {code}
> python2.6 run-tests.py
>   File "run-tests.py", line 124
> {m: set(m.dependencies).intersection(modules_to_test) for m in 
> modules_to_test}, sort=True)
> ^
> SyntaxError: invalid syntax
> {code}
> It looks there are quite some places to fix to support Python 2.6 in 
> {{run-tests.py}} and related Python scripts.
> We might just try to set Python 2.7 in few other scripts running this if 
> available.
> Please also see 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Tests-failing-with-run-tests-py-SyntaxError-td22030.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21573) Tests failing with run-tests.py SyntaxError occasionally in Jenkins

2017-07-28 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-21573:
-
Description: 
It looks default {{python}} in the path at few places such as 
{{./dev/run-tests}} use Python 2.6 in Jenkins and it fails to execute 
{{run-tests.py}}:

{code}
python2.6 run-tests.py
  File "run-tests.py", line 124
{m: set(m.dependencies).intersection(modules_to_test) for m in 
modules_to_test}, sort=True)
^
SyntaxError: invalid syntax
{code}

It looks there are quite some places to fix to support Python 2.6 in 
{{run-tests.py}} and related Python scripts.
We might just try to set Python 2.7 in few other scripts running this if 
available.

Please also see 
http://apache-spark-developers-list.1001551.n3.nabble.com/Tests-failing-with-run-tests-py-SyntaxError-td22030.html

  was:
{code}
python2.6 run-tests.py
  File "run-tests.py", line 124
{m: set(m.dependencies).intersection(modules_to_test) for m in 
modules_to_test}, sort=True)
^
SyntaxError: invalid syntax
{code}

It looks {{run-tests.py}} failed to execute by Python 2.6. It looks there are 
quite some places to fix to support Python 2.6 with {{run-tests.py}}.

We might just try to set Python 2.7 in few other scripts running this if 
available.

Please also see 
http://apache-spark-developers-list.1001551.n3.nabble.com/Tests-failing-with-run-tests-py-SyntaxError-td22030.html


> Tests failing with run-tests.py SyntaxError occasionally in Jenkins
> ---
>
> Key: SPARK-21573
> URL: https://issues.apache.org/jira/browse/SPARK-21573
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> It looks default {{python}} in the path at few places such as 
> {{./dev/run-tests}} use Python 2.6 in Jenkins and it fails to execute 
> {{run-tests.py}}:
> {code}
> python2.6 run-tests.py
>   File "run-tests.py", line 124
> {m: set(m.dependencies).intersection(modules_to_test) for m in 
> modules_to_test}, sort=True)
> ^
> SyntaxError: invalid syntax
> {code}
> It looks there are quite some places to fix to support Python 2.6 in 
> {{run-tests.py}} and related Python scripts.
> We might just try to set Python 2.7 in few other scripts running this if 
> available.
> Please also see 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Tests-failing-with-run-tests-py-SyntaxError-td22030.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21573) Tests failing with run-tests.py SyntaxError occasionally in Jenkins

2017-07-28 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-21573:
-
Description: 
{code}
python2.6 run-tests.py
  File "run-tests.py", line 124
{m: set(m.dependencies).intersection(modules_to_test) for m in 
modules_to_test}, sort=True)
^
SyntaxError: invalid syntax
{code}

It looks {{run-tests.py}} failed to execute by Python 2.6. It looks there are 
quite some places to fix to support Python 2.6 with {{run-tests.py}}.

We might just try to set Python 2.7 in few other scripts running this if 
available.

Please also see 
http://apache-spark-developers-list.1001551.n3.nabble.com/Tests-failing-with-run-tests-py-SyntaxError-td22030.html

  was:
{code}
python2.6 run-tests.py
  File "run-tests.py", line 124
{m: set(m.dependencies).intersection(modules_to_test) for m in 
modules_to_test}, sort=True)
^
SyntaxError: invalid syntax
{code}

It looks {{run-tests.py}} failed to execute by Python 2.6. It looks there are 
quite some places to fix to support Python 2.6 with {{run-tests.py}}.

We might just try to set Python 2.7 in this script if available.

Please also see 
http://apache-spark-developers-list.1001551.n3.nabble.com/Tests-failing-with-run-tests-py-SyntaxError-td22030.html


> Tests failing with run-tests.py SyntaxError occasionally in Jenkins
> ---
>
> Key: SPARK-21573
> URL: https://issues.apache.org/jira/browse/SPARK-21573
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> {code}
> python2.6 run-tests.py
>   File "run-tests.py", line 124
> {m: set(m.dependencies).intersection(modules_to_test) for m in 
> modules_to_test}, sort=True)
> ^
> SyntaxError: invalid syntax
> {code}
> It looks {{run-tests.py}} failed to execute by Python 2.6. It looks there are 
> quite some places to fix to support Python 2.6 with {{run-tests.py}}.
> We might just try to set Python 2.7 in few other scripts running this if 
> available.
> Please also see 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Tests-failing-with-run-tests-py-SyntaxError-td22030.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21573) Tests failing with run-tests.py SyntaxError occasionally in Jenkins

2017-07-28 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106021#comment-16106021
 ] 

Hyukjin Kwon commented on SPARK-21573:
--

I think we officially dropped Python 2.6 in 
https://issues.apache.org/jira/browse/SPARK-12661; however, I was thinking an 
option would be fix the script to use python2.7 if available, if it is 
currently hard to fix this in Jenkins.

cc [~srowen], [~joshrosen], [~holdenk] and [~dongjoon].

> Tests failing with run-tests.py SyntaxError occasionally in Jenkins
> ---
>
> Key: SPARK-21573
> URL: https://issues.apache.org/jira/browse/SPARK-21573
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> {code}
> python2.6 run-tests.py
>   File "run-tests.py", line 124
> {m: set(m.dependencies).intersection(modules_to_test) for m in 
> modules_to_test}, sort=True)
> ^
> SyntaxError: invalid syntax
> {code}
> It looks {{run-tests.py}} failed to execute by Python 2.6. It looks there are 
> quite some places to fix to support Python 2.6 with {{run-tests.py}}.
> We might just try to set Python 2.7 in this script if available.
> Please also see 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Tests-failing-with-run-tests-py-SyntaxError-td22030.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21573) Tests failing with run-tests.py SyntaxError occasionally in Jenkins

2017-07-28 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-21573:
-
Summary: Tests failing with run-tests.py SyntaxError occasionally in 
Jenkins  (was: Tests failing with run-tests.py SyntaxError occasionally)

> Tests failing with run-tests.py SyntaxError occasionally in Jenkins
> ---
>
> Key: SPARK-21573
> URL: https://issues.apache.org/jira/browse/SPARK-21573
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> {code}
> python2.6 run-tests.py
>   File "run-tests.py", line 124
> {m: set(m.dependencies).intersection(modules_to_test) for m in 
> modules_to_test}, sort=True)
> ^
> SyntaxError: invalid syntax
> {code}
> It looks {{run-tests.py}} failed to execute by Python 2.6. It looks there are 
> quite some places to fix to support Python 2.6 with {{run-tests.py}}.
> We might just try to set Python 2.7 in this script if available.
> Please also see 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Tests-failing-with-run-tests-py-SyntaxError-td22030.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21573) Tests failing with run-tests.py SyntaxError occasionally

2017-07-28 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-21573:
-
Summary: Tests failing with run-tests.py SyntaxError occasionally  (was: 
run-tests script fails with Python 2.6)

> Tests failing with run-tests.py SyntaxError occasionally
> 
>
> Key: SPARK-21573
> URL: https://issues.apache.org/jira/browse/SPARK-21573
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> {code}
> python2.6 run-tests.py
>   File "run-tests.py", line 124
> {m: set(m.dependencies).intersection(modules_to_test) for m in 
> modules_to_test}, sort=True)
> ^
> SyntaxError: invalid syntax
> {code}
> It looks {{run-tests.py}} failed to execute by Python 2.6. It looks there are 
> quite some places to fix to support Python 2.6 with {{run-tests.py}}.
> We might just try to set Python 2.7 in this script if available.
> Please also see 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Tests-failing-with-run-tests-py-SyntaxError-td22030.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20090) Add StructType.fieldNames to Python API

2017-07-28 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-20090:
---

Assignee: Hyukjin Kwon

> Add StructType.fieldNames to Python API
> ---
>
> Key: SPARK-20090
> URL: https://issues.apache.org/jira/browse/SPARK-20090
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Joseph K. Bradley
>Assignee: Hyukjin Kwon
>Priority: Trivial
> Fix For: 2.3.0
>
>
> The Scala/Java API for {{StructType}} has a method {{fieldNames}}.  It would 
> be nice if the Python {{StructType}} did as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21573) run-tests script fails with Python 2.6

2017-07-28 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-21573:


 Summary: run-tests script fails with Python 2.6
 Key: SPARK-21573
 URL: https://issues.apache.org/jira/browse/SPARK-21573
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 2.3.0
Reporter: Hyukjin Kwon
Priority: Minor


{code}
python2.6 run-tests.py
  File "run-tests.py", line 124
{m: set(m.dependencies).intersection(modules_to_test) for m in 
modules_to_test}, sort=True)
^
SyntaxError: invalid syntax
{code}

It looks {{run-tests.py}} failed to execute by Python 2.6. It looks there are 
quite some places to fix to support Python 2.6 with {{run-tests.py}}.

We might just try to set Python 2.7 in this script if available.

Please also see 
http://apache-spark-developers-list.1001551.n3.nabble.com/Tests-failing-with-run-tests-py-SyntaxError-td22030.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20090) Add StructType.fieldNames to Python API

2017-07-28 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-20090.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18618
[https://github.com/apache/spark/pull/18618]

> Add StructType.fieldNames to Python API
> ---
>
> Key: SPARK-20090
> URL: https://issues.apache.org/jira/browse/SPARK-20090
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Joseph K. Bradley
>Priority: Trivial
> Fix For: 2.3.0
>
>
> The Scala/Java API for {{StructType}} has a method {{fieldNames}}.  It would 
> be nice if the Python {{StructType}} did as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21533) "configure(...)" method not called when using Hive Generic UDFs

2017-07-28 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106000#comment-16106000
 ] 

Takeshi Yamamuro commented on SPARK-21533:
--

I feel supporting this function makes little sense because Spark does not use 
`MapredContext` inside. But, this is some error-prone, so probably we better 
print warning messages or something if the function found in Hive UDFs.

> "configure(...)" method not called when using Hive Generic UDFs
> ---
>
> Key: SPARK-21533
> URL: https://issues.apache.org/jira/browse/SPARK-21533
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.1.1
>Reporter: Dean Gurvitz
>Priority: Minor
>
> Using Spark 2.1.1 and Java API, when executing a Hive Generic UDF through the 
> Spark SQL API, the configure() method in it is not called prior to the 
> initialize/evaluate methods as expected. 
> The method configure receives a MapredContext object. It is possible to 
> construct a version of such an object adjusted to Spark, and therefore 
> configure should be called to enable a smooth execution of all Hive Generic 
> UDFs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21572) Add description on how to exit the spark-shell in the welcome message

2017-07-28 Thread Donghui Xu (JIRA)

Donghui Xu created SPARK-21572:
--

 Summary: Add description on how to exit the spark-shell in the 
welcome message
 Key: SPARK-21572
 URL: https://issues.apache.org/jira/browse/SPARK-21572
 Project: Spark
  Issue Type: Improvement
  Components: Spark Shell
Affects Versions: 2.2.0
Reporter: Donghui Xu
Priority: Trivial


When the user uses the spark-shell, does not know how to exit. So we need to 
add description on how to exit in the welcome message.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19490) Hive partition columns are case-sensitive

2017-07-28 Thread cen yuhai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105979#comment-16105979
 ] 

cen yuhai commented on SPARK-19490:
---

I forgot the `fixed PR`, maybe that pr will not by backport to 2.1

> Hive partition columns are case-sensitive
> -
>
> Key: SPARK-19490
> URL: https://issues.apache.org/jira/browse/SPARK-19490
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: cen yuhai
>
> The real partitions columns are lower case (year, month, day)
> {code}
> Caused by: java.lang.RuntimeException: Expected only partition pruning 
> predicates: (concat(YEAR#22, MONTH#23, DAY#24) = 20170202)
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:985)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:976)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:95)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:976)
>   at 
> org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(MetastoreRelation.scala:161)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:151)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:150)
>   at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2472)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:149)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:235)
>   at 
> org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:124)
>   at 
> org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:42)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:141)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:368)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.prepareShuffleDependency(ShuffleExchange.scala:85)
>   at 
> org.apache.spark.sql.execution.exchange.ExchangeCoordinator.doEstimationIfNecessary(ExchangeCoordinator.scala:213)
>   at 
> org.apache.spark.sql.execution.exchange.ExchangeCoordinator.postShuffleRDD(ExchangeCoordinator.scala:261)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:117)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:112)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
> {code}
> Use these sql can reproduce this bug:
> CREATE TABLE partition_test (key Int) partitioned by (date string)
> SELECT * FROM partition_test where DATE = '20170101'



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21571) Spark history server leaves incomplete or unreadable history files around forever.

2017-07-28 Thread Eric Vandenberg (JIRA)

Eric Vandenberg created SPARK-21571:
---

 Summary: Spark history server leaves incomplete or unreadable 
history files around forever.
 Key: SPARK-21571
 URL: https://issues.apache.org/jira/browse/SPARK-21571
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 2.2.0
Reporter: Eric Vandenberg
Priority: Minor


We have noticed that history server logs are sometimes never cleaned up.  The 
current history server logic *ONLY* cleans up history files if they are 
completed since in general it doesn't make sense to clean up inprogress history 
files (after all, the job is presumably still running?)  Note that inprogress 
history files would generally not be targeted for clean up any way assuming 
they regularly flush logs and the file system accurately updates the history 
log last modified time/size, while this is likely it is not guaranteed behavior.

As a consequence of the current clean up logic and a combination of unclean 
shutdowns, various file system bugs, earlier spark bugs, etc. we have 
accumulated thousands of these dead history files associated with long since 
gone jobs.

For example (with spark.history.fs.cleaner.maxAge=14d):

-rw-rw   3 xx   ooo  
14382 2016-09-13 15:40 
/user/hadoop/xx/spark/logs/qq1974_ppp-8812_11058600195_dev4384_-53982.zstandard
-rw-rw   3  ooo   
5933 2016-11-01 20:16 
/user/hadoop/xx/spark/logs/qq2016_ppp-8812_12650700673_dev5365_-65313.lz4
-rw-rw   3 yyy  ooo 
 0 2017-01-19 11:59 
/user/hadoop/xx/spark/logs/0057_326_m-57863.lz4.inprogress
-rw-rw   3 xooo 
 0 2017-01-19 14:17 
/user/hadoop/xx/spark/logs/0063_688_m-33246.lz4.inprogress
-rw-rw   3 yyy  ooo 
 0 2017-01-20 10:56 
/user/hadoop/xx/spark/logs/1030_326_m-45195.lz4.inprogress
-rw-rw   3  ooo  
11955 2017-01-20 17:55 
/user/hadoop/xx/spark/logs/1314_54_kk-64671.lz4.inprogress
-rw-rw   3  ooo  
11958 2017-01-20 17:55 
/user/hadoop/xx/spark/logs/1315_1667_kk-58968.lz4.inprogress
-rw-rw   3  ooo  
11960 2017-01-20 17:55 
/user/hadoop/xx/spark/logs/1316_54_kk-48058.lz4.inprogress

Based on the current logic, clean up candidates are skipped in several cases:
1. if a file has 0 bytes, it is completely ignored
2. if a file is in progress, is it completely ignored
3. if a file is complete and but not parseable, or can't extract appID, it is 
completely ignored.

To address this edge case and provide a way to clean out orphaned history files 
I propose a new configuration option:

spark.history.fs.cleaner.aggressive={true, false}, default is false.

If true, the history server will more aggressively garbage collect history 
files in cases (1), (2) and (3).  Since the default is false, existing 
customers won't be affected unless they explicitly opt-in.  If customers have 
similar leaking garbage over time they have the option of aggressively cleaning 
up in such cases.  Also note that aggressive clean up may not be appropriate 
for some customers if they have long running jobs that exceed the 
cleaner.maxAge time frame and/or have buggy file systems.

Would like to get feedback on if this seems like a reasonable solution.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18535) Redact sensitive information from Spark logs and UI

2017-07-28 Thread Diogo Munaro Vieira (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105955#comment-16105955
 ] 

Diogo Munaro Vieira commented on SPARK-18535:
-

I did a merge request for version 2.1.2: 
https://github.com/apache/spark/pull/18765

> Redact sensitive information from Spark logs and UI
> ---
>
> Key: SPARK-18535
> URL: https://issues.apache.org/jira/browse/SPARK-18535
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI, YARN
>Affects Versions: 2.1.0
>Reporter: Mark Grover
>Assignee: Mark Grover
> Fix For: 2.2.0
>
> Attachments: redacted.png
>
>
> A Spark user may have to provide a sensitive information for a Spark 
> configuration property, or a source out an environment variable in the 
> executor or driver environment that contains sensitive information. A good 
> example of this would be when reading/writing data from/to S3 using Spark. 
> The S3 secret and S3 access key can be placed in a [hadoop credential 
> provider|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CredentialProviderAPI.html].
>  However, one still needs to provide the password for the credential provider 
> to Spark, which is typically supplied as an environment variable to the 
> driver and executor environments. This environment variable shows up in logs, 
> and may also show up in the UI.
> 1. For logs, it shows up in a few places:
>   1A. Event logs under {{SparkListenerEnvironmentUpdate}} event.
>   1B. YARN logs, when printing the executor launch context.
> 2. For UI, it would show up in the _Environment_ tab, but it is redacted if 
> it contains the words "password" or "secret" in it. And, these magic words 
> are 
> [hardcoded|https://github.com/apache/spark/blob/a2d464770cd183daa7d727bf377bde9c21e29e6a/core/src/main/scala/org/apache/spark/ui/env/EnvironmentPage.scala#L30]
>  and hence not customizable.
> This JIRA is to track the work to make sure sensitive information is redacted 
> from all logs and UIs in Spark, while still being passed on to all relevant 
> places it needs to get passed on to.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19720) Redact sensitive information from SparkSubmit console output

2017-07-28 Thread Diogo Munaro Vieira (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105952#comment-16105952
 ] 

Diogo Munaro Vieira commented on SPARK-19720:
-

I did a merge request for this compatibility feature on version 2.1.2: 
https://github.com/apache/spark/pull/18765

> Redact sensitive information from SparkSubmit console output
> 
>
> Key: SPARK-19720
> URL: https://issues.apache.org/jira/browse/SPARK-19720
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>Assignee: Mark Grover
> Fix For: 2.2.0
>
>
> SPARK-18535 took care of redacting sensitive information from Spark event 
> logs and UI. However, it intentionally didn't bother redacting the same 
> sensitive information from SparkSubmit's console output because it was on the 
> client's machine, which already had the sensitive information on disk (in 
> spark-defaults.conf) or on terminal (spark-submit command line).
> However, it seems now that it's better to redact information from 
> SparkSubmit's console output as well because orchestration software like 
> Oozie usually expose SparkSubmit's console output via a UI. To make matters 
> worse, Oozie, in particular, always sets the {{--verbose}} flag on 
> SparkSubmit invocation, making the sensitive information readily available in 
> its UI (see 
> [code|https://github.com/apache/oozie/blob/master/sharelib/spark/src/main/java/org/apache/oozie/action/hadoop/SparkMain.java#L248]
>  here).
> This is a JIRA for tracking redaction of sensitive information from 
> SparkSubmit's console output.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20433) Security issue with jackson-databind

2017-07-28 Thread Andrew Ash (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105951#comment-16105951
 ] 

Andrew Ash commented on SPARK-20433:


As I wrote in that PR, it's 2.6.7.1 of jackson-databind that has the fix, and 
the Jackson project did not publish a corresponding 2.6.7.1 of the other 
components of Jackson.

This affects Spark because a known vulnerable library is on the classpath at 
runtime. So you can only guarantee that Spark isn't vulnerable by removing the 
vulnerable code from the runtime classpath.

Anyways a Jackson bump to a fixed version will likely be picked up by Apache 
Spark the next time Jackson is upgraded so I trust this will get fixed 
eventually regardless of whether Apache takes the hotfix version now or a 
regular release in the future.

> Security issue with jackson-databind
> 
>
> Key: SPARK-20433
> URL: https://issues.apache.org/jira/browse/SPARK-20433
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Andrew Ash
>  Labels: security
>
> There was a security vulnerability recently reported to the upstream 
> jackson-databind project at 
> https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix 
> released.
> From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the 
> first fixed versions in their respectful 2.X branches, and versions in the 
> 2.6.X line and earlier remain vulnerable.
> Right now Spark master branch is on 2.6.5: 
> https://github.com/apache/spark/blob/master/pom.xml#L164
> and Hadoop branch-2.7 is on 2.2.3: 
> https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71
> and Hadoop branch-3.0.0-alpha2 is on 2.7.8: 
> https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74
> We should try to find to find a way to get on a patched version of 
> jackson-bind for the Spark 2.2.0 release.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL

2017-07-28 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105919#comment-16105919
 ] 

Liang-Chi Hsieh commented on SPARK-21274:
-

[~Tagar] I've tried the query on PostgreSQL, the answer of [1, 2, 2] 
intersect_all [1, 2] is [1, 2]. So I think it's correct?

How do we know we need to change the tables when rewriting the intersect query?

> Implement EXCEPT ALL and INTERSECT ALL
> --
>
> Key: SPARK-21274
> URL: https://issues.apache.org/jira/browse/SPARK-21274
> Project: Spark
>  Issue Type: New Feature
>  Components: Optimizer, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Ruslan Dautkhanov
>  Labels: set, sql
>
> 1) *EXCEPT ALL* / MINUS ALL :
> {code}
> SELECT a,b,c FROM tab1
>  EXCEPT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following outer join:
> {code}
> SELECT a,b,c
> FROMtab1 t1
>  LEFT OUTER JOIN 
> tab2 t2
>  ON (
> (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
>  )
> WHERE
> COALESCE(t2.a, t2.b, t2.c) IS NULL
> {code}
> (register as a temp.view this second query under "*t1_except_t2_df*" name 
> that can be also used to find INTERSECT ALL below):
> 2) *INTERSECT ALL*:
> {code}
> SELECT a,b,c FROM tab1
>  INTERSECT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following anti-join using t1_except_t2_df we defined 
> above:
> {code}
> SELECT a,b,c
> FROMtab1 t1
> WHERE 
>NOT EXISTS
>(SELECT 1
> FROMt1_except_t2_df e
> WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
>)
> {code}
> So the suggestion is just to use above query rewrites to implement both 
> EXCEPT ALL and INTERSECT ALL sql set operations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21570) File __spark_libs__XXX.zip does not exist on networked file system w/ yarn

2017-07-28 Thread Albert Chu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105900#comment-16105900
 ] 

Albert Chu commented on SPARK-21570:


Oh, and because it will likely be asked and may likely be relevant.  In this 
test setup, HDFS is not used at all.

{noformat}

  fs.defaultFS
  file:///

{noformat}

All temp dirs, staging dirs, etc. are configured to appropriate locations in 
/tmp or somewhere in the networked file system.

> File __spark_libs__XXX.zip does not exist on networked file system w/ yarn
> --
>
> Key: SPARK-21570
> URL: https://issues.apache.org/jira/browse/SPARK-21570
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Albert Chu
>
> I have a set of scripts that run Spark with data in a networked file system.  
> One of my unit tests to make sure things don't break between Spark releases 
> is to simply run a word count (via org.apache.spark.examples.JavaWordCount) 
> on a file in the networked file system.  This test broke with Spark 2.2.0 
> when I use yarn to launch the job (using the spark standalone scheduler 
> things still work).  I'm currently using Hadoop 2.7.0.  I get the following 
> error:
> {noformat}
> Diagnostics: File 
> file:/p/lcratery/achu/testing/rawnetworkfs/test/1181015/node-0/spark/node-0/spark-292938be-7ae3-460f-aca7-294083ebb790/__spark_libs__695301535722158702.zip
>  does not exist
> java.io.FileNotFoundException: File 
> file:/p/lcratery/achu/testing/rawnetworkfs/test/1181015/node-0/spark/node-0/spark-292938be-7ae3-460f-aca7-294083ebb790/__spark_libs__695301535722158702.zip
>  does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:819)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:596)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
>   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
>   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
>   at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
>   at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:358)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> While debugging, I sat and watched the directory and did see that 
> /p/lcratery/achu/testing/rawnetworkfs/test/1181015/node-0/spark/node-0/spark-292938be-7ae3-460f-aca7-294083ebb790/__spark_libs__695301535722158702.zip
>  does show up at some point.
> Wondering if it's possible something racy was introduced.  Nothing in the 
> Spark 2.2.0 release notes suggests any type of configuration change that 
> needs to be done.
> Thanks



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21570) File __spark_libs__XXX.zip does not exist on networked file system w/ yarn

2017-07-28 Thread Albert Chu (JIRA)

Albert Chu created SPARK-21570:
--

 Summary: File __spark_libs__XXX.zip does not exist on networked 
file system w/ yarn
 Key: SPARK-21570
 URL: https://issues.apache.org/jira/browse/SPARK-21570
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.2.0
Reporter: Albert Chu


I have a set of scripts that run Spark with data in a networked file system.  
One of my unit tests to make sure things don't break between Spark releases is 
to simply run a word count (via org.apache.spark.examples.JavaWordCount) on a 
file in the networked file system.  This test broke with Spark 2.2.0 when I use 
yarn to launch the job (using the spark standalone scheduler things still 
work).  I'm currently using Hadoop 2.7.0.  I get the following error:

{noformat}
Diagnostics: File 
file:/p/lcratery/achu/testing/rawnetworkfs/test/1181015/node-0/spark/node-0/spark-292938be-7ae3-460f-aca7-294083ebb790/__spark_libs__695301535722158702.zip
 does not exist
java.io.FileNotFoundException: File 
file:/p/lcratery/achu/testing/rawnetworkfs/test/1181015/node-0/spark/node-0/spark-292938be-7ae3-460f-aca7-294083ebb790/__spark_libs__695301535722158702.zip
 does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:819)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:596)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:358)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{noformat}

While debugging, I sat and watched the directory and did see that 
/p/lcratery/achu/testing/rawnetworkfs/test/1181015/node-0/spark/node-0/spark-292938be-7ae3-460f-aca7-294083ebb790/__spark_libs__695301535722158702.zip
 does show up at some point.

Wondering if it's possible something racy was introduced.  Nothing in the Spark 
2.2.0 release notes suggests any type of configuration change that needs to be 
done.

Thanks





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21569) Internal Spark class needs to be kryo-registered

2017-07-28 Thread Ryan Williams (JIRA)

Ryan Williams created SPARK-21569:
-

 Summary: Internal Spark class needs to be kryo-registered
 Key: SPARK-21569
 URL: https://issues.apache.org/jira/browse/SPARK-21569
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Ryan Williams


[Full repro here|https://github.com/ryan-williams/spark-bugs/tree/hf]

As of 2.2.0, {{saveAsNewAPIHadoopFile}} jobs fail (when 
{{spark.kryo.registrationRequired=true}}) with:

{code}
java.lang.IllegalArgumentException: Class is not registered: 
org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage
Note: To register this class use: 
kryo.register(org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage.class);
at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:458)
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79)
at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:488)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:593)
at 
org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:315)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:383)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

This internal Spark class should be kryo-registered by Spark by default.

This was not a problem in 2.1.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20433) Security issue with jackson-databind

2017-07-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105792#comment-16105792
 ] 

Sean Owen commented on SPARK-20433:
---

You updated to 2.6.7 but indicated above that's still vulnerable. Does it 
contain the fix?

Also how does this affect Spark?

> Security issue with jackson-databind
> 
>
> Key: SPARK-20433
> URL: https://issues.apache.org/jira/browse/SPARK-20433
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Andrew Ash
>  Labels: security
>
> There was a security vulnerability recently reported to the upstream 
> jackson-databind project at 
> https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix 
> released.
> From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the 
> first fixed versions in their respectful 2.X branches, and versions in the 
> 2.6.X line and earlier remain vulnerable.
> Right now Spark master branch is on 2.6.5: 
> https://github.com/apache/spark/blob/master/pom.xml#L164
> and Hadoop branch-2.7 is on 2.2.3: 
> https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71
> and Hadoop branch-3.0.0-alpha2 is on 2.7.8: 
> https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74
> We should try to find to find a way to get on a patched version of 
> jackson-bind for the Spark 2.2.0 release.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21549) Spark fails to complete job correctly in case of OutputFormat which do not write into hdfs

2017-07-28 Thread Mridul Muralidharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105630#comment-16105630
 ] 

Mridul Muralidharan edited comment on SPARK-21549 at 7/28/17 10:16 PM:
---

This affects both mapred ("mapred.output.dir") and mapreduce 
("mapreduce.output.fileoutputformat.outputdir") based OutputFormat's which do 
not set the properties referenced and is an incompatibility introduced in spark 
2.2

Workaround is to explicitly set the property to a dummy value (which is valid 
and writable by user - say /tmp).

+CC [~WeiqingYang] 




was (Author: mridulm80):
This affects both mapred ("mapred.output.dir") and mapreduce 
("mapreduce.output.fileoutputformat.outputdir") based OutputFormat's which do 
not set the properties referenced and is an incompatibility introduced in spark 
2.2

Workaround is to explicitly set the property to a dummy value (which is valid 
and writable by user).



> Spark fails to complete job correctly in case of OutputFormat which do not 
> write into hdfs
> --
>
> Key: SPARK-21549
> URL: https://issues.apache.org/jira/browse/SPARK-21549
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: spark 2.2.0
> scala 2.11
>Reporter: Sergey Zhemzhitsky
>
> Spark fails to complete job correctly in case of custom OutputFormat 
> implementations.
> There are OutputFormat implementations which do not need to use 
> *mapreduce.output.fileoutputformat.outputdir* standard hadoop property.
> [But spark reads this property from the 
> configuration|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L79]
>  while setting up an OutputCommitter
> {code:javascript}
> val committer = FileCommitProtocol.instantiate(
>   className = classOf[HadoopMapReduceCommitProtocol].getName,
>   jobId = stageId.toString,
>   outputPath = conf.value.get("mapreduce.output.fileoutputformat.outputdir"),
>   isAppend = false).asInstanceOf[HadoopMapReduceCommitProtocol]
> committer.setupJob(jobContext)
> {code}
> ... and then uses this property later on while [commiting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L132],
>  [aborting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L141],
>  [creating task's temp 
> path|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L95]
> In that cases when the job completes then following exception is thrown
> {code}
> Can not create a Path from a null string
> java.lang.IllegalArgumentException: Can not create a Path from a null string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:123)
>   at org.apache.hadoop.fs.Path.(Path.java:135)
>   at org.apache.hadoop.fs.Path.(Path.java:89)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:58)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:141)
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:106)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
>   ...
> {code}
> So it seems that all the jobs which use OutputFormats which don't write data 
> into HDFS-compatible file systems are broken.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21549) Spark fails to complete job correctly in case of OutputFormat which do not write into hdfs

2017-07-28 Thread Mridul Muralidharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105630#comment-16105630
 ] 

Mridul Muralidharan edited comment on SPARK-21549 at 7/28/17 10:14 PM:
---

This affects both mapred ("mapred.output.dir") and mapreduce 
("mapreduce.output.fileoutputformat.outputdir") based OutputFormat's which do 
not set the properties referenced and is an incompatibility introduced in spark 
2.2

Workaround is to explicitly set the property to a dummy value (which is valid 
and writable by user).




was (Author: mridulm80):

This affects both mapred ("mapred.output.dir") and mapreduce 
("mapreduce.output.fileoutputformat.outputdir") based OutputFormat's which do 
not set the properties referenced and is an incompatibility introduced in spark 
2.2

Workaround is to explicitly set the property to a dummy value.



> Spark fails to complete job correctly in case of OutputFormat which do not 
> write into hdfs
> --
>
> Key: SPARK-21549
> URL: https://issues.apache.org/jira/browse/SPARK-21549
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: spark 2.2.0
> scala 2.11
>Reporter: Sergey Zhemzhitsky
>
> Spark fails to complete job correctly in case of custom OutputFormat 
> implementations.
> There are OutputFormat implementations which do not need to use 
> *mapreduce.output.fileoutputformat.outputdir* standard hadoop property.
> [But spark reads this property from the 
> configuration|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L79]
>  while setting up an OutputCommitter
> {code:javascript}
> val committer = FileCommitProtocol.instantiate(
>   className = classOf[HadoopMapReduceCommitProtocol].getName,
>   jobId = stageId.toString,
>   outputPath = conf.value.get("mapreduce.output.fileoutputformat.outputdir"),
>   isAppend = false).asInstanceOf[HadoopMapReduceCommitProtocol]
> committer.setupJob(jobContext)
> {code}
> ... and then uses this property later on while [commiting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L132],
>  [aborting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L141],
>  [creating task's temp 
> path|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L95]
> In that cases when the job completes then following exception is thrown
> {code}
> Can not create a Path from a null string
> java.lang.IllegalArgumentException: Can not create a Path from a null string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:123)
>   at org.apache.hadoop.fs.Path.(Path.java:135)
>   at org.apache.hadoop.fs.Path.(Path.java:89)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:58)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:141)
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:106)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
>   ...
> {code}
> So it seems that all the jobs which use OutputFormats which don't write data 
> into HDFS-compatible file systems are broken.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21563) Race condition when serializing TaskDescriptions and adding jars

2017-07-28 Thread Andrew Ash (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105772#comment-16105772
 ] 

Andrew Ash commented on SPARK-21563:


And for reference, I added this additional logging to assist in debugging: 
https://github.com/palantir/spark/pull/238

> Race condition when serializing TaskDescriptions and adding jars
> 
>
> Key: SPARK-21563
> URL: https://issues.apache.org/jira/browse/SPARK-21563
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Andrew Ash
>
> cc [~robert3005]
> I was seeing this exception during some running Spark jobs:
> {noformat}
> 16:16:28.294 [dispatcher-event-loop-14] ERROR 
> org.apache.spark.rpc.netty.Inbox - Ignoring error
> java.io.EOFException: null
> at java.io.DataInputStream.readFully(DataInputStream.java:197)
> at java.io.DataInputStream.readUTF(DataInputStream.java:609)
> at java.io.DataInputStream.readUTF(DataInputStream.java:564)
> at 
> org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127)
> at 
> org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126)
> at scala.collection.immutable.Range.foreach(Range.scala:160)
> at 
> org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95)
> at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
> at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
> After some debugging, we determined that this is due to a race condition in 
> task serde.  cc [~irashid] [~kayousterhout] who last touched that code in 
> SPARK-19796
> The race is between adding additional jars to the SparkContext and 
> serializing the TaskDescription.
> Consider this sequence of events:
> - TaskSetManager creates a TaskDescription using a reference to the 
> SparkContext's jars: 
> https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L506
> - TaskDescription starts serializing, and begins writing jars: 
> https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L84
> - the size of the jar map is written out: 
> https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L63
> - _on another thread_: the application adds a jar to the SparkContext's jars 
> list
> - then the entries in the jars list are serialized out: 
> https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L64
> The problem now is that the jars list is serialized as having N entries, but 
> actually N+1 entries follow that count!
> This causes task deserialization to fail in the executor, with the stacktrace 
> above.
> The same issue also likely exists for files, though I haven't observed that 
> and our application does not stress that codepath the same way it did for jar 
> additions.
> One fix here is that TaskSetManager could make an immutable copy of the jars 
> list that it passes into the TaskDescription constructor, so that list 
> doesn't change mid-serialization.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20433) Security issue with jackson-databind

2017-07-28 Thread Andrew Ash (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105751#comment-16105751
 ] 

Andrew Ash commented on SPARK-20433:


Here's the patch I put in my fork of Spark: 
https://github.com/palantir/spark/pull/241

It addresses CVE-2017-7525 -- http://www.securityfocus.com/bid/99623

> Security issue with jackson-databind
> 
>
> Key: SPARK-20433
> URL: https://issues.apache.org/jira/browse/SPARK-20433
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Andrew Ash
>  Labels: security
>
> There was a security vulnerability recently reported to the upstream 
> jackson-databind project at 
> https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix 
> released.
> From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the 
> first fixed versions in their respectful 2.X branches, and versions in the 
> 2.6.X line and earlier remain vulnerable.
> Right now Spark master branch is on 2.6.5: 
> https://github.com/apache/spark/blob/master/pom.xml#L164
> and Hadoop branch-2.7 is on 2.2.3: 
> https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71
> and Hadoop branch-3.0.0-alpha2 is on 2.7.8: 
> https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74
> We should try to find to find a way to get on a patched version of 
> jackson-bind for the Spark 2.2.0 release.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21568) ConsoleProgressBar should only be enabled in shells

2017-07-28 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-21568:
--

 Summary: ConsoleProgressBar should only be enabled in shells
 Key: SPARK-21568
 URL: https://issues.apache.org/jira/browse/SPARK-21568
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Marcelo Vanzin
Priority: Minor


This is the current logic that enables the progress bar:

{code}
_progressBar =
  if (_conf.getBoolean("spark.ui.showConsoleProgress", true) && 
!log.isInfoEnabled) {
Some(new ConsoleProgressBar(this))
  } else {
None
  }
{code}

That is based on the logging level; it just happens to align with the default 
configuration for shells (WARN) and normal apps (INFO).

But if someone changes the default logging config for their app, this may 
break; they may silence logs by setting the default level to WARN or ERROR, and 
a normal application will see a lot of log spam from the progress bar (which is 
especially bad when output is redirected to a file, as is usually done when 
running in cluster mode).

While it's possible to disable the progress bar separately, this behavior is 
not really expected.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19490) Hive partition columns are case-sensitive

2017-07-28 Thread Taklon Stephen Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105734#comment-16105734
 ] 

Taklon Stephen Wu commented on SPARK-19490:
---

https://github.com/apache/spark/pull/16832 is still opened and cenyuhai@ didn't 
tell me the direct commit of the `fixed PR`, can we reopen this JIRA or at 
least let me know if this is still an issue.

> Hive partition columns are case-sensitive
> -
>
> Key: SPARK-19490
> URL: https://issues.apache.org/jira/browse/SPARK-19490
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: cen yuhai
>
> The real partitions columns are lower case (year, month, day)
> {code}
> Caused by: java.lang.RuntimeException: Expected only partition pruning 
> predicates: (concat(YEAR#22, MONTH#23, DAY#24) = 20170202)
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:985)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:976)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:95)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:976)
>   at 
> org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(MetastoreRelation.scala:161)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:151)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:150)
>   at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2472)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:149)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:235)
>   at 
> org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:124)
>   at 
> org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:42)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:141)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:368)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.prepareShuffleDependency(ShuffleExchange.scala:85)
>   at 
> org.apache.spark.sql.execution.exchange.ExchangeCoordinator.doEstimationIfNecessary(ExchangeCoordinator.scala:213)
>   at 
> org.apache.spark.sql.execution.exchange.ExchangeCoordinator.postShuffleRDD(ExchangeCoordinator.scala:261)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:117)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:112)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
> {code}
> Use these sql can reproduce this bug:
> CREATE TABLE partition_test (key Int) partitioned by (date string)
> SELECT * FROM partition_test where DATE = '20170101'



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21567) Dataset with Tuple of type alias throws error

2017-07-28 Thread Tomasz Bartczak (JIRA)

Tomasz Bartczak created SPARK-21567:
---

 Summary: Dataset with Tuple of type alias throws error
 Key: SPARK-21567
 URL: https://issues.apache.org/jira/browse/SPARK-21567
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0, 2.1.1
 Environment: verified for spark 2.1.1 and 2.2.0 in sbt build
Reporter: Tomasz Bartczak


returning from a map a thing that is a tuple containg another tuple - defined 
as a type alias - we receive an error.

minimal reproducible case:

having a structure like this:
{code}
object C {
  type TwoInt = (Int,Int)
  def tupleTypeAlias: TwoInt = (1,1)
}
{code}

when I do:
{code}
Seq(1).toDS().map(_ => ("",C.tupleTypeAlias))
{code}


I get exception:
{code}
type T1 is not a class
scala.ScalaReflectionException: type T1 is not a class
at scala.reflect.api.Symbols$SymbolApi$class.asClass(Symbols.scala:275)
at 
scala.reflect.internal.Symbols$SymbolContextApiImpl.asClass(Symbols.scala:84)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.getClassFromType(ScalaReflection.scala:682)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor(ScalaReflection.scala:84)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:614)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:607)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:607)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:619)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:607)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:607)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:438)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
at 
org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:233)
at 
org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:33)
{code}

in spark 2.1.1 the last exception was 'head of an empty list'



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21566) Python method for summary

2017-07-28 Thread Andrew Ray (JIRA)

Andrew Ray created SPARK-21566:
--

 Summary: Python method for summary
 Key: SPARK-21566
 URL: https://issues.apache.org/jira/browse/SPARK-21566
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 2.3.0
Reporter: Andrew Ray


Add python method for summary that was added in SPARK-21100



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21549) Spark fails to complete job correctly in case of OutputFormat which do not write into hdfs

2017-07-28 Thread Mridul Muralidharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105630#comment-16105630
 ] 

Mridul Muralidharan commented on SPARK-21549:
-


This affects both mapred ("mapred.output.dir") and mapreduce 
("mapreduce.output.fileoutputformat.outputdir") based OutputFormat's which do 
not set the properties referenced and is an incompatibility introduced in spark 
2.2

Workaround is to explicitly set the property to a dummy value.



> Spark fails to complete job correctly in case of OutputFormat which do not 
> write into hdfs
> --
>
> Key: SPARK-21549
> URL: https://issues.apache.org/jira/browse/SPARK-21549
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: spark 2.2.0
> scala 2.11
>Reporter: Sergey Zhemzhitsky
>
> Spark fails to complete job correctly in case of custom OutputFormat 
> implementations.
> There are OutputFormat implementations which do not need to use 
> *mapreduce.output.fileoutputformat.outputdir* standard hadoop property.
> [But spark reads this property from the 
> configuration|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L79]
>  while setting up an OutputCommitter
> {code:javascript}
> val committer = FileCommitProtocol.instantiate(
>   className = classOf[HadoopMapReduceCommitProtocol].getName,
>   jobId = stageId.toString,
>   outputPath = conf.value.get("mapreduce.output.fileoutputformat.outputdir"),
>   isAppend = false).asInstanceOf[HadoopMapReduceCommitProtocol]
> committer.setupJob(jobContext)
> {code}
> ... and then uses this property later on while [commiting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L132],
>  [aborting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L141],
>  [creating task's temp 
> path|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L95]
> In that cases when the job completes then following exception is thrown
> {code}
> Can not create a Path from a null string
> java.lang.IllegalArgumentException: Can not create a Path from a null string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:123)
>   at org.apache.hadoop.fs.Path.(Path.java:135)
>   at org.apache.hadoop.fs.Path.(Path.java:89)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:58)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:141)
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:106)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
>   ...
> {code}
> So it seems that all the jobs which use OutputFormats which don't write data 
> into HDFS-compatible file systems are broken.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21565) aggregate query fails with watermark on eventTime but works with watermark on timestamp column generated by current_timestamp

2017-07-28 Thread Amit Assudani (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Assudani updated SPARK-21565:
--
Description: 
*Short Description: *

Aggregation query fails with eventTime as watermark column while works with 
newTimeStamp column generated by running SQL with current_timestamp,

*Exception:*

Caused by: java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(statefulOperators.scala:204)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(statefulOperators.scala:172)
at 
org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:70)
at 
org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:65)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)

*Code to replicate:*

package test

import java.nio.file.{Files, Path, Paths}
import java.text.SimpleDateFormat

import org.apache.spark.sql.types._
import org.apache.spark.sql.{SparkSession}

import scala.collection.JavaConverters._

object Test1 {

  def main(args: Array[String]) {

val sparkSession = SparkSession
  .builder()
  .master("local[*]")
  .appName("Spark SQL basic example")
  .config("spark.some.config.option", "some-value")
  .getOrCreate()

val sdf = new SimpleDateFormat("-MM-dd HH:mm:ss")
val checkpointPath = "target/cp1"
val newEventsPath = Paths.get("target/newEvents/").toAbsolutePath
delete(newEventsPath)
delete(Paths.get(checkpointPath).toAbsolutePath)
Files.createDirectories(newEventsPath)


val dfNewEvents= newEvents(sparkSession)
dfNewEvents.createOrReplaceTempView("dfNewEvents")

//The below works - Start
//val dfNewEvents2 = sparkSession.sql("select *,current_timestamp as 
newTimeStamp from dfNewEvents ").withWatermark("newTimeStamp","2 seconds")
//dfNewEvents2.createOrReplaceTempView("dfNewEvents2")
//val groupEvents = sparkSession.sql("select symbol,newTimeStamp, 
count(price) as count1 from dfNewEvents2 group by symbol,newTimeStamp")
// End


//The below doesn't work - Start
val dfNewEvents2 = sparkSession.sql("select * from dfNewEvents 
").withWatermark("eventTime","2 seconds")
 dfNewEvents2.createOrReplaceTempView("dfNewEvents2")
  val groupEvents = sparkSession.sql("select symbol,eventTime, count(price) 
as count1 from dfNewEvents2 group by symbol,eventTime")
// - End


val query1 = groupEvents.writeStream
  .outputMode("append")
.format("console")
  .option("checkpointLocation", checkpointPath)
  .start("./myop")

val newEventFile1=newEventsPath.resolve("eventNew1.json")
Files.write(newEventFile1, List(
  """{"symbol": 
"GOOG","price":100,"eventTime":"2017-07-25T16:00:00.000-04:00"}""",
  """{"symbol": 
"GOOG","price":200,"eventTime":"2017-07-25T16:00:00.000-04:00"}"""
).toIterable.asJava)
query1.processAllAvailable()

sparkSession.streams.awaitAnyTermination(1)

  }

  private def newEvents(sparkSession: SparkSession) = {
val newEvents = Paths.get("target/newEvents/").toAbsolutePath
delete(newEvents)
Files.createDirectories(newEvents)

val dfNewEvents = 
sparkSession.readStream.schema(eventsSchema).json(newEvents.toString)//.withWatermark("eventTime","2
 seconds")
dfNewEvents
  }

  private val eventsSchema = StructType(List(
StructField("symbol", StringType, true),
StructField("price", DoubleType, true),
StructField("eventTime", TimestampType, false)
  ))

  private def delete(dir: Path) = {
if(Files.exists(dir)) {
  Files.walk(dir).iterator().asScala.toList
.map(p => p.toFile)
.sortWith((o1, o2) => o1.compareTo(o2) > 0)
.foreach(_.delete)
}
  }

}




  was:
Aggregation query fails with eventTime as watermark column while works with 
newTimeStamp column generated by running SQL with current_timestamp,

Exception:

Caused by: java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at

[jira] [Created] (SPARK-21565) aggregate query fails with watermark on eventTime but works with watermark on timestamp column generated by current_timestamp

2017-07-28 Thread Amit Assudani (JIRA)

Amit Assudani created SPARK-21565:
-

 Summary: aggregate query fails with watermark on eventTime but 
works with watermark on timestamp column generated by current_timestamp
 Key: SPARK-21565
 URL: https://issues.apache.org/jira/browse/SPARK-21565
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Amit Assudani


Aggregation query fails with eventTime as watermark column while works with 
newTimeStamp column generated by running SQL with current_timestamp,

Exception:

Caused by: java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(statefulOperators.scala:204)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(statefulOperators.scala:172)
at 
org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:70)
at 
org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:65)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)

Code to replicate:

package test

import java.nio.file.{Files, Path, Paths}
import java.text.SimpleDateFormat

import org.apache.spark.sql.types._
import org.apache.spark.sql.{SparkSession}

import scala.collection.JavaConverters._

object Test1 {

  def main(args: Array[String]) {

val sparkSession = SparkSession
  .builder()
  .master("local[*]")
  .appName("Spark SQL basic example")
  .config("spark.some.config.option", "some-value")
  .getOrCreate()

val sdf = new SimpleDateFormat("-MM-dd HH:mm:ss")
val checkpointPath = "target/cp1"
val newEventsPath = Paths.get("target/newEvents/").toAbsolutePath
delete(newEventsPath)
delete(Paths.get(checkpointPath).toAbsolutePath)
Files.createDirectories(newEventsPath)


val dfNewEvents= newEvents(sparkSession)
dfNewEvents.createOrReplaceTempView("dfNewEvents")

//The below works - Start
//val dfNewEvents2 = sparkSession.sql("select *,current_timestamp as 
newTimeStamp from dfNewEvents ").withWatermark("newTimeStamp","2 seconds")
//dfNewEvents2.createOrReplaceTempView("dfNewEvents2")
//val groupEvents = sparkSession.sql("select symbol,newTimeStamp, 
count(price) as count1 from dfNewEvents2 group by symbol,newTimeStamp")
// End


//The below doesn't work - Start
val dfNewEvents2 = sparkSession.sql("select * from dfNewEvents 
").withWatermark("eventTime","2 seconds")
 dfNewEvents2.createOrReplaceTempView("dfNewEvents2")
  val groupEvents = sparkSession.sql("select symbol,eventTime, count(price) 
as count1 from dfNewEvents2 group by symbol,eventTime")
// - End


val query1 = groupEvents.writeStream
  .outputMode("append")
.format("console")
  .option("checkpointLocation", checkpointPath)
  .start("./myop")

val newEventFile1=newEventsPath.resolve("eventNew1.json")
Files.write(newEventFile1, List(
  """{"symbol": 
"GOOG","price":100,"eventTime":"2017-07-25T16:00:00.000-04:00"}""",
  """{"symbol": 
"GOOG","price":200,"eventTime":"2017-07-25T16:00:00.000-04:00"}"""
).toIterable.asJava)
query1.processAllAvailable()

sparkSession.streams.awaitAnyTermination(1)

  }

  private def newEvents(sparkSession: SparkSession) = {
val newEvents = Paths.get("target/newEvents/").toAbsolutePath
delete(newEvents)
Files.createDirectories(newEvents)

val dfNewEvents = 
sparkSession.readStream.schema(eventsSchema).json(newEvents.toString)//.withWatermark("eventTime","2
 seconds")
dfNewEvents
  }

  private val eventsSchema = StructType(List(
StructField("symbol", StringType, true),
StructField("price", DoubleType, true),
StructField("eventTime", TimestampType, false)
  ))

  private def delete(dir: Path) = {
if(Files.exists(dir)) {
  Files.walk(dir).iterator().asScala.toList
.map(p => p.toFile)
.sortWith((o1, o2) => o1.compareTo(o2) > 0)
.foreach(_.delete)
}
  }

}






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (SPARK-21564) TaskDescription decoding failure should fail the task

2017-07-28 Thread Andrew Ash (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-21564:
---
Description: 
cc [~robert3005]

I was seeing an issue where Spark was throwing this exception:

{noformat}
16:16:28.294 [dispatcher-event-loop-14] ERROR org.apache.spark.rpc.netty.Inbox 
- Ignoring error
java.io.EOFException: null
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readUTF(DataInputStream.java:609)
at java.io.DataInputStream.readUTF(DataInputStream.java:564)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at 
org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{noformat}

For details on the cause of that exception, see SPARK-21563

We've since changed the application and have a proposed fix in Spark at the 
ticket above, but it was troubling that decoding the TaskDescription wasn't 
failing the tasks.  So the Spark job ended up hanging and making no progress, 
permanently stuck, because the driver thinks the task is running but the thread 
has died in the executor.

We should make a change around 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L96
 so that when that decode throws an exception, the task is marked as failed.

cc [~kayousterhout] [~irashid]

  was:
I was seeing an issue where Spark was throwing this exception:

{noformat}
16:16:28.294 [dispatcher-event-loop-14] ERROR org.apache.spark.rpc.netty.Inbox 
- Ignoring error
java.io.EOFException: null
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readUTF(DataInputStream.java:609)
at java.io.DataInputStream.readUTF(DataInputStream.java:564)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at 
org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{noformat}

For details on the cause of that exception, see SPARK-21563

We've since changed the application and have a proposed fix in Spark at the 
ticket above, but it was troubling that decoding the TaskDescription wasn't 
failing the tasks.  So the Spark job ended up hanging and making no progress, 
permanently stuck, because the driver thinks the task is running but the thread 
has died in the executor.

We should make a change around 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L96
 so that when that decode throws an exception, the task is marked as failed.

cc [~kayousterhout] [~irashid]


> TaskDescription decoding failure should fail the task
> -
>
> Key: SPARK-21564
> URL: https://issues.apache.org/jira/browse/SPARK-21564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Andrew Ash
>
> cc [~robert3005]
> I was seeing an issue where Spark was throwing this exception:
> {noformat}
> 16:16:28.294 [dispatcher-event-loop-14] ERROR 
> org.apache.spark.rpc.netty.Inbox -

[jira] [Updated] (SPARK-21563) Race condition when serializing TaskDescriptions and adding jars

2017-07-28 Thread Andrew Ash (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-21563:
---
Description: 
cc [~robert3005]

I was seeing this exception during some running Spark jobs:

{noformat}
16:16:28.294 [dispatcher-event-loop-14] ERROR org.apache.spark.rpc.netty.Inbox 
- Ignoring error
java.io.EOFException: null
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readUTF(DataInputStream.java:609)
at java.io.DataInputStream.readUTF(DataInputStream.java:564)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at 
org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{noformat}

After some debugging, we determined that this is due to a race condition in 
task serde.  cc [~irashid] [~kayousterhout] who last touched that code in 
SPARK-19796

The race is between adding additional jars to the SparkContext and serializing 
the TaskDescription.

Consider this sequence of events:

- TaskSetManager creates a TaskDescription using a reference to the 
SparkContext's jars: 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L506
- TaskDescription starts serializing, and begins writing jars: 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L84
- the size of the jar map is written out: 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L63
- _on another thread_: the application adds a jar to the SparkContext's jars 
list
- then the entries in the jars list are serialized out: 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L64

The problem now is that the jars list is serialized as having N entries, but 
actually N+1 entries follow that count!

This causes task deserialization to fail in the executor, with the stacktrace 
above.

The same issue also likely exists for files, though I haven't observed that and 
our application does not stress that codepath the same way it did for jar 
additions.

One fix here is that TaskSetManager could make an immutable copy of the jars 
list that it passes into the TaskDescription constructor, so that list doesn't 
change mid-serialization.

  was:
cc [~robert3005]

I was seeing this exception during some running Spark jobs:

{noformat}
16:16:28.294 [dispatcher-event-loop-14] ERROR org.apache.spark.rpc.netty.Inbox 
- Ignoring error
java.io.EOFException: null
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readUTF(DataInputStream.java:609)
at java.io.DataInputStream.readUTF(DataInputStream.java:564)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at 
org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{noformat}

After some debugging, we determined that this is due to a race condition in 
task serde introduced in SPARK-19796.  cc [~irashid] [~kayousterhout]

The race is between adding additional jars to the SparkContext and serializing 
the

[jira] [Created] (SPARK-21564) TaskDescription decoding failure should fail the task

2017-07-28 Thread Andrew Ash (JIRA)

Andrew Ash created SPARK-21564:
--

 Summary: TaskDescription decoding failure should fail the task
 Key: SPARK-21564
 URL: https://issues.apache.org/jira/browse/SPARK-21564
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Andrew Ash


I was seeing an issue where Spark was throwing this exception:

{noformat}
16:16:28.294 [dispatcher-event-loop-14] ERROR org.apache.spark.rpc.netty.Inbox 
- Ignoring error
java.io.EOFException: null
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readUTF(DataInputStream.java:609)
at java.io.DataInputStream.readUTF(DataInputStream.java:564)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at 
org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{noformat}

For details on the cause of that exception, see SPARK-21563

We've since changed the application and have a proposed fix in Spark at the 
ticket above, but it was troubling that decoding the TaskDescription wasn't 
failing the tasks.  So the Spark job ended up hanging and making no progress, 
permanently stuck, because the driver thinks the task is running but the thread 
has died in the executor.

We should make a change around 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L96
 so that when that decode throws an exception, the task is marked as failed.

cc [~kayousterhout] [~irashid]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21563) Race condition when serializing TaskDescriptions and adding jars

2017-07-28 Thread Andrew Ash (JIRA)

Andrew Ash created SPARK-21563:
--

 Summary: Race condition when serializing TaskDescriptions and 
adding jars
 Key: SPARK-21563
 URL: https://issues.apache.org/jira/browse/SPARK-21563
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Andrew Ash


cc [~robert3005]

I was seeing this exception during some running Spark jobs:

{noformat}
16:16:28.294 [dispatcher-event-loop-14] ERROR org.apache.spark.rpc.netty.Inbox 
- Ignoring error
java.io.EOFException: null
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readUTF(DataInputStream.java:609)
at java.io.DataInputStream.readUTF(DataInputStream.java:564)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at 
org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{noformat}

After some debugging, we determined that this is due to a race condition in 
task serde introduced in SPARK-19796.  cc [~irashid] [~kayousterhout]

The race is between adding additional jars to the SparkContext and serializing 
the TaskDescription.

Consider this sequence of events:

- TaskSetManager creates a TaskDescription using a reference to the 
SparkContext's jars: 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L506
- TaskDescription starts serializing, and begins writing jars: 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L84
- the size of the jar map is written out: 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L63
- _on another thread_: the application adds a jar to the SparkContext's jars 
list
- then the entries in the jars list are serialized out: 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L64

The problem now is that the jars list is serialized as having N entries, but 
actually N+1 entries follow that count!

This causes task deserialization to fail in the executor, with the stacktrace 
above.

The same issue also likely exists for files, though I haven't observed that and 
our application does not stress that codepath the same way it did for jar 
additions.

One fix here is that TaskSetManager could make an immutable copy of the jars 
list that it passes into the TaskDescription constructor, so that list doesn't 
change mid-serialization.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2017-07-28 Thread James Conner (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105528#comment-16105528
 ] 

James Conner commented on SPARK-18016:
--

Thank you for letting me know, Kazuaki!  Please let me know if you need any 
debug or crash information.

The shape of the data that I am using is as follows:
* 1   x StringType (ID)
* 1   x VectorType (features)
* 2656 x DoubleType (SCORE, and feature_{1..2655}

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Aleksander Eskilson
> Fix For: 2.3.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>

[jira] [Commented] (SPARK-19720) Redact sensitive information from SparkSubmit console output

2017-07-28 Thread Diogo Munaro Vieira (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105514#comment-16105514
 ] 

Diogo Munaro Vieira commented on SPARK-19720:
-

Yes, but it's a major security bug as described here. It should not be ported 
to 2.1.2?

> Redact sensitive information from SparkSubmit console output
> 
>
> Key: SPARK-19720
> URL: https://issues.apache.org/jira/browse/SPARK-19720
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>Assignee: Mark Grover
> Fix For: 2.2.0
>
>
> SPARK-18535 took care of redacting sensitive information from Spark event 
> logs and UI. However, it intentionally didn't bother redacting the same 
> sensitive information from SparkSubmit's console output because it was on the 
> client's machine, which already had the sensitive information on disk (in 
> spark-defaults.conf) or on terminal (spark-submit command line).
> However, it seems now that it's better to redact information from 
> SparkSubmit's console output as well because orchestration software like 
> Oozie usually expose SparkSubmit's console output via a UI. To make matters 
> worse, Oozie, in particular, always sets the {{--verbose}} flag on 
> SparkSubmit invocation, making the sensitive information readily available in 
> its UI (see 
> [code|https://github.com/apache/oozie/blob/master/sharelib/spark/src/main/java/org/apache/oozie/action/hadoop/SparkMain.java#L248]
>  here).
> This is a JIRA for tracking redaction of sensitive information from 
> SparkSubmit's console output.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21562) Spark may request extra containers if the rpc between YARN and spark is too fast

2017-07-28 Thread Wei Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105505#comment-16105505
 ] 

Wei Chen commented on SPARK-21562:
--

Thanks for comments, I will close this one.

On Fri, Jul 28, 2017 at 11:36 AM, Marcelo Vanzin (JIRA) 



> Spark may request extra containers if the rpc between YARN and spark is too 
> fast
> 
>
> Key: SPARK-21562
> URL: https://issues.apache.org/jira/browse/SPARK-21562
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Wei Chen
>  Labels: YARN
>
> hi huys,
> I find an interesting problem when spark tries to request containers from 
> YARN. 
> Here is the case:
> In YarnAllocator.scala
> 1. this function requests container from YARN only if there are executors are 
> not be requested. 
> {code:java}def updateResourceRequests(): Unit = {
> val pendingAllocate = getPendingAllocate
> val numPendingAllocate = pendingAllocate.size
> val missing = targetNumExecutors - numPendingAllocate - 
> numExecutorsRunning
>   
> if (missing > 0) {
>  ..
> }
>   .
> }
> {code}
> 2. After the requested containers are allocated(granted through RPC), then it 
> will update the pending queues
>   
> {code:java}
> private def matchContainerToRequest(
>   allocatedContainer: Container,
>   location: String,
>   containersToUse: ArrayBuffer[Container],
>   remaining: ArrayBuffer[Container]): Unit = {
>   .
>  
>amClient.removeContainerRequest(containerRequest) //update pending queues
>
>.
> }
> {code}
> 3. After the allocated containers are launched, it will update the running 
> queue
> {code:java}
> private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): 
> Unit = {
> for (container <- containersToUse) {
>  
> auncherPool.execute(new Runnable {
> override def run(): Unit = {
>   try {
> new ExecutorRunnable(
>   Some(container),
>   conf,
>   sparkConf,
>   driverUrl,
>   executorId,
>   executorHostname,
>   executorMemory,
>   executorCores,
>   appAttemptId.getApplicationId.toString,
>   securityMgr,
>   localResources
> ).run()
> logInfo(s"has launched $containerId")
> updateInternalState()   //update running queues
>  
>   
> } 
> }{code}
> However, in step 3 it will launch a thread to first launch ExecutorRunnable 
> then update running queue. We found it would take almost 1 sec before the 
> updating running queue function is called(updateInternalState()). So there 
> would be an inconsistent situation here since the pending queue is updated 
> but the running queue is not updated yet due to the launching thread does not 
> reach updateInternalState() yet. If there is an RPC call to 
> amClient.allocate() between this inconsistent interval, then more executors 
> than targetNumExecutors would be requested.
> {noformat}
> Here is an example:
> Initial:
> targetNumExecutors  numPendingAllocate numExecutorsRunning
> 1  00
> After first RPC call to amClient.allocate:
> targetNumExecutors  numPendingAllocate numExecutorsRunning
> 1  1 0
> After the first allocated container is granted by YARN
> targetNumExecutors  numPendingAllocate numExecutorsRunning
> 1  0(is removed in step 2)  0
> =>if there is a RPC call here to amClient.allocate(), then more 
> containers are requested,
> however this situation is caused by the inconsistent state.
> After the container is launched in step 3
> targetNumExecutors  numPendingAllocate numExecutorsRunning
> 1   01
> {noformat}
> ===
> I found this problem because I am changing requestType to test some features 
> on YARN's opportunisitc containers(e.g., allocation takes 100ms) which is 
> much faster then guaranteed containers(e.g., allocation takes almost 1s).
> I am not sure if I have a correct understanding.
> Appreciate anyone's help in this issue(correct me if I have miss 
> understanding)
> Wei



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19938) java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field

2017-07-28 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105449#comment-16105449
 ] 

Ryan Blue commented on SPARK-19938:
---

[~snavatski], I just hit this problem also and found out what causes it. 
There's a [similar issue with Java serialization on a SO 
question|https://stackoverflow.com/questions/9110677/readresolve-not-working-an-instance-of-guavas-serializedform-appears/18647941]
 that I found helpful.

The cause of this is one of two problems during deserialization:

# The classloader can't find the class of objects in the list
# The classloader used by Java deserialization differs from the one that loaded 
the class of objects in the list

These cases end up causing the deserialization code to take a path where 
{{readResolve}} isn't called on the list's {{SerializationProxy}}. When the 
list is set on the object that contains it, the type doesn't match and you get 
this exception.

To fix this problem, check the following things:

* Make sure Jars loaded on the driver are in the executor's classpath
* Make sure Jars provided by Spark aren't included in your application (to 
avoid loading with different classloaders).


> java.lang.ClassCastException: cannot assign instance of 
> scala.collection.immutable.List$SerializationProxy to field
> ---
>
> Key: SPARK-19938
> URL: https://issues.apache.org/jira/browse/SPARK-19938
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.0.2
>Reporter: srinivas thallam
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support

2017-07-28 Thread Paul Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105439#comment-16105439
 ] 

Paul Wu commented on SPARK-17614:
-

Oh, sorry.  I thought I could use a query hereas I do with other rdbms. 
Things become complicated for this Cassandra case after I think more on 
thisI'll accept your comment. 

> sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra 
> does not support
> -
>
> Key: SPARK-17614
> URL: https://issues.apache.org/jira/browse/SPARK-17614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Any Spark Runtime 
>Reporter: Paul Wu
>Assignee: Sean Owen
>Priority: Minor
>  Labels: cassandra-jdbc, sql
> Fix For: 2.1.0
>
>
> I have the code like the following with Cassandra JDBC 
> (https://github.com/adejanovski/cassandra-jdbc-wrapper):
>  final String dbTable= "sql_demo";
> Dataset jdbcDF
> = sparkSession.read()
> .jdbc(CASSANDRA_CONNECTION_URL, dbTable, 
> connectionProperties);
> List rows = jdbcDF.collectAsList();
> It threw the error:
> Exception in thread "main" java.sql.SQLTransientException: 
> com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable 
> alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
> The reason is that the Spark jdbc code uses the sql syntax "where 1=0" 
> somewhere (to get the schema?), but Cassandra does not support this syntax. 
> Not sure how this issue can be resolved...this is because CQL is not standard 
> sql. 
> The following log shows more information:
> 16/09/20 13:16:35 INFO CassandraConnection  138: Datacenter: %s; Host: %s; 
> Rack: %s
> 16/09/20 13:16:35 TRACE CassandraPreparedStatement  98: CQL: SELECT * FROM 
> sql_demo WHERE 1=0
> 16/09/20 13:16:35 TRACE RequestHandler  71: [19400322] 
> com.datastax.driver.core.Statement$1@41ccb3b9
> 16/09/20 13:16:35 TRACE RequestHandler  272: [19400322-1] Starting



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support

2017-07-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105431#comment-16105431
 ] 

Sean Owen commented on SPARK-17614:
---

Well, it's unrelated to this issue, so this isn't the place. And you seem to be 
reporting syntax that Cassandra doesn't support, which isn't a Spark issue. 

> sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra 
> does not support
> -
>
> Key: SPARK-17614
> URL: https://issues.apache.org/jira/browse/SPARK-17614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Any Spark Runtime 
>Reporter: Paul Wu
>Assignee: Sean Owen
>Priority: Minor
>  Labels: cassandra-jdbc, sql
> Fix For: 2.1.0
>
>
> I have the code like the following with Cassandra JDBC 
> (https://github.com/adejanovski/cassandra-jdbc-wrapper):
>  final String dbTable= "sql_demo";
> Dataset jdbcDF
> = sparkSession.read()
> .jdbc(CASSANDRA_CONNECTION_URL, dbTable, 
> connectionProperties);
> List rows = jdbcDF.collectAsList();
> It threw the error:
> Exception in thread "main" java.sql.SQLTransientException: 
> com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable 
> alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
> The reason is that the Spark jdbc code uses the sql syntax "where 1=0" 
> somewhere (to get the schema?), but Cassandra does not support this syntax. 
> Not sure how this issue can be resolved...this is because CQL is not standard 
> sql. 
> The following log shows more information:
> 16/09/20 13:16:35 INFO CassandraConnection  138: Datacenter: %s; Host: %s; 
> Rack: %s
> 16/09/20 13:16:35 TRACE CassandraPreparedStatement  98: CQL: SELECT * FROM 
> sql_demo WHERE 1=0
> 16/09/20 13:16:35 TRACE RequestHandler  71: [19400322] 
> com.datastax.driver.core.Statement$1@41ccb3b9
> 16/09/20 13:16:35 TRACE RequestHandler  272: [19400322-1] Starting



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19720) Redact sensitive information from SparkSubmit console output

2017-07-28 Thread Mark Grover (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105427#comment-16105427
 ] 

Mark Grover commented on SPARK-19720:
-

I wasn't planning on. One could argue the case that this could be backported to 
branch-2.1 given that it's a rather simple change. However, 2.2 brought in some 
changes that were long overdue - dropping support for Java 7, Hadoop 2.5 and 
even if we got this change backported, you won't be able to make use of 
goodness down the road you didn't upgrade to Hadoop 2.6, Java 8, etc. So, my 
recommendation here would be to brave the new world of hadoop 2.6.

> Redact sensitive information from SparkSubmit console output
> 
>
> Key: SPARK-19720
> URL: https://issues.apache.org/jira/browse/SPARK-19720
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>Assignee: Mark Grover
> Fix For: 2.2.0
>
>
> SPARK-18535 took care of redacting sensitive information from Spark event 
> logs and UI. However, it intentionally didn't bother redacting the same 
> sensitive information from SparkSubmit's console output because it was on the 
> client's machine, which already had the sensitive information on disk (in 
> spark-defaults.conf) or on terminal (spark-submit command line).
> However, it seems now that it's better to redact information from 
> SparkSubmit's console output as well because orchestration software like 
> Oozie usually expose SparkSubmit's console output via a UI. To make matters 
> worse, Oozie, in particular, always sets the {{--verbose}} flag on 
> SparkSubmit invocation, making the sensitive information readily available in 
> its UI (see 
> [code|https://github.com/apache/oozie/blob/master/sharelib/spark/src/main/java/org/apache/oozie/action/hadoop/SparkMain.java#L248]
>  here).
> This is a JIRA for tracking redaction of sensitive information from 
> SparkSubmit's console output.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21546) dropDuplicates with watermark yields RuntimeException due to binding failure

2017-07-28 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105426#comment-16105426
 ] 

Shixiong Zhu commented on SPARK-21546:
--

Yeah, good catch. The watermark column should be one of the dropDuplicates 
columns. Otherwise, it never evicts states.

> dropDuplicates with watermark yields RuntimeException due to binding failure
> 
>
> Key: SPARK-21546
> URL: https://issues.apache.org/jira/browse/SPARK-21546
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>
> With today's master...
> The following streaming query with watermark and {{dropDuplicates}} yields 
> {{RuntimeException}} due to failure in binding.
> {code}
> val topic1 = spark.
>   readStream.
>   format("kafka").
>   option("subscribe", "topic1").
>   option("kafka.bootstrap.servers", "localhost:9092").
>   option("startingoffsets", "earliest").
>   load
> val records = topic1.
>   withColumn("eventtime", 'timestamp).  // <-- just to put the right name 
> given the purpose
>   withWatermark(eventTime = "eventtime", delayThreshold = "30 seconds"). // 
> <-- use the renamed eventtime column
>   dropDuplicates("value").  // dropDuplicates will use watermark
> // only when eventTime column exists
>   // include the watermark column => internal design leak?
>   select('key cast "string", 'value cast "string", 'eventtime).
>   as[(String, String, java.sql.Timestamp)]
> scala> records.explain
> == Physical Plan ==
> *Project [cast(key#0 as string) AS key#169, cast(value#1 as string) AS 
> value#170, eventtime#157-T3ms]
> +- StreamingDeduplicate [value#1], 
> StatefulOperatorStateInfo(,93c3de98-3f85-41a4-8aef-d09caf8ea693,0,0),
>  0
>+- Exchange hashpartitioning(value#1, 200)
>   +- EventTimeWatermark eventtime#157: timestamp, interval 30 seconds
>  +- *Project [key#0, value#1, timestamp#5 AS eventtime#157]
> +- StreamingRelation kafka, [key#0, value#1, topic#2, 
> partition#3, offset#4L, timestamp#5, timestampType#6]
> import org.apache.spark.sql.streaming.{OutputMode, Trigger}
> val sq = records.
>   writeStream.
>   format("console").
>   option("truncate", false).
>   trigger(Trigger.ProcessingTime("10 seconds")).
>   queryName("from-kafka-topic1-to-console").
>   outputMode(OutputMode.Update).
>   start
> {code}
> {code}
> ---
> Batch: 0
> ---
> 17/07/27 10:28:58 ERROR Executor: Exception in task 3.0 in stage 13.0 (TID 
> 438)
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: eventtime#157-T3ms
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:977)
>   at 
> org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:370)
>   at 
>

[jira] [Commented] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support

2017-07-28 Thread Paul Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105425#comment-16105425
 ] 

Paul Wu commented on SPARK-17614:
-

So create a new issue? Or this is not an issue to you?

> sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra 
> does not support
> -
>
> Key: SPARK-17614
> URL: https://issues.apache.org/jira/browse/SPARK-17614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Any Spark Runtime 
>Reporter: Paul Wu
>Assignee: Sean Owen
>Priority: Minor
>  Labels: cassandra-jdbc, sql
> Fix For: 2.1.0
>
>
> I have the code like the following with Cassandra JDBC 
> (https://github.com/adejanovski/cassandra-jdbc-wrapper):
>  final String dbTable= "sql_demo";
> Dataset jdbcDF
> = sparkSession.read()
> .jdbc(CASSANDRA_CONNECTION_URL, dbTable, 
> connectionProperties);
> List rows = jdbcDF.collectAsList();
> It threw the error:
> Exception in thread "main" java.sql.SQLTransientException: 
> com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable 
> alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
> The reason is that the Spark jdbc code uses the sql syntax "where 1=0" 
> somewhere (to get the schema?), but Cassandra does not support this syntax. 
> Not sure how this issue can be resolved...this is because CQL is not standard 
> sql. 
> The following log shows more information:
> 16/09/20 13:16:35 INFO CassandraConnection  138: Datacenter: %s; Host: %s; 
> Rack: %s
> 16/09/20 13:16:35 TRACE CassandraPreparedStatement  98: CQL: SELECT * FROM 
> sql_demo WHERE 1=0
> 16/09/20 13:16:35 TRACE RequestHandler  71: [19400322] 
> com.datastax.driver.core.Statement$1@41ccb3b9
> 16/09/20 13:16:35 TRACE RequestHandler  272: [19400322-1] Starting



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21523) Fix bug of strong wolfe linesearch `init` parameter lose effectiveness

2017-07-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21523:
--
Target Version/s: 2.2.1
Priority: Critical  (was: Minor)

I think this is fairly critical actually -- would like to get this into a 2.2.1 
release.

> Fix bug of strong wolfe linesearch `init` parameter lose effectiveness
> --
>
> Key: SPARK-21523
> URL: https://issues.apache.org/jira/browse/SPARK-21523
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>Priority: Critical
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> We need merge this breeze bugfix into spark because it influence a series of 
> algos in MLlib which use LBFGS.
> https://github.com/scalanlp/breeze/pull/651



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support

2017-07-28 Thread Paul Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105402#comment-16105402
 ] 

Paul Wu edited comment on SPARK-17614 at 7/28/17 6:14 PM:
--

The fix does not support the syntax  like this:

{{.jdbc(JDBC_URL, "(select * from emp where empid>2)", connectionProperties);}}

Here is the stack trace:

{{Exception in thread "main" java.sql.SQLTransientException: 
com.datastax.driver.core.exceptions.SyntaxError: line 1:14 no viable 
alternative at input 'select' (SELECT * from [select]...)
at 
com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:60)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:113)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:45)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:166)
at 
com.att.cass.proto.CassJDBCWithSpark.main(CassJDBCWithSpark.java:44)}}


was (Author: zwu@gmail.com):
The fix does not support the syntax  like this:

{{.jdbc(JDBC_URL, "(select * from emp)", connectionProperties);}}

Here is the stack trace:

{{Exception in thread "main" java.sql.SQLTransientException: 
com.datastax.driver.core.exceptions.SyntaxError: line 1:14 no viable 
alternative at input 'select' (SELECT * from [select]...)
at 
com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:60)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:113)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:45)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:166)
at 
com.att.cass.proto.CassJDBCWithSpark.main(CassJDBCWithSpark.java:44)}}

> sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra 
> does not support
> -
>
> Key: SPARK-17614
> URL: https://issues.apache.org/jira/browse/SPARK-17614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Any Spark Runtime 
>Reporter: Paul Wu
>Assignee: Sean Owen
>Priority: Minor
>  Labels: cassandra-jdbc, sql
> Fix For: 2.1.0
>
>
> I have the code like the following with Cassandra JDBC 
> (https://github.com/adejanovski/cassandra-jdbc-wrapper):
>  final String dbTable= "sql_demo";
> Dataset jdbcDF
> = sparkSession.read()
> .jdbc(CASSANDRA_CONNECTION_URL, dbTable, 
> connectionProperties);
> List rows = jdbcDF.collectAsList();
> It threw the error:
> Exception in thread "main" java.sql.SQLTransientException: 
> com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable 
> alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
>

[jira] [Resolved] (SPARK-21561) spark-streaming-kafka-010 DSteam is not pulling anything from Kafka

2017-07-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21561.
---
Resolution: Invalid

This isn't a place to ask for input on your code -- you'd have to show a 
reproducible bug here that you've narrowed down

> spark-streaming-kafka-010 DSteam is not pulling anything from Kafka
> ---
>
> Key: SPARK-21561
> URL: https://issues.apache.org/jira/browse/SPARK-21561
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.1
>Reporter: Vlad Badelita
>  Labels: kafka-0.10, spark-streaming
>
> I am trying to use spark-streaming-kafka-0.10 to pull messages from a kafka 
> topic(broker version 0.10). I have checked that messages are being produced 
> and used a KafkaConsumer to pull them successfully. Now, when I try to use 
> the spark streaming api, I am not getting anything. If I just use 
> KafkaUtils.createRDD and specify some offset ranges manually it works. But 
> when, I try to use createDirectStream, all the rdds are empty and when I 
> check the partition offsets it simply reports that all partitions are 0. Here 
> is what I tried:
> {code:scala}
>  val sparkConf = new SparkConf().setAppName("kafkastream")
>  val ssc = new StreamingContext(sparkConf, Seconds(3))
>  val topics = Array("my_topic")
>  val kafkaParams = Map[String, Object](
>"bootstrap.servers" -> "hostname:6667"
>"key.deserializer" -> classOf[StringDeserializer],
>"value.deserializer" -> classOf[StringDeserializer],
>"group.id" -> "my_group",
>"auto.offset.reset" -> "earliest",
>"enable.auto.commit" -> (true: java.lang.Boolean)
>  )
>  val stream = KafkaUtils.createDirectStream[String, String](
>ssc,
>PreferConsistent,
>Subscribe[String, String](topics, kafkaParams)
>  )
>  stream.foreachRDD { rdd =>
>val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
>rdd.foreachPartition { iter =>
>  val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)
>  println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
>}
>val rddCount = rdd.count()
>println("rdd count: ", rddCount)
>// stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
>  }
>  ssc.start()
>  ssc.awaitTermination()
> {code}
> All partitions show offset ranges from 0 to 0 and all rdds are empty. I would 
> like it to start from the beginning of a partition but also pick up 
> everything that is being produced to it.
> I have also tried using spark-streaming-kafka-0.8 and it does work. I think 
> it is a 0.10 issue because everything else works fine. Thank you!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support

2017-07-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17614.
---
Resolution: Fixed

[~zwu@gmail.com] please don't reopen this

> sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra 
> does not support
> -
>
> Key: SPARK-17614
> URL: https://issues.apache.org/jira/browse/SPARK-17614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Any Spark Runtime 
>Reporter: Paul Wu
>Assignee: Sean Owen
>Priority: Minor
>  Labels: cassandra-jdbc, sql
> Fix For: 2.1.0
>
>
> I have the code like the following with Cassandra JDBC 
> (https://github.com/adejanovski/cassandra-jdbc-wrapper):
>  final String dbTable= "sql_demo";
> Dataset jdbcDF
> = sparkSession.read()
> .jdbc(CASSANDRA_CONNECTION_URL, dbTable, 
> connectionProperties);
> List rows = jdbcDF.collectAsList();
> It threw the error:
> Exception in thread "main" java.sql.SQLTransientException: 
> com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable 
> alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
> The reason is that the Spark jdbc code uses the sql syntax "where 1=0" 
> somewhere (to get the schema?), but Cassandra does not support this syntax. 
> Not sure how this issue can be resolved...this is because CQL is not standard 
> sql. 
> The following log shows more information:
> 16/09/20 13:16:35 INFO CassandraConnection  138: Datacenter: %s; Host: %s; 
> Rack: %s
> 16/09/20 13:16:35 TRACE CassandraPreparedStatement  98: CQL: SELECT * FROM 
> sql_demo WHERE 1=0
> 16/09/20 13:16:35 TRACE RequestHandler  71: [19400322] 
> com.datastax.driver.core.Statement$1@41ccb3b9
> 16/09/20 13:16:35 TRACE RequestHandler  272: [19400322-1] Starting



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support

2017-07-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-17614.
-

> sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra 
> does not support
> -
>
> Key: SPARK-17614
> URL: https://issues.apache.org/jira/browse/SPARK-17614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Any Spark Runtime 
>Reporter: Paul Wu
>Assignee: Sean Owen
>Priority: Minor
>  Labels: cassandra-jdbc, sql
> Fix For: 2.1.0
>
>
> I have the code like the following with Cassandra JDBC 
> (https://github.com/adejanovski/cassandra-jdbc-wrapper):
>  final String dbTable= "sql_demo";
> Dataset jdbcDF
> = sparkSession.read()
> .jdbc(CASSANDRA_CONNECTION_URL, dbTable, 
> connectionProperties);
> List rows = jdbcDF.collectAsList();
> It threw the error:
> Exception in thread "main" java.sql.SQLTransientException: 
> com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable 
> alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
> The reason is that the Spark jdbc code uses the sql syntax "where 1=0" 
> somewhere (to get the schema?), but Cassandra does not support this syntax. 
> Not sure how this issue can be resolved...this is because CQL is not standard 
> sql. 
> The following log shows more information:
> 16/09/20 13:16:35 INFO CassandraConnection  138: Datacenter: %s; Host: %s; 
> Rack: %s
> 16/09/20 13:16:35 TRACE CassandraPreparedStatement  98: CQL: SELECT * FROM 
> sql_demo WHERE 1=0
> 16/09/20 13:16:35 TRACE RequestHandler  71: [19400322] 
> com.datastax.driver.core.Statement$1@41ccb3b9
> 16/09/20 13:16:35 TRACE RequestHandler  272: [19400322-1] Starting



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support

2017-07-28 Thread Paul Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105402#comment-16105402
 ] 

Paul Wu edited comment on SPARK-17614 at 7/28/17 6:09 PM:
--

The fix does not support the syntax  like this:

{{.jdbc(JDBC_URL, "(select * from emp)", connectionProperties);}}

Here is the stack trace:

{{Exception in thread "main" java.sql.SQLTransientException: 
com.datastax.driver.core.exceptions.SyntaxError: line 1:14 no viable 
alternative at input 'select' (SELECT * from [select]...)
at 
com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:60)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:113)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:45)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:166)
at 
com.att.cass.proto.CassJDBCWithSpark.main(CassJDBCWithSpark.java:44)}}


was (Author: zwu@gmail.com):
The fix does not support the syntax on the syntax like this:

{{.jdbc(JDBC_URL, "(select * from emp)", connectionProperties);}}

Here is the stack trace:

{{Exception in thread "main" java.sql.SQLTransientException: 
com.datastax.driver.core.exceptions.SyntaxError: line 1:14 no viable 
alternative at input 'select' (SELECT * from [select]...)
at 
com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:60)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:113)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:45)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:166)
at 
com.att.cass.proto.CassJDBCWithSpark.main(CassJDBCWithSpark.java:44)}}

> sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra 
> does not support
> -
>
> Key: SPARK-17614
> URL: https://issues.apache.org/jira/browse/SPARK-17614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Any Spark Runtime 
>Reporter: Paul Wu
>Assignee: Sean Owen
>Priority: Minor
>  Labels: cassandra-jdbc, sql
> Fix For: 2.1.0
>
>
> I have the code like the following with Cassandra JDBC 
> (https://github.com/adejanovski/cassandra-jdbc-wrapper):
>  final String dbTable= "sql_demo";
> Dataset jdbcDF
> = sparkSession.read()
> .jdbc(CASSANDRA_CONNECTION_URL, dbTable, 
> connectionProperties);
> List rows = jdbcDF.collectAsList();
> It threw the error:
> Exception in thread "main" java.sql.SQLTransientException: 
> com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable 
> alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
>

[jira] [Reopened] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support

2017-07-28 Thread Paul Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Wu reopened SPARK-17614:
-

The fix does not support the syntax on the syntax like this:

{{.jdbc(JDBC_URL, "(select * from emp)", connectionProperties);}}

Here is the stack trace:

{{Exception in thread "main" java.sql.SQLTransientException: 
com.datastax.driver.core.exceptions.SyntaxError: line 1:14 no viable 
alternative at input 'select' (SELECT * from [select]...)
at 
com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:60)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:113)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:45)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:166)
at 
com.att.cass.proto.CassJDBCWithSpark.main(CassJDBCWithSpark.java:44)}}

> sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra 
> does not support
> -
>
> Key: SPARK-17614
> URL: https://issues.apache.org/jira/browse/SPARK-17614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Any Spark Runtime 
>Reporter: Paul Wu
>Assignee: Sean Owen
>Priority: Minor
>  Labels: cassandra-jdbc, sql
> Fix For: 2.1.0
>
>
> I have the code like the following with Cassandra JDBC 
> (https://github.com/adejanovski/cassandra-jdbc-wrapper):
>  final String dbTable= "sql_demo";
> Dataset jdbcDF
> = sparkSession.read()
> .jdbc(CASSANDRA_CONNECTION_URL, dbTable, 
> connectionProperties);
> List rows = jdbcDF.collectAsList();
> It threw the error:
> Exception in thread "main" java.sql.SQLTransientException: 
> com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable 
> alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
> The reason is that the Spark jdbc code uses the sql syntax "where 1=0" 
> somewhere (to get the schema?), but Cassandra does not support this syntax. 
> Not sure how this issue can be resolved...this is because CQL is not standard 
> sql. 
> The following log shows more information:
> 16/09/20 13:16:35 INFO CassandraConnection  138: Datacenter: %s; Host: %s; 
> Rack: %s
> 16/09/20 13:16:35 TRACE CassandraPreparedStatement  98: CQL: SELECT * FROM 
> sql_demo WHERE 1=0
> 16/09/20 13:16:35 TRACE RequestHandler  71: [19400322] 
> com.datastax.driver.core.Statement$1@41ccb3b9
> 16/09/20 13:16:35 TRACE RequestHandler  272: [19400322-1] Starting



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support

2017-07-28 Thread Paul Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105402#comment-16105402
 ] 

Paul Wu edited comment on SPARK-17614 at 7/28/17 6:08 PM:
--

The fix does not support the syntax on the syntax like this:

{{.jdbc(JDBC_URL, "(select * from emp)", connectionProperties);}}

Here is the stack trace:

{{Exception in thread "main" java.sql.SQLTransientException: 
com.datastax.driver.core.exceptions.SyntaxError: line 1:14 no viable 
alternative at input 'select' (SELECT * from [select]...)
at 
com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:60)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:113)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:45)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:166)
at 
com.att.cass.proto.CassJDBCWithSpark.main(CassJDBCWithSpark.java:44)}}


was (Author: zwu@gmail.com):
The fix does not support the syntax on the syntax like this:

{{.jdbc(JDBC_URL, "(select * from emp)", connectionProperties);}}

Here is the stack trace:

{{Exception in thread "main" java.sql.SQLTransientException: 
com.datastax.driver.core.exceptions.SyntaxError: line 1:14 no viable 
alternative at input 'select' (SELECT * from [select]...)
at 
com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:60)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:113)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:45)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:166)
at 
com.att.cass.proto.CassJDBCWithSpark.main(CassJDBCWithSpark.java:44)}}

> sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra 
> does not support
> -
>
> Key: SPARK-17614
> URL: https://issues.apache.org/jira/browse/SPARK-17614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Any Spark Runtime 
>Reporter: Paul Wu
>Assignee: Sean Owen
>Priority: Minor
>  Labels: cassandra-jdbc, sql
> Fix For: 2.1.0
>
>
> I have the code like the following with Cassandra JDBC 
> (https://github.com/adejanovski/cassandra-jdbc-wrapper):
>  final String dbTable= "sql_demo";
> Dataset jdbcDF
> = sparkSession.read()
> .jdbc(CASSANDRA_CONNECTION_URL, dbTable, 
> connectionProperties);
> List rows = jdbcDF.collectAsList();
> It threw the error:
> Exception in thread "main" java.sql.SQLTransientException: 
> com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable 
> alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
>   at 
>

[jira] [Resolved] (SPARK-21562) Spark may request extra containers if the rpc between YARN and spark is too fast

2017-07-28 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-21562.

Resolution: Duplicate

> Spark may request extra containers if the rpc between YARN and spark is too 
> fast
> 
>
> Key: SPARK-21562
> URL: https://issues.apache.org/jira/browse/SPARK-21562
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Wei Chen
>  Labels: YARN
>
> hi huys,
> I find an interesting problem when spark tries to request containers from 
> YARN. 
> Here is the case:
> In YarnAllocator.scala
> 1. this function requests container from YARN only if there are executors are 
> not be requested. 
> {code:java}def updateResourceRequests(): Unit = {
> val pendingAllocate = getPendingAllocate
> val numPendingAllocate = pendingAllocate.size
> val missing = targetNumExecutors - numPendingAllocate - 
> numExecutorsRunning
>   
> if (missing > 0) {
>  ..
> }
>   .
> }
> {code}
> 2. After the requested containers are allocated(granted through RPC), then it 
> will update the pending queues
>   
> {code:java}
> private def matchContainerToRequest(
>   allocatedContainer: Container,
>   location: String,
>   containersToUse: ArrayBuffer[Container],
>   remaining: ArrayBuffer[Container]): Unit = {
>   .
>  
>amClient.removeContainerRequest(containerRequest) //update pending queues
>
>.
> }
> {code}
> 3. After the allocated containers are launched, it will update the running 
> queue
> {code:java}
> private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): 
> Unit = {
> for (container <- containersToUse) {
>  
> auncherPool.execute(new Runnable {
> override def run(): Unit = {
>   try {
> new ExecutorRunnable(
>   Some(container),
>   conf,
>   sparkConf,
>   driverUrl,
>   executorId,
>   executorHostname,
>   executorMemory,
>   executorCores,
>   appAttemptId.getApplicationId.toString,
>   securityMgr,
>   localResources
> ).run()
> logInfo(s"has launched $containerId")
> updateInternalState()   //update running queues
>  
>   
> } 
> }{code}
> However, in step 3 it will launch a thread to first launch ExecutorRunnable 
> then update running queue. We found it would take almost 1 sec before the 
> updating running queue function is called(updateInternalState()). So there 
> would be an inconsistent situation here since the pending queue is updated 
> but the running queue is not updated yet due to the launching thread does not 
> reach updateInternalState() yet. If there is an RPC call to 
> amClient.allocate() between this inconsistent interval, then more executors 
> than targetNumExecutors would be requested.
> {noformat}
> Here is an example:
> Initial:
> targetNumExecutors  numPendingAllocate numExecutorsRunning
> 1  00
> After first RPC call to amClient.allocate:
> targetNumExecutors  numPendingAllocate numExecutorsRunning
> 1  1 0
> After the first allocated container is granted by YARN
> targetNumExecutors  numPendingAllocate numExecutorsRunning
> 1  0(is removed in step 2)  0
> =>if there is a RPC call here to amClient.allocate(), then more 
> containers are requested,
> however this situation is caused by the inconsistent state.
> After the container is launched in step 3
> targetNumExecutors  numPendingAllocate numExecutorsRunning
> 1   01
> {noformat}
> ===
> I found this problem because I am changing requestType to test some features 
> on YARN's opportunisitc containers(e.g., allocation takes 100ms) which is 
> much faster then guaranteed containers(e.g., allocation takes almost 1s).
> I am not sure if I have a correct understanding.
> Appreciate anyone's help in this issue(correct me if I have miss 
> understanding)
> Wei



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21562) Spark may request extra containers if the rpc between YARN and spark is too fast

2017-07-28 Thread Wei Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Chen updated SPARK-21562:
-
Description: 
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 

{code:java}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}
{code}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{code:java}
private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 


   amClient.removeContainerRequest(containerRequest) //update pending queues
   


   .
}
{code}

3. After the allocated containers are launched, it will update the running queue
{code:java}
private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): 
Unit = {
for (container <- containersToUse) {
 
auncherPool.execute(new Runnable {
override def run(): Unit = {
  try {
new ExecutorRunnable(
  Some(container),
  conf,
  sparkConf,
  driverUrl,
  executorId,
  executorHostname,
  executorMemory,
  executorCores,
  appAttemptId.getApplicationId.toString,
  securityMgr,
  localResources
).run()
logInfo(s"has launched $containerId")
updateInternalState()   //update running queues
 
  


} 


}{code}



However, in step 3 it will launch a thread to first launch ExecutorRunnable 
then update running queue. We found it would take almost 1 sec before the 
updating running queue function is called(updateInternalState()). So there 
would be an inconsistent situation here since the pending queue is updated but 
the running queue is not updated yet due to the launching thread does not reach 
updateInternalState() yet. If there is an RPC call to amClient.allocate() 
between this inconsistent interval, then more executors than targetNumExecutors 
would be requested.


{noformat}
Here is an example:
Initial:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1  00



After first RPC call to amClient.allocate:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1  1 0



After the first allocated container is granted by YARN
targetNumExecutors  numPendingAllocate numExecutorsRunning
1  0(is removed in step 2)  0


=>if there is a RPC call here to amClient.allocate(), then more containers 
are requested,
however this situation is caused by the inconsistent state.


After the container is launched in step 3
targetNumExecutors  numPendingAllocate numExecutorsRunning
1   01


{noformat}
===
I found this problem because I am changing requestType to test some features on 
YARN's opportunisitc containers(e.g., allocation takes 100ms) which is much 
faster then guaranteed containers(e.g., allocation takes almost 1s).


I am not sure if I have a correct understanding.
Appreciate anyone's help in this issue(correct me if I have miss understanding)


Wei



  was:
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 

{code:java}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}
{code}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{code:java}
private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 


   amClient.removeContainerRequest(containerRequest) //update pending queues
   


   .
}
{code}

3. After the allocated containers are

[jira] [Updated] (SPARK-21562) Spark may request extra containers if the rpc between YARN and spark is too fast

2017-07-28 Thread Wei Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Chen updated SPARK-21562:
-
Description: 
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 

{code:java}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}
{code}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{code:java}
private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 


   amClient.removeContainerRequest(containerRequest) //update pending queues
   


   .
}
{code}

3. After the allocated containers are launched, it will update the running queue
{code:java}
private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): 
Unit = {
for (container <- containersToUse) {
 
auncherPool.execute(new Runnable {
override def run(): Unit = {
  try {
new ExecutorRunnable(
  Some(container),
  conf,
  sparkConf,
  driverUrl,
  executorId,
  executorHostname,
  executorMemory,
  executorCores,
  appAttemptId.getApplicationId.toString,
  securityMgr,
  localResources
).run()
logInfo(s"has launched $containerId")
updateInternalState()   //update running queues
 
  


} 


}{code}



However, in step 3 it will launch a thread to first launch ExecutorRunnable 
then update running queue. We found it would take almost 1 sec before the 
updating running queue function is called(updateInternalState()). So there 
would be an inconsistent situation here since the pending queue is updated but 
the running queue is not updated yet due to the launching thread does not reach 
updateInternalState() yet. If there is an RPC call to amClient.allocate() 
between this inconsistent interval, then more executors than targetNumExecutors 
would be requested.


{noformat}
Here is an example:
Initial:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1  00



After first RPC call to amClient.allocate:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1  1 0



After the first allocated container is granted by YARN
targetNumExecutors  numPendingAllocate numExecutorsRunning
1  0(is removed in step 2)  0


=>if there is a RPC call here to amClient.allocate(), then more containers 
are requested,
however this situation is caused by the inconsistent state.


After the container is launched in step 3
targetNumExecutors  numPendingAllocate numExecutorsRunning
1   01


{noformat}
===
I found this problem because I am testng the feature on YARN's opportunisitc 
containers(e.g., allocation takes 100ms) which is much faster then guaranteed 
containers(e.g., allocateion takes almost 1s).


I am not sure if I have a correct understanding.
Appreciate anyone's help in this issue(correct me if I have miss understanding)


Wei



  was:
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 

{code:java}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}
{code}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{code:java}
private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 


   amClient.removeContainerRequest(containerRequest) //update pending queues
   


   .
}
{code}

3. After the allocated containers are launched, it will update the

[jira] [Updated] (SPARK-21562) Spark may request extra containers if the rpc between YARN and spark is too fast

2017-07-28 Thread Wei Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Chen updated SPARK-21562:
-
Description: 
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 

{code:java}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}
{code}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{code:java}
private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 


   amClient.removeContainerRequest(containerRequest) //update pending queues
   


   .
}
{code}

3. After the allocated containers are launched, it will update the running queue
{code:java}
private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): 
Unit = {
for (container <- containersToUse) {
 
auncherPool.execute(new Runnable {
override def run(): Unit = {
  try {
new ExecutorRunnable(
  Some(container),
  conf,
  sparkConf,
  driverUrl,
  executorId,
  executorHostname,
  executorMemory,
  executorCores,
  appAttemptId.getApplicationId.toString,
  securityMgr,
  localResources
).run()
logInfo(s"has launched $containerId")
updateInternalState()   //update running queues
 
  


} 


}{code}



However, in step 3 it will launch a thread to first launch ExecutorRunnable 
then update running queue. We found it would take almost 1 sec before the 
updating running queue function is called(updateInternalState()). So there 
would be an inconsistent situation here since the pending queue is updated but 
the running queue is not updated yet due to the launching thread does not reach 
updateInternalState() yet. If there is an RPC call to amClient.allocate() 
between this inconsistent interval, then more executors than targetNumExecutors 
would be requested.


{noformat}
Here is an example:
Initial:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1  0   0



After first RPC call to amClient.allocate:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1  10



After the first allocated container is granted by YARN
targetNumExecutors  numPendingAllocate numExecutorsRunning
1  0(is removed in step 2)  0


=>if there is a RPC call here to amClient.allocate(), then more containers 
are requested,
however this situation is caused by the inconsistent state.


After the container is launched in step 3
targetNumExecutors  numPendingAllocate numExecutorsRunning
1   0   1


{noformat}
===
I found this problem because I am testng the feature on YARN's opportunisitc 
containers(e.g., allocation takes 100ms) which is much faster then guaranteed 
containers(e.g., allocateion takes almost 1s).


I am not sure if I have a correct understanding.
Appreciate anyone's help in this issue(correct me if I have miss understanding)


Wei



  was:
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 

{code:java}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}
{code}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{code:java}
private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 


   amClient.removeContainerRequest(containerRequest) //update pending queues
   


   .
}
{code}

3. After the allocated containers are launched, it will

[jira] [Updated] (SPARK-21562) Spark may request extra containers if the rpc between YARN and spark is too fast

2017-07-28 Thread Wei Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Chen updated SPARK-21562:
-
Description: 
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 

{code:java}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}
{code}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{code:java}
private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 


   amClient.removeContainerRequest(containerRequest) //update pending queues
   


   .
}
{code}

3. After the allocated containers are launched, it will update the running queue
{code:java}
private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): 
Unit = {
for (container <- containersToUse) {
 
auncherPool.execute(new Runnable {
override def run(): Unit = {
  try {
new ExecutorRunnable(
  Some(container),
  conf,
  sparkConf,
  driverUrl,
  executorId,
  executorHostname,
  executorMemory,
  executorCores,
  appAttemptId.getApplicationId.toString,
  securityMgr,
  localResources
).run()
logInfo(s"has launched $containerId")
updateInternalState()   //update running queues
 
  


} 


}{code}



However, in step 3 it will launch a thread to first launch ExecutorRunnable 
then update running queue. We found it would take almost 1 sec before the 
updating running queue function is called(updateInternalState()). So there 
would be an inconsistent situation here since the pending queue is updated but 
the running queue is not updated yet due to the launching thread does not reach 
updateInternalState() yet. If there is an RPC call to amClient.allocate() 
between this inconsistent interval, then more executors than targetNumExecutors 
would be requested.


{noformat}
Here is an example:
Initial:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1  0 0



After first RPC call to amClient.allocate:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1  1 0



After the first allocated container is granted by YARN
targetNumExecutors  numPendingAllocate numExecutorsRunning
1  0(is removed in step 2)  0


=>if there is a RPC call here to amClient.allocate(), then more containers 
are requested,
however this situation is caused by the inconsistent state.


After the container is launched in step 3
targetNumExecutors  numPendingAllocate numExecutorsRunning
1   01


{noformat}
===
I found this problem because I am testng the feature on YARN's opportunisitc 
containers(e.g., allocation takes 100ms) which is much faster then guaranteed 
containers(e.g., allocateion takes almost 1s).


I am not sure if I have a correct understanding.
Appreciate anyone's help in this issue(correct me if I have miss understanding)


Wei



  was:
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 

{code:java}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}
{code}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{code:java}
private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 


   amClient.removeContainerRequest(containerRequest) //update pending queues
   


   .
}
{code}

3. After the allocated

[jira] [Updated] (SPARK-21562) Spark may request extra containers if the rpc between YARN and spark is too fast

2017-07-28 Thread Wei Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Chen updated SPARK-21562:
-
Description: 
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 

{code:java}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}
{code}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{code:java}
private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 


   amClient.removeContainerRequest(containerRequest) //update pending queues
   


   .
}
{code}

3. After the allocated containers are launched, it will update the running queue
{code:java}
private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): 
Unit = {
for (container <- containersToUse) {
 
auncherPool.execute(new Runnable {
override def run(): Unit = {
  try {
new ExecutorRunnable(
  Some(container),
  conf,
  sparkConf,
  driverUrl,
  executorId,
  executorHostname,
  executorMemory,
  executorCores,
  appAttemptId.getApplicationId.toString,
  securityMgr,
  localResources
).run()
logInfo(s"has launched $containerId")
updateInternalState()   //update running queues
 
  


} 


}{code}



However, in step 3 it will launch a thread to first launch ExecutorRunnable 
then update running queue. We found it would take almost 1 sec before the 
updating running queue function is called(updateInternalState()). So there 
would be an inconsistent situation here since the pending queue is updated but 
the running queue is not updated yet due to the launching thread does not reach 
updateInternalState() yet. If there is an RPC call to amClient.allocate() 
between this inconsistent interval, then more executors than targetNumExecutors 
would be requested.


{noformat}
Here is an example:
Initial:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 0 0
After first RPC call to amClient.allocate:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 1 0
After the first allocated container is granted by YARN
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 0(is removed in step 2)  0

=>if there is a RPC call here to amClient.allocate(), then more containers 
are requested,
however this situation is caused by the inconsistent state.

After the container is launched in step 3
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 01


{noformat}
===
I found this problem because I am testng the feature on YARN's opportunisitc 
containers(e.g., allocation takes 100ms) which is much faster then guaranteed 
containers(e.g., allocateion takes almost 1s).


I am not sure if I have a correct understanding.
Appreciate anyone's help in this issue(correct me if I have miss understanding)


Wei



  was:
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 

{code:java}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}
{code}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{code:java}
private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 


   amClient.removeContainerRequest(containerRequest) //update pending queues
   


   .
}
{code}

3. After

[jira] [Updated] (SPARK-21562) Spark may request extra containers if the rpc between YARN and spark is too fast

2017-07-28 Thread Wei Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Chen updated SPARK-21562:
-
Description: 
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 

{code:java}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}
{code}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{code:java}
private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 


   amClient.removeContainerRequest(containerRequest) //update pending queues
   


   .
}
{code}

3. After the allocated containers are launched, it will update the running queue
{code:java}
private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): 
Unit = {
for (container <- containersToUse) {
 
auncherPool.execute(new Runnable {
override def run(): Unit = {
  try {
new ExecutorRunnable(
  Some(container),
  conf,
  sparkConf,
  driverUrl,
  executorId,
  executorHostname,
  executorMemory,
  executorCores,
  appAttemptId.getApplicationId.toString,
  securityMgr,
  localResources
).run()
logInfo(s"has launched $containerId")
updateInternalState()   //update running queues
 
  


} 


}{code}



However, in step 3 it will launch a thread to first launch ExecutorRunnable 
then update running queue. We found it would take almost 1 sec before the 
updating running queue function is called(updateInternalState()). So there 
would be an inconsistent situation here since the pending queue is updated but 
the running queue is not updated yet due to the launching thread does not reach 
updateInternalState() yet. If there is an RPC call to amClient.allocate() 
between this inconsistent interval, then more executors than targetNumExecutors 
would be requested.


{noformat}
Here is an example:
Initial:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 0 
   0
After first RPC call to amClient.allocate:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 1 
0
After the first allocated container is granted by YARN
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 0(is removed in step 2)  0

=>if there is a RPC call here to amClient.allocate(), then more containers 
are requested,
however this situation is caused by the inconsistent state.

After the container is launched in step 3
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 01

*no* further _formatting_ is done here
{noformat}
===
I found this problem because I am testng the feature on YARN's opportunisitc 
containers(e.g., allocation takes 100ms) which is much faster then guaranteed 
containers(e.g., allocateion takes almost 1s).


I am not sure if I have a correct understanding.
Appreciate anyone's help in this issue(correct me if I have miss understanding)


Wei



  was:
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 

{code:java}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}
{code}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{code:java}
private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .

[jira] [Updated] (SPARK-21562) Spark may request extra containers if the rpc between YARN and spark is too fast

2017-07-28 Thread Wei Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Chen updated SPARK-21562:
-
Description: 
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 

{code:java}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}
{code}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{code:java}
private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 


   amClient.removeContainerRequest(containerRequest) //update pending queues
   


   .
}
{code}

3. After the allocated containers are launched, it will update the running queue
{code:java}
private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): 
Unit = {
for (container <- containersToUse) {
 
auncherPool.execute(new Runnable {
override def run(): Unit = {
  try {
new ExecutorRunnable(
  Some(container),
  conf,
  sparkConf,
  driverUrl,
  executorId,
  executorHostname,
  executorMemory,
  executorCores,
  appAttemptId.getApplicationId.toString,
  securityMgr,
  localResources
).run()
logInfo(s"has launched $containerId")
updateInternalState()   //update running queues
 
  


} 


}{code}



However, in step 3 it will launch a thread to first launch ExecutorRunnable 
then update running queue. We found it would take almost 1 sec before the 
updating running queue function is called(updateInternalState()). So there 
would be an inconsistent situation here since the pending queue is updated but 
the running queue is not updated yet due to the launching thread does not reach 
updateInternalState() yet. If there is an RPC call to amClient.allocate() 
between this inconsistent interval, then more executors than targetNumExecutors 
would be requested.

Here is an example:
Initial:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 0 
   0
After first RPC call to amClient.allocate:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 1 
0
After the first allocated container is granted by YARN
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 0(is removed in step 2)  0

=>if there is a RPC call here to amClient.allocate(), then more containers 
are requested,
however this situation is caused by the inconsistent state.

After the container is launched in step 3
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 01

===
I found this problem because I am testng the feature on YARN's opportunisitc 
containers(e.g., allocation takes 100ms) which is much faster then guaranteed 
containers(e.g., allocateion takes almost 1s).


I am not sure if I have a correct understanding.
Appreciate anyone's help in this issue(correct me if I have miss understanding)


Wei



  was:
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 

{color:red}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}{color}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{color:red}private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 


   amClient.removeContainerRequest(containerRequest) //update pending queues
   


   .
}
{color}

[jira] [Updated] (SPARK-21562) Spark may request extra containers if the rpc between YARN and spark is too fast

2017-07-28 Thread Wei Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Chen updated SPARK-21562:
-
Description: 
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 

{color:red}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}{color}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{color:red}private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 


   amClient.removeContainerRequest(containerRequest) //update pending queues
   


   .
}
{color}

3. After the allocated containers are launched, it will update the running queue
{color:red}private def runAllocatedContainers(containersToUse: 
ArrayBuffer[Container]): Unit = {
for (container <- containersToUse) {
 
auncherPool.execute(new Runnable {
override def run(): Unit = {
  try {
new ExecutorRunnable(
  Some(container),
  conf,
  sparkConf,
  driverUrl,
  executorId,
  executorHostname,
  executorMemory,
  executorCores,
  appAttemptId.getApplicationId.toString,
  securityMgr,
  localResources
).run()
logInfo(s"has launched $containerId")
updateInternalState()   //update running queues
 
  


} 


}{color}



However, in step 3 it will launch a thread to first launch ExecutorRunnable 
then update running queue. We found it would take almost 1 sec before the 
updating running queue function is called(updateInternalState()). So there 
would be an inconsistent situation here since the pending queue is updated but 
the running queue is not updated yet due to the launching thread does not reach 
updateInternalState() yet. If there is an RPC call to amClient.allocate() 
between this inconsistent interval, then more executors than targetNumExecutors 
would be requested.

Here is an example:
Initial:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 0 
   0
After first RPC call to amClient.allocate:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 1 
0
After the first allocated container is granted by YARN
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 0(is removed in step 2)  0

=>if there is a RPC call here to amClient.allocate(), then more containers 
are requested,
however this situation is caused by the inconsistent state.

After the container is launched in step 3
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 01

===
I found this problem because I am testng the feature if YARN's opportunisitc 
containers(e.g., allocation takes 100ms) which is much faster then guaranteed 
containers(e.g., allocateion takes almost 1s).


I am not sure if I have a correct understanding.
Appreciate anyone's help in this issue(correct me if I have miss understanding)


Wei



  was:
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 
{color:red}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}{color}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{color:red}private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 


   amClient.removeContainerRequest(containerRequest) //update pending queues
   


   .
}
{color}

[jira] [Updated] (SPARK-21562) Spark may request extra containers if the rpc between YARN and spark is too fast

2017-07-28 Thread Wei Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Chen updated SPARK-21562:
-
Description: 
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 
{color:red}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}{color}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{color:red}private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 


   amClient.removeContainerRequest(containerRequest) //update pending queues
   


   .
}
{color}

3. After the allocated containers are launched, it will update the running queue
{color:red}private def runAllocatedContainers(containersToUse: 
ArrayBuffer[Container]): Unit = {
for (container <- containersToUse) {
 
auncherPool.execute(new Runnable {
override def run(): Unit = {
  try {
new ExecutorRunnable(
  Some(container),
  conf,
  sparkConf,
  driverUrl,
  executorId,
  executorHostname,
  executorMemory,
  executorCores,
  appAttemptId.getApplicationId.toString,
  securityMgr,
  localResources
).run()
logInfo(s"has launched $containerId")
updateInternalState()   //update running queues
 
  } 


}{color}



However, in step 3 it will launch a thread to first launch ExecutorRunnable 
then update running queue. We found it would take almost 1 sec before the 
updating running queue function is called(updateInternalState()). So there 
would be an inconsistent situation here since the pending queue is updated but 
the running queue is not updated yet due to the launching thread does not reach 
updateInternalState() yet. If there is an RPC call to amClient.allocate() 
between this inconsistent interval, then more executors than targetNumExecutors 
would be requested.

Here is an example:
Initial:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 0 
   0
After first RPC call to amClient.allocate:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 1 
0
After the first allocated container is granted by YARN
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 0(is removed in step 2)  0

=>if there is a RPC call here to amClient.allocate(), then more containers 
are requested,
however this situation is caused by the inconsistent state.

After the container is launched in step 3
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 01

===
I found this problem because I am testng the feature if YARN's opportunisitc 
containers(e.g., allocation takes 100ms) which is much faster then guaranteed 
containers(e.g., allocateion takes almost 1s).


I am not sure if I have a correct understanding.
Appreciate anyone's help in this issue(correct me if I have miss understanding)


Wei



  was:
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 
{color:red}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}{color}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{color:red}private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 amClient.removeContainerRequest(containerRequest) //update pending queues
  .
}
{color}

3. After the

[jira] [Created] (SPARK-21562) Spark may request extra containers if the rpc between YARN and spark is too fast

2017-07-28 Thread Wei Chen (JIRA)

Wei Chen created SPARK-21562:


 Summary: Spark may request extra containers if the rpc between 
YARN and spark is too fast
 Key: SPARK-21562
 URL: https://issues.apache.org/jira/browse/SPARK-21562
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.2.0
Reporter: Wei Chen


hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 
{color:red}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}{color}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{color:red}private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 amClient.removeContainerRequest(containerRequest) //update pending queues
  .
}
{color}

3. After the allocated containers are launched, it will update the running queue
{color:red}private def runAllocatedContainers(containersToUse: 
ArrayBuffer[Container]): Unit = {
for (container <- containersToUse) {
 
auncherPool.execute(new Runnable {
override def run(): Unit = {
  try {
new ExecutorRunnable(
  Some(container),
  conf,
  sparkConf,
  driverUrl,
  executorId,
  executorHostname,
  executorMemory,
  executorCores,
  appAttemptId.getApplicationId.toString,
  securityMgr,
  localResources
).run()
logInfo(s"has launched $containerId")
updateInternalState()   //update running queues
 
  } 


}{color}



However, in step 3 it will launch a thread to first launch ExecutorRunnable 
then update running queue. We found it would take almost 1 sec before the 
updating running queue function is called(updateInternalState()). So there 
would be an inconsistent situation here since the pending queue is updated but 
the running queue is not updated yet due to the launching thread does not reach 
updateInternalState() yet. If there is an RPC call to amClient.allocate() 
between this inconsistent interval, then more executors than targetNumExecutors 
would be requested.

Here is an example:
Initial:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 0 
   0
After first RPC call to amClient.allocate:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 1 
0
After the first allocated container is granted by YARN
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 0(is removed in step 2)  0

=>if there is a RPC call here to amClient.allocate(), then more containers 
are requested,
however this situation is caused by the inconsistent state.

After the container is launched in step 3
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 01

===
I found this problem because I am testng the feature if YARN's opportunisitc 
containers(e.g., allocation takes 100ms) which is much faster then guaranteed 
containers(e.g., allocateion takes almost 1s).


I am not sure if I have a correct understanding.
Appreciate anyone's help in this issue(correct me if I have miss understanding)


Wei





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21561) spark-streaming-kafka-010 DSteam is not pulling anything from Kafka

2017-07-28 Thread Vlad Badelita (JIRA)

Vlad Badelita created SPARK-21561:
-

 Summary: spark-streaming-kafka-010 DSteam is not pulling anything 
from Kafka
 Key: SPARK-21561
 URL: https://issues.apache.org/jira/browse/SPARK-21561
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.1.1
Reporter: Vlad Badelita


I am trying to use spark-streaming-kafka-0.10 to pull messages from a kafka 
topic(broker version 0.10). I have checked that messages are being produced and 
used a KafkaConsumer to pull them successfully. Now, when I try to use the 
spark streaming api, I am not getting anything. If I just use 
KafkaUtils.createRDD and specify some offset ranges manually it works. But 
when, I try to use createDirectStream, all the rdds are empty and when I check 
the partition offsets it simply reports that all partitions are 0. Here is what 
I tried:

{code:scala}
 val sparkConf = new SparkConf().setAppName("kafkastream")
 val ssc = new StreamingContext(sparkConf, Seconds(3))
 val topics = Array("my_topic")

 val kafkaParams = Map[String, Object](
   "bootstrap.servers" -> "hostname:6667"
   "key.deserializer" -> classOf[StringDeserializer],
   "value.deserializer" -> classOf[StringDeserializer],
   "group.id" -> "my_group",
   "auto.offset.reset" -> "earliest",
   "enable.auto.commit" -> (true: java.lang.Boolean)
 )

 val stream = KafkaUtils.createDirectStream[String, String](
   ssc,
   PreferConsistent,
   Subscribe[String, String](topics, kafkaParams)
 )

 stream.foreachRDD { rdd =>
   val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
   rdd.foreachPartition { iter =>
 val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)
 println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
   }

   val rddCount = rdd.count()
   println("rdd count: ", rddCount)

   // stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
 }

 ssc.start()
 ssc.awaitTermination()
{code}

All partitions show offset ranges from 0 to 0 and all rdds are empty. I would 
like it to start from the beginning of a partition but also pick up everything 
that is being produced to it.

I have also tried using spark-streaming-kafka-0.8 and it does work. I think it 
is a 0.10 issue because everything else works fine. Thank you!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17321) YARN shuffle service should use good disk from yarn.nodemanager.local-dirs

2017-07-28 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105164#comment-16105164
 ] 

Thomas Graves commented on SPARK-17321:
---

Can you clarify?   as stated above you should not be using 
nodemanager.local-dirs.  If you are you should look at reconfiguring yarn to 
use the proper NM recovery dirs.  see 
https://issues.apache.org/jira/browse/SPARK-14963

if you aren't using NM recovery then yes we should fix this so spark doesn't 
use backup db at all.



> YARN shuffle service should use good disk from yarn.nodemanager.local-dirs
> --
>
> Key: SPARK-17321
> URL: https://issues.apache.org/jira/browse/SPARK-17321
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.2, 2.0.0
>Reporter: yunjiong zhao
>
> We run spark on yarn, after enabled spark dynamic allocation, we notice some 
> spark application failed randomly due to YarnShuffleService.
> From log I found
> {quote}
> 2016-08-29 11:33:03,450 ERROR org.apache.spark.network.TransportContext: 
> Error while initializing Netty pipeline
> java.lang.NullPointerException
> at 
> org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77)
> at 
> org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159)
> at 
> org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135)
> at 
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123)
> at 
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116)
> at 
> io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:745)
> {quote} 
> Which caused by the first disk in yarn.nodemanager.local-dirs was broken.
> If we enabled spark.yarn.shuffle.stopOnFailure(SPARK-16505) we might lost 
> hundred nodes which is unacceptable.
> We have 12 disks in yarn.nodemanager.local-dirs, so why not use other good 
> disks if the first one is broken?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL

2017-07-28 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105127#comment-16105127
 ] 

Ruslan Dautkhanov edited comment on SPARK-21274 at 7/28/17 3:47 PM:


[~viirya], yes it returns {noformat}[1, 2, 2]{noformat} for both of the 
queries. 

I don't think {noformat}[1, 2]{noformat} is the correct behavior for the first 
query.

EXCEPT ALL returns all records from the *first* table which are not present in 
the second table, leaving the duplicates as is.

If you believe it should be "1,2", then it's easy to fix by just changing tab1 
to tab2 in the second query.

Or other way around, original queries would return just "1,2" if you swap two 
datasets, 
{noformat}
[1, 2]
for [1, 2] intersect_all [1, 2, 2]
{noformat}


was (Author: tagar):
[~viirya], yes it returns {noformat}[1, 2, 2]{noformat} for both of the 
queries. 

I don't think {noformat}[1, 2]{noformat} is the correct behavior for the first 
query.

EXCEPT ALL which returns all records from the *first* table which are not 
present in the second table, leaving the duplicates as is.

If you believe it should be "1,2", then it's easy to fix by just changing tab1 
to tab2 in the second query.

Or other way around, original queries would return 
{noformat}
[1, 2]
for [1, 2] intersect_all [1, 2, 2]
{noformat}

> Implement EXCEPT ALL and INTERSECT ALL
> --
>
> Key: SPARK-21274
> URL: https://issues.apache.org/jira/browse/SPARK-21274
> Project: Spark
>  Issue Type: New Feature
>  Components: Optimizer, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Ruslan Dautkhanov
>  Labels: set, sql
>
> 1) *EXCEPT ALL* / MINUS ALL :
> {code}
> SELECT a,b,c FROM tab1
>  EXCEPT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following outer join:
> {code}
> SELECT a,b,c
> FROMtab1 t1
>  LEFT OUTER JOIN 
> tab2 t2
>  ON (
> (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
>  )
> WHERE
> COALESCE(t2.a, t2.b, t2.c) IS NULL
> {code}
> (register as a temp.view this second query under "*t1_except_t2_df*" name 
> that can be also used to find INTERSECT ALL below):
> 2) *INTERSECT ALL*:
> {code}
> SELECT a,b,c FROM tab1
>  INTERSECT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following anti-join using t1_except_t2_df we defined 
> above:
> {code}
> SELECT a,b,c
> FROMtab1 t1
> WHERE 
>NOT EXISTS
>(SELECT 1
> FROMt1_except_t2_df e
> WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
>)
> {code}
> So the suggestion is just to use above query rewrites to implement both 
> EXCEPT ALL and INTERSECT ALL sql set operations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL

2017-07-28 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105127#comment-16105127
 ] 

Ruslan Dautkhanov edited comment on SPARK-21274 at 7/28/17 3:46 PM:


[~viirya], yes it returns {noformat}[1, 2, 2]{noformat} for both of the 
queries. 

I don't think {noformat}[1, 2]{noformat} is the correct behavior for the first 
query.

EXCEPT ALL which returns all records from the *first* table which are not 
present in the second table, leaving the duplicates as is.

If you believe it should be "1,2", then it's easy to fix by just changing tab1 
to tab2 in the second query.

Or other way around, original queries would return 
{noformat}
[1, 2]
for [1, 2] intersect_all [1, 2, 2]
{noformat}


was (Author: tagar):
[~viirya], yes it returns {noformat}[1, 2, 2]{noformat} for both of the 
queries. 

I don't think [1, 2] is the correct behavior for the first query.
EXCEPT ALL which returns all records from the *first* table which are not 
present in the second table, leaving the duplicates as is.



> Implement EXCEPT ALL and INTERSECT ALL
> --
>
> Key: SPARK-21274
> URL: https://issues.apache.org/jira/browse/SPARK-21274
> Project: Spark
>  Issue Type: New Feature
>  Components: Optimizer, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Ruslan Dautkhanov
>  Labels: set, sql
>
> 1) *EXCEPT ALL* / MINUS ALL :
> {code}
> SELECT a,b,c FROM tab1
>  EXCEPT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following outer join:
> {code}
> SELECT a,b,c
> FROMtab1 t1
>  LEFT OUTER JOIN 
> tab2 t2
>  ON (
> (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
>  )
> WHERE
> COALESCE(t2.a, t2.b, t2.c) IS NULL
> {code}
> (register as a temp.view this second query under "*t1_except_t2_df*" name 
> that can be also used to find INTERSECT ALL below):
> 2) *INTERSECT ALL*:
> {code}
> SELECT a,b,c FROM tab1
>  INTERSECT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following anti-join using t1_except_t2_df we defined 
> above:
> {code}
> SELECT a,b,c
> FROMtab1 t1
> WHERE 
>NOT EXISTS
>(SELECT 1
> FROMt1_except_t2_df e
> WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
>)
> {code}
> So the suggestion is just to use above query rewrites to implement both 
> EXCEPT ALL and INTERSECT ALL sql set operations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21541) Spark Logs show incorrect job status for a job that does not create SparkContext

2017-07-28 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105133#comment-16105133
 ] 

Thomas Graves commented on SPARK-21541:
---

it was merged 
https://github.com/apache/spark/commit/69ab0e4bddccb461f960fcb48a390a1517e504dd 
 but I guess the pr link didn't pick it up.

I missed that the title wasn't quite right [Spark-21541] so perhaps jira didn't 
pick it up properly.

> Spark Logs show incorrect job status for a job that does not create 
> SparkContext
> 
>
> Key: SPARK-21541
> URL: https://issues.apache.org/jira/browse/SPARK-21541
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Parth Gandhi
>Assignee: Parth Gandhi
>Priority: Minor
> Fix For: 2.3.0
>
>
> If you run a spark job without creating the SparkSession or SparkContext, the 
> spark job logs says it succeeded but yarn says it fails and retries 3 times. 
> Also, since, Application Master unregisters with Resource Manager and exits 
> successfully, it deletes the spark staging directory, so when yarn makes 
> subsequent retries, it fails to find the staging directory and thus, the 
> retries fail.
> *Steps:*
> 1. For example, run a pyspark job without creating SparkSession or 
> SparkContext. 
> *Example:*
> import sys
> from random import random
> from operator import add
> from pyspark import SparkContext
> if __name__ == "__main__":
>   print("hello world")
> 2. Spark will mark it as FAILED. Got to the UI and check the container logs.
> 3. You will see the following information in the logs:
> spark:
> 7/07/14 13:22:10 INFO ApplicationMaster: Final app status: SUCCEEDED, 
> exitCode: 0
> 17/07/14 13:22:10 INFO ApplicationMaster: Unregistering ApplicationMaster 
> with SUCCEEDED
> But yarn logs will show:
> 2017-07-14 01:14:33,203 [AsyncDispatcher event handler] INFO 
> attempt.RMAppAttemptImpl: appattempt_1493735952617_12443844_01 State 
> change from FINAL_SAVING to FAILED



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL

2017-07-28 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105127#comment-16105127
 ] 

Ruslan Dautkhanov commented on SPARK-21274:
---

[~viirya], yes it returns {noformat}[1, 2, 2]{noformat} for both of the 
queries. 

I don't think [1, 2] is the correct behavior for the first query.
EXCEPT ALL which returns all records from the *first* table which are not 
present in the second table, leaving the duplicates as is.



> Implement EXCEPT ALL and INTERSECT ALL
> --
>
> Key: SPARK-21274
> URL: https://issues.apache.org/jira/browse/SPARK-21274
> Project: Spark
>  Issue Type: New Feature
>  Components: Optimizer, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Ruslan Dautkhanov
>  Labels: set, sql
>
> 1) *EXCEPT ALL* / MINUS ALL :
> {code}
> SELECT a,b,c FROM tab1
>  EXCEPT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following outer join:
> {code}
> SELECT a,b,c
> FROMtab1 t1
>  LEFT OUTER JOIN 
> tab2 t2
>  ON (
> (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
>  )
> WHERE
> COALESCE(t2.a, t2.b, t2.c) IS NULL
> {code}
> (register as a temp.view this second query under "*t1_except_t2_df*" name 
> that can be also used to find INTERSECT ALL below):
> 2) *INTERSECT ALL*:
> {code}
> SELECT a,b,c FROM tab1
>  INTERSECT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following anti-join using t1_except_t2_df we defined 
> above:
> {code}
> SELECT a,b,c
> FROMtab1 t1
> WHERE 
>NOT EXISTS
>(SELECT 1
> FROMt1_except_t2_df e
> WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
>)
> {code}
> So the suggestion is just to use above query rewrites to implement both 
> EXCEPT ALL and INTERSECT ALL sql set operations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21541) Spark Logs show incorrect job status for a job that does not create SparkContext

2017-07-28 Thread Parth Gandhi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105116#comment-16105116
 ] 

Parth Gandhi commented on SPARK-21541:
--

The change has been merged. Thank you.

> Spark Logs show incorrect job status for a job that does not create 
> SparkContext
> 
>
> Key: SPARK-21541
> URL: https://issues.apache.org/jira/browse/SPARK-21541
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Parth Gandhi
>Assignee: Parth Gandhi
>Priority: Minor
> Fix For: 2.3.0
>
>
> If you run a spark job without creating the SparkSession or SparkContext, the 
> spark job logs says it succeeded but yarn says it fails and retries 3 times. 
> Also, since, Application Master unregisters with Resource Manager and exits 
> successfully, it deletes the spark staging directory, so when yarn makes 
> subsequent retries, it fails to find the staging directory and thus, the 
> retries fail.
> *Steps:*
> 1. For example, run a pyspark job without creating SparkSession or 
> SparkContext. 
> *Example:*
> import sys
> from random import random
> from operator import add
> from pyspark import SparkContext
> if __name__ == "__main__":
>   print("hello world")
> 2. Spark will mark it as FAILED. Got to the UI and check the container logs.
> 3. You will see the following information in the logs:
> spark:
> 7/07/14 13:22:10 INFO ApplicationMaster: Final app status: SUCCEEDED, 
> exitCode: 0
> 17/07/14 13:22:10 INFO ApplicationMaster: Unregistering ApplicationMaster 
> with SUCCEEDED
> But yarn logs will show:
> 2017-07-14 01:14:33,203 [AsyncDispatcher event handler] INFO 
> attempt.RMAppAttemptImpl: appattempt_1493735952617_12443844_01 State 
> change from FINAL_SAVING to FAILED



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21549) Spark fails to complete job correctly in case of OutputFormat which do not write into hdfs

2017-07-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21549:
--
Priority: Major  (was: Blocker)

> Spark fails to complete job correctly in case of OutputFormat which do not 
> write into hdfs
> --
>
> Key: SPARK-21549
> URL: https://issues.apache.org/jira/browse/SPARK-21549
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: spark 2.2.0
> scala 2.11
>Reporter: Sergey Zhemzhitsky
>
> Spark fails to complete job correctly in case of custom OutputFormat 
> implementations.
> There are OutputFormat implementations which do not need to use 
> *mapreduce.output.fileoutputformat.outputdir* standard hadoop property.
> [But spark reads this property from the 
> configuration|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L79]
>  while setting up an OutputCommitter
> {code:javascript}
> val committer = FileCommitProtocol.instantiate(
>   className = classOf[HadoopMapReduceCommitProtocol].getName,
>   jobId = stageId.toString,
>   outputPath = conf.value.get("mapreduce.output.fileoutputformat.outputdir"),
>   isAppend = false).asInstanceOf[HadoopMapReduceCommitProtocol]
> committer.setupJob(jobContext)
> {code}
> ... and then uses this property later on while [commiting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L132],
>  [aborting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L141],
>  [creating task's temp 
> path|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L95]
> In that cases when the job completes then following exception is thrown
> {code}
> Can not create a Path from a null string
> java.lang.IllegalArgumentException: Can not create a Path from a null string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:123)
>   at org.apache.hadoop.fs.Path.(Path.java:135)
>   at org.apache.hadoop.fs.Path.(Path.java:89)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:58)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:141)
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:106)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
>   ...
> {code}
> So it seems that all the jobs which use OutputFormats which don't write data 
> into HDFS-compatible file systems are broken.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21549) Spark fails to complete job correctly in case of OutputFormat which do not write into hdfs

2017-07-28 Thread Sergey Zhemzhitsky (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Zhemzhitsky updated SPARK-21549:
---
Priority: Blocker  (was: Critical)

> Spark fails to complete job correctly in case of OutputFormat which do not 
> write into hdfs
> --
>
> Key: SPARK-21549
> URL: https://issues.apache.org/jira/browse/SPARK-21549
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: spark 2.2.0
> scala 2.11
>Reporter: Sergey Zhemzhitsky
>Priority: Blocker
>
> Spark fails to complete job correctly in case of custom OutputFormat 
> implementations.
> There are OutputFormat implementations which do not need to use 
> *mapreduce.output.fileoutputformat.outputdir* standard hadoop property.
> [But spark reads this property from the 
> configuration|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L79]
>  while setting up an OutputCommitter
> {code:javascript}
> val committer = FileCommitProtocol.instantiate(
>   className = classOf[HadoopMapReduceCommitProtocol].getName,
>   jobId = stageId.toString,
>   outputPath = conf.value.get("mapreduce.output.fileoutputformat.outputdir"),
>   isAppend = false).asInstanceOf[HadoopMapReduceCommitProtocol]
> committer.setupJob(jobContext)
> {code}
> ... and then uses this property later on while [commiting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L132],
>  [aborting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L141],
>  [creating task's temp 
> path|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L95]
> In that cases when the job completes then following exception is thrown
> {code}
> Can not create a Path from a null string
> java.lang.IllegalArgumentException: Can not create a Path from a null string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:123)
>   at org.apache.hadoop.fs.Path.(Path.java:135)
>   at org.apache.hadoop.fs.Path.(Path.java:89)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:58)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:141)
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:106)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
>   ...
> {code}
> So it seems that all the jobs which use OutputFormats which don't write data 
> into HDFS-compatible file systems are broken.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-07-28 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105095#comment-16105095
 ] 

Li Jin commented on SPARK-21190:


I think the use case 2 of what [~rxin] proposed originally is a good API to 
enable first. I think it can a bit better if the input of the user function is 
not a {{pandas.DataFrame}} but {{pandas.Series}} to match Spark columns. i.e., 
instead of:

{code}
@spark_udf(some way to describe the return schema)
def my_func(input):
  """ Some user-defined function.
 
  :param input: A Pandas DataFrame with two columns, a and b.
  :return: :class: A numpy array
  """
  return input[a] + input[b]
 
df = spark.range(1000).selectExpr("id a", "id / 2 b")
df.withColumn("c", my_func(df.a, df.b))
{code}

I think this is better:
{code}
@spark_udf(some way to describe the return schema)
def my_func(a, b):
  """ Some user-defined function.
 
  :param input: Two Pandas Series, a and b
  :return: :class: A Pandas Series
  """
  return a + b
 
df = spark.range(1000).selectExpr("id a", "id / 2 b")
df.withColumn("c", my_func(df.a, df.b))
{code}

> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
> Attachments: SPIPVectorizedUDFsforPython (1).pdf
>
>
> *Background and Motivation*
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> (e.g. numpy, Pandas) that are written in native code.
>  
> This proposal advocates introducing new APIs to support vectorized UDFs in 
> Python, in which a block of data is transferred over to Python in some 
> columnar format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> - Support vectorized UDFs that apply on chunks of the data frame
> - Low system overhead: Substantially reduce serialization and deserialization 
> overhead when compared with row-at-a-time interface
> - UDF performance: Enable users to leverage native libraries in Python (e.g. 
> numpy, Pandas) for data manipulation in these UDFs
>  
> *Non-Goals*
> The following are explicitly out of scope for the current SPIP, and should be 
> done in future SPIPs. Nonetheless, it would be good to consider these future 
> use cases during API design, so we can achieve some consistency when rolling 
> out new APIs.
>  
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
> - Tight integration with machine learning frameworks
>  
> *Proposed API Changes*
> The following sketches some possibilities. I haven’t spent a lot of time 
> thinking about the API (wrote it down in 5 mins) and I am not attached to 
> this design at all. The main purpose of the SPIP is to get feedback on use 
> cases and see how they can impact API design.
>  
> A few things to consider are:
>  
> 1. Python is dynamically typed, whereas DataFrames/SQL requires static, 
> analysis time typing. This means users would need to specify the return type 
> of their UDFs.
>  
> 2. Ratio of input rows to output rows. We propose initially we require number 
> of output rows to be the same as the number of input rows. In the future, we 
> can consider relaxing this constraint with support for vectorized aggregate 
> UDFs.
> 3. How do we handle null values, since Pandas doesn't have the concept of 
> nulls?
>  
> Proposed API sketch (using examples):
>  
> Use case 1. A function that defines all the columns of a DataFrame (similar 
> to a “map” function):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_on_entire_df(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A Pandas data frame.
>   """
>   input[c] = input[a] + input[b]
>   Input[d] = input[a] - input[b]
>   return input
>  
> spark.range(1000).selectExpr("id a", "id / 2 b")
>   .mapBatches(my_func_on_entire_df)
> {code}
>  
> Use case 2. A function that defines only one column (similar to existing 
> UDFs):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_that_returns_one_column(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A numpy array
>   """
>   return input[a] + input[b]
>  
> my_func = udf(my_func_that_returns_one_column)
>  
> df =

[jira] [Commented] (SPARK-21555) GROUP BY don't work with expressions with NVL and nested objects

2017-07-28 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105092#comment-16105092
 ] 

Liang-Chi Hsieh commented on SPARK-21555:
-

The sync between PR and JIRA seems broken still. I already submitted a PR for 
this issue at https://github.com/apache/spark/pull/18761.

> GROUP BY don't work with expressions with NVL and nested objects
> 
>
> Key: SPARK-21555
> URL: https://issues.apache.org/jira/browse/SPARK-21555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Vitaly Gerasimov
>
> {code}
> spark.read.json(spark.createDataset("""{"foo":{"foo1":"value"}}""" :: 
> Nil)).createOrReplaceTempView("test")
> spark.sql("select nvl(foo.foo1, \"value\"), count(*) from test group by 
> nvl(foo.foo1, \"value\")")
> {code}
> returns exception:
> {code}
> org.apache.spark.sql.AnalysisException: expression 'test.`foo`' is neither 
> present in the group by, nor is it an aggregate function. Add to group by or 
> wrap in first() (or first_value) if you don't care which value you get.;;
> Aggregate [nvl(foo#4.foo1 AS foo1#8, value)], [nvl(foo#4.foo1 AS foo1#9, 
> value) AS nvl(test.`foo`.`foo1` AS `foo1`, 'value')#11, count(1) AS 
> count(1)#12L]
> +- SubqueryAlias test
>+- LogicalRDD [foo#4]
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:247)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:253)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:253)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:253)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:253)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$9.apply(CheckAnalysis.scala:280)
>

[jira] [Commented] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-07-28 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105072#comment-16105072
 ] 

Li Jin commented on SPARK-21190:


[~cloud_fan], thanks for pointing out `ArrowColumnVector`. [~bryanc], I think 
#18659 could serve as a basis for future udf work. My work with #18732 has some 
overlap with #18659 but I can work with [~bryanc] to merge. 

[~cloud_fan] and [~rxin], do you have chance to think more about the API?

> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
> Attachments: SPIPVectorizedUDFsforPython (1).pdf
>
>
> *Background and Motivation*
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> (e.g. numpy, Pandas) that are written in native code.
>  
> This proposal advocates introducing new APIs to support vectorized UDFs in 
> Python, in which a block of data is transferred over to Python in some 
> columnar format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> - Support vectorized UDFs that apply on chunks of the data frame
> - Low system overhead: Substantially reduce serialization and deserialization 
> overhead when compared with row-at-a-time interface
> - UDF performance: Enable users to leverage native libraries in Python (e.g. 
> numpy, Pandas) for data manipulation in these UDFs
>  
> *Non-Goals*
> The following are explicitly out of scope for the current SPIP, and should be 
> done in future SPIPs. Nonetheless, it would be good to consider these future 
> use cases during API design, so we can achieve some consistency when rolling 
> out new APIs.
>  
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
> - Tight integration with machine learning frameworks
>  
> *Proposed API Changes*
> The following sketches some possibilities. I haven’t spent a lot of time 
> thinking about the API (wrote it down in 5 mins) and I am not attached to 
> this design at all. The main purpose of the SPIP is to get feedback on use 
> cases and see how they can impact API design.
>  
> A few things to consider are:
>  
> 1. Python is dynamically typed, whereas DataFrames/SQL requires static, 
> analysis time typing. This means users would need to specify the return type 
> of their UDFs.
>  
> 2. Ratio of input rows to output rows. We propose initially we require number 
> of output rows to be the same as the number of input rows. In the future, we 
> can consider relaxing this constraint with support for vectorized aggregate 
> UDFs.
> 3. How do we handle null values, since Pandas doesn't have the concept of 
> nulls?
>  
> Proposed API sketch (using examples):
>  
> Use case 1. A function that defines all the columns of a DataFrame (similar 
> to a “map” function):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_on_entire_df(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A Pandas data frame.
>   """
>   input[c] = input[a] + input[b]
>   Input[d] = input[a] - input[b]
>   return input
>  
> spark.range(1000).selectExpr("id a", "id / 2 b")
>   .mapBatches(my_func_on_entire_df)
> {code}
>  
> Use case 2. A function that defines only one column (similar to existing 
> UDFs):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_that_returns_one_column(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A numpy array
>   """
>   return input[a] + input[b]
>  
> my_func = udf(my_func_that_returns_one_column)
>  
> df = spark.range(1000).selectExpr("id a", "id / 2 b")
> df.withColumn("c", my_func(df.a, df.b))
> {code}
>  
>  
>  
> *Optional Design Sketch*
> I’m more concerned about getting proper feedback for API design. The 
> implementation should be pretty straightforward and is not a huge concern at 
> this point. We can leverage the same implementation for faster toPandas 
> (using Arrow).
>  
>  
> *Optional Rejected Designs*
> See above.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For

[jira] [Comment Edited] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-07-28 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105072#comment-16105072
 ] 

Li Jin edited comment on SPARK-21190 at 7/28/17 3:06 PM:
-

[~cloud_fan], thanks for pointing out `ArrowColumnVector`. [~bryanc], I think 
#18659 could serve as a basis for future udf work. My work with #18732 has some 
overlap with #18659 but I can work with [~bryanc] to merge. 

[~cloud_fan] and [~rxin], have you got the chance to think more about the API?


was (Author: icexelloss):
[~cloud_fan], thanks for pointing out `ArrowColumnVector`. [~bryanc], I think 
#18659 could serve as a basis for future udf work. My work with #18732 has some 
overlap with #18659 but I can work with [~bryanc] to merge. 

[~cloud_fan] and [~rxin], do you have chance to think more about the API?

> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
> Attachments: SPIPVectorizedUDFsforPython (1).pdf
>
>
> *Background and Motivation*
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> (e.g. numpy, Pandas) that are written in native code.
>  
> This proposal advocates introducing new APIs to support vectorized UDFs in 
> Python, in which a block of data is transferred over to Python in some 
> columnar format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> - Support vectorized UDFs that apply on chunks of the data frame
> - Low system overhead: Substantially reduce serialization and deserialization 
> overhead when compared with row-at-a-time interface
> - UDF performance: Enable users to leverage native libraries in Python (e.g. 
> numpy, Pandas) for data manipulation in these UDFs
>  
> *Non-Goals*
> The following are explicitly out of scope for the current SPIP, and should be 
> done in future SPIPs. Nonetheless, it would be good to consider these future 
> use cases during API design, so we can achieve some consistency when rolling 
> out new APIs.
>  
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
> - Tight integration with machine learning frameworks
>  
> *Proposed API Changes*
> The following sketches some possibilities. I haven’t spent a lot of time 
> thinking about the API (wrote it down in 5 mins) and I am not attached to 
> this design at all. The main purpose of the SPIP is to get feedback on use 
> cases and see how they can impact API design.
>  
> A few things to consider are:
>  
> 1. Python is dynamically typed, whereas DataFrames/SQL requires static, 
> analysis time typing. This means users would need to specify the return type 
> of their UDFs.
>  
> 2. Ratio of input rows to output rows. We propose initially we require number 
> of output rows to be the same as the number of input rows. In the future, we 
> can consider relaxing this constraint with support for vectorized aggregate 
> UDFs.
> 3. How do we handle null values, since Pandas doesn't have the concept of 
> nulls?
>  
> Proposed API sketch (using examples):
>  
> Use case 1. A function that defines all the columns of a DataFrame (similar 
> to a “map” function):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_on_entire_df(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A Pandas data frame.
>   """
>   input[c] = input[a] + input[b]
>   Input[d] = input[a] - input[b]
>   return input
>  
> spark.range(1000).selectExpr("id a", "id / 2 b")
>   .mapBatches(my_func_on_entire_df)
> {code}
>  
> Use case 2. A function that defines only one column (similar to existing 
> UDFs):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_that_returns_one_column(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A numpy array
>   """
>   return input[a] + input[b]
>  
> my_func = udf(my_func_that_returns_one_column)
>  
> df = spark.range(1000).selectExpr("id a", "id / 2 b")
> df.withColumn("c", my_func(df.a, df.b))
> {code}
>  
>  
>  
> *Optional Design Sketch*
> I’m more concerned about getting proper feedback for API design. The 
> implementation should be pretty straightforward and is not a

[jira] [Commented] (SPARK-21541) Spark Logs show incorrect job status for a job that does not create SparkContext

2017-07-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105049#comment-16105049
 ] 

Sean Owen commented on SPARK-21541:
---

Was this change merged? I don't think it was 
https://github.com/apache/spark/pull/18741

> Spark Logs show incorrect job status for a job that does not create 
> SparkContext
> 
>
> Key: SPARK-21541
> URL: https://issues.apache.org/jira/browse/SPARK-21541
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Parth Gandhi
>Assignee: Parth Gandhi
>Priority: Minor
> Fix For: 2.3.0
>
>
> If you run a spark job without creating the SparkSession or SparkContext, the 
> spark job logs says it succeeded but yarn says it fails and retries 3 times. 
> Also, since, Application Master unregisters with Resource Manager and exits 
> successfully, it deletes the spark staging directory, so when yarn makes 
> subsequent retries, it fails to find the staging directory and thus, the 
> retries fail.
> *Steps:*
> 1. For example, run a pyspark job without creating SparkSession or 
> SparkContext. 
> *Example:*
> import sys
> from random import random
> from operator import add
> from pyspark import SparkContext
> if __name__ == "__main__":
>   print("hello world")
> 2. Spark will mark it as FAILED. Got to the UI and check the container logs.
> 3. You will see the following information in the logs:
> spark:
> 7/07/14 13:22:10 INFO ApplicationMaster: Final app status: SUCCEEDED, 
> exitCode: 0
> 17/07/14 13:22:10 INFO ApplicationMaster: Unregistering ApplicationMaster 
> with SUCCEEDED
> But yarn logs will show:
> 2017-07-14 01:14:33,203 [AsyncDispatcher event handler] INFO 
> attempt.RMAppAttemptImpl: appattempt_1493735952617_12443844_01 State 
> change from FINAL_SAVING to FAILED



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21541) Spark Logs show incorrect job status for a job that does not create SparkContext

2017-07-28 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-21541.
---
   Resolution: Fixed
 Assignee: Parth Gandhi
Fix Version/s: 2.3.0

> Spark Logs show incorrect job status for a job that does not create 
> SparkContext
> 
>
> Key: SPARK-21541
> URL: https://issues.apache.org/jira/browse/SPARK-21541
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Parth Gandhi
>Assignee: Parth Gandhi
>Priority: Minor
> Fix For: 2.3.0
>
>
> If you run a spark job without creating the SparkSession or SparkContext, the 
> spark job logs says it succeeded but yarn says it fails and retries 3 times. 
> Also, since, Application Master unregisters with Resource Manager and exits 
> successfully, it deletes the spark staging directory, so when yarn makes 
> subsequent retries, it fails to find the staging directory and thus, the 
> retries fail.
> *Steps:*
> 1. For example, run a pyspark job without creating SparkSession or 
> SparkContext. 
> *Example:*
> import sys
> from random import random
> from operator import add
> from pyspark import SparkContext
> if __name__ == "__main__":
>   print("hello world")
> 2. Spark will mark it as FAILED. Got to the UI and check the container logs.
> 3. You will see the following information in the logs:
> spark:
> 7/07/14 13:22:10 INFO ApplicationMaster: Final app status: SUCCEEDED, 
> exitCode: 0
> 17/07/14 13:22:10 INFO ApplicationMaster: Unregistering ApplicationMaster 
> with SUCCEEDED
> But yarn logs will show:
> 2017-07-14 01:14:33,203 [AsyncDispatcher event handler] INFO 
> attempt.RMAppAttemptImpl: appattempt_1493735952617_12443844_01 State 
> change from FINAL_SAVING to FAILED



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21553) Add the description of the default value of master parameter in the spark-shell

2017-07-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-21553:
-

Assignee: Donghui Xu

> Add the description of the default value of master parameter in the 
> spark-shell
> ---
>
> Key: SPARK-21553
> URL: https://issues.apache.org/jira/browse/SPARK-21553
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Affects Versions: 2.2.0
>Reporter: Donghui Xu
>Assignee: Donghui Xu
>Priority: Minor
> Fix For: 2.3.0
>
>
> When I type spark-shell --help, I find that the default value description for 
> the master parameter is missing. The user does not know what the default 
> value is when the master parameter is not included, so we need to add the 
> master parameter default description to the help information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21553) Add the description of the default value of master parameter in the spark-shell

2017-07-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21553:
--
Priority: Trivial  (was: Minor)

> Add the description of the default value of master parameter in the 
> spark-shell
> ---
>
> Key: SPARK-21553
> URL: https://issues.apache.org/jira/browse/SPARK-21553
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Affects Versions: 2.2.0
>Reporter: Donghui Xu
>Assignee: Donghui Xu
>Priority: Trivial
> Fix For: 2.3.0
>
>
> When I type spark-shell --help, I find that the default value description for 
> the master parameter is missing. The user does not know what the default 
> value is when the master parameter is not included, so we need to add the 
> master parameter default description to the help information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21553) Add the description of the default value of master parameter in the spark-shell

2017-07-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21553.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18755
[https://github.com/apache/spark/pull/18755]

> Add the description of the default value of master parameter in the 
> spark-shell
> ---
>
> Key: SPARK-21553
> URL: https://issues.apache.org/jira/browse/SPARK-21553
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Affects Versions: 2.2.0
>Reporter: Donghui Xu
>Priority: Minor
> Fix For: 2.3.0
>
>
> When I type spark-shell --help, I find that the default value description for 
> the master parameter is missing. The user does not know what the default 
> value is when the master parameter is not included, so we need to add the 
> master parameter default description to the help information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions

2017-07-28 Thread SOMASUNDARAM SUDALAIMUTHU (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SOMASUNDARAM SUDALAIMUTHU updated SPARK-14927:
--
Comment: was deleted

(was: Is this fixed in 2.0 version ?)

> DataFrame. saveAsTable creates RDD partitions but not Hive partitions
> -
>
> Key: SPARK-14927
> URL: https://issues.apache.org/jira/browse/SPARK-14927
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.1
> Environment: Mac OS X 10.11.4 local
>Reporter: Sasha Ovsankin
>
> This is a followup to 
> http://stackoverflow.com/questions/31341498/save-spark-dataframe-as-dynamic-partitioned-table-in-hive
>  . I tried to use suggestions in the answers but couldn't make it to work in 
> Spark 1.6.1
> I am trying to create partitions programmatically from `DataFrame. Here is 
> the relevant code (adapted from a Spark test):
> hc.setConf("hive.metastore.warehouse.dir", "tmp/tests")
> //hc.setConf("hive.exec.dynamic.partition", "true")
> //hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
> hc.sql("create database if not exists tmp")
> hc.sql("drop table if exists tmp.partitiontest1")
> Seq(2012 -> "a").toDF("year", "val")
>   .write
>   .partitionBy("year")
>   .mode(SaveMode.Append)
>   .saveAsTable("tmp.partitiontest1")
> hc.sql("show partitions tmp.partitiontest1").show
> Full file is here: 
> https://gist.github.com/SashaOv/7c65f03a51c7e8f9c9e018cd42aa4c4a
> I get the error that the table is not partitioned:
> ==
> HIVE FAILURE OUTPUT
> ==
> SET hive.support.sql11.reserved.keywords=false
> SET hive.metastore.warehouse.dir=tmp/tests
> OK
> OK
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.DDLTask. Table tmp.partitiontest1 is not a 
> partitioned table
> ==
> It looks like the root cause is that 
> `org.apache.spark.sql.hive.HiveMetastoreCatalog.newSparkSQLSpecificMetastoreTable`
>  always creates table with empty partitions.
> Any help to move this forward is appreciated.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions

2017-07-28 Thread SOMASUNDARAM SUDALAIMUTHU (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105004#comment-16105004
 ] 

SOMASUNDARAM SUDALAIMUTHU commented on SPARK-14927:
---

Is this fixed in 2.0 version ?

> DataFrame. saveAsTable creates RDD partitions but not Hive partitions
> -
>
> Key: SPARK-14927
> URL: https://issues.apache.org/jira/browse/SPARK-14927
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.1
> Environment: Mac OS X 10.11.4 local
>Reporter: Sasha Ovsankin
>
> This is a followup to 
> http://stackoverflow.com/questions/31341498/save-spark-dataframe-as-dynamic-partitioned-table-in-hive
>  . I tried to use suggestions in the answers but couldn't make it to work in 
> Spark 1.6.1
> I am trying to create partitions programmatically from `DataFrame. Here is 
> the relevant code (adapted from a Spark test):
> hc.setConf("hive.metastore.warehouse.dir", "tmp/tests")
> //hc.setConf("hive.exec.dynamic.partition", "true")
> //hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
> hc.sql("create database if not exists tmp")
> hc.sql("drop table if exists tmp.partitiontest1")
> Seq(2012 -> "a").toDF("year", "val")
>   .write
>   .partitionBy("year")
>   .mode(SaveMode.Append)
>   .saveAsTable("tmp.partitiontest1")
> hc.sql("show partitions tmp.partitiontest1").show
> Full file is here: 
> https://gist.github.com/SashaOv/7c65f03a51c7e8f9c9e018cd42aa4c4a
> I get the error that the table is not partitioned:
> ==
> HIVE FAILURE OUTPUT
> ==
> SET hive.support.sql11.reserved.keywords=false
> SET hive.metastore.warehouse.dir=tmp/tests
> OK
> OK
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.DDLTask. Table tmp.partitiontest1 is not a 
> partitioned table
> ==
> It looks like the root cause is that 
> `org.apache.spark.sql.hive.HiveMetastoreCatalog.newSparkSQLSpecificMetastoreTable`
>  always creates table with empty partitions.
> Any help to move this forward is appreciated.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21560) Add hold mode for the LiveListenerBus

2017-07-28 Thread Li Yuanjian (JIRA)

Li Yuanjian created SPARK-21560:
---

 Summary: Add hold mode for the LiveListenerBus
 Key: SPARK-21560
 URL: https://issues.apache.org/jira/browse/SPARK-21560
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.0, 2.1.0
Reporter: Li Yuanjian


As the comments in SPARK-18838, we also face the same problem about critical 
events dropped while the event queue is full. 
There's no doubt that improving the performance of the processing thread is 
important, whether the solution is multithreading or any others like 
SPARK-20776, but maybe we still need the hold strategy when the event queue is 
full, and restart after some room released. The hold strategy open or not and 
the empty rate should both configurable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source

2017-07-28 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104918#comment-16104918
 ] 

Wenchen Fan commented on SPARK-21067:
-

We have many tests for CREATE TABLE inside Spark SQL, so I think this issue is 
thrift-server specific.

However I'm not familiar with the thrift-server code, cc [~rxin] do you know 
who is the maintainer?

> Thrift Server - CTAS fail with Unable to move source
> 
>
> Key: SPARK-21067
> URL: https://issues.apache.org/jira/browse/SPARK-21067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
> Environment: Yarn
> Hive MetaStore
> HDFS (HA)
>Reporter: Dominic Ricard
>
> After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS 
> would fail, sometimes...
> Most of the time, the CTAS would work only once, after starting the thrift 
> server. After that, dropping the table and re-issuing the same CTAS would 
> fail with the following message (Sometime, it fails right away, sometime it 
> work for a long period of time):
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> We have already found the following Jira 
> (https://issues.apache.org/jira/browse/SPARK-11021) which state that the 
> {{hive.exec.stagingdir}} had to be added in order for Spark to be able to 
> handle CREATE TABLE properly as of 2.0. As you can see in the error, we have 
> ours set to "/tmp/hive-staging/\{user.name\}"
> Same issue with INSERT statements:
> {noformat}
> CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE 
> dricard.test SELECT 1;
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> This worked fine in 1.6.2, which we currently run in our Production 
> Environment but since 2.0+, we haven't been able to CREATE TABLE consistently 
> on the cluster.
> SQL to reproduce issue:
> {noformat}
> DROP SCHEMA IF EXISTS dricard CASCADE; 
> CREATE SCHEMA dricard; 
> CREATE TABLE dricard.test (col1 int); 
> INSERT INTO TABLE dricard.test SELECT 1; 
> SELECT * from dricard.test; 
> DROP TABLE dricard.test; 
> CREATE TABLE dricard.test AS select 1 as `col1`;
> SELECT * from dricard.test
> {noformat}
> Thrift server usually fails at INSERT...
> Tried the same procedure in a spark context using spark.sql() and didn't 
> encounter the same issue.
> Full stack Trace:
> {noformat}
> 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error 
> executing query, currentState RUNNING,
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0
>  to desti
> nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0;
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
> at

[jira] [Updated] (SPARK-21306) OneVsRest Conceals Columns That May Be Relevant To Underlying Classifier

2017-07-28 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-21306:

Fix Version/s: (was: 2.1.2)
   (was: 2.0.3)

> OneVsRest Conceals Columns That May Be Relevant To Underlying Classifier
> 
>
> Key: SPARK-21306
> URL: https://issues.apache.org/jira/browse/SPARK-21306
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Cathal Garvey
>Assignee: Yan Facai (颜发才)
>Priority: Critical
>  Labels: classification, ml
> Fix For: 2.2.1, 2.3.0
>
>
> Hi folks, thanks for Spark! :)
> I've been learning to use `ml` and `mllib`, and I've encountered a block 
> while trying to use `ml.classification.OneVsRest` with 
> `ml.classification.LogisticRegression`. Basically, [here in the 
> code|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala#L320],
>  only two columns are being extracted and fed to the underlying classifiers.. 
> however with some configurations, more than two columns are required.
> Specifically: I want to do multiclass learning with Logistic Regression, on a 
> very imbalanced dataset. In my dataset, I have lots of imbalances, so I was 
> planning to use weights. I set a column, `"weight"`, as the inverse frequency 
> of each field, and I configured my `LogisticRegression` class to use this 
> column, then put it in a `OneVsRest` wrapper.
> However, `OneVsRest` strips all but two columns out of a dataset before 
> training, so I get an error from within `LogisticRegression` that it can't 
> find the `"weight"` column.
> It would be nice to have this fixed! I can see a few ways, but a very 
> conservative fix would be to include a parameter in `OneVsRest.fit` for 
> additional columns to `select` before passing to the underlying model.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20919) Simplificaiton of CachedKafkaConsumer.

2017-07-28 Thread Prashant Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-20919:

Description: Using an object pool instead of a cache for recycling objects 
in Kafka consumer cache.  (was: On the lines of SPARK-19968, guava cache can be 
used to simplify the code in CachedKafkaConsumer as well. With an additional 
feature of automatic cleanup of a consumer unused for a configurable time.)

> Simplificaiton of CachedKafkaConsumer.
> --
>
> Key: SPARK-20919
> URL: https://issues.apache.org/jira/browse/SPARK-20919
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Prashant Sharma
>
> Using an object pool instead of a cache for recycling objects in Kafka 
> consumer cache.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20919) Simplificaiton of CachedKafkaConsumer.

2017-07-28 Thread Prashant Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-20919:

Summary: Simplificaiton of CachedKafkaConsumer.  (was: Simplificaiton of 
CachedKafkaConsumer using guava cache.)

> Simplificaiton of CachedKafkaConsumer.
> --
>
> Key: SPARK-20919
> URL: https://issues.apache.org/jira/browse/SPARK-20919
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Prashant Sharma
>
> On the lines of SPARK-19968, guava cache can be used to simplify the code in 
> CachedKafkaConsumer as well. With an additional feature of automatic cleanup 
> of a consumer unused for a configurable time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21559) Remove Mesos fine-grained mode

2017-07-28 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-21559:

Description: 
After discussing this with people from Mesosphere we agreed that it is time to 
remove fine grained mode. Plans are to improve cluster mode to cover any 
benefits may existed when using fine grained mode.
 [~susanxhuynh]
Previous status of this can be found here:
https://issues.apache.org/jira/browse/SPARK-11857



  was:
After discussing this with people from Mesosphere we agreed that it is time to 
remove fine grain mode. Plans are to improve cluster mode to cover any benefits 
may existed when using fine grain mode.
 [~susanxhuynh]
Previous status of this can be found here:
https://issues.apache.org/jira/browse/SPARK-11857




> Remove Mesos fine-grained mode
> --
>
> Key: SPARK-21559
> URL: https://issues.apache.org/jira/browse/SPARK-21559
> Project: Spark
>  Issue Type: Task
>  Components: Mesos
>Affects Versions: 2.2.0
>Reporter: Stavros Kontopoulos
>
> After discussing this with people from Mesosphere we agreed that it is time 
> to remove fine grained mode. Plans are to improve cluster mode to cover any 
> benefits may existed when using fine grained mode.
>  [~susanxhuynh]
> Previous status of this can be found here:
> https://issues.apache.org/jira/browse/SPARK-11857



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21559) Remove Mesos fine-grained mode

2017-07-28 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-21559:

Summary: Remove Mesos fine-grained mode  (was: Remove Mesos Fine-grain mode)

> Remove Mesos fine-grained mode
> --
>
> Key: SPARK-21559
> URL: https://issues.apache.org/jira/browse/SPARK-21559
> Project: Spark
>  Issue Type: Task
>  Components: Mesos
>Affects Versions: 2.2.0
>Reporter: Stavros Kontopoulos
>
> After discussing this with people from Mesosphere we agreed that it is time 
> to remove fine grain mode. Plans are to improve cluster mode to cover any 
> benefits may existed when using fine grain mode.
>  [~susanxhuynh]
> Previous status of this can be found here:
> https://issues.apache.org/jira/browse/SPARK-11857



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21559) Remove Mesos Fine-grain mode

2017-07-28 Thread Stavros Kontopoulos (JIRA)

Stavros Kontopoulos created SPARK-21559:
---

 Summary: Remove Mesos Fine-grain mode
 Key: SPARK-21559
 URL: https://issues.apache.org/jira/browse/SPARK-21559
 Project: Spark
  Issue Type: Task
  Components: Mesos
Affects Versions: 2.2.0
Reporter: Stavros Kontopoulos


After discussing this with people from Mesosphere we agreed that it is time to 
remove fine grain mode. Plans are to improve cluster mode to cover any benefits 
may existed when using fine grain mode.
 [~susanxhuynh]
Previous status of this can be found here:
https://issues.apache.org/jira/browse/SPARK-11857





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21553) Add the description of the default value of master parameter in the spark-shell

2017-07-28 Thread Donghui Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Donghui Xu updated SPARK-21553:
---
Summary: Add the description of the default value of master parameter in 
the spark-shell  (was: Added the description of the default value of master 
parameter in the spark-shell)

> Add the description of the default value of master parameter in the 
> spark-shell
> ---
>
> Key: SPARK-21553
> URL: https://issues.apache.org/jira/browse/SPARK-21553
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Affects Versions: 2.2.0
>Reporter: Donghui Xu
>Priority: Minor
>
> When I type spark-shell --help, I find that the default value description for 
> the master parameter is missing. The user does not know what the default 
> value is when the master parameter is not included, so we need to add the 
> master parameter default description to the help information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21479) Outer join filter pushdown in null supplying table when condition is on one of the joined columns

2017-07-28 Thread Abhijit Bhole (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104708#comment-16104708
 ] 

Abhijit Bhole commented on SPARK-21479:
---

So here is the actual use case - 

{code:java}
spark = SparkSession.builder.getOrCreate()

df1 = spark.createDataFrame([{ "x" : 'c1', "a": 1, "b" : 2}, { "x" : 'c2', "a": 
3, "b" : 4}])
df2 = spark.createDataFrame([{ "x" : 'c1', "a": 1, "c" : 5}, { "x" : 'c1', "a": 
3, "c" : 6}, { "x" : 'c2', "a": 5, "c" : 8}])

df1.join(df2, ['x', 'a'], 'right_outer').where("b = 2").explain()

df1.join(df2, ['x', 'a'], 'right_outer').where("b = 2").show()

print 

df1 = spark.createDataFrame([{ "x" : 'c1', "a": 1, "b" : 2}, { "x" : 'c2', "a": 
3, "b" : 4}])
df2 = spark.createDataFrame([{ "x" : 'c1', "a": 1, "c" : 5}, { "x" : 'c1', "a": 
3, "c" : 6}, { "x" : 'c2', "a": 5, "c" : 8}])


df1.join(df2, ['x', 'a'], 'right_outer').where("x = 'c1'").explain()

df1.join(df2, ['x', 'a'], 'right_outer').where("x = 'c1'").show()
{code}

Output - 

{code:java}
== Physical Plan ==
*Project [x#458, a#456L, b#450L, c#457L]
+- *SortMergeJoin [x#451, a#449L], [x#458, a#456L], Inner
   :- *Sort [x#451 ASC NULLS FIRST, a#449L ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(x#451, a#449L, 4)
   : +- *Filter (((isnotnull(b#450L) && (b#450L = 2)) && isnotnull(x#451)) 
&& isnotnull(a#449L))
   :+- Scan ExistingRDD[a#449L,b#450L,x#451]
   +- *Sort [x#458 ASC NULLS FIRST, a#456L ASC NULLS FIRST], false, 0
  +- Exchange hashpartitioning(x#458, a#456L, 4)
 +- *Filter (isnotnull(x#458) && isnotnull(a#456L))
+- Scan ExistingRDD[a#456L,c#457L,x#458]
+---+---+---+---+
|  x|  a|  b|  c|
+---+---+---+---+
| c1|  1|  2|  5|
+---+---+---+---+


== Physical Plan ==
*Project [x#490, a#488L, b#482L, c#489L]
+- SortMergeJoin [x#483, a#481L], [x#490, a#488L], RightOuter
   :- *Sort [x#483 ASC NULLS FIRST, a#481L ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(x#483, a#481L, 4)
   : +- Scan ExistingRDD[a#481L,b#482L,x#483]
   +- *Sort [x#490 ASC NULLS FIRST, a#488L ASC NULLS FIRST], false, 0
  +- Exchange hashpartitioning(x#490, a#488L, 4)
 +- *Filter (isnotnull(x#490) && (x#490 = c1))
+- Scan ExistingRDD[a#488L,c#489L,x#490]
+---+---++---+
|  x|  a|   b|  c|
+---+---++---+
| c1|  1|   2|  5|
| c1|  3|null|  6|
+---+---++---+
{code}

As you can see filter on 'x' column does not get pushed down. In our cases, 'x' 
is a company id in an multi tenant system and it is extremely important to pass 
this filter to both dataframes or else it fetches the entire data for both the 
tables.


> Outer join filter pushdown in null supplying table when condition is on one 
> of the joined columns
> -
>
> Key: SPARK-21479
> URL: https://issues.apache.org/jira/browse/SPARK-21479
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Abhijit Bhole
>
> Here are two different query plans - 
> {code:java}
> df1 = spark.createDataFrame([{ "a": 1, "b" : 2}, { "a": 3, "b" : 4}])
> df2 = spark.createDataFrame([{ "a": 1, "c" : 5}, { "a": 3, "c" : 6}, { "a": 
> 5, "c" : 8}])
> df1.join(df2, ['a'], 'right_outer').where("b = 2").explain()
> == Physical Plan ==
> *Project [a#16299L, b#16295L, c#16300L]
> +- *SortMergeJoin [a#16294L], [a#16299L], Inner
>:- *Sort [a#16294L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(a#16294L, 4)
>: +- *Filter ((isnotnull(b#16295L) && (b#16295L = 2)) && 
> isnotnull(a#16294L))
>:+- Scan ExistingRDD[a#16294L,b#16295L]
>+- *Sort [a#16299L ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(a#16299L, 4)
>  +- *Filter isnotnull(a#16299L)
> +- Scan ExistingRDD[a#16299L,c#16300L]
> df1 = spark.createDataFrame([{ "a": 1, "b" : 2}, { "a": 3, "b" : 4}])
> df2 = spark.createDataFrame([{ "a": 1, "c" : 5}, { "a": 3, "c" : 6}, { "a": 
> 5, "c" : 8}])
> df1.join(df2, ['a'], 'right_outer').where("a = 1").explain()
> == Physical Plan ==
> *Project [a#16314L, b#16310L, c#16315L]
> +- SortMergeJoin [a#16309L], [a#16314L], RightOuter
>:- *Sort [a#16309L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(a#16309L, 4)
>: +- Scan ExistingRDD[a#16309L,b#16310L]
>+- *Sort [a#16314L ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(a#16314L, 4)
>  +- *Filter (isnotnull(a#16314L) && (a#16314L = 1))
> +- Scan ExistingRDD[a#16314L,c#16315L]
> {code}
> If condition on b can be pushed down on df1 then why not condition on a?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail:

[jira] [Created] (SPARK-21558) Kinesis lease failover time should be increased or made configurable

2017-07-28 Thread JIRA

Clément MATHIEU created SPARK-21558:
---

 Summary: Kinesis lease failover time should be increased or made 
configurable
 Key: SPARK-21558
 URL: https://issues.apache.org/jira/browse/SPARK-21558
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.0.2
Reporter: Clément MATHIEU


I have a Spark Streaming application reading from a Kinesis stream which 
exhibits serious shard lease fickleness. The root cause as been identified as 
KCL default failover time being too low for our typical JVM pauses time:

#  KinesisClientLibConfiguration#DEFAULT_FAILOVER_TIME_MILLIS is 10 seconds, 
meaning that if a worker does not renew a lease within 10s, others workers will 
steal it
# spark-streaming-kinesis-asl uses default KCL failover time and does not allow 
to configure it
# Executor's JVM logs show frequent 10+ seconds pauses

While we could spend some time to fine tune GC configuration to reduce pause 
times, I am wondering if 10 seconds is not too low. Typical Spark executors 
have very large heaps and GCs available in HotSpot are not great at ensuring 
low and deterministic pause times. One might also want to use ParallelGC. 

What do you think about:

# Increasing fail over time (it might hurts application with low latency 
requirements)
# Making it configurable




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21554) Spark Hive reporting pyspark.sql.utils.AnalysisException: u'Table not found: XXX' when run on yarn cluster

2017-07-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104640#comment-16104640
 ] 

Sean Owen commented on SPARK-21554:
---

The error here doesn't show the actual error. 

> Spark Hive reporting pyspark.sql.utils.AnalysisException: u'Table not found: 
> XXX' when run on yarn cluster
> --
>
> Key: SPARK-21554
> URL: https://issues.apache.org/jira/browse/SPARK-21554
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.1
> Environment: We are deploying pyspark scripts on EMR 5.7
>Reporter: Subhod Lagade
>
> Traceback (most recent call last):
>   File "Test.py", line 7, in 
> hc = HiveContext(sc)
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1500357225179_0540/container_1500357225179_0540_02_01/pyspark.zip/pyspark/sql/context.py",
>  line 514, in __init__
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1500357225179_0540/container_1500357225179_0540_02_01/pyspark.zip/pyspark/sql/session.py",
>  line 179, in getOrCreate
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1500357225179_0540/container_1500357225179_0540_02_01/py4j-0.10.4-src.zip/py4j/java_gateway.py",
>  line 1133, in __call__
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1500357225179_0540/container_1500357225179_0540_02_01/pyspark.zip/pyspark/sql/utils.py",
>  line 79, in deco
> pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 
> 'org.apache.spark.sql.hive.HiveSessionState':"



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21557) Debug issues for SparkML(scala.Predef$.any2ArrowAssoc)

2017-07-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21557.
---
  Resolution: Invalid
   Fix Version/s: (was: 2.1.2)
Target Version/s:   (was: 2.2.0)

JIRA isn't for questions - stackoverflow maybe. 

> Debug issues for SparkML(scala.Predef$.any2ArrowAssoc)
> --
>
> Key: SPARK-21557
> URL: https://issues.apache.org/jira/browse/SPARK-21557
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: prabir bhowmick
>Priority: Critical
>
> Hi Team,
> Can you please see the below error ,when I am running the below program using 
> below mvn config.Kindly tell me which version I have to use.I am running this 
> program from eclipse neon.
> Error at Runtime:- 
> Exception in thread "main" java.lang.NoSuchMethodError: 
> scala.Predef$.any2ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
>   at 
> org.apache.spark.sql.SparkSession$Builder.config(SparkSession.scala:750)
>   at 
> org.apache.spark.sql.SparkSession$Builder.appName(SparkSession.scala:741)
>   at com.MLTest.JavaPCAExample.main(JavaPCAExample.java:20)
> Java Class:-
> package com.MLTest;
> import org.apache.spark.sql.SparkSession;
> import java.util.Arrays;
> import java.util.List;
> import org.apache.spark.ml.feature.PCA;
> import org.apache.spark.ml.feature.PCAModel;
> import org.apache.spark.ml.linalg.VectorUDT;
> import org.apache.spark.ml.linalg.Vectors;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.RowFactory;
> import org.apache.spark.sql.types.Metadata;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> public class JavaPCAExample {
>   public static void main(String[] args) {
>   SparkSession spark = 
> SparkSession.builder().appName("JavaPCAExample3")
>   .config("spark.some.config.option", 
> "some-value").getOrCreate();
>   List data = Arrays.asList(
>   RowFactory.create(Vectors.sparse(5, new int[] { 
> 1, 3 }, new double[] { 1.0, 7.0 })),
>   RowFactory.create(Vectors.dense(2.0, 0.0, 3.0, 
> 4.0, 5.0)),
>   RowFactory.create(Vectors.dense(4.0, 0.0, 0.0, 
> 6.0, 7.0)));
>   StructType schema = new StructType(
>   new StructField[] { new StructField("features", 
> new VectorUDT(), false, Metadata.empty()), });
>   Dataset df = spark.createDataFrame(data, schema);
>   PCAModel pca = new 
> PCA().setInputCol("features").setOutputCol("pcaFeatures").setK(3).fit(df);
>   Dataset result = pca.transform(df).select("pcaFeatures");
>   result.show(false);
>   spark.stop();
>   }
> }
> pom.xml:-
> http://maven.apache.org/POM/4.0.0; 
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
>   xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
> http://maven.apache.org/xsd/maven-4.0.0.xsd;>
>   4.0.0
>   SparkMLTest
>   SparkMLTest
>   0.0.1-SNAPSHOT
>   
>   src
>   
>   
>   maven-compiler-plugin
>   3.5.1
>   
>   1.8
>   1.8
>   
>   
>   
>   
>   
>   
>   org.apache.spark
>   spark-core_2.10
>   2.2.0
>   
>   
>   org.apache.spark
>   spark-streaming_2.10
>   2.1.1
>   
>   
>   org.apache.spark
>   spark-mllib_2.10
>   2.1.1
>   provided
>   
>   
>   org.apache.spark
>   spark-sql_2.10
>   2.1.1
>   
>   
>   org.scala-lang
>   scala-library
>   2.13.0-M1
>   
>   
>   org.apache.parquet
>   parquet-hadoop-bundle
>   1.8.1
>   
>   
> 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21554) Spark Hive reporting pyspark.sql.utils.AnalysisException: u'Table not found: XXX' when run on yarn cluster

2017-07-28 Thread Subhod Lagade (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104634#comment-16104634
 ] 

Subhod Lagade commented on SPARK-21554:
---

Thanks for quick reply @Hyukjin Kwon  - We have spark installed with version 
2.1.1 on EMR 5.7 cluster. 
- From any of the node when i try to submit pyspark job we are getting above 
error.

Deploy command : spark-submit --master yarn --deploy-mode cluster 
spark_installed_dir\examples\src\main\python\sql\basic.py

> Spark Hive reporting pyspark.sql.utils.AnalysisException: u'Table not found: 
> XXX' when run on yarn cluster
> --
>
> Key: SPARK-21554
> URL: https://issues.apache.org/jira/browse/SPARK-21554
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.1
> Environment: We are deploying pyspark scripts on EMR 5.7
>Reporter: Subhod Lagade
>
> Traceback (most recent call last):
>   File "Test.py", line 7, in 
> hc = HiveContext(sc)
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1500357225179_0540/container_1500357225179_0540_02_01/pyspark.zip/pyspark/sql/context.py",
>  line 514, in __init__
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1500357225179_0540/container_1500357225179_0540_02_01/pyspark.zip/pyspark/sql/session.py",
>  line 179, in getOrCreate
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1500357225179_0540/container_1500357225179_0540_02_01/py4j-0.10.4-src.zip/py4j/java_gateway.py",
>  line 1133, in __call__
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1500357225179_0540/container_1500357225179_0540_02_01/pyspark.zip/pyspark/sql/utils.py",
>  line 79, in deco
> pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 
> 'org.apache.spark.sql.hive.HiveSessionState':"



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 116 matches

Mail list logo