date:20190107

[jira] [Commented] (SPARK-26439) Introduce WorkerOffer reservation mechanism for Barrier TaskSet

2019-01-07 Thread wuyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736687#comment-16736687
 ] 

wuyi commented on SPARK-26439:
--

[~srowen] ok.

> Introduce WorkerOffer reservation mechanism for Barrier TaskSet
> ---
>
> Key: SPARK-26439
> URL: https://issues.apache.org/jira/browse/SPARK-26439
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: wuyi
>Priority: Major
>  Labels: performance
>
> Currently, Barrier TaskSet has a hard requirement that tasks can only be 
> launched
>  in a single resourceOffers round with enough slots(or sufficient resources), 
> but
>  can not be guaranteed even if with enough slots due to task locality delay 
> scheduling.
>  So, it is very likely that Barrier TaskSet gets a chunk of sufficient 
> resources after
>  all the trouble, but let it go easily just beacuae one of pending tasks can 
> not be
>  scheduled. Futhermore, it causes severe resource competition between 
> TaskSets and jobs
>  and introduce unclear semantic for DynamicAllocation.
> This JIRA trys to introduce WorkerOffer reservation mechanism for Barrier 
> TaskSet, which
>  allows Barrier TaskSet to reserve WorkerOffer in each resourceOffers round, 
> and launch
>  tasks at the same time once it accumulate the sufficient resource. In this 
> way, we 
>  relax the requirement of resources for the Barrier TaskSet. To avoid the 
> deadlock which
>  may be introuduced by serveral Barrier TaskSets holding the reserved 
> WorkerOffers for a
>  long time, we'll ask Barrier TaskSets to force releasing part of reserved 
> WorkerOffers 
>  on demand. So, it is highly possible that each Barrier TaskSet would be 
> launched in the
>  end.
> To integrate with DynamicAllocation
> The possible effective way I can imagine is that adding new event, e.g. 
>  ExecutorReservedEvent, ExecutorReleasedEvent, which behaved like busy 
> executor with
>  running tasks or idle executor without running tasks. Thus, 
> ExecutionAllocationManager 
>  would not let the executor go if it reminds of there're some reserved 
> resource on that
>  executor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26567) Should we align CSV query results with hive text query results: an int field, if the input value is 1.0, hive text query results is 1, CSV query results is null

2019-01-07 Thread eaton (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736685#comment-16736685
 ] 

eaton commented on SPARK-26567:
---

Ok. [~hyukjin.kwon]

> Should we align CSV query results with hive text query results: an int field, 
> if the input value is 1.0, hive text query results is 1, CSV query results is 
> null
> 
>
> Key: SPARK-26567
> URL: https://issues.apache.org/jira/browse/SPARK-26567
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: eaton
>Priority: Minor
>
> If we want to be consistent, we can modify the makeConverter function in 
> UnivocityParser, but the performance may get worse.The modified code is as 
> follows:
>  
> {code:java}
> def makeConverter(
> name: String,
> dataType: DataType,
> nullable: Boolean = true,
> options: CSVOptions): ValueConverter = dataType match {
>   case _: ByteType => (d: String) =>
> nullSafeDatum(d, name, nullable, options)(_.toDouble.intValue().toByte)
>   case _: ShortType => (d: String) =>
> nullSafeDatum(d, name, nullable, options)(_.toDouble.intValue().toShort)
>   case _: IntegerType => (d: String) =>
> nullSafeDatum(d, name, nullable, options)(_.toDouble.intValue())
>  
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26567) Should we align CSV query results with hive text query results: an int field, if the input value is 1.0, hive text query results is 1, CSV query results is null

2019-01-07 Thread eaton (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

eaton updated SPARK-26567:
--
Description: 
If we want to be consistent, we can modify the makeConverter function in 
UnivocityParser, but the performance may get worse.The modified code is as 
follows:

 
{code:java}
def makeConverter(
name: String,
dataType: DataType,
nullable: Boolean = true,
options: CSVOptions): ValueConverter = dataType match {
  case _: ByteType => (d: String) =>
nullSafeDatum(d, name, nullable, options)(_.toDouble.intValue().toByte)

  case _: ShortType => (d: String) =>
nullSafeDatum(d, name, nullable, options)(_.toDouble.intValue().toShort)

  case _: IntegerType => (d: String) =>
nullSafeDatum(d, name, nullable, options)(_.toDouble.intValue())
 
{code}

  was:
If we want to be consistent, we can modify the makeConverter function in 
UnivocityParser, but the performance may get worse.The modified code is as 
follows:

 
{code:java}
def makeConverter(
name: String,
dataType: DataType,
nullable: Boolean = true,
options: CSVOptions): ValueConverter = dataType match {
  case _: ByteType => (d: String) =>
nullSafeDatum(d, name, nullable, options)(_.toDouble.intValue().toByte)

  case _: ShortType => (d: String) =>
nullSafeDatum(d, name, nullable, options)(_.toDouble.intValue().toShort)

  case _: IntegerType => (d: String) =>
nullSafeDatum(d, name, nullable, options)(_.toDouble.intValue())

  case _: LongType => (d: String) =>
nullSafeDatum(d, name, nullable, options)(_.toDouble.intValue().toLong)
{code}
 


> Should we align CSV query results with hive text query results: an int field, 
> if the input value is 1.0, hive text query results is 1, CSV query results is 
> null
> 
>
> Key: SPARK-26567
> URL: https://issues.apache.org/jira/browse/SPARK-26567
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: eaton
>Priority: Minor
>
> If we want to be consistent, we can modify the makeConverter function in 
> UnivocityParser, but the performance may get worse.The modified code is as 
> follows:
>  
> {code:java}
> def makeConverter(
> name: String,
> dataType: DataType,
> nullable: Boolean = true,
> options: CSVOptions): ValueConverter = dataType match {
>   case _: ByteType => (d: String) =>
> nullSafeDatum(d, name, nullable, options)(_.toDouble.intValue().toByte)
>   case _: ShortType => (d: String) =>
> nullSafeDatum(d, name, nullable, options)(_.toDouble.intValue().toShort)
>   case _: IntegerType => (d: String) =>
> nullSafeDatum(d, name, nullable, options)(_.toDouble.intValue())
>  
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26567) Should we align CSV query results with hive text query results: an int field, if the input value is 1.0, hive text query results is 1, CSV query results is null

2019-01-07 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736683#comment-16736683
 ] 

Hyukjin Kwon commented on SPARK-26567:
--

I don't think we should necessarily follow Hive. The current behaviour looks 
making sense already. Let's don't fix. 

> Should we align CSV query results with hive text query results: an int field, 
> if the input value is 1.0, hive text query results is 1, CSV query results is 
> null
> 
>
> Key: SPARK-26567
> URL: https://issues.apache.org/jira/browse/SPARK-26567
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: eaton
>Priority: Minor
>
> If we want to be consistent, we can modify the makeConverter function in 
> UnivocityParser, but the performance may get worse.The modified code is as 
> follows:
>  
> {code:java}
> def makeConverter(
> name: String,
> dataType: DataType,
> nullable: Boolean = true,
> options: CSVOptions): ValueConverter = dataType match {
>   case _: ByteType => (d: String) =>
> nullSafeDatum(d, name, nullable, options)(_.toDouble.intValue().toByte)
>   case _: ShortType => (d: String) =>
> nullSafeDatum(d, name, nullable, options)(_.toDouble.intValue().toShort)
>   case _: IntegerType => (d: String) =>
> nullSafeDatum(d, name, nullable, options)(_.toDouble.intValue())
>   case _: LongType => (d: String) =>
> nullSafeDatum(d, name, nullable, options)(_.toDouble.intValue().toLong)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26567) Should we align CSV query results with hive text query results: an int field, if the input value is 1.0, hive text query results is 1, CSV query results is null

2019-01-07 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26567.
--
Resolution: Won't Fix

> Should we align CSV query results with hive text query results: an int field, 
> if the input value is 1.0, hive text query results is 1, CSV query results is 
> null
> 
>
> Key: SPARK-26567
> URL: https://issues.apache.org/jira/browse/SPARK-26567
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: eaton
>Priority: Minor
>
> If we want to be consistent, we can modify the makeConverter function in 
> UnivocityParser, but the performance may get worse.The modified code is as 
> follows:
>  
> {code:java}
> def makeConverter(
> name: String,
> dataType: DataType,
> nullable: Boolean = true,
> options: CSVOptions): ValueConverter = dataType match {
>   case _: ByteType => (d: String) =>
> nullSafeDatum(d, name, nullable, options)(_.toDouble.intValue().toByte)
>   case _: ShortType => (d: String) =>
> nullSafeDatum(d, name, nullable, options)(_.toDouble.intValue().toShort)
>   case _: IntegerType => (d: String) =>
> nullSafeDatum(d, name, nullable, options)(_.toDouble.intValue())
>   case _: LongType => (d: String) =>
> nullSafeDatum(d, name, nullable, options)(_.toDouble.intValue().toLong)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26567) Should we align CSV query results with hive text query results: an int field, if the input value is 1.0, hive text query results is 1, CSV query results is null

2019-01-07 Thread eaton (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

eaton updated SPARK-26567:
--
Description: 
If we want to be consistent, we can modify the makeConverter function in 
UnivocityParser, but the performance may get worse.The modified code is as 
follows:

 
{code:java}
def makeConverter(
name: String,
dataType: DataType,
nullable: Boolean = true,
options: CSVOptions): ValueConverter = dataType match {
  case _: ByteType => (d: String) =>
nullSafeDatum(d, name, nullable, options)(_.toDouble.intValue().toByte)

  case _: ShortType => (d: String) =>
nullSafeDatum(d, name, nullable, options)(_.toDouble.intValue().toShort)

  case _: IntegerType => (d: String) =>
nullSafeDatum(d, name, nullable, options)(_.toDouble.intValue())

  case _: LongType => (d: String) =>
nullSafeDatum(d, name, nullable, options)(_.toDouble.intValue().toLong)
{code}
 

  was:
If we want to be consistent, we can modify the makeConverter function in 
UnivocityParser, but the performance may get worse.The modified code is as 
follows:
{code:java}
// code placeholder 
def makeConverter( name: String, dataType: DataType, nullable: Boolean = true, 
options: CSVOptions): ValueConverter = dataType match { case : ByteType => (d: 
String) => nullSafeDatum(d, name, nullable, 
options)(.toDouble.intValue().toByte) 
case : ShortType => (d: String) => nullSafeDatum(d, name, nullable, 
options)(.toDouble.intValue().toShort)
 case : IntegerType => (d: String) => nullSafeDatum(d, name, nullable, 
options)(.toDouble.intValue()) 
case : LongType => (d: String) => nullSafeDatum(d, name, nullable, 
options)(.toDouble.intValue().toLong)
{code}
 


> Should we align CSV query results with hive text query results: an int field, 
> if the input value is 1.0, hive text query results is 1, CSV query results is 
> null
> 
>
> Key: SPARK-26567
> URL: https://issues.apache.org/jira/browse/SPARK-26567
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: eaton
>Priority: Minor
>
> If we want to be consistent, we can modify the makeConverter function in 
> UnivocityParser, but the performance may get worse.The modified code is as 
> follows:
>  
> {code:java}
> def makeConverter(
> name: String,
> dataType: DataType,
> nullable: Boolean = true,
> options: CSVOptions): ValueConverter = dataType match {
>   case _: ByteType => (d: String) =>
> nullSafeDatum(d, name, nullable, options)(_.toDouble.intValue().toByte)
>   case _: ShortType => (d: String) =>
> nullSafeDatum(d, name, nullable, options)(_.toDouble.intValue().toShort)
>   case _: IntegerType => (d: String) =>
> nullSafeDatum(d, name, nullable, options)(_.toDouble.intValue())
>   case _: LongType => (d: String) =>
> nullSafeDatum(d, name, nullable, options)(_.toDouble.intValue().toLong)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26567) Should we align CSV query results with hive text query results: an int field, if the input value is 1.0, hive text query results is 1, CSV query results is null

2019-01-07 Thread eaton (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

eaton updated SPARK-26567:
--
Description: 
If we want to be consistent, we can modify the makeConverter function in 
UnivocityParser, but the performance may get worse.The modified code is as 
follows:
{code:java}
// code placeholder 
def makeConverter( name: String, dataType: DataType, nullable: Boolean = true, 
options: CSVOptions): ValueConverter = dataType match { case : ByteType => (d: 
String) => nullSafeDatum(d, name, nullable, 
options)(.toDouble.intValue().toByte) case : ShortType => (d: String) => 
nullSafeDatum(d, name, nullable, options)(.toDouble.intValue().toShort) case : 
IntegerType => (d: String) => nullSafeDatum(d, name, nullable, 
options)(.toDouble.intValue()) case : LongType => (d: String) => 
nullSafeDatum(d, name, nullable, options)(.toDouble.intValue().toLong)
{code}
 

  was:
If we want to be consistent, we can modify the makeConverter function in 
UnivocityParser, but the performance may get worse.The modified code is as 
follows:
{code:java}
// code placeholder
{code}
def makeConverter( name: String, dataType: DataType, nullable: Boolean = true, 
options: CSVOptions): ValueConverter = dataType match { case _: ByteType => (d: 
String) => nullSafeDatum(d, name, nullable, 
options)(_.toDouble.intValue().toByte) case _: ShortType => (d: String) => 
nullSafeDatum(d, name, nullable, options)(_.toDouble.intValue().toShort) case 
_: IntegerType => (d: String) => nullSafeDatum(d, name, nullable, 
options)(_.toDouble.intValue()) case _: LongType => (d: String) => 
nullSafeDatum(d, name, nullable, options)(_.toDouble.intValue().toLong)


> Should we align CSV query results with hive text query results: an int field, 
> if the input value is 1.0, hive text query results is 1, CSV query results is 
> null
> 
>
> Key: SPARK-26567
> URL: https://issues.apache.org/jira/browse/SPARK-26567
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: eaton
>Priority: Minor
>
> If we want to be consistent, we can modify the makeConverter function in 
> UnivocityParser, but the performance may get worse.The modified code is as 
> follows:
> {code:java}
> // code placeholder 
> def makeConverter( name: String, dataType: DataType, nullable: Boolean = 
> true, options: CSVOptions): ValueConverter = dataType match { case : ByteType 
> => (d: String) => nullSafeDatum(d, name, nullable, 
> options)(.toDouble.intValue().toByte) case : ShortType => (d: String) => 
> nullSafeDatum(d, name, nullable, options)(.toDouble.intValue().toShort) case 
> : IntegerType => (d: String) => nullSafeDatum(d, name, nullable, 
> options)(.toDouble.intValue()) case : LongType => (d: String) => 
> nullSafeDatum(d, name, nullable, options)(.toDouble.intValue().toLong)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26567) Should we align CSV query results with hive text query results: an int field, if the input value is 1.0, hive text query results is 1, CSV query results is null

2019-01-07 Thread eaton (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

eaton updated SPARK-26567:
--
Description: 
If we want to be consistent, we can modify the makeConverter function in 
UnivocityParser, but the performance may get worse.The modified code is as 
follows:
{code:java}
// code placeholder 
def makeConverter( name: String, dataType: DataType, nullable: Boolean = true, 
options: CSVOptions): ValueConverter = dataType match { case : ByteType => (d: 
String) => nullSafeDatum(d, name, nullable, 
options)(.toDouble.intValue().toByte) 
case : ShortType => (d: String) => nullSafeDatum(d, name, nullable, 
options)(.toDouble.intValue().toShort)
 case : IntegerType => (d: String) => nullSafeDatum(d, name, nullable, 
options)(.toDouble.intValue()) 
case : LongType => (d: String) => nullSafeDatum(d, name, nullable, 
options)(.toDouble.intValue().toLong)
{code}
 

  was:
If we want to be consistent, we can modify the makeConverter function in 
UnivocityParser, but the performance may get worse.The modified code is as 
follows:
{code:java}
// code placeholder 
def makeConverter( name: String, dataType: DataType, nullable: Boolean = true, 
options: CSVOptions): ValueConverter = dataType match { case : ByteType => (d: 
String) => nullSafeDatum(d, name, nullable, 
options)(.toDouble.intValue().toByte) case : ShortType => (d: String) => 
nullSafeDatum(d, name, nullable, options)(.toDouble.intValue().toShort) case : 
IntegerType => (d: String) => nullSafeDatum(d, name, nullable, 
options)(.toDouble.intValue()) case : LongType => (d: String) => 
nullSafeDatum(d, name, nullable, options)(.toDouble.intValue().toLong)
{code}
 


> Should we align CSV query results with hive text query results: an int field, 
> if the input value is 1.0, hive text query results is 1, CSV query results is 
> null
> 
>
> Key: SPARK-26567
> URL: https://issues.apache.org/jira/browse/SPARK-26567
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: eaton
>Priority: Minor
>
> If we want to be consistent, we can modify the makeConverter function in 
> UnivocityParser, but the performance may get worse.The modified code is as 
> follows:
> {code:java}
> // code placeholder 
> def makeConverter( name: String, dataType: DataType, nullable: Boolean = 
> true, options: CSVOptions): ValueConverter = dataType match { case : ByteType 
> => (d: String) => nullSafeDatum(d, name, nullable, 
> options)(.toDouble.intValue().toByte) 
> case : ShortType => (d: String) => nullSafeDatum(d, name, nullable, 
> options)(.toDouble.intValue().toShort)
>  case : IntegerType => (d: String) => nullSafeDatum(d, name, nullable, 
> options)(.toDouble.intValue()) 
> case : LongType => (d: String) => nullSafeDatum(d, name, nullable, 
> options)(.toDouble.intValue().toLong)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26567) Should we align CSV query results with hive text query results: an int field, if the input value is 1.0, hive text query results is 1, CSV query results is null

2019-01-07 Thread eaton (JIRA)

eaton created SPARK-26567:
-

 Summary: Should we align CSV query results with hive text query 
results: an int field, if the input value is 1.0, hive text query results is 1, 
CSV query results is null
 Key: SPARK-26567
 URL: https://issues.apache.org/jira/browse/SPARK-26567
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: eaton


If we want to be consistent, we can modify the makeConverter function in 
UnivocityParser, but the performance may get worse.The modified code is as 
follows:
{code:java}
// code placeholder
{code}
def makeConverter( name: String, dataType: DataType, nullable: Boolean = true, 
options: CSVOptions): ValueConverter = dataType match { case _: ByteType => (d: 
String) => nullSafeDatum(d, name, nullable, 
options)(_.toDouble.intValue().toByte) case _: ShortType => (d: String) => 
nullSafeDatum(d, name, nullable, options)(_.toDouble.intValue().toShort) case 
_: IntegerType => (d: String) => nullSafeDatum(d, name, nullable, 
options)(_.toDouble.intValue()) case _: LongType => (d: String) => 
nullSafeDatum(d, name, nullable, options)(_.toDouble.intValue().toLong)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26562) countDistinct and user-defined function cannot be used in SELECT

2019-01-07 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26562.
--
Resolution: Cannot Reproduce

I can't reproduce this:

{code}
>>> df.select(F.sum(df['a']), F.count(df['a']), F.countDistinct(df['a']), 
>>> F.approx_count_distinct(df['a']), F.sum(func(df['a'])) ).show()
+--++-+++
|sum(a)|count(a)|count(DISTINCT a)|approx_count_distinct(a)|sum((a))|
+--++-+++
| 7|   3|3|   3| 3.0|
+--++-+++
{code}

in the current master. It would be great if we can identify the JIRA and 
backport if applicable. I'm leaving this resolved.

> countDistinct and user-defined function cannot be used in SELECT
> 
>
> Key: SPARK-26562
> URL: https://issues.apache.org/jira/browse/SPARK-26562
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: Macbook Pro 10.14.2
> spark 2.3.1
>  
>Reporter: Ravi Kaushik
>Priority: Minor
>
> df=spark.createDataFrame([ [1,2,3], [2,3,4], [4,5,6] ], ['a', 'b', 'c'])
> from pyspark.sql import functions as F, types as T
> df.select(F.sum(df['a']), F.count(df['a']), F.countDistinct(df['a']), 
> F.approx_count_distinct(df['a'])).show()
> func =F.udf(lambda x: 1.0, T.DoubelType())
>  
> df.select(F.sum(df['a']), F.count(df['a']), F.countDistinct(df['a']), 
> F.approx_count_distinct(df['a']), F.sum(func(df['a'])) ).show()
>  
>  
> Error
>  
> 2019-01-07 18:30:50 ERROR Executor:91 - Exception in task 6.0 in stage 4.0 
> (TID 223)
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: sum#45849
>  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$11.apply(TreeNode.scala:335)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at scala.collection.immutable.List.foreach(List.scala:381)
>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>  at scala.collection.immutable.List.map(List.scala:285)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:333)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$$anonfun$bind$1.apply(GenerateMutableProjection.scala:38)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$$anonfun$bind$1.apply(GenerateMutableProjection.scala:38)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
>

[jira] [Commented] (SPARK-26558) java.util.NoSuchElementException while saving data into HDFS using Spark

2019-01-07 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736662#comment-16736662
 ] 

Hyukjin Kwon commented on SPARK-26558:
--

The exception itself is simply from {{greenplum}}. Looks there's just something 
wrong or a bug in {{GreenplumRowIterator}}. I think this isn't an issue within 
a Spark.

> java.util.NoSuchElementException while saving data into HDFS using Spark
> 
>
> Key: SPARK-26558
> URL: https://issues.apache.org/jira/browse/SPARK-26558
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.0.0
>Reporter: Sidhartha
>Priority: Major
> Attachments: OKVMg.png, k5EWv.png
>
>
> h1. !OKVMg.png!!k5EWv.png! How to fix java.util.NoSuchElementException while 
> saving data into HDFS using Spark ?
>  
> I'm trying to ingest a greenplum table into HDFS using spark-greenplum reader.
> Below are the versions of Spark & Scala I am using:
> spark-core: 2.0.0
>  spark-sql: 2.0.0
>  Scala version: 2.11.8
> To do that, I wrote the following code:
>  
> {code:java}
> val conf = new 
> SparkConf().setAppName("TEST_YEAR").set("spark.executor.heartbeatInterval", 
> "1200s") .set("spark.network.timeout", "12000s") 
> .set("spark.sql.inMemoryColumnarStorage.compressed", "true") 
> .set("spark.shuffle.compress", "true") .set("spark.shuffle.spill.compress", 
> "true") .set("spark.sql.orc.filterPushdown", "true") .set("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer") 
> .set("spark.kryoserializer.buffer.max", "512m") .set("spark.serializer", 
> classOf[org.apache.spark.serializer.KryoSerializer].getName) 
> .set("spark.streaming.stopGracefullyOnShutdown", "true") 
> .set("spark.yarn.driver.memoryOverhead", "8192") 
> .set("spark.yarn.executor.memoryOverhead", "8192") 
> .set("spark.sql.shuffle.partitions", "400") 
> .set("spark.dynamicAllocation.enabled", "false") 
> .set("spark.shuffle.service.enabled", "true") 
> .set("spark.sql.tungsten.enabled", "true") .set("spark.executor.instances", 
> "12") .set("spark.executor.memory", "13g") .set("spark.executor.cores", "4") 
> .set("spark.files.maxPartitionBytes", "268435468") 
> val flagCol = "del_flag" val spark = 
> SparkSession.builder().config(conf).master("yarn").enableHiveSupport().config("hive.exec.dynamic.partition",
>  "true").config("hive.exec.dynamic.partition.mode", 
> "nonstrict").getOrCreate() import spark.implicits._ 
> val dtypes = spark.read.format("jdbc").option("url", 
> hiveMetaConURL).option("dbtable", "(select source_type, hive_type from 
> hivemeta.types) as gpHiveDataTypes").option("user", 
> metaUserName).option("password", metaPassword).load() 
> val spColsDF = spark.read.format("jdbc").option("url", hiveMetaConURL) 
> .option("dbtable", "(select source_columns, precision_columns, 
> partition_columns from hivemeta.source_table where 
> tablename='gpschema.empdocs') as colsPrecision") .option("user", 
> metaUserName).option("password", metaPassword).load() 
> val dataMapper = dtypes.as[(String, String)].collect().toMap 
> val gpCols = spColsDF.select("source_columns").map(row => 
> row.getString(0)).collect.mkString(",") 
> val gpColumns = gpCols.split("\\|").map(e => e.split("\\:")).map(s => 
> s(0)).mkString(",") val splitColumns = gpCols.split("\\|").toList 
> val precisionCols = 
> spColsDF.select("precision_columns").collect().map(_.getString(0)).mkString(",")
>  val partition_columns = 
> spColsDF.select("partition_columns").collect.flatMap(x => 
> x.getAs[String](0).split(",")) 
> val prtn_String_columns = 
> spColsDF.select("partition_columns").collect().map(_.getString(0)).mkString(",")
>  val partCList = prtn_String_columns.split(",").map(x => col(x)) 
> var splitPrecisionCols = precisionCols.split(",") for (i <- 
> splitPrecisionCols) { precisionColsText += i.concat(s"::${textType} as 
> ").concat(s"${i}_text") textList += s"${i}_text:${textType}" } 
> val pCols = precisionColsText.mkString(",") 
> val allColumns = gpColumns.concat("," + pCols) 
> val allColumnsSeq = allColumns.split(",").toSeq 
> val allColumnsSeqC = allColumnsSeq.map(x => column(x)) 
> val gpColSeq = gpColumns.split(",").toSeq 
> def prepareFinalDF(splitColumns: List[String], textList: ListBuffer[String], 
> allColumns: String, dataMapper: Map[String, String], partition_columns: 
> Array[String], spark: SparkSession): DataFrame = { 
> val yearDF = 
> spark.read.format("io.pivotal.greenplum.spark.GreenplumRelationProvider").option("url",
>  connectionUrl) .option("dbtable", "empdocs") .option("dbschema","gpschema") 
> .option("user", devUserName).option("password", devPassword) 
> .option("partitionColumn","header_id") .load() .where("year=2017 and 
> month=12") .select(gpColSeq map col:_*) .withColumn(flagCol, lit(0)) 
> val totalCols:

[jira] [Resolved] (SPARK-24196) Spark Thrift Server - SQL Client connections does't show db artefacts

2019-01-07 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24196.
-
   Resolution: Fixed
 Assignee: Yuming Wang
Fix Version/s: 3.0.0

> Spark Thrift Server - SQL Client connections does't show db artefacts
> -
>
> Key: SPARK-24196
> URL: https://issues.apache.org/jira/browse/SPARK-24196
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: rr
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: screenshot-1.png
>
>
> When connecting to Spark Thrift Server via JDBC artefacts(db objects are not 
> showing up)
> whereas when connecting to hiveserver2 it shows the schema, tables, columns 
> ...
> SQL Client user: IBM Data Studio, DBeaver SQL Client



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26558) java.util.NoSuchElementException while saving data into HDFS using Spark

2019-01-07 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26558.
--
Resolution: Invalid

> java.util.NoSuchElementException while saving data into HDFS using Spark
> 
>
> Key: SPARK-26558
> URL: https://issues.apache.org/jira/browse/SPARK-26558
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.0.0
>Reporter: Sidhartha
>Priority: Major
> Attachments: OKVMg.png, k5EWv.png
>
>
> h1. !OKVMg.png!!k5EWv.png! How to fix java.util.NoSuchElementException while 
> saving data into HDFS using Spark ?
>  
> I'm trying to ingest a greenplum table into HDFS using spark-greenplum reader.
> Below are the versions of Spark & Scala I am using:
> spark-core: 2.0.0
>  spark-sql: 2.0.0
>  Scala version: 2.11.8
> To do that, I wrote the following code:
>  
> {code:java}
> val conf = new 
> SparkConf().setAppName("TEST_YEAR").set("spark.executor.heartbeatInterval", 
> "1200s") .set("spark.network.timeout", "12000s") 
> .set("spark.sql.inMemoryColumnarStorage.compressed", "true") 
> .set("spark.shuffle.compress", "true") .set("spark.shuffle.spill.compress", 
> "true") .set("spark.sql.orc.filterPushdown", "true") .set("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer") 
> .set("spark.kryoserializer.buffer.max", "512m") .set("spark.serializer", 
> classOf[org.apache.spark.serializer.KryoSerializer].getName) 
> .set("spark.streaming.stopGracefullyOnShutdown", "true") 
> .set("spark.yarn.driver.memoryOverhead", "8192") 
> .set("spark.yarn.executor.memoryOverhead", "8192") 
> .set("spark.sql.shuffle.partitions", "400") 
> .set("spark.dynamicAllocation.enabled", "false") 
> .set("spark.shuffle.service.enabled", "true") 
> .set("spark.sql.tungsten.enabled", "true") .set("spark.executor.instances", 
> "12") .set("spark.executor.memory", "13g") .set("spark.executor.cores", "4") 
> .set("spark.files.maxPartitionBytes", "268435468") 
> val flagCol = "del_flag" val spark = 
> SparkSession.builder().config(conf).master("yarn").enableHiveSupport().config("hive.exec.dynamic.partition",
>  "true").config("hive.exec.dynamic.partition.mode", 
> "nonstrict").getOrCreate() import spark.implicits._ 
> val dtypes = spark.read.format("jdbc").option("url", 
> hiveMetaConURL).option("dbtable", "(select source_type, hive_type from 
> hivemeta.types) as gpHiveDataTypes").option("user", 
> metaUserName).option("password", metaPassword).load() 
> val spColsDF = spark.read.format("jdbc").option("url", hiveMetaConURL) 
> .option("dbtable", "(select source_columns, precision_columns, 
> partition_columns from hivemeta.source_table where 
> tablename='gpschema.empdocs') as colsPrecision") .option("user", 
> metaUserName).option("password", metaPassword).load() 
> val dataMapper = dtypes.as[(String, String)].collect().toMap 
> val gpCols = spColsDF.select("source_columns").map(row => 
> row.getString(0)).collect.mkString(",") 
> val gpColumns = gpCols.split("\\|").map(e => e.split("\\:")).map(s => 
> s(0)).mkString(",") val splitColumns = gpCols.split("\\|").toList 
> val precisionCols = 
> spColsDF.select("precision_columns").collect().map(_.getString(0)).mkString(",")
>  val partition_columns = 
> spColsDF.select("partition_columns").collect.flatMap(x => 
> x.getAs[String](0).split(",")) 
> val prtn_String_columns = 
> spColsDF.select("partition_columns").collect().map(_.getString(0)).mkString(",")
>  val partCList = prtn_String_columns.split(",").map(x => col(x)) 
> var splitPrecisionCols = precisionCols.split(",") for (i <- 
> splitPrecisionCols) { precisionColsText += i.concat(s"::${textType} as 
> ").concat(s"${i}_text") textList += s"${i}_text:${textType}" } 
> val pCols = precisionColsText.mkString(",") 
> val allColumns = gpColumns.concat("," + pCols) 
> val allColumnsSeq = allColumns.split(",").toSeq 
> val allColumnsSeqC = allColumnsSeq.map(x => column(x)) 
> val gpColSeq = gpColumns.split(",").toSeq 
> def prepareFinalDF(splitColumns: List[String], textList: ListBuffer[String], 
> allColumns: String, dataMapper: Map[String, String], partition_columns: 
> Array[String], spark: SparkSession): DataFrame = { 
> val yearDF = 
> spark.read.format("io.pivotal.greenplum.spark.GreenplumRelationProvider").option("url",
>  connectionUrl) .option("dbtable", "empdocs") .option("dbschema","gpschema") 
> .option("user", devUserName).option("password", devPassword) 
> .option("partitionColumn","header_id") .load() .where("year=2017 and 
> month=12") .select(gpColSeq map col:_*) .withColumn(flagCol, lit(0)) 
> val totalCols: List[String] = splitColumns ++ textList 
> val allColsOrdered = yearDF.columns.diff(partition_columns) ++ 
> partition_columns val allCols = allColsOrdered.map(colname => 
>

[jira] [Updated] (SPARK-26431) Update availableSlots by availableCpus for barrier taskset

2019-01-07 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26431:
--
Target Version/s:   (was: 2.4.0)

> Update availableSlots by availableCpus for barrier taskset
> --
>
> Key: SPARK-26431
> URL: https://issues.apache.org/jira/browse/SPARK-26431
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: wuyi
>Priority: Major
> Fix For: 2.4.0
>
>
> availableCpus decrease as  tasks allocated, so, we should update 
> availableSlots by availableCpus for barrier taskset to avoid unnecessary 
> resourceOffer process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24464) Unit tests for MLlib's Instrumentation

2019-01-07 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-24464:
--
Target Version/s:   (was: 2.4.0)

> Unit tests for MLlib's Instrumentation
> --
>
> Key: SPARK-24464
> URL: https://issues.apache.org/jira/browse/SPARK-24464
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> We added Instrumentation to MLlib to log params and metrics during machine 
> learning training and inference. However, the code has zero test coverage, 
> which usually means bugs and regressions in the future. I created this JIRA 
> to discuss how we should test Instrumentation.
> cc: [~thunterdb] [~josephkb] [~lu.DB]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23693) SQL function uuid()

2019-01-07 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23693.
---
Resolution: Won't Fix

> SQL function uuid()
> ---
>
> Key: SPARK-23693
> URL: https://issues.apache.org/jira/browse/SPARK-23693
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Arseniy Tashoyan
>Priority: Minor
>
> Add function uuid() to org.apache.spark.sql.functions that returns 
> [Universally Unique 
> ID|https://en.wikipedia.org/wiki/Universally_unique_identifier].
> Sometimes it is necessary to uniquely identify each row in a DataFrame.
> Currently the following ways are available:
>  * monotonically_increasing_id() function
>  * row_number() function over some window
>  * convert the DataFrame to RDD and zipWithIndex()
> All these approaches do not work when appending this DataFrame to another 
> DataFrame (union). Collisions may occur - two rows in different DataFrames 
> may have the same ID. Re-generating IDs on the resulting DataFrame is not an 
> option, because some data in some other system may already refer to old IDs.
> The proposed solution is to add new function:
> {code:scala}
> def uuid(): Column
> {code}
> that returns String representation of UUID.
> UUID is represented as a 128-bit number (two long numbers). Such numbers are 
> not supported in Scala or Java. In addition, some storage systems do not 
> support 128-bit numbers (Parquet's largest numeric type is INT96). This is 
> the reason for the uuid() function to return String.
> I already have a simple implementation based on 
> [java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I 
> can share it as a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26543) Support the coordinator to determine post-shuffle partitions more reasonably

2019-01-07 Thread chenliang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736653#comment-16736653
 ] 

chenliang commented on SPARK-26543:
---

[~yhuai] [~aaronmarkham]  Could you please have a look at this?

> Support the coordinator to determine post-shuffle partitions more reasonably
> 
>
> Key: SPARK-26543
> URL: https://issues.apache.org/jira/browse/SPARK-26543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: chenliang
>Priority: Major
> Attachments: image-2019-01-05-13-18-30-487.png
>
>
> For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true'，the 
> ExchangeCoordinator will introduced to determine the number of post-shuffle 
> partitions. But in some certain conditions,the coordinator performed not very 
> well, there are always some tasks retained and they worked with Shuffle Read 
> Size / Records 0.0B/0 ,We could increase the 
> spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
> action is unreasonable as targetPostShuffleInputSize Should not be set too 
> large. As follow:
> !image-2019-01-05-13-18-30-487.png!
> We can filter the useless partition(0B) with ExchangeCoorditinator 
> automatically



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26439) Introduce WorkerOffer reservation mechanism for Barrier TaskSet

2019-01-07 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26439:
--
   Flags:   (was: Important)
Target Version/s:   (was: 2.4.0)
   Fix Version/s: (was: 2.4.0)

Please don't set target/fix version

> Introduce WorkerOffer reservation mechanism for Barrier TaskSet
> ---
>
> Key: SPARK-26439
> URL: https://issues.apache.org/jira/browse/SPARK-26439
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: wuyi
>Priority: Major
>  Labels: performance
>
> Currently, Barrier TaskSet has a hard requirement that tasks can only be 
> launched
>  in a single resourceOffers round with enough slots(or sufficient resources), 
> but
>  can not be guaranteed even if with enough slots due to task locality delay 
> scheduling.
>  So, it is very likely that Barrier TaskSet gets a chunk of sufficient 
> resources after
>  all the trouble, but let it go easily just beacuae one of pending tasks can 
> not be
>  scheduled. Futhermore, it causes severe resource competition between 
> TaskSets and jobs
>  and introduce unclear semantic for DynamicAllocation.
> This JIRA trys to introduce WorkerOffer reservation mechanism for Barrier 
> TaskSet, which
>  allows Barrier TaskSet to reserve WorkerOffer in each resourceOffers round, 
> and launch
>  tasks at the same time once it accumulate the sufficient resource. In this 
> way, we 
>  relax the requirement of resources for the Barrier TaskSet. To avoid the 
> deadlock which
>  may be introuduced by serveral Barrier TaskSets holding the reserved 
> WorkerOffers for a
>  long time, we'll ask Barrier TaskSets to force releasing part of reserved 
> WorkerOffers 
>  on demand. So, it is highly possible that each Barrier TaskSet would be 
> launched in the
>  end.
> To integrate with DynamicAllocation
> The possible effective way I can imagine is that adding new event, e.g. 
>  ExecutorReservedEvent, ExecutorReleasedEvent, which behaved like busy 
> executor with
>  running tasks or idle executor without running tasks. Thus, 
> ExecutionAllocationManager 
>  would not let the executor go if it reminds of there're some reserved 
> resource on that
>  executor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25346) Document Spark builtin data sources

2019-01-07 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25346:
--
Target Version/s:   (was: 2.4.0)

> Document Spark builtin data sources
> ---
>
> Key: SPARK-25346
> URL: https://issues.apache.org/jira/browse/SPARK-25346
> Project: Spark
>  Issue Type: Story
>  Components: Documentation
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> It would be nice to list built-in data sources in the doc site. So users know 
> what are available by default. However, I didn't find any from 2.3.1 docs.
>  
> cc: [~hyukjin.kwon]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25584) Document libsvm data source in doc site

2019-01-07 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25584:
--
Target Version/s:   (was: 2.4.0)

> Document libsvm data source in doc site
> ---
>
> Key: SPARK-25584
> URL: https://issues.apache.org/jira/browse/SPARK-25584
> Project: Spark
>  Issue Type: Story
>  Components: Documentation, ML
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Currently, we only have Scala/Java API docs for libsvm data source. It would 
> be nice to have some documentation in the doc site. So Python/R users can 
> also discover this feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26334) NullPointerException in CallMethodViaReflection when we apply reflect function for empty field

2019-01-07 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26334.
---
Resolution: Not A Problem

> NullPointerException  in CallMethodViaReflection when we apply reflect 
> function for empty field
> ---
>
> Key: SPARK-26334
> URL: https://issues.apache.org/jira/browse/SPARK-26334
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: chenliang
>Priority: Major
> Attachments: 21_47_01__01_04_2019.jpg, screenshot-1.png
>
>
> In the table shown below：
> {code:sql}
> CREATE EXTERNAL TABLE `test_db4`(`a` string, `b` string, `url` string) 
> PARTITIONED BY (`dt` string);
> {code}
>  !screenshot-1.png! 
>   For field `url`, some values  are initialized to NULL .
>   When we apply reflect function  to `url` , it will  lead to  
> NullPointerException  as follow:
> {code:scala}
> select reflect('java.net.URLDecoder', 'decode', url ,'utf-8') from 
> mydemo.test_db4 where dt=20180920;
> {code}
> {panel:title=NPE}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 
> (TID 17, bigdata-nmg-hdfstest12.nmg01.diditaxi.com, executor 1): 
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection.eval(CallMethodViaReflection.scala:95)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:235)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
> at java.net.URLDecoder.decode(URLDecoder.java:136)
> ... 21 more
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
> at scala.Option.foreach(Option.scala:257)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
> at

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2019-01-07 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736645#comment-16736645
 ] 

Dongjoon Hyun commented on SPARK-25823:
---

+1. Thank you, [~hyukjin.kwon].

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25823) map_filter can generate incorrect data

2019-01-07 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736582#comment-16736582
 ] 

Hyukjin Kwon edited comment on SPARK-25823 at 1/8/19 2:09 AM:
--

Looks [~Thincrs] bot is still active. I'm going to ask directly via emails. If 
the bot is still active, I'm going to open an infra JIRA to ban this bot.


was (Author: hyukjin.kwon):
Looks [~Thincrs] bot is still active. I'm going to ask directly via emails. If 
the bot is still active, I'm going to open an infra JIRA to ben this bot.

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2019-01-07 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736582#comment-16736582
 ] 

Hyukjin Kwon commented on SPARK-25823:
--

Looks [~Thincrs] bot is still active. I'm going to ask directly via emails. If 
the bot is still active, I'm going to open an infra JIRA to ben this bot.

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26565) modify dev/create-release/release-build.sh to let jenkins build packages w/o publishing

2019-01-07 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736580#comment-16736580
 ] 

Wenchen Fan commented on SPARK-26565:
-

The release tools have been evolved. In the beginning, people need to run 
*release-build.sh* multiple times for each step, and now it's automated by the 
*do-release-docker.sh*.

> modify dev/create-release/release-build.sh to let jenkins build packages w/o 
> publishing
> ---
>
> Key: SPARK-26565
> URL: https://issues.apache.org/jira/browse/SPARK-26565
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.3, 2.3.3, 2.4.1, 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Attachments: fine.png, no-idea.jpg
>
>
> about a year+ ago, we stopped publishing releases directly from jenkins...
> this means that the spark-\{branch}-packaging builds are failing due to gpg 
> signing failures, and i would like to update these builds to *just* perform 
> packaging.
> example:
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2183/console]
> i propose to change dev/create-release/release-build.sh...
> when the script is called w/the 'package' option, remove ALL of the following 
> bits:
> 1) gpg signing of the source tarball (lines 184-187)
> 2) gpg signing of the sparkR dist (lines 243-248)
> 3) gpg signing of the python dist (lines 256-261)
> 4) gpg signing of the regular binary dist (lines 264-271)
> 5) the svn push of the signed dists (lines 317-332)
>  
> another, and probably much better option, is to nuke the 
> spark-\{branch}-packaging builds and create new ones that just build things 
> w/o touching this incredible fragile shell scripting nightmare.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26566) Upgrade apache/arrow to 0.12.0

2019-01-07 Thread Bryan Cutler (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736540#comment-16736540
 ] 

Bryan Cutler commented on SPARK-26566:
--

Version 0.12.0 is slated to be released in mid January

> Upgrade apache/arrow to 0.12.0
> --
>
> Key: SPARK-26566
> URL: https://issues.apache.org/jira/browse/SPARK-26566
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Priority: Major
>
> _This is just a placeholder for now to collect what needs to be fixed when we 
> upgrade next time_
> Version 0.12.0 includes the following:
>  * pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23386) Enable direct application links before replay

2019-01-07 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23386.

Resolution: Won't Fix

I'm closing this for now as I believe SPARK-6951 handles a lot of the pain 
here. But if there's more that can be done, feel free to reopen.

> Enable direct application links before replay
> -
>
> Key: SPARK-23386
> URL: https://issues.apache.org/jira/browse/SPARK-23386
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.2.1
>Reporter: Gera Shegalov
>Priority: Minor
>
> In a deployment with multiple 10K of large event logs it may take *many 
> hours* until all logs are replayed. Most our users use SHS by clicking on a 
> link in a client log in case of an error. Direct links currently don't work 
> until the event log is processed in a replay thread. This Jira proposes to 
> link appid to the event logs already during scan, without a full replay. This 
> makes on-demand retrievals accessible almost immediately upon SHS start.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26566) Upgrade apache/arrow to 0.12.0

2019-01-07 Thread Bryan Cutler (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-26566:
-
Description: 
_This is just a placeholder for now to collect what needs to be fixed when we 
upgrade next time_

Version 0.12.0 includes the following:
 * pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098

 

  was:
Version 0.10.0 will allow for the following improvements and bug fixes:
 * Allow for adding BinaryType support ARROW-2141
 * Bug fix related to array serialization ARROW-1973
 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
 * Python bytearrays are supported in as input to pyarrow ARROW-2141
 * Java has common interface for reset to cleanup complex vectors in Spark 
ArrowWriter ARROW-1962
 * Cleanup pyarrow type equality checks ARROW-2423
 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
ARROW-2645
 * Improved low level handling of messages for RecordBatch ARROW-2704

 

 


> Upgrade apache/arrow to 0.12.0
> --
>
> Key: SPARK-26566
> URL: https://issues.apache.org/jira/browse/SPARK-26566
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Priority: Major
>
> _This is just a placeholder for now to collect what needs to be fixed when we 
> upgrade next time_
> Version 0.12.0 includes the following:
>  * pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26267) Kafka source may reprocess data

2019-01-07 Thread Shixiong Zhu (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-26267.
--
   Resolution: Fixed
Fix Version/s: 2.4.1

> Kafka source may reprocess data
> ---
>
> Key: SPARK-26267
> URL: https://issues.apache.org/jira/browse/SPARK-26267
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.1, 3.0.0
>
>
> Due to KAFKA-7703, when the Kafka source tries to get the latest offset, it 
> may get an earliest offset, and then it will reprocess messages that have 
> been processed when it gets the correct latest offset in the next batch.
> This usually happens when restarting a streaming query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26566) Upgrade apache/arrow to 0.12.0

2019-01-07 Thread Bryan Cutler (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-26566:
-
Affects Version/s: (was: 2.3.0)
   2.4.0

> Upgrade apache/arrow to 0.12.0
> --
>
> Key: SPARK-26566
> URL: https://issues.apache.org/jira/browse/SPARK-26566
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.4.0
>
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support ARROW-2141
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26566) Upgrade apache/arrow to 0.12.0

2019-01-07 Thread Bryan Cutler (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-26566:
-
Fix Version/s: (was: 2.4.0)

> Upgrade apache/arrow to 0.12.0
> --
>
> Key: SPARK-26566
> URL: https://issues.apache.org/jira/browse/SPARK-26566
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support ARROW-2141
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26566) Upgrade apache/arrow to 0.12.0

2019-01-07 Thread Bryan Cutler (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-26566:


Assignee: (was: Bryan Cutler)

> Upgrade apache/arrow to 0.12.0
> --
>
> Key: SPARK-26566
> URL: https://issues.apache.org/jira/browse/SPARK-26566
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Priority: Major
> Fix For: 2.4.0
>
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support ARROW-2141
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26566) Upgrade apache/arrow to 0.12.0

2019-01-07 Thread Bryan Cutler (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-26566:
-
Target Version/s:   (was: 2.4.0)

> Upgrade apache/arrow to 0.12.0
> --
>
> Key: SPARK-26566
> URL: https://issues.apache.org/jira/browse/SPARK-26566
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.4.0
>
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support ARROW-2141
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26566) Upgrade apache/arrow to 0.12.0

2019-01-07 Thread Bryan Cutler (JIRA)

Bryan Cutler created SPARK-26566:


 Summary: Upgrade apache/arrow to 0.12.0
 Key: SPARK-26566
 URL: https://issues.apache.org/jira/browse/SPARK-26566
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Bryan Cutler
Assignee: Bryan Cutler
 Fix For: 2.4.0


Version 0.10.0 will allow for the following improvements and bug fixes:
 * Allow for adding BinaryType support ARROW-2141
 * Bug fix related to array serialization ARROW-1973
 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
 * Python bytearrays are supported in as input to pyarrow ARROW-2141
 * Java has common interface for reset to cleanup complex vectors in Spark 
ArrowWriter ARROW-1962
 * Cleanup pyarrow type equality checks ARROW-2423
 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
ARROW-2645
 * Improved low level handling of messages for RecordBatch ARROW-2704

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run

2019-01-07 Thread Bryan Cutler (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-25272.
--
Resolution: Won't Fix

> Show some kind of test output to indicate pyarrow tests were run
> 
>
> Key: SPARK-25272
> URL: https://issues.apache.org/jira/browse/SPARK-25272
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> Right now tests only output status when they are skipped and there is no way 
> to really see from the logs that pyarrow tests, like ArrowTests, have been 
> run except by the absence of a skipped message.  We can add a test that is 
> skipped if pyarrow is installed, which will give an output in our Jenkins 
> test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-25823) map_filter can generate incorrect data

2019-01-07 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25823:
--
Comment: was deleted

(was: A user of thincrs has selected this issue. Deadline: Mon, Jan 14, 2019 
10:32 PM)

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26565) modify dev/create-release/release-build.sh to let jenkins build packages w/o publishing

2019-01-07 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736500#comment-16736500
 ] 

shane knapp edited comment on SPARK-26565 at 1/7/19 11:57 PM:
--

[~vanzin] – good to have confirmed that {{do-release-docker.sh}} is the 
official/only way to do a release.  it appears that there's a LOT of cruft in 
the repo WRT release process, and that should probably go away ASAP.

there is some discussion on the dev@ list about the packaging builds, and the 
general consensus was they could still be useful to test if packaging itself 
still works...  however, after looking at {{do-release-docker.sh}} i feel that 
maybe the package builds have reached the limit of their usefulness.

that all being said, the {{do-release.sh}} script is still used for docs and 
maven snapshots (but not packaging).


was (Author: shaneknapp):
[~vanzin] – good to have confirmed that `do-release-docker.sh` is the 
official/only way to do a release.  it appears that there's a LOT of cruft in 
the repo WRT release process, and that should probably go away ASAP.

there is some discussion on the dev@ list about the packaging builds, and the 
general consensus was they could still be useful to test if packaging itself 
still works...  however, after looking at `do-release-docker.sh` i feel that 
maybe the package builds have reached the limit of their usefulness.

that all being said, the `do-release.sh` script is still used for docs and 
maven snapshots (but not packaging).

> modify dev/create-release/release-build.sh to let jenkins build packages w/o 
> publishing
> ---
>
> Key: SPARK-26565
> URL: https://issues.apache.org/jira/browse/SPARK-26565
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.3, 2.3.3, 2.4.1, 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Attachments: fine.png, no-idea.jpg
>
>
> about a year+ ago, we stopped publishing releases directly from jenkins...
> this means that the spark-\{branch}-packaging builds are failing due to gpg 
> signing failures, and i would like to update these builds to *just* perform 
> packaging.
> example:
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2183/console]
> i propose to change dev/create-release/release-build.sh...
> when the script is called w/the 'package' option, remove ALL of the following 
> bits:
> 1) gpg signing of the source tarball (lines 184-187)
> 2) gpg signing of the sparkR dist (lines 243-248)
> 3) gpg signing of the python dist (lines 256-261)
> 4) gpg signing of the regular binary dist (lines 264-271)
> 5) the svn push of the signed dists (lines 317-332)
>  
> another, and probably much better option, is to nuke the 
> spark-\{branch}-packaging builds and create new ones that just build things 
> w/o touching this incredible fragile shell scripting nightmare.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26565) modify dev/create-release/release-build.sh to let jenkins build packages w/o publishing

2019-01-07 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736500#comment-16736500
 ] 

shane knapp commented on SPARK-26565:
-

[~vanzin] – good to have confirmed that `do-release-docker.sh` is the 
official/only way to do a release.  it appears that there's a LOT of cruft in 
the repo WRT release process, and that should probably go away ASAP.

there is some discussion on the dev@ list about the packaging builds, and the 
general consensus was they could still be useful to test if packaging itself 
still works...  however, after looking at `do-release-docker.sh` i feel that 
maybe the package builds have reached the limit of their usefulness.

that all being said, the `do-release.sh` script is still used for docs and 
maven snapshots (but not packaging).

> modify dev/create-release/release-build.sh to let jenkins build packages w/o 
> publishing
> ---
>
> Key: SPARK-26565
> URL: https://issues.apache.org/jira/browse/SPARK-26565
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.3, 2.3.3, 2.4.1, 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Attachments: fine.png, no-idea.jpg
>
>
> about a year+ ago, we stopped publishing releases directly from jenkins...
> this means that the spark-\{branch}-packaging builds are failing due to gpg 
> signing failures, and i would like to update these builds to *just* perform 
> packaging.
> example:
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2183/console]
> i propose to change dev/create-release/release-build.sh...
> when the script is called w/the 'package' option, remove ALL of the following 
> bits:
> 1) gpg signing of the source tarball (lines 184-187)
> 2) gpg signing of the sparkR dist (lines 243-248)
> 3) gpg signing of the python dist (lines 256-261)
> 4) gpg signing of the regular binary dist (lines 264-271)
> 5) the svn push of the signed dists (lines 317-332)
>  
> another, and probably much better option, is to nuke the 
> spark-\{branch}-packaging builds and create new ones that just build things 
> w/o touching this incredible fragile shell scripting nightmare.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26565) modify dev/create-release/release-build.sh to let jenkins build packages w/o publishing

2019-01-07 Thread Marcelo Vanzin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736493#comment-16736493
 ] 

Marcelo Vanzin commented on SPARK-26565:


Stepping back a little, what is the purpose of that packaging build?

IIRC it was meant to publish an unofficial nightly "distribution" that people 
could download / use with a custom maven repo; and for that you need GPG keys.

If we remove the keys, is that build doing anything useful? If it is, can we 
just add that to the existing test scripts instead, and get rid of that build?

To answer one of the questions, the current "official" way to create a release 
is to run {{do-release-docker.sh}} from the master branch (regardless of which 
release you're creating); the web site may need some updates to reflect that.

> modify dev/create-release/release-build.sh to let jenkins build packages w/o 
> publishing
> ---
>
> Key: SPARK-26565
> URL: https://issues.apache.org/jira/browse/SPARK-26565
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.3, 2.3.3, 2.4.1, 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Attachments: fine.png, no-idea.jpg
>
>
> about a year+ ago, we stopped publishing releases directly from jenkins...
> this means that the spark-\{branch}-packaging builds are failing due to gpg 
> signing failures, and i would like to update these builds to *just* perform 
> packaging.
> example:
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2183/console]
> i propose to change dev/create-release/release-build.sh...
> when the script is called w/the 'package' option, remove ALL of the following 
> bits:
> 1) gpg signing of the source tarball (lines 184-187)
> 2) gpg signing of the sparkR dist (lines 243-248)
> 3) gpg signing of the python dist (lines 256-261)
> 4) gpg signing of the regular binary dist (lines 264-271)
> 5) the svn push of the signed dists (lines 317-332)
>  
> another, and probably much better option, is to nuke the 
> spark-\{branch}-packaging builds and create new ones that just build things 
> w/o touching this incredible fragile shell scripting nightmare.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26491) Use ConfigEntry for hardcoded configs for test categories.

2019-01-07 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-26491:
--

Assignee: Marco Gaido

> Use ConfigEntry for hardcoded configs for test categories.
> --
>
> Key: SPARK-26491
> URL: https://issues.apache.org/jira/browse/SPARK-26491
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Assignee: Marco Gaido
>Priority: Major
>
> Make the following hardcoded configs to use ConfigEntry.
> {code}
> spark.test
> spark.testing
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26491) Use ConfigEntry for hardcoded configs for test categories.

2019-01-07 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26491.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23413
[https://github.com/apache/spark/pull/23413]

> Use ConfigEntry for hardcoded configs for test categories.
> --
>
> Key: SPARK-26491
> URL: https://issues.apache.org/jira/browse/SPARK-26491
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 3.0.0
>
>
> Make the following hardcoded configs to use ConfigEntry.
> {code}
> spark.test
> spark.testing
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26565) modify dev/create-release/release-build.sh to let jenkins build packages w/o publishing

2019-01-07 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736485#comment-16736485
 ] 

Sean Owen commented on SPARK-26565:
---

I've never done a release myself ... [~cloud_fan] [~vanzin] [~irashid] ?

> modify dev/create-release/release-build.sh to let jenkins build packages w/o 
> publishing
> ---
>
> Key: SPARK-26565
> URL: https://issues.apache.org/jira/browse/SPARK-26565
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.3, 2.3.3, 2.4.1, 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Attachments: fine.png, no-idea.jpg
>
>
> about a year+ ago, we stopped publishing releases directly from jenkins...
> this means that the spark-\{branch}-packaging builds are failing due to gpg 
> signing failures, and i would like to update these builds to *just* perform 
> packaging.
> example:
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2183/console]
> i propose to change dev/create-release/release-build.sh...
> when the script is called w/the 'package' option, remove ALL of the following 
> bits:
> 1) gpg signing of the source tarball (lines 184-187)
> 2) gpg signing of the sparkR dist (lines 243-248)
> 3) gpg signing of the python dist (lines 256-261)
> 4) gpg signing of the regular binary dist (lines 264-271)
> 5) the svn push of the signed dists (lines 317-332)
>  
> another, and probably much better option, is to nuke the 
> spark-\{branch}-packaging builds and create new ones that just build things 
> w/o touching this incredible fragile shell scripting nightmare.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26565) modify dev/create-release/release-build.sh to let jenkins build packages w/o publishing

2019-01-07 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736472#comment-16736472
 ] 

shane knapp edited comment on SPARK-26565 at 1/7/19 11:26 PM:
--

[~srowen] – herein lies my confusion...  after reading the release process page 
([https://spark.apache.org/release-process.html)] i don't see any mention of 
the release manager manually running this script.

i checked for anything in the spark repo that calls this script, and the only 
one i found was `do-release.sh`...  which also doesn't appear to be run 
(according to the release docs).
{noformat}
➜ spark git:(remove-package-signing) grep -r "release-build.sh" *
dev/create-release/do-release.sh: "$SELF/release-build.sh" package
dev/create-release/do-release.sh: "$SELF/release-build.sh" docs
dev/create-release/do-release.sh: "$SELF/release-build.sh" publish-release
dev/create-release/release-build.sh:usage: release-build.sh 
{noformat}
so, if release-build.sh is actually still in use during the release process 
(for 'package' *only*), then i will nuke my changes and add in a 
jenkins-specific flag to skip the GPG and svn parts, rather than deleting them 
outright.

a quick image to sum up my feelings about how all of the release tooling works:

!no-idea.jpg!


was (Author: shaneknapp):
[~srowen] – herein lies my confusion...  after reading the release process page 
([https://spark.apache.org/release-process.html)] i don't see any mention of 
the release manager manually running this script.

i checked for anything in the spark repo that calls this script, and the only 
one i found was `do-release.sh`...  which also doesn't appear to be run 
(according to the release docs).
{noformat}
➜ spark git:(remove-package-signing) grep -r "release-build.sh" *
dev/create-release/do-release.sh: "$SELF/release-build.sh" package
dev/create-release/do-release.sh: "$SELF/release-build.sh" docs
dev/create-release/do-release.sh: "$SELF/release-build.sh" publish-release
dev/create-release/release-build.sh:usage: release-build.sh 
{noformat}
so, if release-build.sh is actually still in use during the release process 
(package), then i will nuke my changes and add in a jenkins-specific flag to 
skip the GPG and svn parts, rather than deleting them outright.

a quick image to sum up my feelings about how all of the release tooling works:

!no-idea.jpg!

> modify dev/create-release/release-build.sh to let jenkins build packages w/o 
> publishing
> ---
>
> Key: SPARK-26565
> URL: https://issues.apache.org/jira/browse/SPARK-26565
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.3, 2.3.3, 2.4.1, 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Attachments: fine.png, no-idea.jpg
>
>
> about a year+ ago, we stopped publishing releases directly from jenkins...
> this means that the spark-\{branch}-packaging builds are failing due to gpg 
> signing failures, and i would like to update these builds to *just* perform 
> packaging.
> example:
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2183/console]
> i propose to change dev/create-release/release-build.sh...
> when the script is called w/the 'package' option, remove ALL of the following 
> bits:
> 1) gpg signing of the source tarball (lines 184-187)
> 2) gpg signing of the sparkR dist (lines 243-248)
> 3) gpg signing of the python dist (lines 256-261)
> 4) gpg signing of the regular binary dist (lines 264-271)
> 5) the svn push of the signed dists (lines 317-332)
>  
> another, and probably much better option, is to nuke the 
> spark-\{branch}-packaging builds and create new ones that just build things 
> w/o touching this incredible fragile shell scripting nightmare.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26565) modify dev/create-release/release-build.sh to let jenkins build packages w/o publishing

2019-01-07 Thread shane knapp (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp updated SPARK-26565:

Attachment: no-idea.jpg

> modify dev/create-release/release-build.sh to let jenkins build packages w/o 
> publishing
> ---
>
> Key: SPARK-26565
> URL: https://issues.apache.org/jira/browse/SPARK-26565
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.3, 2.3.3, 2.4.1, 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Attachments: fine.png, no-idea.jpg
>
>
> about a year+ ago, we stopped publishing releases directly from jenkins...
> this means that the spark-\{branch}-packaging builds are failing due to gpg 
> signing failures, and i would like to update these builds to *just* perform 
> packaging.
> example:
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2183/console]
> i propose to change dev/create-release/release-build.sh...
> when the script is called w/the 'package' option, remove ALL of the following 
> bits:
> 1) gpg signing of the source tarball (lines 184-187)
> 2) gpg signing of the sparkR dist (lines 243-248)
> 3) gpg signing of the python dist (lines 256-261)
> 4) gpg signing of the regular binary dist (lines 264-271)
> 5) the svn push of the signed dists (lines 317-332)
>  
> another, and probably much better option, is to nuke the 
> spark-\{branch}-packaging builds and create new ones that just build things 
> w/o touching this incredible fragile shell scripting nightmare.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26565) modify dev/create-release/release-build.sh to let jenkins build packages w/o publishing

2019-01-07 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736472#comment-16736472
 ] 

shane knapp commented on SPARK-26565:
-

[~srowen] – herein lies my confusion...  after reading the release process page 
([https://spark.apache.org/release-process.html)] i don't see any mention of 
the release manager manually running this script.

i checked for anything in the spark repo that calls this script, and the only 
one i found was `do-release.sh`...  which also doesn't appear to be run 
(according to the release docs).
{noformat}
➜ spark git:(remove-package-signing) grep -r "release-build.sh" *
dev/create-release/do-release.sh: "$SELF/release-build.sh" package
dev/create-release/do-release.sh: "$SELF/release-build.sh" docs
dev/create-release/do-release.sh: "$SELF/release-build.sh" publish-release
dev/create-release/release-build.sh:usage: release-build.sh 
{noformat}
so, if release-build.sh is actually still in use during the release process 
(package), then i will nuke my changes and add in a jenkins-specific flag to 
skip the GPG and svn parts, rather than deleting them outright.

a quick image to sum up my feelings about how all of the release tooling works:

!no-idea.jpg!

> modify dev/create-release/release-build.sh to let jenkins build packages w/o 
> publishing
> ---
>
> Key: SPARK-26565
> URL: https://issues.apache.org/jira/browse/SPARK-26565
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.3, 2.3.3, 2.4.1, 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Attachments: fine.png, no-idea.jpg
>
>
> about a year+ ago, we stopped publishing releases directly from jenkins...
> this means that the spark-\{branch}-packaging builds are failing due to gpg 
> signing failures, and i would like to update these builds to *just* perform 
> packaging.
> example:
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2183/console]
> i propose to change dev/create-release/release-build.sh...
> when the script is called w/the 'package' option, remove ALL of the following 
> bits:
> 1) gpg signing of the source tarball (lines 184-187)
> 2) gpg signing of the sparkR dist (lines 243-248)
> 3) gpg signing of the python dist (lines 256-261)
> 4) gpg signing of the regular binary dist (lines 264-271)
> 5) the svn push of the signed dists (lines 317-332)
>  
> another, and probably much better option, is to nuke the 
> spark-\{branch}-packaging builds and create new ones that just build things 
> w/o touching this incredible fragile shell scripting nightmare.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26565) modify dev/create-release/release-build.sh to let jenkins build packages w/o publishing

2019-01-07 Thread shane knapp (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp updated SPARK-26565:

Description: 
about a year+ ago, we stopped publishing releases directly from jenkins...

this means that the spark-\{branch}-packaging builds are failing due to gpg 
signing failures, and i would like to update these builds to *just* perform 
packaging.

example:

[https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2183/console]

i propose to change dev/create-release/release-build.sh...

when the script is called w/the 'package' option, remove ALL of the following 
bits:

1) gpg signing of the source tarball (lines 184-187)

2) gpg signing of the sparkR dist (lines 243-248)

3) gpg signing of the python dist (lines 256-261)

4) gpg signing of the regular binary dist (lines 264-271)

5) the svn push of the signed dists (lines 317-332)

 

another, and probably much better option, is to nuke the 
spark-\{branch}-packaging builds and create new ones that just build things w/o 
touching this incredible fragile shell scripting nightmare.

  was:
about a year+ ago, we stopped publishing releases directly from jenkins...

this means that the spark-\{branch}-packaging builds are failing due to gpg 
signing failures, and i would like to update these builds to *just* perform 
packaging.

example:

[https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2183/console]

i propose to change dev/create-release/release-build.sh...

when the script is called w/the 'package' option, remove ALL of the following 
bits:

1) gpg signing of the source tarball (lines 184-187)

2) gpg signing of the sparkR dist (lines 243-248)

3) gpg signing of the python dist (lines 256-261)

4) gpg signing of the regular binary dist (lines 264-271)

5) the svn push of the signed dists (lines 317-332)

 

PR coming soon. 


> modify dev/create-release/release-build.sh to let jenkins build packages w/o 
> publishing
> ---
>
> Key: SPARK-26565
> URL: https://issues.apache.org/jira/browse/SPARK-26565
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.3, 2.3.3, 2.4.1, 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Attachments: fine.png
>
>
> about a year+ ago, we stopped publishing releases directly from jenkins...
> this means that the spark-\{branch}-packaging builds are failing due to gpg 
> signing failures, and i would like to update these builds to *just* perform 
> packaging.
> example:
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2183/console]
> i propose to change dev/create-release/release-build.sh...
> when the script is called w/the 'package' option, remove ALL of the following 
> bits:
> 1) gpg signing of the source tarball (lines 184-187)
> 2) gpg signing of the sparkR dist (lines 243-248)
> 3) gpg signing of the python dist (lines 256-261)
> 4) gpg signing of the regular binary dist (lines 264-271)
> 5) the svn push of the signed dists (lines 317-332)
>  
> another, and probably much better option, is to nuke the 
> spark-\{branch}-packaging builds and create new ones that just build things 
> w/o touching this incredible fragile shell scripting nightmare.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26565) modify dev/create-release/release-build.sh to let jenkins build packages w/o publishing

2019-01-07 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736465#comment-16736465
 ] 

Sean Owen commented on SPARK-26565:
---

Sure, if we're not making releases with it, no need to sign them.

> modify dev/create-release/release-build.sh to let jenkins build packages w/o 
> publishing
> ---
>
> Key: SPARK-26565
> URL: https://issues.apache.org/jira/browse/SPARK-26565
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.3, 2.3.3, 2.4.1, 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Attachments: fine.png
>
>
> about a year+ ago, we stopped publishing releases directly from jenkins...
> this means that the spark-\{branch}-packaging builds are failing due to gpg 
> signing failures, and i would like to update these builds to *just* perform 
> packaging.
> example:
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2183/console]
> i propose to change dev/create-release/release-build.sh...
> when the script is called w/the 'package' option, remove ALL of the following 
> bits:
> 1) gpg signing of the source tarball (lines 184-187)
> 2) gpg signing of the sparkR dist (lines 243-248)
> 3) gpg signing of the python dist (lines 256-261)
> 4) gpg signing of the regular binary dist (lines 264-271)
> 5) the svn push of the signed dists (lines 317-332)
>  
> PR coming soon. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26565) modify dev/create-release/release-build.sh to let jenkins build packages w/o publishing

2019-01-07 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26565:


Assignee: shane knapp  (was: Apache Spark)

> modify dev/create-release/release-build.sh to let jenkins build packages w/o 
> publishing
> ---
>
> Key: SPARK-26565
> URL: https://issues.apache.org/jira/browse/SPARK-26565
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.3, 2.3.3, 2.4.1, 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Attachments: fine.png
>
>
> about a year+ ago, we stopped publishing releases directly from jenkins...
> this means that the spark-\{branch}-packaging builds are failing due to gpg 
> signing failures, and i would like to update these builds to *just* perform 
> packaging.
> example:
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2183/console]
> i propose to change dev/create-release/release-build.sh...
> when the script is called w/the 'package' option, remove ALL of the following 
> bits:
> 1) gpg signing of the source tarball (lines 184-187)
> 2) gpg signing of the sparkR dist (lines 243-248)
> 3) gpg signing of the python dist (lines 256-261)
> 4) gpg signing of the regular binary dist (lines 264-271)
> 5) the svn push of the signed dists (lines 317-332)
>  
> PR coming soon. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26565) modify dev/create-release/release-build.sh to let jenkins build packages w/o publishing

2019-01-07 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26565:


Assignee: Apache Spark  (was: shane knapp)

> modify dev/create-release/release-build.sh to let jenkins build packages w/o 
> publishing
> ---
>
> Key: SPARK-26565
> URL: https://issues.apache.org/jira/browse/SPARK-26565
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.3, 2.3.3, 2.4.1, 3.0.0
>Reporter: shane knapp
>Assignee: Apache Spark
>Priority: Major
> Attachments: fine.png
>
>
> about a year+ ago, we stopped publishing releases directly from jenkins...
> this means that the spark-\{branch}-packaging builds are failing due to gpg 
> signing failures, and i would like to update these builds to *just* perform 
> packaging.
> example:
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2183/console]
> i propose to change dev/create-release/release-build.sh...
> when the script is called w/the 'package' option, remove ALL of the following 
> bits:
> 1) gpg signing of the source tarball (lines 184-187)
> 2) gpg signing of the sparkR dist (lines 243-248)
> 3) gpg signing of the python dist (lines 256-261)
> 4) gpg signing of the regular binary dist (lines 264-271)
> 5) the svn push of the signed dists (lines 317-332)
>  
> PR coming soon. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26565) modify dev/create-release/release-build.sh to let jenkins build packages w/o publishing

2019-01-07 Thread shane knapp (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp updated SPARK-26565:

Description: 
about a year+ ago, we stopped publishing releases directly from jenkins...

this means that the spark-\{branch}-packaging builds are failing due to gpg 
signing failures, and i would like to update these builds to *just* perform 
packaging.

example:

[https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2183/console]

i propose to change dev/create-release/release-build.sh...

when the script is called w/the 'package' option, remove ALL of the following 
bits:

1) gpg signing of the source tarball (lines 184-187)

2) gpg signing of the sparkR dist (lines 243-248)

3) gpg signing of the python dist (lines 256-261)

4) gpg signing of the regular binary dist (lines 264-271)

5) the svn push of the signed dists (lines 317-332)

 

PR coming soon. 

  was:
about a year+ ago, we stopped publishing releases directly from jenkins...

this means that the spark-\{branch}-packaging builds are failing due to gpg 
signing failures, and i would like to update these builds to *just* perform 
packaging.

example:

[https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2183/console]

i propose to change dev/create-release/release-build.sh...

when the script is called w/the 'package' option, remove ALL of the following 
bits:

1) gpg signing of the source tarball (lines 184-187)

2) gpg signing of the sparkR dist (lines 243-248)

3) gpg signing of the python dist (lines 256-261)

4) gpg signing of the regular binary dist (lines 264-271)

5) the svn push of the signed dists (lines 317-332)

obligatory images:

!image-2019-01-07-14-46-20-907.png!!image-2019-01-07-14-46-40-836.png!

 


> modify dev/create-release/release-build.sh to let jenkins build packages w/o 
> publishing
> ---
>
> Key: SPARK-26565
> URL: https://issues.apache.org/jira/browse/SPARK-26565
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.3, 2.3.3, 2.4.1, 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Attachments: fine.png
>
>
> about a year+ ago, we stopped publishing releases directly from jenkins...
> this means that the spark-\{branch}-packaging builds are failing due to gpg 
> signing failures, and i would like to update these builds to *just* perform 
> packaging.
> example:
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2183/console]
> i propose to change dev/create-release/release-build.sh...
> when the script is called w/the 'package' option, remove ALL of the following 
> bits:
> 1) gpg signing of the source tarball (lines 184-187)
> 2) gpg signing of the sparkR dist (lines 243-248)
> 3) gpg signing of the python dist (lines 256-261)
> 4) gpg signing of the regular binary dist (lines 264-271)
> 5) the svn push of the signed dists (lines 317-332)
>  
> PR coming soon. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26565) modify dev/create-release/release-build.sh to let jenkins build packages w/o publishing

2019-01-07 Thread shane knapp (JIRA)

shane knapp created SPARK-26565:
---

 Summary: modify dev/create-release/release-build.sh to let jenkins 
build packages w/o publishing
 Key: SPARK-26565
 URL: https://issues.apache.org/jira/browse/SPARK-26565
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.2.3, 2.3.3, 2.4.1, 3.0.0
Reporter: shane knapp
Assignee: shane knapp


about a year+ ago, we stopped publishing releases directly from jenkins...

this means that the spark-\{branch}-packaging builds are failing due to gpg 
signing failures, and i would like to update these builds to *just* perform 
packaging.

example:

[https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2183/console]

i propose to change dev/create-release/release-build.sh...

when the script is called w/the 'package' option, remove ALL of the following 
bits:

1) gpg signing of the source tarball (lines 184-187)

2) gpg signing of the sparkR dist (lines 243-248)

3) gpg signing of the python dist (lines 256-261)

4) gpg signing of the regular binary dist (lines 264-271)

5) the svn push of the signed dists (lines 317-332)

obligatory images:

!image-2019-01-07-14-46-20-907.png!!image-2019-01-07-14-46-40-836.png!

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26565) modify dev/create-release/release-build.sh to let jenkins build packages w/o publishing

2019-01-07 Thread shane knapp (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp updated SPARK-26565:

Attachment: fine.png

> modify dev/create-release/release-build.sh to let jenkins build packages w/o 
> publishing
> ---
>
> Key: SPARK-26565
> URL: https://issues.apache.org/jira/browse/SPARK-26565
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.3, 2.3.3, 2.4.1, 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Attachments: fine.png
>
>
> about a year+ ago, we stopped publishing releases directly from jenkins...
> this means that the spark-\{branch}-packaging builds are failing due to gpg 
> signing failures, and i would like to update these builds to *just* perform 
> packaging.
> example:
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2183/console]
> i propose to change dev/create-release/release-build.sh...
> when the script is called w/the 'package' option, remove ALL of the following 
> bits:
> 1) gpg signing of the source tarball (lines 184-187)
> 2) gpg signing of the sparkR dist (lines 243-248)
> 3) gpg signing of the python dist (lines 256-261)
> 4) gpg signing of the regular binary dist (lines 264-271)
> 5) the svn push of the signed dists (lines 317-332)
> obligatory images:
> !image-2019-01-07-14-46-20-907.png!!image-2019-01-07-14-46-40-836.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26269) YarnAllocator should have same blacklist behaviour with YARN to maxmize use of cluster resource

2019-01-07 Thread Thomas Graves (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-26269:
--
Fix Version/s: 2.4.1

> YarnAllocator should have same blacklist behaviour with YARN to maxmize use 
> of cluster resource
> ---
>
> Key: SPARK-26269
> URL: https://issues.apache.org/jira/browse/SPARK-26269
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.3.1, 2.3.2, 2.4.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Minor
> Fix For: 2.4.1, 3.0.0
>
>
> Currently, YarnAllocator may put a node with a completed container whose exit 
> status is not one of SUCCESS, PREEMPTED, KILLED_EXCEEDED_VMEM, 
> KILLED_EXCEEDED_PMEM into blacklist. Howerver, for other exit status, e.g. 
> KILLED_BY_RESOURCEMANAGER, Yarn do not consider its related nodes shoule be 
> added into blacklist(see YARN's explaination for detail 
> https://github.com/apache/hadoop/blob/228156cfd1b474988bc4fedfbf7edddc87db41e3/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/Apps.java#L273).
>  So, relaxing the current blacklist rule and having the same blacklist 
> behaviour with YARN would maxmize use of cluster resources.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26564) Fix misleading error message about spark.network.timeout and spark.executor.heartbeatInterval

2019-01-07 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26564:


Assignee: (was: Apache Spark)

> Fix misleading error message about spark.network.timeout and 
> spark.executor.heartbeatInterval
> -
>
> Key: SPARK-26564
> URL: https://issues.apache.org/jira/browse/SPARK-26564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Kengo Seki
>Priority: Trivial
>  Labels: starter
>
> I mistakenly set an equivalent value with spark.network.timeout to 
> spark.executor.heartbeatInterval and got the following error:
> {code}
> java.lang.IllegalArgumentException: requirement failed: The value of 
> spark.network.timeout=120s must be no less than the value of 
> spark.executor.heartbeatInterval=120s.
> {code}
> But it can be read as they could be equal. "Greater than" is more precise 
> than "no less than".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26564) Fix misleading error message about spark.network.timeout and spark.executor.heartbeatInterval

2019-01-07 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26564:


Assignee: Apache Spark

> Fix misleading error message about spark.network.timeout and 
> spark.executor.heartbeatInterval
> -
>
> Key: SPARK-26564
> URL: https://issues.apache.org/jira/browse/SPARK-26564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Kengo Seki
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: starter
>
> I mistakenly set an equivalent value with spark.network.timeout to 
> spark.executor.heartbeatInterval and got the following error:
> {code}
> java.lang.IllegalArgumentException: requirement failed: The value of 
> spark.network.timeout=120s must be no less than the value of 
> spark.executor.heartbeatInterval=120s.
> {code}
> But it can be read as they could be equal. "Greater than" is more precise 
> than "no less than".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2019-01-07 Thread Thincrs (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736446#comment-16736446
 ] 

Thincrs commented on SPARK-25823:
-

A user of thincrs has selected this issue. Deadline: Mon, Jan 14, 2019 10:32 PM

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26564) Fix misleading error message about spark.network.timeout and spark.executor.heartbeatInterval

2019-01-07 Thread Kengo Seki (JIRA)

Kengo Seki created SPARK-26564:
--

 Summary: Fix misleading error message about spark.network.timeout 
and spark.executor.heartbeatInterval
 Key: SPARK-26564
 URL: https://issues.apache.org/jira/browse/SPARK-26564
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0, 2.3.2, 2.3.1, 2.3.0
Reporter: Kengo Seki


I mistakenly set an equivalent value with spark.network.timeout to 
spark.executor.heartbeatInterval and got the following error:

{code}
java.lang.IllegalArgumentException: requirement failed: The value of 
spark.network.timeout=120s must be no less than the value of 
spark.executor.heartbeatInterval=120s.
{code}

But it can be read as they could be equal. "Greater than" is more precise than 
"no less than".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-26537) update the release scripts to point to gitbox

2019-01-07 Thread shane knapp (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp closed SPARK-26537.
---

> update the release scripts to point to gitbox
> -
>
> Key: SPARK-26537
> URL: https://issues.apache.org/jira/browse/SPARK-26537
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Fix For: 2.2.3, 2.3.3, 2.4.1, 3.0.0
>
>
> we're seeing packaging build failures like this:  
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2179/console]
> i did a quick skim through the repo, and found the offending urls to the old 
> apache git repos:
>  
> {code:java}
> (py35) ➜ spark git:(update-apache-repo) grep -r git-wip *
> dev/create-release/release-tag.sh:ASF_SPARK_REPO="git-wip-us.apache.org/repos/asf/spark.git"
> dev/create-release/release-util.sh:ASF_REPO="https://git-wip-us.apache.org/repos/asf/spark.git;
> dev/create-release/release-util.sh:ASF_REPO_WEBUI="https://git-wip-us.apache.org/repos/asf?p=spark.git;
> pom.xml: 
> scm:git:https://git-wip-us.apache.org/repos/asf/spark.git
> {code}
> this affects all versions of spark, so it will need to be backported to all 
> released versions.
> i'll put together a pull request later today.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26065) Change query hint from a `LogicalPlan` to a field

2019-01-07 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-26065.
-
   Resolution: Fixed
 Assignee: Maryann Xue
Fix Version/s: 3.0.0

> Change query hint from a `LogicalPlan` to a field
> -
>
> Key: SPARK-26065
> URL: https://issues.apache.org/jira/browse/SPARK-26065
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maryann Xue
>Assignee: Maryann Xue
>Priority: Major
> Fix For: 3.0.0
>
>
> The existing query hint implementation relies on a logical plan node 
> {{ResolvedHint}} to store query hints in logical plans, and on {{Statistics}} 
> in physical plans. Since {{ResolvedHint}} is not really a logical operator 
> and can break the pattern matching for existing and future optimization 
> rules, it is a issue to the Optimizer as the old {{AnalysisBarrier}} to the 
> Analyzer.
> Given the fact that all our query hints are either 1) a join hint, i.e., 
> broadcast hint; or 2) a re-partition hint, which is indeed an operator, we 
> only need to add a hint field on the {{Join}} plan and that will be a good 
> enough solution for current hint usage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24701) SparkMaster WebUI allow all appids to be shown in detail on port 4040 rather than different ports per app

2019-01-07 Thread Pablo Langa Blanco (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736366#comment-16736366
 ] 

Pablo Langa Blanco commented on SPARK-24701:


Hi [~toopt4]

i think you have this functionality in the history server. For each application 
spark starts new server (in a new port) and you have the history server up only 
in one port monitoring all the applications.

As you can see ([https://spark.apache.org/docs/latest/monitoring.html)] "The 
history server displays both completed and incomplete Spark jobs"  "Incomplete 
applications are only updated intermittently" spark.history.fs.update.interval

it could be a solution for you?

> SparkMaster WebUI allow all appids to be shown in detail on port 4040 rather 
> than different ports per app
> -
>
> Key: SPARK-24701
> URL: https://issues.apache.org/jira/browse/SPARK-24701
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: t oo
>Priority: Major
>  Labels: master, security, ui, web, web-ui
> Attachments: spark_ports.png
>
>
> Right now the detail for all application ids are shown on a diff port per app 
> id, ie. 4040, 4041, 4042...etc this is problematic for environments with 
> tight firewall settings. Proposing to allow 4040?appid=1,  4040?appid=2,  
> 4040?appid=3..etc for the master web ui just like what the History Web UI 
> does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26563) Quick Start documentation provides example that doesn't work (Java)

2019-01-07 Thread Leonardo Colman Lopes (JIRA)

Leonardo Colman Lopes created SPARK-26563:
-

 Summary: Quick Start documentation provides example that doesn't 
work (Java)
 Key: SPARK-26563
 URL: https://issues.apache.org/jira/browse/SPARK-26563
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 2.4.0
Reporter: Leonardo Colman Lopes


In the [Self-Containg 
Application|https://spark.apache.org/docs/latest/quick-start.html#self-contained-applications]
 page, the Java application won't work as shown.

Creating a spark session using

SparkSession spark = SparkSession.builder().appName("Simple 
Application").getOrCreate();

Doesn't work.
It must be created using

 SparkSession spark = SparkSession.builder().appName("Simple 
Application").config("spark.master", "local").getOrCreate();



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26563) Quick Start documentation provides example that doesn't work (Java)

2019-01-07 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26563:


Assignee: (was: Apache Spark)

> Quick Start documentation provides example that doesn't work (Java)
> ---
>
> Key: SPARK-26563
> URL: https://issues.apache.org/jira/browse/SPARK-26563
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.4.0
>Reporter: Leonardo Colman Lopes
>Priority: Minor
>
> In the [Self-Containg 
> Application|https://spark.apache.org/docs/latest/quick-start.html#self-contained-applications]
>  page, the Java application won't work as shown.
> Creating a spark session using
> SparkSession spark = SparkSession.builder().appName("Simple 
> Application").getOrCreate();
> Doesn't work.
> It must be created using
>  SparkSession spark = SparkSession.builder().appName("Simple 
> Application").config("spark.master", "local").getOrCreate();



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26563) Quick Start documentation provides example that doesn't work (Java)

2019-01-07 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736363#comment-16736363
 ] 

Apache Spark commented on SPARK-26563:
--

User 'Kerooker' has created a pull request for this issue:
https://github.com/apache/spark/pull/23487

> Quick Start documentation provides example that doesn't work (Java)
> ---
>
> Key: SPARK-26563
> URL: https://issues.apache.org/jira/browse/SPARK-26563
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.4.0
>Reporter: Leonardo Colman Lopes
>Priority: Minor
>
> In the [Self-Containg 
> Application|https://spark.apache.org/docs/latest/quick-start.html#self-contained-applications]
>  page, the Java application won't work as shown.
> Creating a spark session using
> SparkSession spark = SparkSession.builder().appName("Simple 
> Application").getOrCreate();
> Doesn't work.
> It must be created using
>  SparkSession spark = SparkSession.builder().appName("Simple 
> Application").config("spark.master", "local").getOrCreate();



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26563) Quick Start documentation provides example that doesn't work (Java)

2019-01-07 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26563:


Assignee: Apache Spark

> Quick Start documentation provides example that doesn't work (Java)
> ---
>
> Key: SPARK-26563
> URL: https://issues.apache.org/jira/browse/SPARK-26563
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.4.0
>Reporter: Leonardo Colman Lopes
>Assignee: Apache Spark
>Priority: Minor
>
> In the [Self-Containg 
> Application|https://spark.apache.org/docs/latest/quick-start.html#self-contained-applications]
>  page, the Java application won't work as shown.
> Creating a spark session using
> SparkSession spark = SparkSession.builder().appName("Simple 
> Application").getOrCreate();
> Doesn't work.
> It must be created using
>  SparkSession spark = SparkSession.builder().appName("Simple 
> Application").config("spark.master", "local").getOrCreate();



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26563) Quick Start documentation provides example that doesn't work (Java)

2019-01-07 Thread Leonardo Colman Lopes (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736358#comment-16736358
 ] 

Leonardo Colman Lopes commented on SPARK-26563:
---

Pull request ready 
https://github.com/apache/spark/pull/23487

> Quick Start documentation provides example that doesn't work (Java)
> ---
>
> Key: SPARK-26563
> URL: https://issues.apache.org/jira/browse/SPARK-26563
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.4.0
>Reporter: Leonardo Colman Lopes
>Priority: Minor
>
> In the [Self-Containg 
> Application|https://spark.apache.org/docs/latest/quick-start.html#self-contained-applications]
>  page, the Java application won't work as shown.
> Creating a spark session using
> SparkSession spark = SparkSession.builder().appName("Simple 
> Application").getOrCreate();
> Doesn't work.
> It must be created using
>  SparkSession spark = SparkSession.builder().appName("Simple 
> Application").config("spark.master", "local").getOrCreate();



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25689) Move token renewal logic to driver in yarn-client mode

2019-01-07 Thread Imran Rashid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-25689.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23338
[https://github.com/apache/spark/pull/23338]

> Move token renewal logic to driver in yarn-client mode
> --
>
> Key: SPARK-25689
> URL: https://issues.apache.org/jira/browse/SPARK-25689
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 3.0.0
>
>
> Currently, both in yarn-cluster and yarn-client mode, the YARN AM is 
> responsible for renewing delegation tokens. That differs from other RMs 
> (Mesos and later k8s when it supports this functionality), and is one of the 
> roadblocks towards fully sharing the same delegation token-related code.
> We should look at keeping the renewal logic within the driver in yarn-client 
> mode. That would also remove the need to distribute the user's keytab to the 
> AM when running in that particular mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25689) Move token renewal logic to driver in yarn-client mode

2019-01-07 Thread Imran Rashid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-25689:


Assignee: Marcelo Vanzin

> Move token renewal logic to driver in yarn-client mode
> --
>
> Key: SPARK-25689
> URL: https://issues.apache.org/jira/browse/SPARK-25689
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
>
> Currently, both in yarn-cluster and yarn-client mode, the YARN AM is 
> responsible for renewing delegation tokens. That differs from other RMs 
> (Mesos and later k8s when it supports this functionality), and is one of the 
> roadblocks towards fully sharing the same delegation token-related code.
> We should look at keeping the renewal logic within the driver in yarn-client 
> mode. That would also remove the need to distribute the user's keytab to the 
> AM when running in that particular mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26551) Selecting one complex field and having is null predicate on another complex field can cause error

2019-01-07 Thread DB Tsai (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-26551:

Affects Version/s: 2.4.0

> Selecting one complex field and having is null predicate on another complex 
> field can cause error
> -
>
> Key: SPARK-26551
> URL: https://issues.apache.org/jira/browse/SPARK-26551
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
>
> The query below can cause error when doing schema pruning:
> {code:java}
> val query = sql("select * from contacts")
>   .where("name.middle is not null")
>   .select(
> "id",
> "name.first",
> "name.middle",
> "name.last"
>   )
>   .where("last = 'Jones'")
>   .select(count("id")
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26551) Selecting one complex field and having is null predicate on another complex field can cause error

2019-01-07 Thread DB Tsai (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-26551:
---

Assignee: Liang-Chi Hsieh

> Selecting one complex field and having is null predicate on another complex 
> field can cause error
> -
>
> Key: SPARK-26551
> URL: https://issues.apache.org/jira/browse/SPARK-26551
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
>
> The query below can cause error when doing schema pruning:
> {code:java}
> val query = sql("select * from contacts")
>   .where("name.middle is not null")
>   .select(
> "id",
> "name.first",
> "name.middle",
> "name.last"
>   )
>   .where("last = 'Jones'")
>   .select(count("id")
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26518) UI Application Info Race Condition Can Throw NoSuchElement

2019-01-07 Thread Russell Spitzer (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736091#comment-16736091
 ] 

Russell Spitzer commented on SPARK-26518:
-

Yeah I basically came to the same conclusion, there is no easy way to instead 
throw up a "Now loading" message with the current architecture. I think it's 
probably fine to leave it as is, we only noticed this because one of our tests 
was specifically expecting either "Not Available" or 200. Since this throws up 
a server side error instead we took note.

> UI Application Info Race Condition Can Throw NoSuchElement
> --
>
> Key: SPARK-26518
> URL: https://issues.apache.org/jira/browse/SPARK-26518
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Russell Spitzer
>Priority: Trivial
>
> There is a slight race condition in the 
> [AppStatusStore|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusStore.scala#L39]
> Which calls `next` on the returned store even if it is empty which i can be 
> for a short period of time after the UI is up but before the store is 
> populated.
> {code}
> 
> 
> Error 500 Server Error
> 
> HTTP ERROR 500
> Problem accessing /jobs/. Reason:
> Server ErrorCaused 
> by:java.util.NoSuchElementException
> at java.util.Collections$EmptyIterator.next(Collections.java:4189)
> at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryIterator.next(InMemoryStore.java:281)
> at 
> org.apache.spark.status.AppStatusStore.applicationInfo(AppStatusStore.scala:38)
> at org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:275)
> at org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:86)
> at org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:86)
> at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
> at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:865)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:535)
> at 
> org.spark_project.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
> at 
> org.spark_project.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)
> at 
> org.spark_project.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
> at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
> at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:724)
> at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
> at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
> at org.spark_project.jetty.server.Server.handle(Server.java:531)
> at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:352)
> at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
> at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
> at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:102)
> at 
> org.spark_project.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
> at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)
> at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26561) Symbol 'type org.apache.spark.Logging' is missing from the classpath. This symbol is required by 'class org.apache.hadoop.hbase.spark.HBaseContext'

2019-01-07 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26561.

Resolution: Not A Bug

External libraries should not use internal Spark classes like {{Logging}}. This 
is a problem in the library you're using, not Spark.

> Symbol 'type org.apache.spark.Logging' is missing from the classpath. This 
> symbol is required by 'class org.apache.hadoop.hbase.spark.HBaseContext'
> ---
>
> Key: SPARK-26561
> URL: https://issues.apache.org/jira/browse/SPARK-26561
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: abhinav
>Priority: Major
>
> Symbol 'type org.apache.spark.Logging' is missing from the classpath. This 
> symbol is required by 'class 
>   org.apache.hadoop.hbase.spark.HBaseContext'. Make sure that type Logging is 
> in your classpath and check 
>   for conflicting dependencies with -Ylog-classpath. A full rebuild may help 
> if 'HBaseContext.class' was 
>   compiled against an incompatible version of org.apache.spark.
>  
>  
>  
> Getting above issue while initializing HBaseContext. I am using spark 2.3.0 , 
> scala 2.11.11 and hbase 1.2.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26518) UI Application Info Race Condition Can Throw NoSuchElement

2019-01-07 Thread Pablo Langa Blanco (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736087#comment-16736087
 ] 

Pablo Langa Blanco commented on SPARK-26518:


i was trying to find and easy solution to this but i dont find it. The problem 
is not only in that point because when you control this error thre are other 
elements that are not loaded yet and fails too. Other thing that i was trying 
to do is to find a point to know if the KVStore is load and show a message 
instead the error, but its no easy. So, as it is an issue whit trivial priority 
its no reasonable to change a lot of things. 

> UI Application Info Race Condition Can Throw NoSuchElement
> --
>
> Key: SPARK-26518
> URL: https://issues.apache.org/jira/browse/SPARK-26518
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Russell Spitzer
>Priority: Trivial
>
> There is a slight race condition in the 
> [AppStatusStore|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusStore.scala#L39]
> Which calls `next` on the returned store even if it is empty which i can be 
> for a short period of time after the UI is up but before the store is 
> populated.
> {code}
> 
> 
> Error 500 Server Error
> 
> HTTP ERROR 500
> Problem accessing /jobs/. Reason:
> Server ErrorCaused 
> by:java.util.NoSuchElementException
> at java.util.Collections$EmptyIterator.next(Collections.java:4189)
> at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryIterator.next(InMemoryStore.java:281)
> at 
> org.apache.spark.status.AppStatusStore.applicationInfo(AppStatusStore.scala:38)
> at org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:275)
> at org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:86)
> at org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:86)
> at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
> at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:865)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:535)
> at 
> org.spark_project.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
> at 
> org.spark_project.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)
> at 
> org.spark_project.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
> at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
> at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:724)
> at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
> at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
> at org.spark_project.jetty.server.Server.handle(Server.java:531)
> at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:352)
> at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
> at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
> at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:102)
> at 
> org.spark_project.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
> at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)
> at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24489) No check for invalid input type of weight data in ml.PowerIterationClustering

2019-01-07 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-24489.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Thank's for working on this, I've merged the fix into master :)

> No check for invalid input type of weight data in ml.PowerIterationClustering
> -
>
> Key: SPARK-24489
> URL: https://issues.apache.org/jira/browse/SPARK-24489
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: shahid
>Assignee: shahid
>Priority: Minor
> Fix For: 3.0.0
>
>
> The test case will result the following failure. currently in ml.PIC, there 
> is no check for the data type of weight column. We should check for the valid 
> data type of the weight.
> {code:java}
>   test("invalid input types for weight") {
> val invalidWeightData = spark.createDataFrame(Seq(
>   (0L, 1L, "a"),
>   (2L, 3L, "b")
> )).toDF("src", "dst", "weight")
> val pic = new PowerIterationClustering()
>   .setWeightCol("weight")
> val result = pic.assignClusters(invalidWeightData)
>   }
> {code}
> {code:java}
> Job aborted due to stage failure: Task 0 in stage 8077.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 8077.0 (TID 882, localhost, executor 
> driver): scala.MatchError: [0,1,null] (of class 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
>   at 
> org.apache.spark.ml.clustering.PowerIterationClustering$$anonfun$3.apply(PowerIterationClustering.scala:178)
>   at 
> org.apache.spark.ml.clustering.PowerIterationClustering$$anonfun$3.apply(PowerIterationClustering.scala:178)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
>   at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:847)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24489) No check for invalid input type of weight data in ml.PowerIterationClustering

2019-01-07 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-24489:
---

Assignee: shahid

> No check for invalid input type of weight data in ml.PowerIterationClustering
> -
>
> Key: SPARK-24489
> URL: https://issues.apache.org/jira/browse/SPARK-24489
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: shahid
>Assignee: shahid
>Priority: Minor
>
> The test case will result the following failure. currently in ml.PIC, there 
> is no check for the data type of weight column. We should check for the valid 
> data type of the weight.
> {code:java}
>   test("invalid input types for weight") {
> val invalidWeightData = spark.createDataFrame(Seq(
>   (0L, 1L, "a"),
>   (2L, 3L, "b")
> )).toDF("src", "dst", "weight")
> val pic = new PowerIterationClustering()
>   .setWeightCol("weight")
> val result = pic.assignClusters(invalidWeightData)
>   }
> {code}
> {code:java}
> Job aborted due to stage failure: Task 0 in stage 8077.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 8077.0 (TID 882, localhost, executor 
> driver): scala.MatchError: [0,1,null] (of class 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
>   at 
> org.apache.spark.ml.clustering.PowerIterationClustering$$anonfun$3.apply(PowerIterationClustering.scala:178)
>   at 
> org.apache.spark.ml.clustering.PowerIterationClustering$$anonfun$3.apply(PowerIterationClustering.scala:178)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
>   at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:847)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26457) Show hadoop configurations in HistoryServer environment tab

2019-01-07 Thread Pablo Langa Blanco (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736024#comment-16736024
 ] 

Pablo Langa Blanco commented on SPARK-26457:


Ok, I don't know that you are working on it. i'm going to comment on your pull 
request

Thanks!!

> Show hadoop configurations in HistoryServer environment tab
> ---
>
> Key: SPARK-26457
> URL: https://issues.apache.org/jira/browse/SPARK-26457
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2, 2.4.0
> Environment: Maybe it is good to show some configurations in 
> HistoryServer environment tab for debugging some bugs about hadoop
>Reporter: deshanxiao
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26225) Scan: track decoding time for row-based data sources

2019-01-07 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735954#comment-16735954
 ] 

Wenchen Fan commented on SPARK-26225:
-

I think it's hard to define the decoding time, as every data source may has its 
own definition.

For data source v1, I think we just need to update `RowDataSourceScanExec` and 
track the time of the unsafe projection that turns Row to InternalRow.

For data source v2, it's totally different. Spark needs to ask the data source 
to report the decoding time (or any other metrics). I'd like to defer it after 
the data source v2 metrics API is introduced.

> Scan: track decoding time for row-based data sources
> 
>
> Key: SPARK-26225
> URL: https://issues.apache.org/jira/browse/SPARK-26225
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Priority: Major
>
> Scan node should report decoding time for each record, if it is not too much 
> overhead.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26383) NPE when use DataFrameReader.jdbc with wrong URL

2019-01-07 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26383:
-

Assignee: clouds

> NPE when use DataFrameReader.jdbc with wrong URL
> 
>
> Key: SPARK-26383
> URL: https://issues.apache.org/jira/browse/SPARK-26383
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: clouds
>Assignee: clouds
>Priority: Minor
>
> When passing wrong url to jdbc:
> {code:java}
> val opts = Map(
>   "url" -> "jdbc:mysql://localhost/db",
>   "dbtable" -> "table",
>   "driver" -> "org.postgresql.Driver"
> )
> var df = spark.read.format("jdbc").options(opts).load
> {code}
> It would throw an NPE instead of complaining about connection failed. (Note 
> url and driver not matched here)
> {code:java}
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:71)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
> at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
> {code}
> as [postgresql jdbc driver 
> document|https://jdbc.postgresql.org/development/privateapi/org/postgresql/Driver.html#connect-java.lang.String-java.util.Properties-]
>  saying, The driver should return "null" if it realizes it is the wrong kind 
> of driver to connect to the given URL.
> while 
> [ConnectionFactory|https://github.com/apache/spark/blob/e743e848484bf7d97e1b4f33ea83f8520ae7da04/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L56]
>  would not check if conn is null.
> {code:java}
> val conn: Connection = JdbcUtils.createConnectionFactory(options)()
> {code}
>  and trying to close the conn anyway
> {code:java}
> try {
>   ...
> } finally {
>   conn.close()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13928) Move org.apache.spark.Logging into org.apache.spark.internal.Logging

2019-01-07 Thread abhinav (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-13928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735942#comment-16735942
 ] 

abhinav commented on SPARK-13928:
-

Thanks a lot Wen




> Move org.apache.spark.Logging into org.apache.spark.internal.Logging
> 
>
> Key: SPARK-13928
> URL: https://issues.apache.org/jira/browse/SPARK-13928
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.0.0
>
>
> Logging was made private in Spark 2.0. If we move it, then users would be 
> able to create a Logging trait themselves to avoid changing their own code. 
> Alternatively, we can also provide in a compatibility package that adds 
> logging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26513) Trigger GC on executor node idle

2019-01-07 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26513.
---
Resolution: Won't Fix

> Trigger GC on executor node idle
> 
>
> Key: SPARK-26513
> URL: https://issues.apache.org/jira/browse/SPARK-26513
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sandish Kumar HN
>Priority: Major
>
>  
> Correct me if I'm wrong.
>  *Stage:*
>       On a large cluster, each stage would have some executors. were a few 
> executors would finish a couple of tasks first and wait for whole stage or 
> remaining tasks to finish which are executed by different executors nodes in 
> a cluster. a stage will only be completed when all tasks in a current stage 
> finish its execution. and the next stage execution has to wait till all tasks 
> of the current stage are completed. 
>  
> why don't we trigger GC, when the executor node is waiting for remaining 
> tasks to finish, or executor Idle? anyways executor has to wait for the 
> remaining tasks to finish which can at least take a couple of seconds. why 
> don't we trigger GC? which will max take <300ms
>  
> I have proposed a small code snippet which triggers GC when running tasks are 
> empty and heap usage in current executor node is more than the given 
> threshold.
> This could improve performance for long-running spark job's. 
> we referred this paper 
> [https://www.computer.org/csdl/proceedings/hipc/2016/5411/00/07839705.pdf] 
> and we found performance improvements in our long-running spark batch job's.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26383) NPE when use DataFrameReader.jdbc with wrong URL

2019-01-07 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26383.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23464
[https://github.com/apache/spark/pull/23464]

> NPE when use DataFrameReader.jdbc with wrong URL
> 
>
> Key: SPARK-26383
> URL: https://issues.apache.org/jira/browse/SPARK-26383
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: clouds
>Assignee: clouds
>Priority: Minor
> Fix For: 3.0.0
>
>
> When passing wrong url to jdbc:
> {code:java}
> val opts = Map(
>   "url" -> "jdbc:mysql://localhost/db",
>   "dbtable" -> "table",
>   "driver" -> "org.postgresql.Driver"
> )
> var df = spark.read.format("jdbc").options(opts).load
> {code}
> It would throw an NPE instead of complaining about connection failed. (Note 
> url and driver not matched here)
> {code:java}
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:71)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
> at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
> {code}
> as [postgresql jdbc driver 
> document|https://jdbc.postgresql.org/development/privateapi/org/postgresql/Driver.html#connect-java.lang.String-java.util.Properties-]
>  saying, The driver should return "null" if it realizes it is the wrong kind 
> of driver to connect to the given URL.
> while 
> [ConnectionFactory|https://github.com/apache/spark/blob/e743e848484bf7d97e1b4f33ea83f8520ae7da04/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L56]
>  would not check if conn is null.
> {code:java}
> val conn: Connection = JdbcUtils.createConnectionFactory(options)()
> {code}
>  and trying to close the conn anyway
> {code:java}
> try {
>   ...
> } finally {
>   conn.close()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26549) PySpark worker reuse take no effect for parallelize xrange

2019-01-07 Thread Yuanjian Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-26549:

Description: 
During [the follow-up 
work|https://github.com/apache/spark/pull/23435#issuecomment-451079886] for 
PySpark worker reuse scenario, we found that the worker reuse takes no effect 
for `sc.parallelize(xrange(...))`.
It happened because of the specialize rdd.parallelize for xrange(SPARK-4398) 
generated data by xrange, which don't need to use the passed-in iterator. But 
this will break the end of stream checking in python worker and finally cause 
worker reuse takes no effect.


Relative code block and more details listing below:
Current specialize logic of xrange don't need the passed-in iterator, 
context.py:
{code:java}
if isinstance(c, xrange):
...
def f(split, iterator):
return xrange(getStart(split), getStart(split + 1), step)
...
return self.parallelize([], numSlices).mapPartitionsWithIndex(f)
{code}
We got an unexpected value -1 which refers to END_OF_DATA_SECTION while check 
end of stream. See the code in worker.py:
{code:java}
# check end of stream
if read_int(infile) == SpecialLengths.END_OF_STREAM:
write_int(SpecialLengths.END_OF_STREAM, outfile)
else:
# write a different value to tell JVM to not reuse this worker
write_int(SpecialLengths.END_OF_DATA_SECTION, outfile)
sys.exit(-1)
{code}
The code works well for parallelize(range) because the END_OF_DATA_SECTION has 
been handled during load iterator from the socket stream, see the code in 
FramedSerializer:
{code:java}
def load_stream(self, stream):
while True:
try:
yield self._read_with_length(stream)
except EOFError:
return
...
def _read_with_length(self, stream):
length = read_int(stream)
if length == SpecialLengths.END_OF_DATA_SECTION:
raise EOFError #END_OF_DATA_SECTION raised EOF here and catched in 
load_stream
elif length == SpecialLengths.NULL:
return None
obj = stream.read(length)
if len(obj) < length:
raise EOFError
return self.loads(obj)
{code}

  was:
During [the follow-up 
work|https://github.com/apache/spark/pull/23435#issuecomment-451079886] for 
PySpark worker reuse scenario, we found that the worker reuse takes no effect 
for `sc.parallelize(xrange(...))`.
It happened because of the specialize rdd.parallelize for xrange(SPARK-4398) 
generated data by xrange, which don't need the passed-in iterator. But this 
will break the end of stream checking in python worker and finally cause worker 
reuse take no effect. 
Relative code block list below:
Current specialize logic of xrange don't need the passed-in iterator, 
context.py:
{code:java}
if isinstance(c, xrange):
...
def f(split, iterator):
return xrange(getStart(split), getStart(split + 1), step)
...
return self.parallelize([], numSlices).mapPartitionsWithIndex(f)
{code}
We got an unexpected value -1 which refers to END_OF_DATA_SECTION while check 
end of stream. See the code in worker.py:
{code:java}
# check end of stream
if read_int(infile) == SpecialLengths.END_OF_STREAM:
write_int(SpecialLengths.END_OF_STREAM, outfile)
else:
# write a different value to tell JVM to not reuse this worker
write_int(SpecialLengths.END_OF_DATA_SECTION, outfile)
sys.exit(-1)
{code}
The code works well for parallelize(range) because the END_OF_DATA_SECTION has 
been handled during load iterator from the socket stream, see the code in 
FramedSerializer:
{code:java}
def load_stream(self, stream):
while True:
try:
yield self._read_with_length(stream)
except EOFError:
return
...
def _read_with_length(self, stream):
length = read_int(stream)
if length == SpecialLengths.END_OF_DATA_SECTION:
raise EOFError #END_OF_DATA_SECTION raised EOF here and catched in 
load_stream
elif length == SpecialLengths.NULL:
return None
obj = stream.read(length)
if len(obj) < length:
raise EOFError
return self.loads(obj)
{code}


> PySpark worker reuse take no effect for parallelize xrange
> --
>
> Key: SPARK-26549
> URL: https://issues.apache.org/jira/browse/SPARK-26549
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> During [the follow-up 
> work|https://github.com/apache/spark/pull/23435#issuecomment-451079886] for 
> PySpark worker reuse scenario, we found that the worker reuse takes no effect 
> for `sc.parallelize(xrange(...))`.
> It happened because of the specialize rdd.parallelize for xrange(SPARK-4398) 
> generated data by xrange, which don't need to use the passed-in iterator. But 
> this will break the end of stream checking

[jira] [Commented] (SPARK-13928) Move org.apache.spark.Logging into org.apache.spark.internal.Logging

2019-01-07 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-13928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735862#comment-16735862
 ] 

Wenchen Fan commented on SPARK-13928:
-

You need to check the HBase document and see which Spark versions it supports, 
I can't help you here.

Regarding "copy", I mean copy-paste the code from 
`org.apache.spark.internal.Logging` in the Spark project (check out the Spark 
repo from Github to find it), to `org.apache.spark.Logging` in your project.

> Move org.apache.spark.Logging into org.apache.spark.internal.Logging
> 
>
> Key: SPARK-13928
> URL: https://issues.apache.org/jira/browse/SPARK-13928
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.0.0
>
>
> Logging was made private in Spark 2.0. If we move it, then users would be 
> able to create a Logging trait themselves to avoid changing their own code. 
> Alternatively, we can also provide in a compatibility package that adds 
> logging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24725) Discuss necessary info and access in barrier mode + Mesos

2019-01-07 Thread Angel Conde (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735836#comment-16735836
 ] 

Angel Conde  commented on SPARK-24725:
--

Mesos has already the possibility via MPI framework 
([https://github.com/apache/mesos/tree/master/mpi]). However that 
implementation is 5 year old  (and seems to be a proof of concept) and I do not 
know whether that approach could be used using Docker images as executors. 

Bests

> Discuss necessary info and access in barrier mode + Mesos
> -
>
> Key: SPARK-24725
> URL: https://issues.apache.org/jira/browse/SPARK-24725
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> In barrier mode, to run hybrid distributed DL training jobs, we need to 
> provide users sufficient info and access so they can set up a hybrid 
> distributed training job, e.g., using MPI.
> This ticket limits the scope of discussion to Spark + Mesos. I'm not aware of 
> MPI support in Mesos. So we should find someone with good knowledge to lead 
> the discussion here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26549) PySpark worker reuse take no effect for parallelize xrange

2019-01-07 Thread Yuanjian Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-26549:

Description: 
During [the follow-up 
work|https://github.com/apache/spark/pull/23435#issuecomment-451079886] for 
PySpark worker reuse scenario, we found that the worker reuse takes no effect 
for `sc.parallelize(xrange(...))`.
It happened because of the specialize rdd.parallelize for xrange(SPARK-4398) 
generated data by xrange, which don't need the passed-in iterator. But this 
will break the end of stream checking in python worker and finally cause worker 
reuse take no effect. 
Relative code block list below:
Current specialize logic of xrange don't need the passed-in iterator, 
context.py:
{code:java}
if isinstance(c, xrange):
...
def f(split, iterator):
return xrange(getStart(split), getStart(split + 1), step)
...
return self.parallelize([], numSlices).mapPartitionsWithIndex(f)
{code}
We got an unexpected value -1 which refers to END_OF_DATA_SECTION while check 
end of stream. See the code in worker.py:
{code:java}
# check end of stream
if read_int(infile) == SpecialLengths.END_OF_STREAM:
write_int(SpecialLengths.END_OF_STREAM, outfile)
else:
# write a different value to tell JVM to not reuse this worker
write_int(SpecialLengths.END_OF_DATA_SECTION, outfile)
sys.exit(-1)
{code}
The code works well for parallelize(range) because the END_OF_DATA_SECTION has 
been handled during load iterator from the socket stream, see the code in 
FramedSerializer:
{code:java}
def load_stream(self, stream):
while True:
try:
yield self._read_with_length(stream)
except EOFError:
return
...
def _read_with_length(self, stream):
length = read_int(stream)
if length == SpecialLengths.END_OF_DATA_SECTION:
raise EOFError #END_OF_DATA_SECTION raised EOF here and catched in 
load_stream
elif length == SpecialLengths.NULL:
return None
obj = stream.read(length)
if len(obj) < length:
raise EOFError
return self.loads(obj)
{code}

  was:
During [the follow-up 
work|https://github.com/apache/spark/pull/23435#issuecomment-451079886] for 
PySpark worker reuse scenario, we found that the worker reuse takes no effect 
for `sc.parallelize(xrange(...))`.
It happened because the specialize rdd.parallelize for xrange generated data by 
xrange, which don't need the passed-in iterator. But this will break the end of 
stream checking in python worker and finally cause worker reuse take no effect. 
Relative code block list below:
Current specialize logic of xrange don't need the passed-in iterator, 
context.py:
{code:java}
if isinstance(c, xrange):
...
def f(split, iterator):
return xrange(getStart(split), getStart(split + 1), step)
...
return self.parallelize([], numSlices).mapPartitionsWithIndex(f)
{code}
We got an unexpected value -1 which refers to END_OF_DATA_SECTION while check 
end of stream. See the code in worker.py:
{code:java}
# check end of stream
if read_int(infile) == SpecialLengths.END_OF_STREAM:
write_int(SpecialLengths.END_OF_STREAM, outfile)
else:
# write a different value to tell JVM to not reuse this worker
write_int(SpecialLengths.END_OF_DATA_SECTION, outfile)
sys.exit(-1)
{code}
The code works well for parallelize(range) because the END_OF_DATA_SECTION has 
been handled during load iterator from the socket stream, see the code in 
FramedSerializer:
{code:java}
def load_stream(self, stream):
while True:
try:
yield self._read_with_length(stream)
except EOFError:
return
...
def _read_with_length(self, stream):
length = read_int(stream)
if length == SpecialLengths.END_OF_DATA_SECTION:
raise EOFError #END_OF_DATA_SECTION raised EOF here and catched in 
load_stream
elif length == SpecialLengths.NULL:
return None
obj = stream.read(length)
if len(obj) < length:
raise EOFError
return self.loads(obj)
{code}


> PySpark worker reuse take no effect for parallelize xrange
> --
>
> Key: SPARK-26549
> URL: https://issues.apache.org/jira/browse/SPARK-26549
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> During [the follow-up 
> work|https://github.com/apache/spark/pull/23435#issuecomment-451079886] for 
> PySpark worker reuse scenario, we found that the worker reuse takes no effect 
> for `sc.parallelize(xrange(...))`.
> It happened because of the specialize rdd.parallelize for xrange(SPARK-4398) 
> generated data by xrange, which don't need the passed-in iterator. But this 
> will break the end of stream checking in python worker and finally cause 
> worker reuse

[jira] [Updated] (SPARK-26549) PySpark worker reuse take no effect for parallelize xrange

2019-01-07 Thread Yuanjian Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-26549:

Description: 
During [the follow-up 
work|https://github.com/apache/spark/pull/23435#issuecomment-451079886] for 
PySpark worker reuse scenario, we found that the worker reuse takes no effect 
for `sc.parallelize(xrange(...))`.
It happened because the specialize rdd.parallelize for xrange generated data by 
xrange, which don't need the passed-in iterator. But this will break the end of 
stream checking in python worker and finally cause worker reuse take no effect. 
Relative code block list below:
Current specialize logic of xrange don't need the passed-in iterator, 
context.py:
{code:java}
if isinstance(c, xrange):
...
def f(split, iterator):
return xrange(getStart(split), getStart(split + 1), step)
...
return self.parallelize([], numSlices).mapPartitionsWithIndex(f)
{code}
We got an unexpected value -1 which refers to END_OF_DATA_SECTION while check 
end of stream. See the code in worker.py:
{code:java}
# check end of stream
if read_int(infile) == SpecialLengths.END_OF_STREAM:
write_int(SpecialLengths.END_OF_STREAM, outfile)
else:
# write a different value to tell JVM to not reuse this worker
write_int(SpecialLengths.END_OF_DATA_SECTION, outfile)
sys.exit(-1)
{code}
The code works well for parallelize(range) because the END_OF_DATA_SECTION has 
been handled during load iterator from the socket stream, see the code in 
FramedSerializer:
{code:java}
def load_stream(self, stream):
while True:
try:
yield self._read_with_length(stream)
except EOFError:
return
...
def _read_with_length(self, stream):
length = read_int(stream)
if length == SpecialLengths.END_OF_DATA_SECTION:
raise EOFError #END_OF_DATA_SECTION raised EOF here and catched in 
load_stream
elif length == SpecialLengths.NULL:
return None
obj = stream.read(length)
if len(obj) < length:
raise EOFError
return self.loads(obj)
{code}

  was:
During [the follow-up 
work|https://github.com/apache/spark/pull/23435#issuecomment-451079886] for 
PySpark worker reuse scenario, we found that the worker reuse takes no effect 
for `sc.parallelize(xrange(...))`.
It happened because the specialize logic generated data by xrange, which don't 
need the passed-in iterator. But this will break the end of stream checking in 
python worker and finally cause worker reuse take no effect. 
Relative code block list below:
Current specialize logic of xrange don't need the passed-in iterator, 
context.py:
{code}
if isinstance(c, xrange):
...
def f(split, iterator):
return xrange(getStart(split), getStart(split + 1), step)
...
return self.parallelize([], numSlices).mapPartitionsWithIndex(f)
{code}
We got an unexpected value -1 which refers to END_OF_DATA_SECTION while check 
end of stream. See the code in worker.py:
{code}
# check end of stream
if read_int(infile) == SpecialLengths.END_OF_STREAM:
write_int(SpecialLengths.END_OF_STREAM, outfile)
else:
# write a different value to tell JVM to not reuse this worker
write_int(SpecialLengths.END_OF_DATA_SECTION, outfile)
sys.exit(-1)
{code}
The code works well for parallelize(range) because the END_OF_DATA_SECTION has 
been handled during load iterator from the socket stream, see the code in 
FramedSerializer:
{code}
def load_stream(self, stream):
while True:
try:
yield self._read_with_length(stream)
except EOFError:
return
...
def _read_with_length(self, stream):
length = read_int(stream)
if length == SpecialLengths.END_OF_DATA_SECTION:
raise EOFError #END_OF_DATA_SECTION raised EOF here and catched in 
load_stream
elif length == SpecialLengths.NULL:
return None
obj = stream.read(length)
if len(obj) < length:
raise EOFError
return self.loads(obj)
{code}


> PySpark worker reuse take no effect for parallelize xrange
> --
>
> Key: SPARK-26549
> URL: https://issues.apache.org/jira/browse/SPARK-26549
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> During [the follow-up 
> work|https://github.com/apache/spark/pull/23435#issuecomment-451079886] for 
> PySpark worker reuse scenario, we found that the worker reuse takes no effect 
> for `sc.parallelize(xrange(...))`.
> It happened because the specialize rdd.parallelize for xrange generated data 
> by xrange, which don't need the passed-in iterator. But this will break the 
> end of stream checking in python worker and finally cause worker reuse take 
> no effect. 
> Relative code block list below:
> Current

[jira] [Updated] (SPARK-26549) PySpark worker reuse take no effect for parallelize xrange

2019-01-07 Thread Yuanjian Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-26549:

Description: 
During [the follow-up 
work|https://github.com/apache/spark/pull/23435#issuecomment-451079886] for 
PySpark worker reuse scenario, we found that the worker reuse takes no effect 
for `sc.parallelize(xrange(...))`.
It happened because the specialize logic generated data by xrange, which don't 
need the passed-in iterator. But this will break the end of stream checking in 
python worker and finally cause worker reuse take no effect. 
Relative code block list below:
Current specialize logic of xrange don't need the passed-in iterator, 
context.py:
{code}
if isinstance(c, xrange):
...
def f(split, iterator):
return xrange(getStart(split), getStart(split + 1), step)
...
return self.parallelize([], numSlices).mapPartitionsWithIndex(f)
{code}
We got an unexpected value -1 which refers to END_OF_DATA_SECTION while check 
end of stream. See the code in worker.py:
{code}
# check end of stream
if read_int(infile) == SpecialLengths.END_OF_STREAM:
write_int(SpecialLengths.END_OF_STREAM, outfile)
else:
# write a different value to tell JVM to not reuse this worker
write_int(SpecialLengths.END_OF_DATA_SECTION, outfile)
sys.exit(-1)
{code}
The code works well for parallelize(range) because the END_OF_DATA_SECTION has 
been handled during load iterator from the socket stream, see the code in 
FramedSerializer:
{code}
def load_stream(self, stream):
while True:
try:
yield self._read_with_length(stream)
except EOFError:
return
...
def _read_with_length(self, stream):
length = read_int(stream)
if length == SpecialLengths.END_OF_DATA_SECTION:
raise EOFError #END_OF_DATA_SECTION raised EOF here and catched in 
load_stream
elif length == SpecialLengths.NULL:
return None
obj = stream.read(length)
if len(obj) < length:
raise EOFError
return self.loads(obj)
{code}

  was:
During [the follow-up 
work|https://github.com/apache/spark/pull/23435#issuecomment-451079886] for 
PySpark worker reuse scenario, we found that the worker reuse takes no effect 
for `sc.parallelize(xrange(...))`.
It happened because, we specialize xrange for performance in rdd.parallelize, 
but the specialize function don't need iterator



 during the python worker check end of the stream in Python3, we got an 
unexpected value -1 here which refers to END_OF_DATA_SECTION. See the code in 
worker.py:
{code:python}
# check end of stream
if read_int(infile) == SpecialLengths.END_OF_STREAM:
write_int(SpecialLengths.END_OF_STREAM, outfile)
else:
# write a different value to tell JVM to not reuse this worker
write_int(SpecialLengths.END_OF_DATA_SECTION, outfile)
sys.exit(-1)
{code}
The code works well for Python2 and PyPy because the END_OF_DATA_SECTION has 
been handled during load iterator from the socket stream, see the code in 
FramedSerializer:

{code:python}
def load_stream(self, stream):
while True:
try:
yield self._read_with_length(stream)
except EOFError:
return

...

def _read_with_length(self, stream):
length = read_int(stream)
if length == SpecialLengths.END_OF_DATA_SECTION:
raise EOFError #END_OF_DATA_SECTION raised EOF here and catched in 
load_stream
elif length == SpecialLengths.NULL:
return None
obj = stream.read(length)
if len(obj) < length:
raise EOFError
return self.loads(obj)
{code}




> PySpark worker reuse take no effect for parallelize xrange
> --
>
> Key: SPARK-26549
> URL: https://issues.apache.org/jira/browse/SPARK-26549
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> During [the follow-up 
> work|https://github.com/apache/spark/pull/23435#issuecomment-451079886] for 
> PySpark worker reuse scenario, we found that the worker reuse takes no effect 
> for `sc.parallelize(xrange(...))`.
> It happened because the specialize logic generated data by xrange, which 
> don't need the passed-in iterator. But this will break the end of stream 
> checking in python worker and finally cause worker reuse take no effect. 
> Relative code block list below:
> Current specialize logic of xrange don't need the passed-in iterator, 
> context.py:
> {code}
> if isinstance(c, xrange):
> ...
> def f(split, iterator):
> return xrange(getStart(split), getStart(split + 1), step)
> ...
> return self.parallelize([], numSlices).mapPartitionsWithIndex(f)
> {code}
> We got an unexpected value -1 which refers to END_OF_DATA_SECTION while check 
> end of stream. See the code in

[jira] [Created] (SPARK-26562) countDistinct and user-defined function cannot be used in SELECT

2019-01-07 Thread Ravi Kaushik (JIRA)

Ravi Kaushik created SPARK-26562:


 Summary: countDistinct and user-defined function cannot be used in 
SELECT
 Key: SPARK-26562
 URL: https://issues.apache.org/jira/browse/SPARK-26562
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.1
 Environment: Macbook Pro 10.14.2

spark 2.3.1

 
Reporter: Ravi Kaushik


df=spark.createDataFrame([ [1,2,3], [2,3,4], [4,5,6] ], ['a', 'b', 'c'])

from pyspark.sql import functions as F, types as T

df.select(F.sum(df['a']), F.count(df['a']), F.countDistinct(df['a']), 
F.approx_count_distinct(df['a'])).show()

func =F.udf(lambda x: 1.0, T.DoubelType())

 

df.select(F.sum(df['a']), F.count(df['a']), F.countDistinct(df['a']), 
F.approx_count_distinct(df['a']), F.sum(func(df['a'])) ).show()

 

 

Error

 

2019-01-07 18:30:50 ERROR Executor:91 - Exception in task 6.0 in stage 4.0 (TID 
223)

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, tree: sum#45849

 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)

 at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91)

 at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)

 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)

 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)

 at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)

 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)

 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)

 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)

 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)

 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)

 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)

 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)

 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)

 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)

 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$11.apply(TreeNode.scala:335)

 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)

 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)

 at scala.collection.immutable.List.foreach(List.scala:381)

 at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)

 at scala.collection.immutable.List.map(List.scala:285)

 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:333)

 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)

 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)

 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)

 at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)

 at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90)

 at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$$anonfun$bind$1.apply(GenerateMutableProjection.scala:38)

 at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$$anonfun$bind$1.apply(GenerateMutableProjection.scala:38)

 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)

 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)

 at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

 at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)

 at scala.collection.AbstractTraversable.map(Traversable.scala:104)

 at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.bind(GenerateMutableProjection.scala:38)

 at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.generate(GenerateMutableProjection.scala:44)

 at 
org.apache.spark.sql.execution.SparkPlan.newMutableProjection(SparkPlan.scala:383)

 at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4$$anonfun$5.apply(HashAggregateExec.scala:119)

 at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4$$anonfun$5.apply(HashAggregateExec.scala:118)

 at

[jira] [Updated] (SPARK-26549) PySpark worker reuse take no effect for parallelize xrange

2019-01-07 Thread Yuanjian Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-26549:

Description: 
During [the follow-up 
work|https://github.com/apache/spark/pull/23435#issuecomment-451079886] for 
PySpark worker reuse scenario, we found that the worker reuse takes no effect 
for `sc.parallelize(xrange(...))`.
It happened because, we specialize xrange for performance in rdd.parallelize, 
but the specialize function don't need iterator



 during the python worker check end of the stream in Python3, we got an 
unexpected value -1 here which refers to END_OF_DATA_SECTION. See the code in 
worker.py:
{code:python}
# check end of stream
if read_int(infile) == SpecialLengths.END_OF_STREAM:
write_int(SpecialLengths.END_OF_STREAM, outfile)
else:
# write a different value to tell JVM to not reuse this worker
write_int(SpecialLengths.END_OF_DATA_SECTION, outfile)
sys.exit(-1)
{code}
The code works well for Python2 and PyPy because the END_OF_DATA_SECTION has 
been handled during load iterator from the socket stream, see the code in 
FramedSerializer:

{code:python}
def load_stream(self, stream):
while True:
try:
yield self._read_with_length(stream)
except EOFError:
return

...

def _read_with_length(self, stream):
length = read_int(stream)
if length == SpecialLengths.END_OF_DATA_SECTION:
raise EOFError #END_OF_DATA_SECTION raised EOF here and catched in 
load_stream
elif length == SpecialLengths.NULL:
return None
obj = stream.read(length)
if len(obj) < length:
raise EOFError
return self.loads(obj)
{code}



  was:
During [the follow-up 
work|https://github.com/apache/spark/pull/23435#issuecomment-451079886] for 
PySpark worker reuse scenario, we found that the worker reuse takes no effect 
for Python3 while works properly for Python2 and PyPy.
It happened because, during the python worker check end of the stream in 
Python3, we got an unexpected value -1 here which refers to 
END_OF_DATA_SECTION. See the code in worker.py:
{code:python}
# check end of stream
if read_int(infile) == SpecialLengths.END_OF_STREAM:
write_int(SpecialLengths.END_OF_STREAM, outfile)
else:
# write a different value to tell JVM to not reuse this worker
write_int(SpecialLengths.END_OF_DATA_SECTION, outfile)
sys.exit(-1)
{code}
The code works well for Python2 and PyPy because the END_OF_DATA_SECTION has 
been handled during load iterator from the socket stream, see the code in 
FramedSerializer:

{code:python}
def load_stream(self, stream):
while True:
try:
yield self._read_with_length(stream)
except EOFError:
return

...

def _read_with_length(self, stream):
length = read_int(stream)
if length == SpecialLengths.END_OF_DATA_SECTION:
raise EOFError #END_OF_DATA_SECTION raised EOF here and catched in 
load_stream
elif length == SpecialLengths.NULL:
return None
obj = stream.read(length)
if len(obj) < length:
raise EOFError
return self.loads(obj)
{code}




> PySpark worker reuse take no effect for parallelize xrange
> --
>
> Key: SPARK-26549
> URL: https://issues.apache.org/jira/browse/SPARK-26549
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> During [the follow-up 
> work|https://github.com/apache/spark/pull/23435#issuecomment-451079886] for 
> PySpark worker reuse scenario, we found that the worker reuse takes no effect 
> for `sc.parallelize(xrange(...))`.
> It happened because, we specialize xrange for performance in rdd.parallelize, 
> but the specialize function don't need iterator
>  during the python worker check end of the stream in Python3, we got an 
> unexpected value -1 here which refers to END_OF_DATA_SECTION. See the code in 
> worker.py:
> {code:python}
> # check end of stream
> if read_int(infile) == SpecialLengths.END_OF_STREAM:
> write_int(SpecialLengths.END_OF_STREAM, outfile)
> else:
> # write a different value to tell JVM to not reuse this worker
> write_int(SpecialLengths.END_OF_DATA_SECTION, outfile)
> sys.exit(-1)
> {code}
> The code works well for Python2 and PyPy because the END_OF_DATA_SECTION has 
> been handled during load iterator from the socket stream, see the code in 
> FramedSerializer:
> {code:python}
> def load_stream(self, stream):
> while True:
> try:
> yield self._read_with_length(stream)
> except EOFError:
> return
> ...
> def _read_with_length(self, stream):
> length = read_int(stream)
> if length == SpecialLengths.END_OF_DATA_SECTION:
> raise EOFError #END_OF_DATA_SECTION raised EOF

[jira] [Updated] (SPARK-26549) PySpark worker reuse take no effect for parallelize xrange

2019-01-07 Thread Yuanjian Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-26549:

Summary: PySpark worker reuse take no effect for parallelize xrange  (was: 
PySpark worker reuse take no effect for Python3)

> PySpark worker reuse take no effect for parallelize xrange
> --
>
> Key: SPARK-26549
> URL: https://issues.apache.org/jira/browse/SPARK-26549
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> During [the follow-up 
> work|https://github.com/apache/spark/pull/23435#issuecomment-451079886] for 
> PySpark worker reuse scenario, we found that the worker reuse takes no effect 
> for Python3 while works properly for Python2 and PyPy.
> It happened because, during the python worker check end of the stream in 
> Python3, we got an unexpected value -1 here which refers to 
> END_OF_DATA_SECTION. See the code in worker.py:
> {code:python}
> # check end of stream
> if read_int(infile) == SpecialLengths.END_OF_STREAM:
> write_int(SpecialLengths.END_OF_STREAM, outfile)
> else:
> # write a different value to tell JVM to not reuse this worker
> write_int(SpecialLengths.END_OF_DATA_SECTION, outfile)
> sys.exit(-1)
> {code}
> The code works well for Python2 and PyPy because the END_OF_DATA_SECTION has 
> been handled during load iterator from the socket stream, see the code in 
> FramedSerializer:
> {code:python}
> def load_stream(self, stream):
> while True:
> try:
> yield self._read_with_length(stream)
> except EOFError:
> return
> ...
> def _read_with_length(self, stream):
> length = read_int(stream)
> if length == SpecialLengths.END_OF_DATA_SECTION:
> raise EOFError #END_OF_DATA_SECTION raised EOF here and catched in 
> load_stream
> elif length == SpecialLengths.NULL:
> return None
> obj = stream.read(length)
> if len(obj) < length:
> raise EOFError
> return self.loads(obj)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26457) Show hadoop configurations in HistoryServer environment tab

2019-01-07 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26457:


Assignee: Apache Spark

> Show hadoop configurations in HistoryServer environment tab
> ---
>
> Key: SPARK-26457
> URL: https://issues.apache.org/jira/browse/SPARK-26457
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2, 2.4.0
> Environment: Maybe it is good to show some configurations in 
> HistoryServer environment tab for debugging some bugs about hadoop
>Reporter: deshanxiao
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23636) [SPARK 2.2] | Kafka Consumer | KafkaUtils.createRDD throws Exception - java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access

2019-01-07 Thread Gabor Somogyi (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi resolved SPARK-23636.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Please reopen if problem re-appears.

> [SPARK 2.2] | Kafka Consumer | KafkaUtils.createRDD throws Exception - 
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access
> -
>
> Key: SPARK-23636
> URL: https://issues.apache.org/jira/browse/SPARK-23636
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Deepak
>Priority: Major
>  Labels: performance
> Fix For: 2.4.0
>
>
> h2.  
> h2. Summary
>  
> While using the KafkaUtils.createRDD API - we receive below listed error, 
> specifically when 1 executor connects to 1 kafka topic-partition, but with 
> more than 1 core & fetches an Array(OffsetRanges)
>  
> _I've tagged this issue to "Structured Streaming" - as I could not find a 
> more appropriate component_ 
>  
> 
> h2. Error Faced
> {noformat}
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access{noformat}
>  Stack Trace
> {noformat}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 5 in stage 1.0 failed 4 times, most recent failure: Lost task 5.3 in 
> stage 1.0 (TID 17, host, executor 16): 
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access
> at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1629)
> at 
> org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1528)
> at 
> org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1508)
> at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.close(CachedKafkaConsumer.scala:59)
> at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer$.remove(CachedKafkaConsumer.scala:185)
> at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.(KafkaRDD.scala:204)
> at org.apache.spark.streaming.kafka010.KafkaRDD.compute(KafkaRDD.scala:181)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323){noformat}
>  
> 
> h2. Config Used to simulate the error
> A session with : 
>  * Executors - 1
>  * Cores - 2 or More
>  * Kafka Topic - has only 1 partition
>  * While fetching - More than one Array of Offset Range , Example 
> {noformat}
> Array(OffsetRange("kafka_topic",0,608954201,608954202),
> OffsetRange("kafka_topic",0,608954202,608954203)
> ){noformat}
>  
> 
> h2. Was this approach working before?
>  
> This was working in spark 1.6.2
> However, from spark 2.1 onwards - the approach throws exception
>  
> 
> h2. Why are we fetching from kafka as mentioned above.
>  
> This gives us the capability to establish a connection to Kafka Broker for 
> every spark executor's core, thus each core can fetch/process its own set of 
> messages based on the specified (offset ranges).
>  
>  
> 
> h2. Sample Code
>  
> {quote}scala snippet - on versions spark 2.2.0 or 2.1.0
> // Bunch of imports
> import kafka.serializer.\{DefaultDecoder, StringDecoder}
>  import org.apache.avro.generic.GenericRecord
>  import org.apache.kafka.clients.consumer.ConsumerRecord
>  import org.apache.kafka.common.serialization._
>  import org.apache.spark.rdd.RDD
>  import org.apache.spark.sql.\{DataFrame, Row, SQLContext}
>  import org.apache.spark.sql.Row
>  import org.apache.spark.sql.hive.HiveContext
>  import org.apache.spark.sql.types.\{StringType, StructField, StructType}
>  import org.apache.spark.streaming.kafka010._
>  import org.apache.spark.streaming.kafka010.KafkaUtils._
> {quote}
> {quote}// This forces two connections - from a single executor - to 
> topic-partition .
> // And with 2 cores assigned to 1 executor : each core has a task - pulling 
> respective offsets : OffsetRange("kafka_topic",0,1,2) & 
> OffsetRange("kafka_topic",0,2,3)
> val parallelizedRanges = Array(OffsetRange("kafka_topic",0,1,2), // Fetching 
> sample 2 records 
>  OffsetRange("kafka_topic",0,2,3) // Fetching sample 2 records 
>  )
>  
> // Initiate kafka properties
> val kafkaParams1: java.util.Map[String, Object] = new java.util.HashMap()
> // kafkaParams1.put("key","val") add all the parameters such as broker, 
> topic Not listing every property here.
>  
> // Create RDD
> val rDDConsumerRec: RDD[ConsumerRecord[String, String]] =
>  createRDD[String, String](sparkContext
>  , kafkaParams1, parallelizedRanges, LocationStrategies.PreferConsistent)
>  
> // Map Function
> val data: RDD[Row] = rDDConsumerRec.map \{ x => Row(x.topic().toString, 
>

[jira] [Assigned] (SPARK-26457) Show hadoop configurations in HistoryServer environment tab

2019-01-07 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26457:


Assignee: (was: Apache Spark)

> Show hadoop configurations in HistoryServer environment tab
> ---
>
> Key: SPARK-26457
> URL: https://issues.apache.org/jira/browse/SPARK-26457
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2, 2.4.0
> Environment: Maybe it is good to show some configurations in 
> HistoryServer environment tab for debugging some bugs about hadoop
>Reporter: deshanxiao
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13928) Move org.apache.spark.Logging into org.apache.spark.internal.Logging

2019-01-07 Thread abhinav (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-13928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735744#comment-16735744
 ] 

abhinav commented on SPARK-13928:
-

Wen, I have already included the "org.apache.spark.Logging" jar in my Eclipse 
Scala project.

Please let me know if you mean any different way of handling this.

Your help would be highly appreciable.

Thanks,

Abhinav

 

> Move org.apache.spark.Logging into org.apache.spark.internal.Logging
> 
>
> Key: SPARK-13928
> URL: https://issues.apache.org/jira/browse/SPARK-13928
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.0.0
>
>
> Logging was made private in Spark 2.0. If we move it, then users would be 
> able to create a Logging trait themselves to avoid changing their own code. 
> Alternatively, we can also provide in a compatibility package that adds 
> logging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26457) Show hadoop configurations in HistoryServer environment tab

2019-01-07 Thread deshanxiao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735707#comment-16735707
 ] 

deshanxiao commented on SPARK-26457:


[~planga82]
Hi, thanks for your reply! I know that yarn provided all hadoop configurations. 
But I guess it may be fine that the historyserver unify all configuration in 
it. I care the case where different hadoop version may have  different behavior 
or some configurations need a special hadoop version. It will be convenient for 
us to debug some problems.

Thanks a lot!

> Show hadoop configurations in HistoryServer environment tab
> ---
>
> Key: SPARK-26457
> URL: https://issues.apache.org/jira/browse/SPARK-26457
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2, 2.4.0
> Environment: Maybe it is good to show some configurations in 
> HistoryServer environment tab for debugging some bugs about hadoop
>Reporter: deshanxiao
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26534) Closure Cleaner Bug

2019-01-07 Thread sam (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sam updated SPARK-26534:

Description: 
I've found a strange combination of closures where the closure cleaner doesn't 
seem to be smart enough to figure out how to remove a reference that is not 
used. I.e. we get a `org.apache.spark.SparkException: Task not serializable` 
for a Task that is perfectly serializable.  

 

In the example below, the only `val` that is actually needed for the closure of 
the `map` is `foo`, but it tries to serialise `thingy`.  What is odd is 
changing this code in a number of subtle ways eliminates the error, which I've 
tried to highlight using comments inline.

 
{code:java}
import org.apache.spark.sql._

object Test {
  val sparkSession: SparkSession =
SparkSession.builder.master("local").appName("app").getOrCreate()

  def apply(): Unit = {
import sparkSession.implicits._

val landedData: Dataset[String] = 
sparkSession.sparkContext.makeRDD(Seq("foo", "bar")).toDS()

// thingy has to be in this outer scope to reproduce, if in someFunc, 
cannot reproduce
val thingy: Thingy = new Thingy

// If not wrapped in someFunc cannot reproduce
val someFunc = () => {
  // If don't reference this foo inside the closer (e.g. just use identity 
function) cannot reproduce
  val foo: String = "foo"

  thingy.run(block = () => {
landedData.map(r => {
  r + foo
})
.count()
  })
}

someFunc()

  }
}

class Thingy {
  def run[R](block: () => R): R = {
block()
  }
}
{code}

The full trace if ran in `sbt console`

{code}
scala> class Thingy {
 |   def run[R](block: () => R): R = {
 | block()
 |   }
 | }
defined class Thingy

scala> 

scala> object Test {
 |   val sparkSession: SparkSession =
 | SparkSession.builder.master("local").appName("app").getOrCreate()
 | 
 |   def apply(): Unit = {
 | import sparkSession.implicits._
 | 
 | val landedData: Dataset[String] = 
sparkSession.sparkContext.makeRDD(Seq("foo", "bar")).toDS()
 | 
 | // thingy has to be in this outer scope to reproduce, if in 
someFunc, cannot reproduce
 | val thingy: Thingy = new Thingy
 | 
 | // If not wrapped in someFunc cannot reproduce
 | val someFunc = () => {
 |   // If don't reference this foo inside the closer (e.g. just use 
identity function) cannot reproduce
 |   val foo: String = "foo"
 | 
 |   thingy.run(block = () => {
 | landedData.map(r => {
 |   r + foo
 | })
 | .count()
 |   })
 | }
 | 
 | someFunc()
 | 
 |   }
 | }
defined object Test

scala> 

scala> 

scala> Test.apply()
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/01/07 11:27:19 INFO SparkContext: Running Spark version 2.3.1
19/01/07 11:27:20 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
19/01/07 11:27:20 INFO SparkContext: Submitted application: app
19/01/07 11:27:20 INFO SecurityManager: Changing view acls to: sams
19/01/07 11:27:20 INFO SecurityManager: Changing modify acls to: sams
19/01/07 11:27:20 INFO SecurityManager: Changing view acls groups to: 
19/01/07 11:27:20 INFO SecurityManager: Changing modify acls groups to: 
19/01/07 11:27:20 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users  with view permissions: Set(sams); groups 
with view permissions: Set(); users  with modify permissions: Set(sams); groups 
with modify permissions: Set()
19/01/07 11:27:20 INFO Utils: Successfully started service 'sparkDriver' on 
port 54066.
19/01/07 11:27:20 INFO SparkEnv: Registering MapOutputTracker
19/01/07 11:27:20 INFO SparkEnv: Registering BlockManagerMaster
19/01/07 11:27:20 INFO BlockManagerMasterEndpoint: Using 
org.apache.spark.storage.DefaultTopologyMapper for getting topology information
19/01/07 11:27:20 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
19/01/07 11:27:20 INFO DiskBlockManager: Created local directory at 
/private/var/folders/x9/r21b5ttd1wx8zq9qtckfp411n7085c/T/blockmgr-c35bdd46-4804-427b-a513-ee8778814f88
19/01/07 11:27:20 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
19/01/07 11:27:20 INFO SparkEnv: Registering OutputCommitCoordinator
19/01/07 11:27:20 INFO Utils: Successfully started service 'SparkUI' on port 
4040.
19/01/07 11:27:20 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at 
http://10.197.196.44:4040
19/01/07 11:27:21 INFO Executor: Starting executor ID driver on host localhost
19/01/07 11:27:21 INFO Utils: Successfully started service 
'org.apache.spark.network.netty.NettyBlockTransferService' on port 54067.
19/01/07 11:27:21 INFO NettyBlockTransferService: Server created on

[jira] [Commented] (SPARK-26534) Closure Cleaner Bug

2019-01-07 Thread sam (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735699#comment-16735699
 ] 

sam commented on SPARK-26534:
-

[~viirya]

If I change to RDD I cannot reproduce either.  This is further evidence that 
this is certainly a bug, since serialisation of closures ought to be 
independent whether we use RDD or Dataset.

I have pasted the full output of my sbt console session to show the error for 
Dataset.

> Closure Cleaner Bug
> ---
>
> Key: SPARK-26534
> URL: https://issues.apache.org/jira/browse/SPARK-26534
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: sam
>Priority: Major
>
> I've found a strange combination of closures where the closure cleaner 
> doesn't seem to be smart enough to figure out how to remove a reference that 
> is not used. I.e. we get a `org.apache.spark.SparkException: Task not 
> serializable` for a Task that is perfectly serializable.  
>  
> In the example below, the only `val` that is actually needed for the closure 
> of the `map` is `foo`, but it tries to serialise `thingy`.  What is odd is 
> changing this code in a number of subtle ways eliminates the error, which 
> I've tried to highlight using comments inline.
>  
> {code:java}
> import org.apache.spark.sql._
> object Test {
>   val sparkSession: SparkSession =
> SparkSession.builder.master("local").appName("app").getOrCreate()
>   def apply(): Unit = {
> import sparkSession.implicits._
> val landedData: Dataset[String] = 
> sparkSession.sparkContext.makeRDD(Seq("foo", "bar")).toDS()
> // thingy has to be in this outer scope to reproduce, if in someFunc, 
> cannot reproduce
> val thingy: Thingy = new Thingy
> // If not wrapped in someFunc cannot reproduce
> val someFunc = () => {
>   // If don't reference this foo inside the closer (e.g. just use 
> identity function) cannot reproduce
>   val foo: String = "foo"
>   thingy.run(block = () => {
> landedData.map(r => {
>   r + foo
> })
> .count()
>   })
> }
> someFunc()
>   }
> }
> class Thingy {
>   def run[R](block: () => R): R = {
> block()
>   }
> }
> {code}
> The full trace if ran in `sbt console`
> {code}
> scala> class Thingy {
>  |   def run[R](block: () => R): R = {
>  | block()
>  |   }
>  | }
> defined class Thingy
> scala> 
> scala> object Test {
>  |   val sparkSession: SparkSession =
>  | SparkSession.builder.master("local").appName("app").getOrCreate()
>  | 
>  |   def apply(): Unit = {
>  | import sparkSession.implicits._
>  | 
>  | val landedData: Dataset[String] = 
> sparkSession.sparkContext.makeRDD(Seq("foo", "bar")).toDS()
>  | 
>  | // thingy has to be in this outer scope to reproduce, if in 
> someFunc, cannot reproduce
>  | val thingy: Thingy = new Thingy
>  | 
>  | // If not wrapped in someFunc cannot reproduce
>  | val someFunc = () => {
>  |   // If don't reference this foo inside the closer (e.g. just use 
> identity function) cannot reproduce
>  |   val foo: String = "foo"
>  | 
>  |   thingy.run(block = () => {
>  | landedData.map(r => {
>  |   r + foo
>  | })
>  | .count()
>  |   })
>  | }
>  | 
>  | someFunc()
>  | 
>  |   }
>  | }
> defined object Test
> scala> 
> scala> 
> scala> Test.apply()
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 19/01/07 11:27:19 INFO SparkContext: Running Spark version 2.3.1
> 19/01/07 11:27:20 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 19/01/07 11:27:20 INFO SparkContext: Submitted application: app
> 19/01/07 11:27:20 INFO SecurityManager: Changing view acls to: sams
> 19/01/07 11:27:20 INFO SecurityManager: Changing modify acls to: sams
> 19/01/07 11:27:20 INFO SecurityManager: Changing view acls groups to: 
> 19/01/07 11:27:20 INFO SecurityManager: Changing modify acls groups to: 
> 19/01/07 11:27:20 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users  with view permissions: Set(sams); groups 
> with view permissions: Set(); users  with modify permissions: Set(sams); 
> groups with modify permissions: Set()
> 19/01/07 11:27:20 INFO Utils: Successfully started service 'sparkDriver' on 
> port 54066.
> 19/01/07 11:27:20 INFO SparkEnv: Registering MapOutputTracker
> 19/01/07 11:27:20 INFO SparkEnv: Registering BlockManagerMaster
> 19/01/07 11:27:20 INFO BlockManagerMasterEndpoint: Using 
> org.apache.spark.storage.DefaultTopologyMapper for getting topology 
> information
> 19/01/07 11:27:20 INFO

[jira] [Commented] (SPARK-13928) Move org.apache.spark.Logging into org.apache.spark.internal.Logging

2019-01-07 Thread abhinav (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-13928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735691#comment-16735691
 ] 

abhinav commented on SPARK-13928:
-

Hi,

I am using HBase 1.2.0. is that fine? Can you tell me location of logging
class and where I need to copy it? Please describe how to copy logging
class.



Regards,
Abhinav





> Move org.apache.spark.Logging into org.apache.spark.internal.Logging
> 
>
> Key: SPARK-13928
> URL: https://issues.apache.org/jira/browse/SPARK-13928
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.0.0
>
>
> Logging was made private in Spark 2.0. If we move it, then users would be 
> able to create a Logging trait themselves to avoid changing their own code. 
> Alternatively, we can also provide in a compatibility package that adds 
> logging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13928) Move org.apache.spark.Logging into org.apache.spark.internal.Logging

2019-01-07 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-13928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735665#comment-16735665
 ] 

Wenchen Fan commented on SPARK-13928:
-

what version of HBase connector you are using? If it's an old version that 
doesn't support Spark 2.x, I'm afraid you need to downgrade to Spark 1.x, or 
use the workaround mentioned in the JIRA description: copy the `Logging` class 
to `org.apache.spark.Logging` in your application codebase.

> Move org.apache.spark.Logging into org.apache.spark.internal.Logging
> 
>
> Key: SPARK-13928
> URL: https://issues.apache.org/jira/browse/SPARK-13928
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.0.0
>
>
> Logging was made private in Spark 2.0. If we move it, then users would be 
> able to create a Logging trait themselves to avoid changing their own code. 
> Alternatively, we can also provide in a compatibility package that adds 
> logging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26561) Symbol 'type org.apache.spark.Logging' is missing from the classpath. This symbol is required by 'class org.apache.hadoop.hbase.spark.HBaseContext'

2019-01-07 Thread abhinav (JIRA)

abhinav created SPARK-26561:
---

 Summary: Symbol 'type org.apache.spark.Logging' is missing from 
the classpath. This symbol is required by 'class 
org.apache.hadoop.hbase.spark.HBaseContext'
 Key: SPARK-26561
 URL: https://issues.apache.org/jira/browse/SPARK-26561
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: abhinav


Symbol 'type org.apache.spark.Logging' is missing from the classpath. This 
symbol is required by 'class 
  org.apache.hadoop.hbase.spark.HBaseContext'. Make sure that type Logging is 
in your classpath and check 
  for conflicting dependencies with -Ylog-classpath. A full rebuild may help if 
'HBaseContext.class' was 
  compiled against an incompatible version of org.apache.spark.

 

 

 

Getting above issue while initializing HBaseContext. I am using spark 2.3.0 , 
scala 2.11.11 and hbase 1.2.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26559) ML image can't work with numpy versions prior to 1.9

2019-01-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26559:
-
Fix Version/s: 3.0.0
   2.4.1

> ML image can't work with numpy versions prior to 1.9
> 
>
> Key: SPARK-26559
> URL: https://issues.apache.org/jira/browse/SPARK-26559
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> ML image can't work with numpy version prior to 1.9 now.
> Current pyspark test can show it:
> {code:java}
> test_read_images (pyspark.ml.tests.test_image.ImageReaderTest) ... ERROR  
>     
> test_read_images_multiple_times 
> (pyspark.ml.tests.test_image.ImageReaderTest2) ... ok 
>     
>   
>     
> ==
>     
> ERROR: test_read_images (pyspark.ml.tests.test_image.ImageReaderTest) 
>     
> --
>     
> Traceback (most recent call last):
>   File 
> "/Users/viirya/docker_tmp/repos/spark-1/python/pyspark/ml/tests/test_image.py",
>  line 36, in test_read_images   
>     self.assertEqual(ImageSchema.toImage(array, origin=first_row[0]), 
> first_row)
>   
>   File "/Users/viirya/docker_tmp/repos/spark-1/python/pyspark/ml/image.py", 
> line 193, in toImage  
>     data = bytearray(array.astype(dtype=np.uint8).ravel().tobytes())  
>     
> AttributeError: 'numpy.ndarray' object has no attribute 'tobytes' 
>     
>   
>     
> --
> Ran 2 tests in 29.040s
>     
>   
>     
> FAILED (errors=1)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

1 2 >

1 - 100 of 126 matches

Mail list logo