[jira] [Commented] (SPARK-23928) High-order function: shuffle(x) → array

2018-04-10 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433406#comment-16433406
 ] 

Liang-Chi Hsieh commented on SPARK-23928:
-

If no assignee and no one announces, it is no problem you to take an jira.

> High-order function: shuffle(x) → array
> ---
>
> Key: SPARK-23928
> URL: https://issues.apache.org/jira/browse/SPARK-23928
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Generate a random permutation of the given array x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23958) HadoopRdd filters empty files to avoid generating empty tasks that affect the performance of the Spark computing performance.

2018-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23958:


Assignee: Apache Spark

> HadoopRdd filters empty files to avoid generating empty tasks that affect the 
> performance of the Spark computing performance.
> -
>
> Key: SPARK-23958
> URL: https://issues.apache.org/jira/browse/SPARK-23958
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Assignee: Apache Spark
>Priority: Minor
>
> HadoopRdd filter empty files to avoid generating empty tasks that affect the 
> performance of the Spark computing performance.
> Empty file's length is zero.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23958) HadoopRdd filters empty files to avoid generating empty tasks that affect the performance of the Spark computing performance.

2018-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23958:


Assignee: (was: Apache Spark)

> HadoopRdd filters empty files to avoid generating empty tasks that affect the 
> performance of the Spark computing performance.
> -
>
> Key: SPARK-23958
> URL: https://issues.apache.org/jira/browse/SPARK-23958
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Priority: Minor
>
> HadoopRdd filter empty files to avoid generating empty tasks that affect the 
> performance of the Spark computing performance.
> Empty file's length is zero.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23958) HadoopRdd filters empty files to avoid generating empty tasks that affect the performance of the Spark computing performance.

2018-04-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1640#comment-1640
 ] 

Apache Spark commented on SPARK-23958:
--

User 'guoxiaolongzte' has created a pull request for this issue:
https://github.com/apache/spark/pull/21036

> HadoopRdd filters empty files to avoid generating empty tasks that affect the 
> performance of the Spark computing performance.
> -
>
> Key: SPARK-23958
> URL: https://issues.apache.org/jira/browse/SPARK-23958
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Priority: Minor
>
> HadoopRdd filter empty files to avoid generating empty tasks that affect the 
> performance of the Spark computing performance.
> Empty file's length is zero.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23958) HadoopRdd filters empty files to avoid generating empty tasks that affect the performance of the Spark computing performance.

2018-04-10 Thread guoxiaolongzte (JIRA)
guoxiaolongzte created SPARK-23958:
--

 Summary: HadoopRdd filters empty files to avoid generating empty 
tasks that affect the performance of the Spark computing performance.
 Key: SPARK-23958
 URL: https://issues.apache.org/jira/browse/SPARK-23958
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: guoxiaolongzte


HadoopRdd filter empty files to avoid generating empty tasks that affect the 
performance of the Spark computing performance.

Empty file's length is zero.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23955) typo in parameter name 'rawPredicition'

2018-04-10 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-23955:
-
Priority: Trivial  (was: Minor)

> typo in parameter name 'rawPredicition'
> ---
>
> Key: SPARK-23955
> URL: https://issues.apache.org/jira/browse/SPARK-23955
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: John Bauer
>Priority: Trivial
>
> classifier.py MultilayerPerceptronClassifier.__init__ API call had typo 
> rawPredicition instead of rawPrediction
> also present in doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23945) Column.isin() should accept a single-column DataFrame as input

2018-04-10 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433316#comment-16433316
 ] 

Nicholas Chammas commented on SPARK-23945:
--

I always looked at DataFrames and SQL as two different interfaces to the same 
underlying logical model, so I just assumed that the vision was for them to be 
equally powerful. Is that not the case?

So in the grand scheme of things I'd expect DataFrames to be able to do 
everything that SQL can and vice versa, but for the narrow purposes of this 
ticket I'm just interested in {{IN }}and {{NOT IN.}}

> Column.isin() should accept a single-column DataFrame as input
> --
>
> Key: SPARK-23945
> URL: https://issues.apache.org/jira/browse/SPARK-23945
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> In SQL you can filter rows based on the result of a subquery:
> {code:java}
> SELECT *
> FROM table1
> WHERE name NOT IN (
> SELECT name
> FROM table2
> );{code}
> In the Spark DataFrame API, the equivalent would probably look like this:
> {code:java}
> (table1
> .where(
> ~col('name').isin(
> table2.select('name')
> )
> )
> ){code}
> However, .isin() currently [only accepts a local list of 
> values|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin].
> I imagine making this enhancement would happen as part of a larger effort to 
> support correlated subqueries in the DataFrame API.
> Or perhaps there is no plan to support this style of query in the DataFrame 
> API, and queries like this should instead be written in a different way? How 
> would we write a query like the one I have above in the DataFrame API, 
> without needing to collect values locally for the NOT IN filter?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23847) Add asc_nulls_first, asc_nulls_last to PySpark

2018-04-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433314#comment-16433314
 ] 

Apache Spark commented on SPARK-23847:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/21035

> Add asc_nulls_first, asc_nulls_last to PySpark
> --
>
> Key: SPARK-23847
> URL: https://issues.apache.org/jira/browse/SPARK-23847
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 2.4.0
>
>
> Column.scala and Functions.scala have asc_nulls_first, asc_nulls_last,  
> desc_nulls_first and desc_nulls_last. Add the corresponding python APIs in 
> PySpark. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23955) typo in parameter name 'rawPredicition'

2018-04-10 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433302#comment-16433302
 ] 

Hyukjin Kwon commented on SPARK-23955:
--

Fixing a typo doesn't need a JIRA. Let's avoid this next time.

> typo in parameter name 'rawPredicition'
> ---
>
> Key: SPARK-23955
> URL: https://issues.apache.org/jira/browse/SPARK-23955
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: John Bauer
>Priority: Minor
>
> classifier.py MultilayerPerceptronClassifier.__init__ API call had typo 
> rawPredicition instead of rawPrediction
> also present in doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23954) Converting spark dataframe containing int64 fields to R dataframes leads to impredictable errors.

2018-04-10 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433298#comment-16433298
 ] 

Hyukjin Kwon commented on SPARK-23954:
--

Can you check other JIRAs and see if there are duplicates? I feel sure there 
are duplicates about this, for example, SPARK-14326 or SPARK-12360.

> Converting spark dataframe containing int64 fields to R dataframes leads to 
> impredictable errors.
> -
>
> Key: SPARK-23954
> URL: https://issues.apache.org/jira/browse/SPARK-23954
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: nicolas paris
>Priority: Minor
>
> Converting spark dataframe containing int64 fields to R dataframes leads to 
> impredictable errors. 
> The problems comes from R that does not handle int64 natively. As a result a 
> good workaround would be to convert bigint as strings when transforming spark 
> dataframes into R dataframes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23950) Coalescing an empty dataframe to 1 partition

2018-04-10 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433296#comment-16433296
 ] 

Hyukjin Kwon commented on SPARK-23950:
--

Seems fixed in the current master. Let me leave this resolved but it would be 
great if we can find which one fixes it and backports if applicable.

> Coalescing an empty dataframe to 1 partition
> 
>
> Key: SPARK-23950
> URL: https://issues.apache.org/jira/browse/SPARK-23950
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.1
> Environment: Operating System: Windows 7
> Tested in Jupyter notebooks using Python 2.7.14 and Python 3.6.3.
> Hardware specs not relevant to the issue.
>Reporter: João Neves
>Priority: Major
>
> Coalescing an empty dataframe to 1 partition returns an error.
> The funny thing is that coalescing an empty dataframe to 2 or more partitions 
> seem to work.
> The test case is the following:
> {code}
> from pyspark.sql.types import StructType
> df = spark.createDataFrame(spark.sparkContext.emptyRDD(), StructType([]))
> print(df.coalesce(2).count())
> print(df.coalesce(3).count())
> print(df.coalesce(4).count())
> df.coalesce(1).count(){code}
> Output:
> {code:java}
> 0
> 0
> 0
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
> 7 print(df.coalesce(4).count())
> 8 
> > 9 print(df.coalesce(1).count())
> C:\spark-2.2.1-bin-hadoop2.7\python\pyspark\sql\dataframe.py in count(self)
> 425 2
> 426 """
> --> 427 return int(self._jdf.count())
> 428 
> 429 @ignore_unicode_prefix
> c:\python36\lib\site-packages\py4j\java_gateway.py in __call__(self, *args)
> 1131 answer = self.gateway_client.send_command(command)
> 1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, self.name)
> 1134 
> 1135 for temp_arg in temp_args:
> C:\spark-2.2.1-bin-hadoop2.7\python\pyspark\sql\utils.py in deco(*a, **kw)
> 61 def deco(*a, **kw):
> 62 try:
> ---> 63 return f(*a, **kw)
> 64 except py4j.protocol.Py4JJavaError as e:
> 65 s = e.java_exception.toString()
> c:\python36\lib\site-packages\py4j\protocol.py in get_return_value(answer, 
> gateway_client, target_id, name)
> 317 raise Py4JJavaError(
> 318 "An error occurred while calling {0}{1}{2}.\n".
> --> 319 format(target_id, ".", name), value)
> 320 else:
> 321 raise Py4JError(
> Py4JJavaError: An error occurred while calling o176.count.
> : java.util.NoSuchElementException: next on empty iterator
> at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
> at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
> at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
> at scala.collection.IterableLike$class.head(IterableLike.scala:107)
> at 
> scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:186)
> at 
> scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126)
> at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:186)
> at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2435)
> at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2434)
> at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2842)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
> at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2841)
> at org.apache.spark.sql.Dataset.count(Dataset.scala:2434)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
> at java.lang.reflect.Method.invoke(Unknown Source)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:280)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:214)
> at java.lang.Thread.run(Unknown Source){code}
> Shouldn't this be consistent?
> Thank you very much.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19947) RFormulaModel always throws Exception on transforming data with NULL or Unseen labels

2018-04-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-19947.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

I'll mark this as complete.  Those earlier PRs fixed some issues, and 
[SPARK-23562] should fix the rest.

> RFormulaModel always throws Exception on transforming data with NULL or 
> Unseen labels
> -
>
> Key: SPARK-19947
> URL: https://issues.apache.org/jira/browse/SPARK-19947
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Andrey Yatsuk
>Priority: Major
> Fix For: 2.4.0
>
>
> I have trained ML model and big data table in parquet. I want add new column 
> to this table with predicted values. I can't lose any data, but can having 
> null values in it.
> RFormulaModel.fit() method creates new StringIndexer with default 
> (handleInvalid="error") parameter. Also VectorAssembler on NULL values 
> throwing Exception. So I must call df.na.drop() to transform this DataFrame 
> and I don't want to do this.
> Need add to RFormula new parameter like handleInvalid in StringIndexer.
> Or add transform(Seq): Vector method which user can use as UDF method 
> in df.withColumn("predicted", functions.callUDF(rFormulaModel::transform, 
> Seq))



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23562) RFormula handleInvalid should handle invalid values in non-string columns.

2018-04-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-23562.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

I think everything has been fixed, so I'll close this.  Thanks [~yogeshgarg] 
and [~huaxingao]!

> RFormula handleInvalid should handle invalid values in non-string columns.
> --
>
> Key: SPARK-23562
> URL: https://issues.apache.org/jira/browse/SPARK-23562
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently when handleInvalid is set to 'keep' or 'skip' this only applies to 
> String fields. Numeric fields that are null will either cause the transformer 
> to fail or might be null in the resulting label column.
> I'm not sure what the semantics of keep might be for numeric columns with 
> null values, but we should be able to at least support skip for these types.
> --> Discussed offline: null values can be converted to NaN values for "keep"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23562) RFormula handleInvalid should handle invalid values in non-string columns.

2018-04-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-23562:
--
Shepherd: Joseph K. Bradley

> RFormula handleInvalid should handle invalid values in non-string columns.
> --
>
> Key: SPARK-23562
> URL: https://issues.apache.org/jira/browse/SPARK-23562
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently when handleInvalid is set to 'keep' or 'skip' this only applies to 
> String fields. Numeric fields that are null will either cause the transformer 
> to fail or might be null in the resulting label column.
> I'm not sure what the semantics of keep might be for numeric columns with 
> null values, but we should be able to at least support skip for these types.
> --> Discussed offline: null values can be converted to NaN values for "keep"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23337) withWatermark raises an exception on struct objects

2018-04-10 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433222#comment-16433222
 ] 

Michael Armbrust commented on SPARK-23337:
--

The checkpoint will only grow if you are doing an aggregation, otherwise the 
watermark will not affect computation.

You can set a watermark on the nested column, you just need to project it to a 
top level column using {{withColumn}}

> withWatermark raises an exception on struct objects
> ---
>
> Key: SPARK-23337
> URL: https://issues.apache.org/jira/browse/SPARK-23337
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.1
> Environment: Linux Ubuntu, Spark on standalone mode
>Reporter: Aydin Kocas
>Priority: Major
>
> Hi,
>  
> when using a nested object (I mean an object within a struct, here concrete: 
> _source.createTime) from a json file as the parameter for the 
> withWatermark-method, I get an exception (see below).
> Anything else works flawlessly with the nested object.
>  
> +*{color:#14892c}works:{color}*+ 
> {code:java}
> Dataset jsonRow = 
> spark.readStream().schema(getJSONSchema()).json(file).dropDuplicates("_id").withWatermark("myTime",
>  "10 seconds").toDF();{code}
>  
> json structure:
> {code:java}
> root
>  |-- _id: string (nullable = true)
>  |-- _index: string (nullable = true)
>  |-- _score: long (nullable = true)
>  |-- myTime: timestamp (nullable = true)
> ..{code}
> +*{color:#d04437}does not work - nested json{color}:*+
> {code:java}
> Dataset jsonRow = 
> spark.readStream().schema(getJSONSchema()).json(file).dropDuplicates("_id").withWatermark("_source.createTime",
>  "10 seconds").toDF();{code}
>  
> json structure:
>  
> {code:java}
> root
>  |-- _id: string (nullable = true)
>  |-- _index: string (nullable = true)
>  |-- _score: long (nullable = true)
>  |-- _source: struct (nullable = true)
>  | |-- createTime: timestamp (nullable = true)
> ..
>  
> Exception in thread "main" 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
> tree:
> 'EventTimeWatermark '_source.createTime, interval 10 seconds
> +- Deduplicate [_id#0], true
>  +- StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@5dbbb292,json,List(),Some(StructType(StructField(_id,StringType,true),
>  StructField(_index,StringType,true), StructField(_score,LongType,true), 
> StructField(_source,StructType(StructField(additionalData,StringType,true), 
> StructField(client,StringType,true), 
> StructField(clientDomain,BooleanType,true), 
> StructField(clientVersion,StringType,true), 
> StructField(country,StringType,true), 
> StructField(countryName,StringType,true), 
> StructField(createTime,TimestampType,true), 
> StructField(externalIP,StringType,true), 
> StructField(hostname,StringType,true), 
> StructField(internalIP,StringType,true), 
> StructField(location,StringType,true), 
> StructField(locationDestination,StringType,true), 
> StructField(login,StringType,true), 
> StructField(originalRequestString,StringType,true), 
> StructField(password,StringType,true), 
> StructField(peerIdent,StringType,true), 
> StructField(peerType,StringType,true), 
> StructField(recievedTime,TimestampType,true), 
> StructField(sessionEnd,StringType,true), 
> StructField(sessionStart,StringType,true), 
> StructField(sourceEntryAS,StringType,true), 
> StructField(sourceEntryIp,StringType,true), 
> StructField(sourceEntryPort,StringType,true), 
> StructField(targetCountry,StringType,true), 
> StructField(targetCountryName,StringType,true), 
> StructField(targetEntryAS,StringType,true), 
> StructField(targetEntryIp,StringType,true), 
> StructField(targetEntryPort,StringType,true), 
> StructField(targetport,StringType,true), 
> StructField(username,StringType,true), 
> StructField(vulnid,StringType,true)),true), 
> StructField(_type,StringType,true))),List(),None,Map(path -> ./input/),None), 
> FileSource[./input/], [_id#0, _index#1, _score#2L, _source#3, _type#4]
> at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:300)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:268)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9.applyOrElse(Analyzer.scala:854)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9.applyOrElse(Analyzer.scala:796)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62)
>  at 
> 

[jira] [Resolved] (SPARK-23944) Add Param set functions to LSHModel types

2018-04-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-23944.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21015
[https://github.com/apache/spark/pull/21015]

> Add Param set functions to LSHModel types
> -
>
> Key: SPARK-23944
> URL: https://issues.apache.org/jira/browse/SPARK-23944
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.0
>Reporter: Lu Wang
>Assignee: Lu Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> 2 param set methods ( setInputCol, setOutputCol) are added to the two 
> LSHModel types for min hash and random projections.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23944) Add Param set functions to LSHModel types

2018-04-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-23944:
-

Assignee: Lu Wang

> Add Param set functions to LSHModel types
> -
>
> Key: SPARK-23944
> URL: https://issues.apache.org/jira/browse/SPARK-23944
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.0
>Reporter: Lu Wang
>Assignee: Lu Wang
>Priority: Major
>
> 2 param set methods ( setInputCol, setOutputCol) are added to the two 
> LSHModel types for min hash and random projections.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23957) Sorts in subqueries are redundant and can be removed

2018-04-10 Thread Henry Robinson (JIRA)
Henry Robinson created SPARK-23957:
--

 Summary: Sorts in subqueries are redundant and can be removed
 Key: SPARK-23957
 URL: https://issues.apache.org/jira/browse/SPARK-23957
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Henry Robinson


Unless combined with a {{LIMIT}}, there's no correctness reason that planned 
and optimized subqueries should have any sort operators (since the result of 
the subquery is an unordered collection of tuples). 

For example:

{{SELECT count(1) FROM (select id FROM dft ORDER by id)}}

has the following plan:
{code:java}
== Physical Plan ==
*(3) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
   +- *(2) HashAggregate(keys=[], functions=[partial_count(1)])
  +- *(2) Project
 +- *(2) Sort [id#0L ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200)
   +- *(1) Project [id#0L]
  +- *(1) FileScan parquet [id#0L] Batched: true, Format: 
Parquet, Location: InMemoryFileIndex[file:/Users/henry/src/spark/one_million], 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct
{code}
... but the sort operator is redundant.

Less intuitively, the sort is also redundant in selections from an ordered 
subquery:

{{SELECT * FROM (SELECT id FROM dft ORDER BY id)}}

has plan:
{code:java}
== Physical Plan ==
*(2) Sort [id#0L ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200)
   +- *(1) Project [id#0L]
  +- *(1) FileScan parquet [id#0L] Batched: true, Format: Parquet, 
Location: InMemoryFileIndex[file:/Users/henry/src/spark/one_million], 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct
{code}
... but again, since the subquery returns a bag of tuples, the sort is 
unnecessary.

We should consider adding an optimizer rule that removes a sort inside a 
subquery. SPARK-23375 is related, but removes sorts that are functionally 
redundant because they perform the same ordering.
  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23871) add python api for VectorAssembler handleInvalid

2018-04-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-23871.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21003
[https://github.com/apache/spark/pull/21003]

> add python api for VectorAssembler handleInvalid
> 
>
> Key: SPARK-23871
> URL: https://issues.apache.org/jira/browse/SPARK-23871
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: yogesh garg
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19680) Offsets out of range with no configured reset policy for partitions

2018-04-10 Thread Nicholas Verbeck (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433100#comment-16433100
 ] 

Nicholas Verbeck commented on SPARK-19680:
--

KAFKA-3370 is a good solution to the bad preforming jobs problem from a central 
point.

However irregardless of that, Spark shouldn't just dictate functionality to 
users like that. It should instead leave it up to the user to assume 
responsibility if they wish to enable that setting. Making notes and comments 
within sparks docs of the potential issues until either Kafka and/or Spark can 
come up with a solution that isn't the removal of functionality. 

As a side solution, until Kafka can be fixed, would be for Spark to eval the 
setting itself. If set go about looking up the current offsets at start and 
handling moving them to the latest/earliest as requested. Then switching to 
NONE for the continued run. This would prevent the issues that you appear to be 
wanting to prevent. While letting the users maintain, somewhat the key part of 
the functionality they are looking for. 

 

 

> Offsets out of range with no configured reset policy for partitions
> ---
>
> Key: SPARK-19680
> URL: https://issues.apache.org/jira/browse/SPARK-19680
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Schakmann Rene
>Priority: Major
>
> I'm using spark streaming with kafka to acutally create a toplist. I want to 
> read all the messages in kafka. So I set
>"auto.offset.reset" -> "earliest"
> Nevertheless when I start the job on our spark cluster it is not working I 
> get:
> Error:
> {code:title=error.log|borderStyle=solid}
>   Job aborted due to stage failure: Task 2 in stage 111.0 failed 4 times, 
> most recent failure: Lost task 2.3 in stage 111.0 (TID 1270, 194.232.55.23, 
> executor 2): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: 
> Offsets out of range with no configured reset policy for partitions: 
> {SearchEvents-2=161803385}
> {code}
> This is somehow wrong because I did set the auto.offset.reset property
> Setup:
> Kafka Parameter:
> {code:title=Config.Scala|borderStyle=solid}
>   def getDefaultKafkaReceiverParameter(properties: Properties):Map[String, 
> Object] = {
> Map(
>   "bootstrap.servers" -> 
> properties.getProperty("kafka.bootstrap.servers"),
>   "group.id" -> properties.getProperty("kafka.consumer.group"),
>   "auto.offset.reset" -> "earliest",
>   "spark.streaming.kafka.consumer.cache.enabled" -> "false",
>   "enable.auto.commit" -> "false",
>   "key.deserializer" -> classOf[StringDeserializer],
>   "value.deserializer" -> "at.willhaben.sid.DTOByteDeserializer")
>   }
> {code}
> Job:
> {code:title=Job.Scala|borderStyle=solid}
>   def processSearchKeyWords(stream: InputDStream[ConsumerRecord[String, 
> Array[Byte]]], windowDuration: Int, slideDuration: Int, kafkaSink: 
> Broadcast[KafkaSink[TopList]]): Unit = {
> getFilteredStream(stream.map(_.value()), windowDuration, 
> slideDuration).foreachRDD(rdd => {
>   val topList = new TopList
>   topList.setCreated(new Date())
>   topList.setTopListEntryList(rdd.take(TopListLength).toList)
>   CurrentLogger.info("TopList length: " + 
> topList.getTopListEntryList.size().toString)
>   kafkaSink.value.send(SendToTopicName, topList)
>   CurrentLogger.info("Last Run: " + System.currentTimeMillis())
> })
>   }
>   def getFilteredStream(result: DStream[Array[Byte]], windowDuration: Int, 
> slideDuration: Int): DStream[TopListEntry] = {
> val Mapper = MapperObject.readerFor[SearchEventDTO]
> result.repartition(100).map(s => Mapper.readValue[SearchEventDTO](s))
>   .filter(s => s != null && s.getSearchRequest != null && 
> s.getSearchRequest.getSearchParameters != null && s.getVertical == 
> Vertical.BAP && 
> s.getSearchRequest.getSearchParameters.containsKey(EspParameterEnum.KEYWORD.getName))
>   .map(row => {
> val name = 
> row.getSearchRequest.getSearchParameters.get(EspParameterEnum.KEYWORD.getName).getEspSearchParameterDTO.getValue.toLowerCase()
> (name, new TopListEntry(name, 1, row.getResultCount))
>   })
>   .reduceByKeyAndWindow(
> (a: TopListEntry, b: TopListEntry) => new TopListEntry(a.getKeyword, 
> a.getSearchCount + b.getSearchCount, a.getMeanSearchHits + 
> b.getMeanSearchHits),
> (a: TopListEntry, b: TopListEntry) => new TopListEntry(a.getKeyword, 
> a.getSearchCount - b.getSearchCount, a.getMeanSearchHits - 
> b.getMeanSearchHits),
> Minutes(windowDuration),
> Seconds(slideDuration))
>   .filter((x: (String, TopListEntry)) => x._2.getSearchCount > 200L)
>   .map(row => (row._2.getSearchCount, row._2))
>   .transform(rdd => 

[jira] [Created] (SPARK-23956) Use effective RPC port in AM registration

2018-04-10 Thread Gera Shegalov (JIRA)
Gera Shegalov created SPARK-23956:
-

 Summary: Use effective RPC port in AM registration 
 Key: SPARK-23956
 URL: https://issues.apache.org/jira/browse/SPARK-23956
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 2.3.0
Reporter: Gera Shegalov


AM's should use their real rpc port in the AM registration for better 
diagnostics in Application Report.

{code}
18/04/10 14:56:21 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: localhost
ApplicationMaster RPC port: 58338
queue: default
start time: 1523397373659
final status: UNDEFINED
tracking URL: http://localhost:8088/proxy/application_1523370127531_0016/
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23871) add python api for VectorAssembler handleInvalid

2018-04-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-23871:
--
Shepherd: Joseph K. Bradley

> add python api for VectorAssembler handleInvalid
> 
>
> Key: SPARK-23871
> URL: https://issues.apache.org/jira/browse/SPARK-23871
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: yogesh garg
>Assignee: Huaxin Gao
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23871) add python api for VectorAssembler handleInvalid

2018-04-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-23871:
-

Assignee: Huaxin Gao

> add python api for VectorAssembler handleInvalid
> 
>
> Key: SPARK-23871
> URL: https://issues.apache.org/jira/browse/SPARK-23871
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: yogesh garg
>Assignee: Huaxin Gao
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-23869) Spark 2.3.0 left outer join not emitting null values instead waiting for the record in other stream

2018-04-10 Thread bharath kumar avusherla (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bharath kumar avusherla closed SPARK-23869.
---

> Spark 2.3.0 left outer join not emitting null values instead waiting for the 
> record in other stream
> ---
>
> Key: SPARK-23869
> URL: https://issues.apache.org/jira/browse/SPARK-23869
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: bharath kumar avusherla
>Priority: Major
>
> Left outer join on two streams not emitting the null outputs. It is just 
> waiting for the record to be added to other stream. Used socketstream to test 
> this. In our case we want to emit the records with null values which doesn't 
> match with id or/and not fall in time range condition
> Details of the watermarks and intervals are:
> val ds1Map = ds1
>  .selectExpr("Id AS ds1_Id", "ds1_timestamp")
>  .withWatermark("ds1_timestamp","10 seconds")
> val ds2Map = ds2
>  .selectExpr("Id AS ds2_Id", "ds2_timestamp")
>  .withWatermark("ds2_timestamp", "20 seconds")
> val output = ds1Map.join( ds2Map,
>  expr(
>  """ ds1_Id = ds2_Id AND ds2_timestamp >= ds1_timestamp AND  ds2_timestamp <= 
> ds1_timestamp + interval 1 minutes """),
>  "leftOuter")
> val query = output.select("*")
>  .writeStream
> .outputMode(OutputMode.Append)
>  .format("console")
>  .option("checkpointLocation", "./ewe-spark-checkpoints/")
>  .start()
> query.awaitTermination()
> Thank you. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19680) Offsets out of range with no configured reset policy for partitions

2018-04-10 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433039#comment-16433039
 ] 

Cody Koeninger commented on SPARK-19680:


[~nerdynick]  If you submit a PR to add documentation I'd be happy to review it.

IMHO something like KAFKA-3370 is really where this issue "should" be fixed.  
In the absence of that, I think using time-based indexes would be the best 
workaround for making jobs easier to start.  If you have a constructive 
alternative suggestion that doesn't conflate offset reset at startup with 
silently losing data in the middle of a running app, I'd be happy to discuss it.

> Offsets out of range with no configured reset policy for partitions
> ---
>
> Key: SPARK-19680
> URL: https://issues.apache.org/jira/browse/SPARK-19680
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Schakmann Rene
>Priority: Major
>
> I'm using spark streaming with kafka to acutally create a toplist. I want to 
> read all the messages in kafka. So I set
>"auto.offset.reset" -> "earliest"
> Nevertheless when I start the job on our spark cluster it is not working I 
> get:
> Error:
> {code:title=error.log|borderStyle=solid}
>   Job aborted due to stage failure: Task 2 in stage 111.0 failed 4 times, 
> most recent failure: Lost task 2.3 in stage 111.0 (TID 1270, 194.232.55.23, 
> executor 2): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: 
> Offsets out of range with no configured reset policy for partitions: 
> {SearchEvents-2=161803385}
> {code}
> This is somehow wrong because I did set the auto.offset.reset property
> Setup:
> Kafka Parameter:
> {code:title=Config.Scala|borderStyle=solid}
>   def getDefaultKafkaReceiverParameter(properties: Properties):Map[String, 
> Object] = {
> Map(
>   "bootstrap.servers" -> 
> properties.getProperty("kafka.bootstrap.servers"),
>   "group.id" -> properties.getProperty("kafka.consumer.group"),
>   "auto.offset.reset" -> "earliest",
>   "spark.streaming.kafka.consumer.cache.enabled" -> "false",
>   "enable.auto.commit" -> "false",
>   "key.deserializer" -> classOf[StringDeserializer],
>   "value.deserializer" -> "at.willhaben.sid.DTOByteDeserializer")
>   }
> {code}
> Job:
> {code:title=Job.Scala|borderStyle=solid}
>   def processSearchKeyWords(stream: InputDStream[ConsumerRecord[String, 
> Array[Byte]]], windowDuration: Int, slideDuration: Int, kafkaSink: 
> Broadcast[KafkaSink[TopList]]): Unit = {
> getFilteredStream(stream.map(_.value()), windowDuration, 
> slideDuration).foreachRDD(rdd => {
>   val topList = new TopList
>   topList.setCreated(new Date())
>   topList.setTopListEntryList(rdd.take(TopListLength).toList)
>   CurrentLogger.info("TopList length: " + 
> topList.getTopListEntryList.size().toString)
>   kafkaSink.value.send(SendToTopicName, topList)
>   CurrentLogger.info("Last Run: " + System.currentTimeMillis())
> })
>   }
>   def getFilteredStream(result: DStream[Array[Byte]], windowDuration: Int, 
> slideDuration: Int): DStream[TopListEntry] = {
> val Mapper = MapperObject.readerFor[SearchEventDTO]
> result.repartition(100).map(s => Mapper.readValue[SearchEventDTO](s))
>   .filter(s => s != null && s.getSearchRequest != null && 
> s.getSearchRequest.getSearchParameters != null && s.getVertical == 
> Vertical.BAP && 
> s.getSearchRequest.getSearchParameters.containsKey(EspParameterEnum.KEYWORD.getName))
>   .map(row => {
> val name = 
> row.getSearchRequest.getSearchParameters.get(EspParameterEnum.KEYWORD.getName).getEspSearchParameterDTO.getValue.toLowerCase()
> (name, new TopListEntry(name, 1, row.getResultCount))
>   })
>   .reduceByKeyAndWindow(
> (a: TopListEntry, b: TopListEntry) => new TopListEntry(a.getKeyword, 
> a.getSearchCount + b.getSearchCount, a.getMeanSearchHits + 
> b.getMeanSearchHits),
> (a: TopListEntry, b: TopListEntry) => new TopListEntry(a.getKeyword, 
> a.getSearchCount - b.getSearchCount, a.getMeanSearchHits - 
> b.getMeanSearchHits),
> Minutes(windowDuration),
> Seconds(slideDuration))
>   .filter((x: (String, TopListEntry)) => x._2.getSearchCount > 200L)
>   .map(row => (row._2.getSearchCount, row._2))
>   .transform(rdd => rdd.sortByKey(ascending = false))
>   .map(row => new TopListEntry(row._2.getKeyword, row._2.getSearchCount, 
> row._2.getMeanSearchHits / row._2.getSearchCount))
>   }
>   def main(properties: Properties): Unit = {
> val sparkSession = SparkUtil.getDefaultSparkSession(properties, TaskName)
> val kafkaSink = 
> sparkSession.sparkContext.broadcast(KafkaSinkUtil.apply[TopList](SparkUtil.getDefaultSparkProperties(properties)))
> val 

[jira] [Commented] (SPARK-19680) Offsets out of range with no configured reset policy for partitions

2018-04-10 Thread Nicholas Verbeck (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433001#comment-16433001
 ] 

Nicholas Verbeck commented on SPARK-19680:
--

I just spent way to long on this. Thought I was doing something wrong or had a 
bug somewhere. I understand the not throwing away data with out notice. However 
something does need to happen with this. If anything at least the documentation 
needs to note this or at the very least a comment within KafkaUtils.

 

However I really don't agree with the throwing away of data argument 100%. You 
are overriding expected functionality provided by the KafkaConsumer. This is 
already highly documented and covered within Kafka's Documentation. I as the 
user already accept that this can happen by setting "auto.offset.reset" in the 
first place. That choice should not be forced upon me. I would say a better 
option is to not override the setting. Instead just default to NONE instead of 
LATEST. Allowing the user to make the choice themselves. 

The work around of rewriting a bunch of code that already exists within the 
KafkaConsumer to provide the offsets on start of the pipeline is not an 
acceptable answer to this.

> Offsets out of range with no configured reset policy for partitions
> ---
>
> Key: SPARK-19680
> URL: https://issues.apache.org/jira/browse/SPARK-19680
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Schakmann Rene
>Priority: Major
>
> I'm using spark streaming with kafka to acutally create a toplist. I want to 
> read all the messages in kafka. So I set
>"auto.offset.reset" -> "earliest"
> Nevertheless when I start the job on our spark cluster it is not working I 
> get:
> Error:
> {code:title=error.log|borderStyle=solid}
>   Job aborted due to stage failure: Task 2 in stage 111.0 failed 4 times, 
> most recent failure: Lost task 2.3 in stage 111.0 (TID 1270, 194.232.55.23, 
> executor 2): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: 
> Offsets out of range with no configured reset policy for partitions: 
> {SearchEvents-2=161803385}
> {code}
> This is somehow wrong because I did set the auto.offset.reset property
> Setup:
> Kafka Parameter:
> {code:title=Config.Scala|borderStyle=solid}
>   def getDefaultKafkaReceiverParameter(properties: Properties):Map[String, 
> Object] = {
> Map(
>   "bootstrap.servers" -> 
> properties.getProperty("kafka.bootstrap.servers"),
>   "group.id" -> properties.getProperty("kafka.consumer.group"),
>   "auto.offset.reset" -> "earliest",
>   "spark.streaming.kafka.consumer.cache.enabled" -> "false",
>   "enable.auto.commit" -> "false",
>   "key.deserializer" -> classOf[StringDeserializer],
>   "value.deserializer" -> "at.willhaben.sid.DTOByteDeserializer")
>   }
> {code}
> Job:
> {code:title=Job.Scala|borderStyle=solid}
>   def processSearchKeyWords(stream: InputDStream[ConsumerRecord[String, 
> Array[Byte]]], windowDuration: Int, slideDuration: Int, kafkaSink: 
> Broadcast[KafkaSink[TopList]]): Unit = {
> getFilteredStream(stream.map(_.value()), windowDuration, 
> slideDuration).foreachRDD(rdd => {
>   val topList = new TopList
>   topList.setCreated(new Date())
>   topList.setTopListEntryList(rdd.take(TopListLength).toList)
>   CurrentLogger.info("TopList length: " + 
> topList.getTopListEntryList.size().toString)
>   kafkaSink.value.send(SendToTopicName, topList)
>   CurrentLogger.info("Last Run: " + System.currentTimeMillis())
> })
>   }
>   def getFilteredStream(result: DStream[Array[Byte]], windowDuration: Int, 
> slideDuration: Int): DStream[TopListEntry] = {
> val Mapper = MapperObject.readerFor[SearchEventDTO]
> result.repartition(100).map(s => Mapper.readValue[SearchEventDTO](s))
>   .filter(s => s != null && s.getSearchRequest != null && 
> s.getSearchRequest.getSearchParameters != null && s.getVertical == 
> Vertical.BAP && 
> s.getSearchRequest.getSearchParameters.containsKey(EspParameterEnum.KEYWORD.getName))
>   .map(row => {
> val name = 
> row.getSearchRequest.getSearchParameters.get(EspParameterEnum.KEYWORD.getName).getEspSearchParameterDTO.getValue.toLowerCase()
> (name, new TopListEntry(name, 1, row.getResultCount))
>   })
>   .reduceByKeyAndWindow(
> (a: TopListEntry, b: TopListEntry) => new TopListEntry(a.getKeyword, 
> a.getSearchCount + b.getSearchCount, a.getMeanSearchHits + 
> b.getMeanSearchHits),
> (a: TopListEntry, b: TopListEntry) => new TopListEntry(a.getKeyword, 
> a.getSearchCount - b.getSearchCount, a.getMeanSearchHits - 
> b.getMeanSearchHits),
> Minutes(windowDuration),
> Seconds(slideDuration))
>   .filter((x: (String, TopListEntry)) 

[jira] [Commented] (SPARK-23926) High-order function: reverse(x) → array

2018-04-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432982#comment-16432982
 ] 

Apache Spark commented on SPARK-23926:
--

User 'mn-mikke' has created a pull request for this issue:
https://github.com/apache/spark/pull/21034

> High-order function: reverse(x) → array
> ---
>
> Key: SPARK-23926
> URL: https://issues.apache.org/jira/browse/SPARK-23926
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns an array which has the reversed order of array x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23926) High-order function: reverse(x) → array

2018-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23926:


Assignee: (was: Apache Spark)

> High-order function: reverse(x) → array
> ---
>
> Key: SPARK-23926
> URL: https://issues.apache.org/jira/browse/SPARK-23926
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns an array which has the reversed order of array x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23926) High-order function: reverse(x) → array

2018-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23926:


Assignee: Apache Spark

> High-order function: reverse(x) → array
> ---
>
> Key: SPARK-23926
> URL: https://issues.apache.org/jira/browse/SPARK-23926
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns an array which has the reversed order of array x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20865) caching dataset throws "Queries with streaming sources must be executed with writeStream.start()"

2018-04-10 Thread hamroune zahir (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432969#comment-16432969
 ] 

hamroune zahir commented on SPARK-20865:


it is huge regression, on that sens we cannot get HEAD FIRST TAKE LIMIT CACHE 
COUNT...?!!!

> caching dataset throws "Queries with streaming sources must be executed with 
> writeStream.start()"
> -
>
> Key: SPARK-20865
> URL: https://issues.apache.org/jira/browse/SPARK-20865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.0.2, 2.1.0, 2.1.1
>Reporter: Martin Brišiak
>Priority: Major
> Fix For: 2.2.0, 2.3.0
>
>
> {code}
> SparkSession
>   .builder
>   .master("local[*]")
>   .config("spark.sql.warehouse.dir", "C:/tmp/spark")
>   .config("spark.sql.streaming.checkpointLocation", 
> "C:/tmp/spark/spark-checkpoint")
>   .appName("my-test")
>   .getOrCreate
>   .readStream
>   .schema(schema)
>   .json("src/test/data")
>   .cache
>   .writeStream
>   .start
>   .awaitTermination
> {code}
> While executing this sample in spark got error. Without the .cache option it 
> worked as intended but with .cache option i got:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries 
> with streaming sources must be executed with writeStream.start();; 
> FileSource[src/test/data] at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:196)
>  at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:35)
>  at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:33)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128) at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:33)
>  at 
> org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:58)
>  at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:69)
>  at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:67)
>  at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:73)
>  at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:73)
>  at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:79)
>  at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:75)
>  at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:84)
>  at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:84)
>  at 
> org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:102)
>  at 
> org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:65) 
> at 
> org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:89) 
> at org.apache.spark.sql.Dataset.persist(Dataset.scala:2479) at 
> org.apache.spark.sql.Dataset.cache(Dataset.scala:2489) at 
> org.me.App$.main(App.scala:23) at org.me.App.main(App.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23931) High-order function: zip(array1, array2[, ...]) → array

2018-04-10 Thread Dylan Guedes (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432850#comment-16432850
 ] 

Dylan Guedes commented on SPARK-23931:
--

I would like to try this one.

> High-order function: zip(array1, array2[, ...]) → array
> 
>
> Key: SPARK-23931
> URL: https://issues.apache.org/jira/browse/SPARK-23931
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Merges the given arrays, element-wise, into a single array of rows. The M-th 
> element of the N-th argument will be the N-th field of the M-th output 
> element. If the arguments have an uneven length, missing values are filled 
> with NULL.
> {noformat}
> SELECT zip(ARRAY[1, 2], ARRAY['1b', null, '3b']); -- [ROW(1, '1b'), ROW(2, 
> null), ROW(null, '3b')]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23955) typo in parameter name 'rawPredicition'

2018-04-10 Thread John Bauer (JIRA)
John Bauer created SPARK-23955:
--

 Summary: typo in parameter name 'rawPredicition'
 Key: SPARK-23955
 URL: https://issues.apache.org/jira/browse/SPARK-23955
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.0
Reporter: John Bauer


classifier.py MultilayerPerceptronClassifier.__init__ API call had typo 
rawPredicition instead of rawPrediction

also present in doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23954) Converting spark dataframe containing int64 fields to R dataframes leads to impredictable errors.

2018-04-10 Thread nicolas paris (JIRA)
nicolas paris created SPARK-23954:
-

 Summary: Converting spark dataframe containing int64 fields to R 
dataframes leads to impredictable errors.
 Key: SPARK-23954
 URL: https://issues.apache.org/jira/browse/SPARK-23954
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.3.0
Reporter: nicolas paris


Converting spark dataframe containing int64 fields to R dataframes leads to 
impredictable errors. 

The problems comes from R that does not handle int64 natively. As a result a 
good workaround would be to convert bigint as strings when transforming spark 
dataframes into R dataframes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19320) Allow guaranteed amount of GPU to be used when launching jobs

2018-04-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432796#comment-16432796
 ] 

Apache Spark commented on SPARK-19320:
--

User 'yanji84' has created a pull request for this issue:
https://github.com/apache/spark/pull/21033

> Allow guaranteed amount of GPU to be used when launching jobs
> -
>
> Key: SPARK-19320
> URL: https://issues.apache.org/jira/browse/SPARK-19320
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>Priority: Major
>
> Currently the only configuration for using GPUs with Mesos is setting the 
> maximum amount of GPUs a job will take from an offer, but doesn't guarantee 
> exactly how much.
> We should have a configuration that sets this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23912) High-order function: array_distinct(x) → array

2018-04-10 Thread Huaxin Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432780#comment-16432780
 ] 

Huaxin Gao commented on SPARK-23912:


I will work on this. Thanks!

> High-order function: array_distinct(x) → array
> --
>
> Key: SPARK-23912
> URL: https://issues.apache.org/jira/browse/SPARK-23912
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Remove duplicate values from the array x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel

2018-04-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21856:
--
Fix Version/s: 2.3.0

> Update Python API for MultilayerPerceptronClassifierModel
> -
>
> Key: SPARK-21856
> URL: https://issues.apache.org/jira/browse/SPARK-21856
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Assignee: Chunsheng Ji
>Priority: Minor
> Fix For: 2.3.0
>
>
> SPARK-12664 has exposed probability in MultilayerPerceptronClassifier, so 
> python API also need update.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23529) Specify hostpath volume and mount the volume in Spark driver and executor pods in Kubernetes

2018-04-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432746#comment-16432746
 ] 

Apache Spark commented on SPARK-23529:
--

User 'madanadit' has created a pull request for this issue:
https://github.com/apache/spark/pull/21032

> Specify hostpath volume and mount the volume in Spark driver and executor 
> pods in Kubernetes
> 
>
> Key: SPARK-23529
> URL: https://issues.apache.org/jira/browse/SPARK-23529
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Suman Somasundar
>Assignee: Anirudh Ramanathan
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23928) High-order function: shuffle(x) → array

2018-04-10 Thread H Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432739#comment-16432739
 ] 

H Lu commented on SPARK-23928:
--

Can I take this one?

> High-order function: shuffle(x) → array
> ---
>
> Key: SPARK-23928
> URL: https://issues.apache.org/jira/browse/SPARK-23928
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Generate a random permutation of the given array x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8571) spark streaming hanging processes upon build exit

2018-04-10 Thread shane knapp (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp resolved SPARK-8571.

Resolution: Not A Problem

> spark streaming hanging processes upon build exit
> -
>
> Key: SPARK-8571
> URL: https://issues.apache.org/jira/browse/SPARK-8571
> Project: Spark
>  Issue Type: Bug
>  Components: Build, DStreams
> Environment: centos 6.6 amplab build system
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Minor
>  Labels: build, test
>
> over the past 3 months i've been noticing that there are occasionally hanging 
> processes on our build system workers after various spark builds have 
> finished.  these are all spark streaming processes.
> today i noticed a 3+ hour spark build that was timed out after 200 minutes 
> (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/),
>  and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on 
> amp-jenkins-worker-02.  after the timeout, it left the following process (and 
> all of it's children) hanging.
> the process' CLI command was:
> {quote}
> [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714
> jenkins1714  733  2.7 21342148 3642740 ?Sl   07:52 1713:41 java 
> -Dderby.system.durability=test -Djava.awt.headless=true 
> -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp
>  -Dspark.driver.allowMultipleContexts=true 
> -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos
>  -Dspark.testing=1 -Dspark.ui.enabled=false 
> -Dspark.ui.showConsoleProgress=false 
> -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming
>  -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m 
> org.scalatest.tools.Runner -R 
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes
>  
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes
>  -o -f 
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt
>  -u 
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/.
> {quote}
> stracing that process doesn't give us much:
> {quote}
> [root@amp-jenkins-worker-02 ~]# strace -p 1714
> Process 1714 attached - interrupt to quit
> futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL
> {quote}
> stracing it's children gives is a *little* bit more...  some loop like this:
> {quote}
> 
> futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0
> futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0
> futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0
> futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, 
> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
> futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, 
> ) = -1 ETIMEDOUT (Connection timed out)
> {quote}
> and others loop on prtrace_attach (no such process) or restart_syscall 
> (resuming interrupted call)
> even though this behavior has been solidly pinned to jobs timing out (which 
> ends w/an aborted, not failed, build), i've seen it happen for failed builds 
> as well.  if i see any hanging processes from failed (not aborted) builds, i 
> will investigate them and update this bug as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23751) Kolmogorov-Smirnoff test Python API in pyspark.ml

2018-04-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-23751:
-

Assignee: Weichen Xu

> Kolmogorov-Smirnoff test Python API in pyspark.ml
> -
>
> Key: SPARK-23751
> URL: https://issues.apache.org/jira/browse/SPARK-23751
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.4.0
>
>
> Python wrapper for new DataFrame-based API for KS test



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23751) Kolmogorov-Smirnoff test Python API in pyspark.ml

2018-04-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-23751.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20904
[https://github.com/apache/spark/pull/20904]

> Kolmogorov-Smirnoff test Python API in pyspark.ml
> -
>
> Key: SPARK-23751
> URL: https://issues.apache.org/jira/browse/SPARK-23751
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.4.0
>
>
> Python wrapper for new DataFrame-based API for KS test



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8571) spark streaming hanging processes upon build exit

2018-04-10 Thread shane knapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432725#comment-16432725
 ] 

shane knapp commented on SPARK-8571:


just doing some email archaeology and found this.

no, it's not an issue anymore.

 

> spark streaming hanging processes upon build exit
> -
>
> Key: SPARK-8571
> URL: https://issues.apache.org/jira/browse/SPARK-8571
> Project: Spark
>  Issue Type: Bug
>  Components: Build, DStreams
> Environment: centos 6.6 amplab build system
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Minor
>  Labels: build, test
>
> over the past 3 months i've been noticing that there are occasionally hanging 
> processes on our build system workers after various spark builds have 
> finished.  these are all spark streaming processes.
> today i noticed a 3+ hour spark build that was timed out after 200 minutes 
> (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/),
>  and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on 
> amp-jenkins-worker-02.  after the timeout, it left the following process (and 
> all of it's children) hanging.
> the process' CLI command was:
> {quote}
> [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714
> jenkins1714  733  2.7 21342148 3642740 ?Sl   07:52 1713:41 java 
> -Dderby.system.durability=test -Djava.awt.headless=true 
> -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp
>  -Dspark.driver.allowMultipleContexts=true 
> -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos
>  -Dspark.testing=1 -Dspark.ui.enabled=false 
> -Dspark.ui.showConsoleProgress=false 
> -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming
>  -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m 
> org.scalatest.tools.Runner -R 
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes
>  
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes
>  -o -f 
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt
>  -u 
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/.
> {quote}
> stracing that process doesn't give us much:
> {quote}
> [root@amp-jenkins-worker-02 ~]# strace -p 1714
> Process 1714 attached - interrupt to quit
> futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL
> {quote}
> stracing it's children gives is a *little* bit more...  some loop like this:
> {quote}
> 
> futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0
> futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0
> futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0
> futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, 
> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
> futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, 
> ) = -1 ETIMEDOUT (Connection timed out)
> {quote}
> and others loop on prtrace_attach (no such process) or restart_syscall 
> (resuming interrupted call)
> even though this behavior has been solidly pinned to jobs timing out (which 
> ends w/an aborted, not failed, build), i've seen it happen for failed builds 
> as well.  if i see any hanging processes from failed (not aborted) builds, i 
> will investigate them and update this bug as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8696) Streaming API for Online LDA

2018-04-10 Thread Joey Frazee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432686#comment-16432686
 ] 

Joey Frazee commented on SPARK-8696:


Is there still interest in this? The two use cases I've seen for this are (1) 
low latency, or near real-time topic generation -- imagine a dashboard or 
process depending on _always_ up-to-date topic dist. -- and (2) desire to do 
updates rather than fitting the entire dataset again because it's very large or 
very expensive to pre-process – though maybe merely having topic-word priors 
such as suggested in SPARK-9134 could be a good enough alternative for this 
second use case. I've seen both of those requirements appear in tandem.

 

> Streaming API for Online LDA
> 
>
> Key: SPARK-8696
> URL: https://issues.apache.org/jira/browse/SPARK-8696
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: yuhao yang
>Priority: Major
>
> Streaming LDA can be a natural extension from online LDA. 
> Yet for now we need to settle down the implementation for LDA prediction, to 
> support the predictOn method in the streaming version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23923) High-order function: cardinality(x) → bigint

2018-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23923:


Assignee: Apache Spark

> High-order function: cardinality(x) → bigint
> 
>
> Key: SPARK-23923
> URL: https://issues.apache.org/jira/browse/SPARK-23923
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html and  
> https://prestodb.io/docs/current/functions/map.html.
> Returns the cardinality (size) of the array/map x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23923) High-order function: cardinality(x) → bigint

2018-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23923:


Assignee: (was: Apache Spark)

> High-order function: cardinality(x) → bigint
> 
>
> Key: SPARK-23923
> URL: https://issues.apache.org/jira/browse/SPARK-23923
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html and  
> https://prestodb.io/docs/current/functions/map.html.
> Returns the cardinality (size) of the array/map x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23923) High-order function: cardinality(x) → bigint

2018-04-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432584#comment-16432584
 ] 

Apache Spark commented on SPARK-23923:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21031

> High-order function: cardinality(x) → bigint
> 
>
> Key: SPARK-23923
> URL: https://issues.apache.org/jira/browse/SPARK-23923
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html and  
> https://prestodb.io/docs/current/functions/map.html.
> Returns the cardinality (size) of the array/map x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23953) Add get_json_scalar function

2018-04-10 Thread Timothy Chen (JIRA)
Timothy Chen created SPARK-23953:


 Summary: Add get_json_scalar function
 Key: SPARK-23953
 URL: https://issues.apache.org/jira/browse/SPARK-23953
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.3.0
Reporter: Timothy Chen


Besides get_json_object which returns a JSON string in a return type, we should 
add a function "get_json_scalar" that returns a scalar type assuming the path 
maps to a scalar (boolean, number, string or null). It returns null when the 
path points to a object or array structure

This is also in Presto (https://prestodb.io/docs/current/functions/json.html).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16630) Blacklist a node if executors won't launch on it.

2018-04-10 Thread Attila Zsolt Piros (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432495#comment-16432495
 ] 

Attila Zsolt Piros commented on SPARK-16630:


Let me illustrate my problem with an example:
- the limit for blacklisted nodes is configured to 2  
- we have one node blacklisted close to the yarn allocator ("host1" -> 
expiryTime1), this is the new code I am working on
- scheduler requests a new executors along with blacklisted nodes (task-level): 
"host2", "host3" 
(org.apache.spark.deploy.yarn.YarnAllocator#requestTotalExecutorsWithPreferredLocalities)

So I have to choose 2 nodes to communicate towards YARN. My idea to pass 
expiryTime2 and expiryTime3 to the YarnAllocator to choose the most relevant 2 
nodes (the one which expires latter are the more relevant).
For this in the case class RequestExecutors the nodeBlacklist field type is 
changed to Map[String, Long] from Set[String].

> Blacklist a node if executors won't launch on it.
> -
>
> Key: SPARK-16630
> URL: https://issues.apache.org/jira/browse/SPARK-16630
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.6.2
>Reporter: Thomas Graves
>Priority: Major
>
> On YARN, its possible that a node is messed or misconfigured such that a 
> container won't launch on it.  For instance if the Spark external shuffle 
> handler didn't get loaded on it , maybe its just some other hardware issue or 
> hadoop configuration issue. 
> It would be nice we could recognize this happening and stop trying to launch 
> executors on it since that could end up causing us to hit our max number of 
> executor failures and then kill the job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23952) remove type parameter in DataReaderFactory

2018-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23952:


Assignee: Wenchen Fan  (was: Apache Spark)

> remove type parameter in DataReaderFactory
> --
>
> Key: SPARK-23952
> URL: https://issues.apache.org/jira/browse/SPARK-23952
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23952) remove type parameter in DataReaderFactory

2018-04-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432472#comment-16432472
 ] 

Apache Spark commented on SPARK-23952:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/21029

> remove type parameter in DataReaderFactory
> --
>
> Key: SPARK-23952
> URL: https://issues.apache.org/jira/browse/SPARK-23952
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23864) Add Unsafe* copy methods to UnsafeWriter

2018-04-10 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-23864.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

> Add Unsafe* copy methods to UnsafeWriter
> 
>
> Key: SPARK-23864
> URL: https://issues.apache.org/jira/browse/SPARK-23864
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23952) remove type parameter in DataReaderFactory

2018-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23952:


Assignee: Apache Spark  (was: Wenchen Fan)

> remove type parameter in DataReaderFactory
> --
>
> Key: SPARK-23952
> URL: https://issues.apache.org/jira/browse/SPARK-23952
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23952) remove type parameter in DataReaderFactory

2018-04-10 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-23952:
---

 Summary: remove type parameter in DataReaderFactory
 Key: SPARK-23952
 URL: https://issues.apache.org/jira/browse/SPARK-23952
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16630) Blacklist a node if executors won't launch on it.

2018-04-10 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432467#comment-16432467
 ] 

Thomas Graves commented on SPARK-16630:
---

sorry I don't follow, the list we get from the blacklist tracker is all nodes 
that are blacklisted currently that haven't met the expiry to unblacklist them. 
 You just union them with the yarn allocator list.   There is obviously some 
race condition there if one of the nodes it just about to be unblacklisted but 
I don't see that as a major issue, the next allocation will not have it.  Is 
there something I'm missing?

> Blacklist a node if executors won't launch on it.
> -
>
> Key: SPARK-16630
> URL: https://issues.apache.org/jira/browse/SPARK-16630
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.6.2
>Reporter: Thomas Graves
>Priority: Major
>
> On YARN, its possible that a node is messed or misconfigured such that a 
> container won't launch on it.  For instance if the Spark external shuffle 
> handler didn't get loaded on it , maybe its just some other hardware issue or 
> hadoop configuration issue. 
> It would be nice we could recognize this happening and stop trying to launch 
> executors on it since that could end up causing us to hit our max number of 
> executor failures and then kill the job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20617) pyspark.sql filtering fails when using ~isin when there are nulls in column

2018-04-10 Thread Ed Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432463#comment-16432463
 ] 

Ed Lee commented on SPARK-20617:


Thank you for the clarification.  So conversely:
{code:java}
spark.sql("select null NOT in ('a')")
{code}
evaluates to null.   And when the filter is applied with null  == False this is 
false and therefore the filter wouldn't return those rows.

I see now that Pandas doesn't follow SQL standards

test_df.query("col1 not in (['a'])")

 

> pyspark.sql filtering fails when using ~isin when there are nulls in column
> ---
>
> Key: SPARK-20617
> URL: https://issues.apache.org/jira/browse/SPARK-20617
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
> Environment: Ubuntu Xenial 16.04, Python 3.5
>Reporter: Ed Lee
>Priority: Major
>
> Hello encountered a filtering bug using 'isin' in pyspark sql on version 
> 2.2.0, Ubuntu 16.04.
> Enclosed below an example to replicate:
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as sf
> import pandas as pd
> spark = SparkSession.builder.master("local").appName("Word 
> Count").getOrCreate()
> test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
> "col2": range(5)
> })
> test_sdf = spark.createDataFrame(test_df)
> test_sdf.show()
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   a|   2|
>  |   b|   3|
>  |   c|   4|
> # Below shows when filtering col1 NOT in list ['a'] the col1 rows with null 
> are missing:
> test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
> Or:
> test_sdf.filter(~sf.col("col1").isin(["a"])).show()
> *Expecting*:
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   b|   3|
>  |   c|   4|
> *Got*:
>  |col1|col2|
>  |   b|   3|
>  |   c|   4|
> My workarounds:
> 1.  null is considered 'in', so add OR isNull conditon:
> test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
> sf.col("col1").isNull())).show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> 2.  Use left join and filter
> join_df = pd.DataFrame({"col1": ["a"],
> "isin": 1
> })
> join_sdf = spark.createDataFrame(join_df)
> test_sdf.join(join_sdf, on="col1", how="left") \
> .filter(sf.col("isin").isNull()) \
> .show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> Thank you



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23922) High-order function: arrays_overlap(x, y) → boolean

2018-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23922:


Assignee: (was: Apache Spark)

> High-order function: arrays_overlap(x, y) → boolean
> ---
>
> Key: SPARK-23922
> URL: https://issues.apache.org/jira/browse/SPARK-23922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Tests if arrays x and y have any any non-null elements in common. Returns 
> null if there are no non-null elements in common but either array contains 
> null.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23922) High-order function: arrays_overlap(x, y) → boolean

2018-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23922:


Assignee: Apache Spark

> High-order function: arrays_overlap(x, y) → boolean
> ---
>
> Key: SPARK-23922
> URL: https://issues.apache.org/jira/browse/SPARK-23922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> Tests if arrays x and y have any any non-null elements in common. Returns 
> null if there are no non-null elements in common but either array contains 
> null.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23922) High-order function: arrays_overlap(x, y) → boolean

2018-04-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432423#comment-16432423
 ] 

Apache Spark commented on SPARK-23922:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/21028

> High-order function: arrays_overlap(x, y) → boolean
> ---
>
> Key: SPARK-23922
> URL: https://issues.apache.org/jira/browse/SPARK-23922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Tests if arrays x and y have any any non-null elements in common. Returns 
> null if there are no non-null elements in common but either array contains 
> null.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16630) Blacklist a node if executors won't launch on it.

2018-04-10 Thread Attila Zsolt Piros (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432415#comment-16432415
 ] 

Attila Zsolt Piros commented on SPARK-16630:


I would need the expiry times to choose the most relevant (most fresh) subset 
of nodes to backlist when the limit is less then the union of all 
blacklist-able nodes. So it is only used for sorting.


> Blacklist a node if executors won't launch on it.
> -
>
> Key: SPARK-16630
> URL: https://issues.apache.org/jira/browse/SPARK-16630
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.6.2
>Reporter: Thomas Graves
>Priority: Major
>
> On YARN, its possible that a node is messed or misconfigured such that a 
> container won't launch on it.  For instance if the Spark external shuffle 
> handler didn't get loaded on it , maybe its just some other hardware issue or 
> hadoop configuration issue. 
> It would be nice we could recognize this happening and stop trying to launch 
> executors on it since that could end up causing us to hit our max number of 
> executor failures and then kill the job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23943) Improve observability of MesosRestServer/MesosClusterDispatcher

2018-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23943:


Assignee: (was: Apache Spark)

> Improve observability of MesosRestServer/MesosClusterDispatcher
> ---
>
> Key: SPARK-23943
> URL: https://issues.apache.org/jira/browse/SPARK-23943
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Mesos
>Affects Versions: 2.2.1, 2.3.0
> Environment:  
>  
>Reporter: paul mackles
>Priority: Minor
> Fix For: 2.4.0
>
>
> Two changes in this PR:
>  * A /health endpoint for a quick binary indication on the health of 
> MesosClusterDispatcher. Useful for those running MesosClusterDispatcher as a 
> marathon app: [http://mesosphere.github.io/marathon/docs/health-checks.html]. 
> Returns a 503 status if the server is unhealthy and a 200 if the server is 
> healthy
>  * A /status endpoint for a more detailed examination on the current state of 
> a MesosClusterDispatcher instance. Useful as a troubleshooting/monitoring tool
> For both endpoints, regardless of status code, the following body is returned:
>  
> {code:java}
> {
>   "action" : "ServerStatusResponse",
>   "launchedDrivers" : 0,
>   "message" : "iamok",
>   "queuedDrivers" : 0,
>   "schedulerDriverStopped" : false,
>   "serverSparkVersion" : "2.3.1-SNAPSHOT",
>   "success" : true,
>   "pendingRetryDrivers" : 0
> }{code}
> Aside from surfacing all of the scheduler metrics, the response also includes 
> the status of the Mesos SchedulerDriver. On numerous occasions now, we have 
> observed scenarios where the Mesos SchedulerDriver quietly exits due to some 
> other failure. When this happens, jobs queue up and the only way to clean 
> things up is to restart the service. 
> With the above health check, marathon can be configured to automatically 
> restart the MesosClusterDispatcher service when the health check fails, 
> lessening the need for manual intervention.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23943) Improve observability of MesosRestServer/MesosClusterDispatcher

2018-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23943:


Assignee: Apache Spark

> Improve observability of MesosRestServer/MesosClusterDispatcher
> ---
>
> Key: SPARK-23943
> URL: https://issues.apache.org/jira/browse/SPARK-23943
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Mesos
>Affects Versions: 2.2.1, 2.3.0
> Environment:  
>  
>Reporter: paul mackles
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 2.4.0
>
>
> Two changes in this PR:
>  * A /health endpoint for a quick binary indication on the health of 
> MesosClusterDispatcher. Useful for those running MesosClusterDispatcher as a 
> marathon app: [http://mesosphere.github.io/marathon/docs/health-checks.html]. 
> Returns a 503 status if the server is unhealthy and a 200 if the server is 
> healthy
>  * A /status endpoint for a more detailed examination on the current state of 
> a MesosClusterDispatcher instance. Useful as a troubleshooting/monitoring tool
> For both endpoints, regardless of status code, the following body is returned:
>  
> {code:java}
> {
>   "action" : "ServerStatusResponse",
>   "launchedDrivers" : 0,
>   "message" : "iamok",
>   "queuedDrivers" : 0,
>   "schedulerDriverStopped" : false,
>   "serverSparkVersion" : "2.3.1-SNAPSHOT",
>   "success" : true,
>   "pendingRetryDrivers" : 0
> }{code}
> Aside from surfacing all of the scheduler metrics, the response also includes 
> the status of the Mesos SchedulerDriver. On numerous occasions now, we have 
> observed scenarios where the Mesos SchedulerDriver quietly exits due to some 
> other failure. When this happens, jobs queue up and the only way to clean 
> things up is to restart the service. 
> With the above health check, marathon can be configured to automatically 
> restart the MesosClusterDispatcher service when the health check fails, 
> lessening the need for manual intervention.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23943) Improve observability of MesosRestServer/MesosClusterDispatcher

2018-04-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432379#comment-16432379
 ] 

Apache Spark commented on SPARK-23943:
--

User 'pmackles' has created a pull request for this issue:
https://github.com/apache/spark/pull/21027

> Improve observability of MesosRestServer/MesosClusterDispatcher
> ---
>
> Key: SPARK-23943
> URL: https://issues.apache.org/jira/browse/SPARK-23943
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Mesos
>Affects Versions: 2.2.1, 2.3.0
> Environment:  
>  
>Reporter: paul mackles
>Priority: Minor
> Fix For: 2.4.0
>
>
> Two changes in this PR:
>  * A /health endpoint for a quick binary indication on the health of 
> MesosClusterDispatcher. Useful for those running MesosClusterDispatcher as a 
> marathon app: [http://mesosphere.github.io/marathon/docs/health-checks.html]. 
> Returns a 503 status if the server is unhealthy and a 200 if the server is 
> healthy
>  * A /status endpoint for a more detailed examination on the current state of 
> a MesosClusterDispatcher instance. Useful as a troubleshooting/monitoring tool
> For both endpoints, regardless of status code, the following body is returned:
>  
> {code:java}
> {
>   "action" : "ServerStatusResponse",
>   "launchedDrivers" : 0,
>   "message" : "iamok",
>   "queuedDrivers" : 0,
>   "schedulerDriverStopped" : false,
>   "serverSparkVersion" : "2.3.1-SNAPSHOT",
>   "success" : true,
>   "pendingRetryDrivers" : 0
> }{code}
> Aside from surfacing all of the scheduler metrics, the response also includes 
> the status of the Mesos SchedulerDriver. On numerous occasions now, we have 
> observed scenarios where the Mesos SchedulerDriver quietly exits due to some 
> other failure. When this happens, jobs queue up and the only way to clean 
> things up is to restart the service. 
> With the above health check, marathon can be configured to automatically 
> restart the MesosClusterDispatcher service when the health check fails, 
> lessening the need for manual intervention.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23951) Use java classed in ExprValue and simplify a bunch of stuff

2018-04-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432375#comment-16432375
 ] 

Apache Spark commented on SPARK-23951:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/21026

> Use java classed in ExprValue and simplify a bunch of stuff
> ---
>
> Key: SPARK-23951
> URL: https://issues.apache.org/jira/browse/SPARK-23951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23951) Use java classed in ExprValue and simplify a bunch of stuff

2018-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23951:


Assignee: Herman van Hovell  (was: Apache Spark)

> Use java classed in ExprValue and simplify a bunch of stuff
> ---
>
> Key: SPARK-23951
> URL: https://issues.apache.org/jira/browse/SPARK-23951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23951) Use java classed in ExprValue and simplify a bunch of stuff

2018-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23951:


Assignee: Apache Spark  (was: Herman van Hovell)

> Use java classed in ExprValue and simplify a bunch of stuff
> ---
>
> Key: SPARK-23951
> URL: https://issues.apache.org/jira/browse/SPARK-23951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23888) speculative task should not run on a given host where another attempt is already running on

2018-04-10 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-23888:
-
Labels: speculation  (was: )

> speculative task should not run on a given host where another attempt is 
> already running on
> ---
>
> Key: SPARK-23888
> URL: https://issues.apache.org/jira/browse/SPARK-23888
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.0
>Reporter: wuyi
>Priority: Major
>  Labels: speculation
> Fix For: 2.3.0
>
>
> There's a bug in:
> {code:java}
> /** Check whether a task is currently running an attempt on a given host */
>  private def hasAttemptOnHost(taskIndex: Int, host: String): Boolean = {
>taskAttempts(taskIndex).exists(_.host == host)
>  }
> {code}
> This will ignore hosts which have finished attempts, so we should check 
> whether the attempt is currently running on the given host. 
> And it is possible for a speculative task to run on a host where another 
> attempt failed here before.
> Assume we have only two machines: host1, host2.  We first run task0.0 on 
> host1. Then, due to  a long time waiting for task0.0, we launch a speculative 
> task0.1 on host2. And, task0.1 finally failed on host1, but it can not re-run 
> since there's already  a copy running on host2. After another long time, we 
> launch a new  speculative task0.2. And, now, we can run task0.2 on host1 
> again, since there's no more running attempt on host1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23888) speculative task should not run on a given host where another attempt is already running on

2018-04-10 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-23888:
-
Component/s: Scheduler

> speculative task should not run on a given host where another attempt is 
> already running on
> ---
>
> Key: SPARK-23888
> URL: https://issues.apache.org/jira/browse/SPARK-23888
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.0
>Reporter: wuyi
>Priority: Major
>  Labels: speculation
> Fix For: 2.3.0
>
>
> There's a bug in:
> {code:java}
> /** Check whether a task is currently running an attempt on a given host */
>  private def hasAttemptOnHost(taskIndex: Int, host: String): Boolean = {
>taskAttempts(taskIndex).exists(_.host == host)
>  }
> {code}
> This will ignore hosts which have finished attempts, so we should check 
> whether the attempt is currently running on the given host. 
> And it is possible for a speculative task to run on a host where another 
> attempt failed here before.
> Assume we have only two machines: host1, host2.  We first run task0.0 on 
> host1. Then, due to  a long time waiting for task0.0, we launch a speculative 
> task0.1 on host2. And, task0.1 finally failed on host1, but it can not re-run 
> since there's already  a copy running on host2. After another long time, we 
> launch a new  speculative task0.2. And, now, we can run task0.2 on host1 
> again, since there's no more running attempt on host1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23929) pandas_udf schema mapped by position and not by name

2018-04-10 Thread Omri (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432344#comment-16432344
 ] 

Omri edited comment on SPARK-23929 at 4/10/18 2:22 PM:
---

[~icexelloss], I couldn't recreate the problem I had where the order was mixed, 
but I have a different example to illustrate the problem.

Here the schema struct is [id,zeros,ones] but the user returned a data frame 
with [id,ones,zeros]. The column names are taken from the provided schema and 
not from the explicitly mentioned data frame.

 
{code:java}
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *
import pandas as pd
df = spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    ("id", "v"))  

schema = StructType([
    StructField("id", LongType()),
    StructField("zeros", DoubleType()),
    StructField("ones", DoubleType())
])

@pandas_udf(schema, PandasUDFType.GROUPED_MAP)  
def constants(grp):
    return pd.DataFrame({"id":grp.iloc[0]['id'],"ones":1,"zeros":0},index = [0])
    
df.groupby("id").apply(constants).show()
{code}
results:
{code:java}
+---+-++
| id|zeros|ones|
+---+-++
|  1|  1.0| 0.0|
|  2|  1.0| 0.0|
+---+-++
{code}
So the user was expecting ones to have 1 and zeros to have 0 but they get wrong 
results.

 


was (Author: omri374):
[~icexelloss], I couldn't recreate the problem I had where the order was mixed, 
but I have a different example to illustrate the problem.

Here the schema struct is [id,zeros,ones] but the user returned a data frame 
with [id,ones,zeros]. The column names are taken from the provided schema and 
not from the explicitly mentioned data frame.

 
{code:java}
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *
import pandas as pd
df = spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    ("id", "v"))  

schema = StructType([
    StructField("id", LongType()),
    StructField("zeros", DoubleType()),
    StructField("ones", DoubleType())
])

@pandas_udf(schema, PandasUDFType.GROUPED_MAP)  
def median_per_group(grp):
    return pd.DataFrame({"id":grp.iloc[0]['id'],"ones":1,"zeros":0},index = [0])
    
df.groupby("id").apply(median_per_group).show()
{code}
results:
{code:java}
+---+-++
| id|zeros|ones|
+---+-++
|  1|  1.0| 0.0|
|  2|  1.0| 0.0|
+---+-++
{code}
So the user was expecting ones to have 1 and zeros to have 0 but they get wrong 
results.

 

> pandas_udf schema mapped by position and not by name
> 
>
> Key: SPARK-23929
> URL: https://issues.apache.org/jira/browse/SPARK-23929
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: PySpark
> Spark 2.3.0
>  
>Reporter: Omri
>Priority: Major
>
> The return struct of a pandas_udf should be mapped to the provided schema by 
> name. Currently it's not the case.
> Consider these two examples, where the only change is the order of the fields 
> in the provided schema struct:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show() 
> {code}
> and this one:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show()
> {code}
> The results should be the same but they are different:
> For the first code:
> {code:java}
> +---+---+
> |  v| id|
> +---+---+
> |1.0|  0|
> |1.0|  0|
> |2.0|  0|
> |2.0|  0|
> |2.0|  1|
> +---+---+
> {code}
> For the second code:
> {code:java}
> +---+---+
> | id|  v|
> +---+---+
> |  1|-0.7071067811865475|
> |  1| 0.7071067811865475|
> |  2|-0.8320502943378437|
> |  2|-0.2773500981126146|
> |  2| 1.1094003924504583|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23929) pandas_udf schema mapped by position and not by name

2018-04-10 Thread Omri (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432344#comment-16432344
 ] 

Omri commented on SPARK-23929:
--

[~icexelloss], I couldn't recreate the problem I had where the order was mixed, 
but I have a different example to illustrate the problem.

Here the schema struct is [id,zeros,ones] but the user returned a data frame 
with [id,ones,zeros]. The column names are taken from the provided schema and 
not from the explicitly mentioned data frame.

 
{code:java}
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *
import pandas as pd
df = spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    ("id", "v"))  

schema = StructType([
    StructField("id", LongType()),
    StructField("zeros", DoubleType()),
    StructField("ones", DoubleType())
])

@pandas_udf(schema, PandasUDFType.GROUPED_MAP)  
def median_per_group(grp):
    return pd.DataFrame({"id":grp.iloc[0]['id'],"ones":1,"zeros":0},index = [0])
    
df.groupby("id").apply(median_per_group).show()
{code}
results:
{code:java}
+---+-++
| id|zeros|ones|
+---+-++
|  1|  1.0| 0.0|
|  2|  1.0| 0.0|
+---+-++
{code}
So the user was expecting ones to have 1 and zeros to have 0 but they get wrong 
results.

 

> pandas_udf schema mapped by position and not by name
> 
>
> Key: SPARK-23929
> URL: https://issues.apache.org/jira/browse/SPARK-23929
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: PySpark
> Spark 2.3.0
>  
>Reporter: Omri
>Priority: Major
>
> The return struct of a pandas_udf should be mapped to the provided schema by 
> name. Currently it's not the case.
> Consider these two examples, where the only change is the order of the fields 
> in the provided schema struct:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show() 
> {code}
> and this one:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show()
> {code}
> The results should be the same but they are different:
> For the first code:
> {code:java}
> +---+---+
> |  v| id|
> +---+---+
> |1.0|  0|
> |1.0|  0|
> |2.0|  0|
> |2.0|  0|
> |2.0|  1|
> +---+---+
> {code}
> For the second code:
> {code:java}
> +---+---+
> | id|  v|
> +---+---+
> |  1|-0.7071067811865475|
> |  1| 0.7071067811865475|
> |  2|-0.8320502943378437|
> |  2|-0.2773500981126146|
> |  2| 1.1094003924504583|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23884) hasLaunchedTask should be true when launchedAnyTask be true

2018-04-10 Thread Yu Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432323#comment-16432323
 ] 

Yu Wang commented on SPARK-23884:
-

[~Ngone51]I did not mention the patch and wanted to mention one

> hasLaunchedTask should be true when launchedAnyTask be true
> ---
>
> Key: SPARK-23884
> URL: https://issues.apache.org/jira/browse/SPARK-23884
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: wuyi
>Priority: Major
>  Labels: easyfix
> Fix For: 2.3.0
>
> Attachments: SPARK-23884.patch
>
>
> *hasLaunchedTask* should be *true* when *launchedAnyTask* be *true*, rather 
> than *task's size > 0.*
> *task'size* would be geater than 0 as long as there‘s any *WorkOffers,*but 
> this dose not ensure there's any tasks launched.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23841) NodeIdCache should unpersist the last cached nodeIdsForInstances

2018-04-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23841.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20956
[https://github.com/apache/spark/pull/20956]

> NodeIdCache should unpersist the last cached nodeIdsForInstances
> 
>
> Key: SPARK-23841
> URL: https://issues.apache.org/jira/browse/SPARK-23841
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 2.4.0
>
>
> NodeIdCache forget to unpersist the last cached intermediate dataset:
>  
> {code:java}
> scala> import org.apache.spark.ml.classification._
> import org.apache.spark.ml.classification._
> scala> val df = 
> spark.read.format("libsvm").load("/Users/zrf/Dev/OpenSource/spark/data/mllib/sample_libsvm_data.txt")
> 2018-04-02 11:48:25 WARN  LibSVMFileFormat:66 - 'numFeatures' option not 
> specified, determining the number of features by going though the input. If 
> you know the number in advance, please specify it via 'numFeatures' option to 
> avoid the extra scan.
> 2018-04-02 11:48:31 WARN  ObjectStore:568 - Failed to get database 
> global_temp, returning NoSuchObjectException
> df: org.apache.spark.sql.DataFrame = [label: double, features: vector]
> scala> val rf = new RandomForestClassifier().setCacheNodeIds(true)
> rf: org.apache.spark.ml.classification.RandomForestClassifier = 
> rfc_aab2b672546b
> scala> val rfm = rf.fit(df)
> rfm: org.apache.spark.ml.classification.RandomForestClassificationModel = 
> RandomForestClassificationModel (uid=rfc_aab2b672546b) with 20 trees
> scala> sc.getPersistentRDDs
> res0: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(56 -> 
> MapPartitionsRDD[56] at map at NodeIdCache.scala:102){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23841) NodeIdCache should unpersist the last cached nodeIdsForInstances

2018-04-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-23841:
-

Assignee: zhengruifeng

> NodeIdCache should unpersist the last cached nodeIdsForInstances
> 
>
> Key: SPARK-23841
> URL: https://issues.apache.org/jira/browse/SPARK-23841
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>
> NodeIdCache forget to unpersist the last cached intermediate dataset:
>  
> {code:java}
> scala> import org.apache.spark.ml.classification._
> import org.apache.spark.ml.classification._
> scala> val df = 
> spark.read.format("libsvm").load("/Users/zrf/Dev/OpenSource/spark/data/mllib/sample_libsvm_data.txt")
> 2018-04-02 11:48:25 WARN  LibSVMFileFormat:66 - 'numFeatures' option not 
> specified, determining the number of features by going though the input. If 
> you know the number in advance, please specify it via 'numFeatures' option to 
> avoid the extra scan.
> 2018-04-02 11:48:31 WARN  ObjectStore:568 - Failed to get database 
> global_temp, returning NoSuchObjectException
> df: org.apache.spark.sql.DataFrame = [label: double, features: vector]
> scala> val rf = new RandomForestClassifier().setCacheNodeIds(true)
> rf: org.apache.spark.ml.classification.RandomForestClassifier = 
> rfc_aab2b672546b
> scala> val rfm = rf.fit(df)
> rfm: org.apache.spark.ml.classification.RandomForestClassificationModel = 
> RandomForestClassificationModel (uid=rfc_aab2b672546b) with 20 trees
> scala> sc.getPersistentRDDs
> res0: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(56 -> 
> MapPartitionsRDD[56] at map at NodeIdCache.scala:102){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16630) Blacklist a node if executors won't launch on it.

2018-04-10 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432244#comment-16432244
 ] 

Thomas Graves commented on SPARK-16630:
---

yes I think it would make sense as the union of all blacklisted nodes.

I'm not sure what you mean by your last question.  The expiry currently is all 
handled in the BlacklistTracker, I wouldn't want to move that out into the yarn 
allocator.  Just use the information passed to it unless there is a case it 
doesn't cover?

> Blacklist a node if executors won't launch on it.
> -
>
> Key: SPARK-16630
> URL: https://issues.apache.org/jira/browse/SPARK-16630
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.6.2
>Reporter: Thomas Graves
>Priority: Major
>
> On YARN, its possible that a node is messed or misconfigured such that a 
> container won't launch on it.  For instance if the Spark external shuffle 
> handler didn't get loaded on it , maybe its just some other hardware issue or 
> hadoop configuration issue. 
> It would be nice we could recognize this happening and stop trying to launch 
> executors on it since that could end up causing us to hit our max number of 
> executor failures and then kill the job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12216) Spark failed to delete temp directory

2018-04-10 Thread Kingsley Jones (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432207#comment-16432207
 ] 

Kingsley Jones commented on SPARK-12216:


scala> val loader = Thread.currentThread.getContextClassLoader()
loader: ClassLoader = 
scala.tools.nsc.interpreter.IMain$TranslatingClassLoader@3a1a20f

scala> val parent1 = loader.getParent()
parent1: ClassLoader = 
scala.reflect.internal.util.ScalaClassLoader$URLClassLoader@66e6af49

scala> val parent2 = parent1.getParent()
parent2: ClassLoader = sun.misc.Launcher$AppClassLoader@5fcfe4b2

scala> val parent3 = parent2.getParent()
parent3: ClassLoader = sun.misc.Launcher$ExtClassLoader@5257226b

scala> val parent4 = parent3.getParent()
parent4: ClassLoader = null

I did experiment with trying to find the open ClassLoaders in the scala session 
(shown above).

 shows exposed methods on the loaders, but there is no close 
method:

scala> loader.
clearAssertionStatus   getResource   getResources   
setClassAssertionStatus setPackageAssertionStatus
getParent  getResourceAsStream   loadClass  
setDefaultAssertionStatus

scala> parent1.
clearAssertionStatus   getResource   getResources   
setClassAssertionStatus setPackageAssertionStatus
getParent  getResourceAsStream   loadClass  
setDefaultAssertionStatus

There is no close method on any of these, so I could not try closing them prior 
to quitting the session.

This was just a simple hack to see if there was any way to use reflection to 
find the open ClassLoaders.

I thought perhaps it might be possible to walk this tree and then close them 
within ShutDownHookManager ???


> Spark failed to delete temp directory 
> --
>
> Key: SPARK-12216
> URL: https://issues.apache.org/jira/browse/SPARK-12216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
> Environment: windows 7 64 bit
> Spark 1.52
> Java 1.8.0.65
> PATH includes:
> C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin
> C:\ProgramData\Oracle\Java\javapath
> C:\Users\Stefan\scala\bin
> SYSTEM variables set are:
> JAVA_HOME=C:\Program Files\Java\jre1.8.0_65
> HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin
> (where the bin\winutils resides)
> both \tmp and \tmp\hive have permissions
> drwxrwxrwx as detected by winutils ls
>Reporter: stefan
>Priority: Minor
>
> The mailing list archives have no obvious solution to this:
> scala> :q
> Stopping spark context.
> 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark 
> temp dir: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> java.io.IOException: Failed to delete: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60)
> at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60)
> at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at scala.util.Try$.apply(Try.scala:161)
> at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
> at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Commented] (SPARK-23337) withWatermark raises an exception on struct objects

2018-04-10 Thread Aydin Kocas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432200#comment-16432200
 ] 

Aydin Kocas commented on SPARK-23337:
-

Hi Michael, in my case it's a blocking issue and unfortunately not just 
annoying because I can't use the watermarking-functionality when doing the 
readstream on a json file . Without the time limitation via the 
watermarking-functionality, I have concerns that my checkpoint-dir will 
increase with time because of not having any time boundaries.

> withWatermark raises an exception on struct objects
> ---
>
> Key: SPARK-23337
> URL: https://issues.apache.org/jira/browse/SPARK-23337
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.1
> Environment: Linux Ubuntu, Spark on standalone mode
>Reporter: Aydin Kocas
>Priority: Major
>
> Hi,
>  
> when using a nested object (I mean an object within a struct, here concrete: 
> _source.createTime) from a json file as the parameter for the 
> withWatermark-method, I get an exception (see below).
> Anything else works flawlessly with the nested object.
>  
> +*{color:#14892c}works:{color}*+ 
> {code:java}
> Dataset jsonRow = 
> spark.readStream().schema(getJSONSchema()).json(file).dropDuplicates("_id").withWatermark("myTime",
>  "10 seconds").toDF();{code}
>  
> json structure:
> {code:java}
> root
>  |-- _id: string (nullable = true)
>  |-- _index: string (nullable = true)
>  |-- _score: long (nullable = true)
>  |-- myTime: timestamp (nullable = true)
> ..{code}
> +*{color:#d04437}does not work - nested json{color}:*+
> {code:java}
> Dataset jsonRow = 
> spark.readStream().schema(getJSONSchema()).json(file).dropDuplicates("_id").withWatermark("_source.createTime",
>  "10 seconds").toDF();{code}
>  
> json structure:
>  
> {code:java}
> root
>  |-- _id: string (nullable = true)
>  |-- _index: string (nullable = true)
>  |-- _score: long (nullable = true)
>  |-- _source: struct (nullable = true)
>  | |-- createTime: timestamp (nullable = true)
> ..
>  
> Exception in thread "main" 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
> tree:
> 'EventTimeWatermark '_source.createTime, interval 10 seconds
> +- Deduplicate [_id#0], true
>  +- StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@5dbbb292,json,List(),Some(StructType(StructField(_id,StringType,true),
>  StructField(_index,StringType,true), StructField(_score,LongType,true), 
> StructField(_source,StructType(StructField(additionalData,StringType,true), 
> StructField(client,StringType,true), 
> StructField(clientDomain,BooleanType,true), 
> StructField(clientVersion,StringType,true), 
> StructField(country,StringType,true), 
> StructField(countryName,StringType,true), 
> StructField(createTime,TimestampType,true), 
> StructField(externalIP,StringType,true), 
> StructField(hostname,StringType,true), 
> StructField(internalIP,StringType,true), 
> StructField(location,StringType,true), 
> StructField(locationDestination,StringType,true), 
> StructField(login,StringType,true), 
> StructField(originalRequestString,StringType,true), 
> StructField(password,StringType,true), 
> StructField(peerIdent,StringType,true), 
> StructField(peerType,StringType,true), 
> StructField(recievedTime,TimestampType,true), 
> StructField(sessionEnd,StringType,true), 
> StructField(sessionStart,StringType,true), 
> StructField(sourceEntryAS,StringType,true), 
> StructField(sourceEntryIp,StringType,true), 
> StructField(sourceEntryPort,StringType,true), 
> StructField(targetCountry,StringType,true), 
> StructField(targetCountryName,StringType,true), 
> StructField(targetEntryAS,StringType,true), 
> StructField(targetEntryIp,StringType,true), 
> StructField(targetEntryPort,StringType,true), 
> StructField(targetport,StringType,true), 
> StructField(username,StringType,true), 
> StructField(vulnid,StringType,true)),true), 
> StructField(_type,StringType,true))),List(),None,Map(path -> ./input/),None), 
> FileSource[./input/], [_id#0, _index#1, _score#2L, _source#3, _type#4]
> at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:300)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:268)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9.applyOrElse(Analyzer.scala:854)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9.applyOrElse(Analyzer.scala:796)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62)
>  

[jira] [Updated] (SPARK-23943) Improve observability of MesosRestServer/MesosClusterDispatcher

2018-04-10 Thread paul mackles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

paul mackles updated SPARK-23943:
-
Description: 
Two changes in this PR:
 * A /health endpoint for a quick binary indication on the health of 
MesosClusterDispatcher. Useful for those running MesosClusterDispatcher as a 
marathon app: [http://mesosphere.github.io/marathon/docs/health-checks.html]. 
Returns a 503 status if the server is unhealthy and a 200 if the server is 
healthy
 * A /status endpoint for a more detailed examination on the current state of a 
MesosClusterDispatcher instance. Useful as a troubleshooting/monitoring tool

For both endpoints, regardless of status code, the following body is returned:

 
{code:java}
{
  "action" : "ServerStatusResponse",
  "launchedDrivers" : 0,
  "message" : "iamok",
  "queuedDrivers" : 0,
  "schedulerDriverStopped" : false,
  "serverSparkVersion" : "2.3.1-SNAPSHOT",
  "success" : true,
  "pendingRetryDrivers" : 0
}{code}
Aside from surfacing all of the scheduler metrics, the response also includes 
the status of the Mesos SchedulerDriver. On numerous occasions now, we have 
observed scenarios where the Mesos SchedulerDriver quietly exits due to some 
other failure. When this happens, jobs queue up and the only way to clean 
things up is to restart the service. 

With the above health check, marathon can be configured to automatically 
restart the MesosClusterDispatcher service when the health check fails, 
lessening the need for manual intervention.

  was:
Two changes:

First, a more robust 
[health-check|[http://mesosphere.github.io/marathon/docs/health-checks.html]] 
for anyone who runs MesosClusterDispatcher as a marathon app. Specifically, 
this check verifies that the MesosSchedulerDriver is still running as we have 
seen certain cases where it stops (rather quietly) and the only way to revive 
it is a restart. With this health check, marathon will restart the dispatcher 
if the MesosSchedulerDriver stops running. The health check lives at the url 
"/health" and returns a 204 when the server is healthy and a 503 when it is not 
(e.g. the MesosSchedulerDriver stopped running).

Second, a server status endpoint that replies with some basic metrics about the 
server. The status endpoint resides at the url "/status" and responds with:
{code:java}
{
  "action" : "ServerStatusResponse",
  "launchedDrivers" : 0,
  "message" : "server OK",
  "queuedDrivers" : 0,
  "schedulerDriverStopped" : false,
  "serverSparkVersion" : "2.3.1-SNAPSHOT",
  "success" : true
}{code}
As you can see, it includes a snapshot of the metrics/health of the scheduler. 
Useful for quick debugging/troubleshooting/monitoring. 


> Improve observability of MesosRestServer/MesosClusterDispatcher
> ---
>
> Key: SPARK-23943
> URL: https://issues.apache.org/jira/browse/SPARK-23943
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Mesos
>Affects Versions: 2.2.1, 2.3.0
> Environment:  
>  
>Reporter: paul mackles
>Priority: Minor
> Fix For: 2.4.0
>
>
> Two changes in this PR:
>  * A /health endpoint for a quick binary indication on the health of 
> MesosClusterDispatcher. Useful for those running MesosClusterDispatcher as a 
> marathon app: [http://mesosphere.github.io/marathon/docs/health-checks.html]. 
> Returns a 503 status if the server is unhealthy and a 200 if the server is 
> healthy
>  * A /status endpoint for a more detailed examination on the current state of 
> a MesosClusterDispatcher instance. Useful as a troubleshooting/monitoring tool
> For both endpoints, regardless of status code, the following body is returned:
>  
> {code:java}
> {
>   "action" : "ServerStatusResponse",
>   "launchedDrivers" : 0,
>   "message" : "iamok",
>   "queuedDrivers" : 0,
>   "schedulerDriverStopped" : false,
>   "serverSparkVersion" : "2.3.1-SNAPSHOT",
>   "success" : true,
>   "pendingRetryDrivers" : 0
> }{code}
> Aside from surfacing all of the scheduler metrics, the response also includes 
> the status of the Mesos SchedulerDriver. On numerous occasions now, we have 
> observed scenarios where the Mesos SchedulerDriver quietly exits due to some 
> other failure. When this happens, jobs queue up and the only way to clean 
> things up is to restart the service. 
> With the above health check, marathon can be configured to automatically 
> restart the MesosClusterDispatcher service when the health check fails, 
> lessening the need for manual intervention.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16630) Blacklist a node if executors won't launch on it.

2018-04-10 Thread Attila Zsolt Piros (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432165#comment-16432165
 ] 

Attila Zsolt Piros edited comment on SPARK-16630 at 4/10/18 12:15 PM:
--

I have question regarding limiting the number of blacklisted nodes according to 
the cluster size. 
With this change there will be two sources of nodes to be backlisted: 
- one list is coming from the scheduler (existing node level backlisting) 
- the other is computed here close to the YARN allocator (stored along with the 
expiry times)

I think it makes sense to have the limit for the complete list (union) of 
blacklisted nodes, am I right? 
If this limit is for the complete list then regarding the subset I think the 
newly blacklisted nodes are more up-to-date to be used then the earlier 
backlisted ones. 
So I would pass the expiry times from the scheduler to the YARN allocator to 
make the subset of backlisted nodes to be communicated to YARN. What is your 
opinion?


was (Author: attilapiros):
I have question regarding limiting the number of blacklisted nodes according to 
the cluster size. 
With this change there will be two sources of nodes to be backlisted: 
- one list is coming from the scheduler (existing node level backlisting) 
- the other is computed here close to the YARN (stored along with the expiry 
times)

I think it makes sense to have the limit for the complete list (union) of 
blacklisted nodes, am I right? 
If this limit is for the complete list then regarding the subset I think the 
newly blacklisted nodes are more up-to-date to be used then the earlier 
backlisted ones. 
So I would pass the expiry times from the scheduler to the YARN allocator to 
make the subset of backlisted nodes to be communicated to YARN. What is your 
opinion?

> Blacklist a node if executors won't launch on it.
> -
>
> Key: SPARK-16630
> URL: https://issues.apache.org/jira/browse/SPARK-16630
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.6.2
>Reporter: Thomas Graves
>Priority: Major
>
> On YARN, its possible that a node is messed or misconfigured such that a 
> container won't launch on it.  For instance if the Spark external shuffle 
> handler didn't get loaded on it , maybe its just some other hardware issue or 
> hadoop configuration issue. 
> It would be nice we could recognize this happening and stop trying to launch 
> executors on it since that could end up causing us to hit our max number of 
> executor failures and then kill the job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16630) Blacklist a node if executors won't launch on it.

2018-04-10 Thread Attila Zsolt Piros (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432165#comment-16432165
 ] 

Attila Zsolt Piros commented on SPARK-16630:


I have question regarding limiting the number of blacklisted nodes according to 
the cluster size. 
With this change there will be two sources of nodes to be backlisted: 
- one list is coming from the scheduler (existing node level backlisting) 
- the other is computed here close to the YARN (stored along with the expiry 
times)

I think it makes sense to have the limit for the complete list (union) of 
blacklisted nodes, am I right? 
If this limit is for the complete list then regarding the subset I think the 
newly blacklisted nodes are more up-to-date to be used then the earlier 
backlisted ones. 
So I would pass the expiry times from the scheduler to the YARN allocator to 
make the subset of backlisted nodes to be communicated to YARN. What is your 
opinion?

> Blacklist a node if executors won't launch on it.
> -
>
> Key: SPARK-16630
> URL: https://issues.apache.org/jira/browse/SPARK-16630
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.6.2
>Reporter: Thomas Graves
>Priority: Major
>
> On YARN, its possible that a node is messed or misconfigured such that a 
> container won't launch on it.  For instance if the Spark external shuffle 
> handler didn't get loaded on it , maybe its just some other hardware issue or 
> hadoop configuration issue. 
> It would be nice we could recognize this happening and stop trying to launch 
> executors on it since that could end up causing us to hit our max number of 
> executor failures and then kill the job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23884) hasLaunchedTask should be true when launchedAnyTask be true

2018-04-10 Thread wuyi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432149#comment-16432149
 ] 

wuyi commented on SPARK-23884:
--

[~gentlewang] why?

> hasLaunchedTask should be true when launchedAnyTask be true
> ---
>
> Key: SPARK-23884
> URL: https://issues.apache.org/jira/browse/SPARK-23884
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: wuyi
>Priority: Major
>  Labels: easyfix
> Fix For: 2.3.0
>
> Attachments: SPARK-23884.patch
>
>
> *hasLaunchedTask* should be *true* when *launchedAnyTask* be *true*, rather 
> than *task's size > 0.*
> *task'size* would be geater than 0 as long as there‘s any *WorkOffers,*but 
> this dose not ensure there's any tasks launched.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23705) dataframe.groupBy() may inadvertently receive sequence of non-distinct strings

2018-04-10 Thread Yu Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Wang updated SPARK-23705:

Attachment: SPARK-23705.patch

> dataframe.groupBy() may inadvertently receive sequence of non-distinct strings
> --
>
> Key: SPARK-23705
> URL: https://issues.apache.org/jira/browse/SPARK-23705
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Khoa Tran
>Priority: Minor
>  Labels: beginner, easyfix, features, newbie, starter
> Attachments: SPARK-23705.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> {code:java}
> // code placeholder
> package org.apache.spark.sql
> .
> .
> .
> class Dataset[T] private[sql](
> .
> .
> .
> def groupBy(col1: String, cols: String*): RelationalGroupedDataset = {
>   val colNames: Seq[String] = col1 +: cols
>   RelationalGroupedDataset(
> toDF(), colNames.map(colName => resolve(colName)), 
> RelationalGroupedDataset.GroupByType)
> }
> {code}
> should append a `.distinct` after `colNames` when used in `groupBy` 
>  
> Not sure if the community agrees with this or it's up to the users to perform 
> the distinct operation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23705) dataframe.groupBy() may inadvertently receive sequence of non-distinct strings

2018-04-10 Thread Yu Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432142#comment-16432142
 ] 

Yu Wang commented on SPARK-23705:
-

[~khoatrantan2000] Could you assign this patch to me?

> dataframe.groupBy() may inadvertently receive sequence of non-distinct strings
> --
>
> Key: SPARK-23705
> URL: https://issues.apache.org/jira/browse/SPARK-23705
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Khoa Tran
>Priority: Minor
>  Labels: beginner, easyfix, features, newbie, starter
> Attachments: SPARK-23705.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> {code:java}
> // code placeholder
> package org.apache.spark.sql
> .
> .
> .
> class Dataset[T] private[sql](
> .
> .
> .
> def groupBy(col1: String, cols: String*): RelationalGroupedDataset = {
>   val colNames: Seq[String] = col1 +: cols
>   RelationalGroupedDataset(
> toDF(), colNames.map(colName => resolve(colName)), 
> RelationalGroupedDataset.GroupByType)
> }
> {code}
> should append a `.distinct` after `colNames` when used in `groupBy` 
>  
> Not sure if the community agrees with this or it's up to the users to perform 
> the distinct operation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23884) hasLaunchedTask should be true when launchedAnyTask be true

2018-04-10 Thread Yu Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432118#comment-16432118
 ] 

Yu Wang edited comment on SPARK-23884 at 4/10/18 11:47 AM:
---

[~Ngone51]Could you assign this task to me?


was (Author: gentlewang):
[~Ngone51]Can you assign this task to me?

> hasLaunchedTask should be true when launchedAnyTask be true
> ---
>
> Key: SPARK-23884
> URL: https://issues.apache.org/jira/browse/SPARK-23884
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: wuyi
>Priority: Major
>  Labels: easyfix
> Fix For: 2.3.0
>
> Attachments: SPARK-23884.patch
>
>
> *hasLaunchedTask* should be *true* when *launchedAnyTask* be *true*, rather 
> than *task's size > 0.*
> *task'size* would be geater than 0 as long as there‘s any *WorkOffers,*but 
> this dose not ensure there's any tasks launched.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23922) High-order function: arrays_overlap(x, y) → boolean

2018-04-10 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432124#comment-16432124
 ] 

Marco Gaido commented on SPARK-23922:
-

I will work on this.

> High-order function: arrays_overlap(x, y) → boolean
> ---
>
> Key: SPARK-23922
> URL: https://issues.apache.org/jira/browse/SPARK-23922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Tests if arrays x and y have any any non-null elements in common. Returns 
> null if there are no non-null elements in common but either array contains 
> null.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23918) High-order function: array_min(x) → x

2018-04-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432119#comment-16432119
 ] 

Apache Spark commented on SPARK-23918:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/21025

> High-order function: array_min(x) → x
> -
>
> Key: SPARK-23918
> URL: https://issues.apache.org/jira/browse/SPARK-23918
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns the minimum value of input array.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23918) High-order function: array_min(x) → x

2018-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23918:


Assignee: (was: Apache Spark)

> High-order function: array_min(x) → x
> -
>
> Key: SPARK-23918
> URL: https://issues.apache.org/jira/browse/SPARK-23918
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns the minimum value of input array.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23918) High-order function: array_min(x) → x

2018-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23918:


Assignee: Apache Spark

> High-order function: array_min(x) → x
> -
>
> Key: SPARK-23918
> URL: https://issues.apache.org/jira/browse/SPARK-23918
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns the minimum value of input array.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23884) hasLaunchedTask should be true when launchedAnyTask be true

2018-04-10 Thread Yu Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Wang updated SPARK-23884:

Attachment: SPARK-23884.patch

> hasLaunchedTask should be true when launchedAnyTask be true
> ---
>
> Key: SPARK-23884
> URL: https://issues.apache.org/jira/browse/SPARK-23884
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: wuyi
>Priority: Major
>  Labels: easyfix
> Fix For: 2.3.0
>
> Attachments: SPARK-23884.patch
>
>
> *hasLaunchedTask* should be *true* when *launchedAnyTask* be *true*, rather 
> than *task's size > 0.*
> *task'size* would be geater than 0 as long as there‘s any *WorkOffers,*but 
> this dose not ensure there's any tasks launched.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23884) hasLaunchedTask should be true when launchedAnyTask be true

2018-04-10 Thread Yu Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432118#comment-16432118
 ] 

Yu Wang commented on SPARK-23884:
-

[~Ngone51]Can you assign this task to me?

> hasLaunchedTask should be true when launchedAnyTask be true
> ---
>
> Key: SPARK-23884
> URL: https://issues.apache.org/jira/browse/SPARK-23884
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: wuyi
>Priority: Major
>  Labels: easyfix
> Fix For: 2.3.0
>
>
> *hasLaunchedTask* should be *true* when *launchedAnyTask* be *true*, rather 
> than *task's size > 0.*
> *task'size* would be geater than 0 as long as there‘s any *WorkOffers,*but 
> this dose not ensure there's any tasks launched.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23948) Trigger mapstage's job listener in submitMissingTasks

2018-04-10 Thread jin xing (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jin xing updated SPARK-23948:
-
Description: 
SparkContext submitted a map stage from "submitMapStage" to DAGScheduler, 
"markMapStageJobAsFinished" is called only in 
(https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L933
 and   
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1314);

But think about below scenario:
1. stage0 and stage1 are all "ShuffleMapStage" and stage1 depends on stage0;
2. We submit stage1 by "submitMapStage";
3. When stage 1 running, "FetchFailed" happened, stage0 and stage1 got 
resubmitted as stage0_1 and stage1_1;
4. When stage0_1 running, speculated tasks in old stage1 come as succeeded, but 
stage1 is not inside "runningStages". So even though all splits(including the 
speculated tasks) in stage1 succeeded, job listener in stage1 will not be 
called;
5. stage0_1 finished, stage1_1 starts running. When "submitMissingTasks", there 
is no missing tasks. But in current code, job listener is not triggered

  was:
SparkContext submitted a map stage from "submitMapStage" to DAGScheduler, 
"markMapStageJobAsFinished" is called only in 
(https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L933
 and   
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1314);

But think about below scenario:
1. stage0 and stage1 are all "ShuffleMapStage" and stage1 depends on stage0;
2. We submit stage1 by "submitMapStage", there are 10 missing tasks in stage1
3. When stage 1 running, "FetchFailed" happened, stage0 and stage1 got 
resubmitted as stage0_1 and stage1_1;
4. When stage0_1 running, speculated tasks in old stage1 come as succeeded, but 
stage1 is not inside "runningStages". So even though all splits(including the 
speculated tasks) in stage1 succeeded, job listener in stage1 will not be 
called;
5. stage0_1 finished, stage1_1 starts running. When "submitMissingTasks", there 
is no missing tasks. But in current code, job listener is not triggered


> Trigger mapstage's job listener in submitMissingTasks
> -
>
> Key: SPARK-23948
> URL: https://issues.apache.org/jira/browse/SPARK-23948
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: jin xing
>Priority: Major
>
> SparkContext submitted a map stage from "submitMapStage" to DAGScheduler, 
> "markMapStageJobAsFinished" is called only in 
> (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L933
>  and   
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1314);
> But think about below scenario:
> 1. stage0 and stage1 are all "ShuffleMapStage" and stage1 depends on stage0;
> 2. We submit stage1 by "submitMapStage";
> 3. When stage 1 running, "FetchFailed" happened, stage0 and stage1 got 
> resubmitted as stage0_1 and stage1_1;
> 4. When stage0_1 running, speculated tasks in old stage1 come as succeeded, 
> but stage1 is not inside "runningStages". So even though all splits(including 
> the speculated tasks) in stage1 succeeded, job listener in stage1 will not be 
> called;
> 5. stage0_1 finished, stage1_1 starts running. When "submitMissingTasks", 
> there is no missing tasks. But in current code, job listener is not triggered



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23951) Use java classed in ExprValue and simplify a bunch of stuff

2018-04-10 Thread Herman van Hovell (JIRA)
Herman van Hovell created SPARK-23951:
-

 Summary: Use java classed in ExprValue and simplify a bunch of 
stuff
 Key: SPARK-23951
 URL: https://issues.apache.org/jira/browse/SPARK-23951
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Herman van Hovell
Assignee: Herman van Hovell






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23917) High-order function: array_max(x) → x

2018-04-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432084#comment-16432084
 ] 

Apache Spark commented on SPARK-23917:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/21024

> High-order function: array_max(x) → x
> -
>
> Key: SPARK-23917
> URL: https://issues.apache.org/jira/browse/SPARK-23917
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Returns the maximum value of input array.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23917) High-order function: array_max(x) → x

2018-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23917:


Assignee: (was: Apache Spark)

> High-order function: array_max(x) → x
> -
>
> Key: SPARK-23917
> URL: https://issues.apache.org/jira/browse/SPARK-23917
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Returns the maximum value of input array.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23917) High-order function: array_max(x) → x

2018-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23917:


Assignee: Apache Spark

> High-order function: array_max(x) → x
> -
>
> Key: SPARK-23917
> URL: https://issues.apache.org/jira/browse/SPARK-23917
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> Returns the maximum value of input array.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23949) makes "&&" supports the function of predicate operator "and"

2018-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23949:


Assignee: Apache Spark

> makes "&&" supports the function of predicate operator "and"
> 
>
> Key: SPARK-23949
> URL: https://issues.apache.org/jira/browse/SPARK-23949
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: hantiantian
>Assignee: Apache Spark
>Priority: Minor
>
> In mysql , symbol && supports  the function of predicate operator "and", 
> maybe we can add support for the function in Spark SQL.
> For example,
> select * from tbl where id==1 && age=10
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23949) makes "&&" supports the function of predicate operator "and"

2018-04-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23949:


Assignee: (was: Apache Spark)

> makes "&&" supports the function of predicate operator "and"
> 
>
> Key: SPARK-23949
> URL: https://issues.apache.org/jira/browse/SPARK-23949
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: hantiantian
>Priority: Minor
>
> In mysql , symbol && supports  the function of predicate operator "and", 
> maybe we can add support for the function in Spark SQL.
> For example,
> select * from tbl where id==1 && age=10
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23949) makes "&&" supports the function of predicate operator "and"

2018-04-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432040#comment-16432040
 ] 

Apache Spark commented on SPARK-23949:
--

User 'httfighter' has created a pull request for this issue:
https://github.com/apache/spark/pull/21023

> makes "&&" supports the function of predicate operator "and"
> 
>
> Key: SPARK-23949
> URL: https://issues.apache.org/jira/browse/SPARK-23949
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: hantiantian
>Priority: Minor
>
> In mysql , symbol && supports  the function of predicate operator "and", 
> maybe we can add support for the function in Spark SQL.
> For example,
> select * from tbl where id==1 && age=10
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23945) Column.isin() should accept a single-column DataFrame as input

2018-04-10 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432039#comment-16432039
 ] 

Herman van Hovell commented on SPARK-23945:
---

[~nchammas] we didn't add explicit dataset support because no-one asked for it, 
until now :)

What do you want to support here? {{(NOT) IN}} and {{EXISTS}}? Or do you also 
want to add support for scalar subqueries, and subqueries in filters?

> Column.isin() should accept a single-column DataFrame as input
> --
>
> Key: SPARK-23945
> URL: https://issues.apache.org/jira/browse/SPARK-23945
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> In SQL you can filter rows based on the result of a subquery:
> {code:java}
> SELECT *
> FROM table1
> WHERE name NOT IN (
> SELECT name
> FROM table2
> );{code}
> In the Spark DataFrame API, the equivalent would probably look like this:
> {code:java}
> (table1
> .where(
> ~col('name').isin(
> table2.select('name')
> )
> )
> ){code}
> However, .isin() currently [only accepts a local list of 
> values|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin].
> I imagine making this enhancement would happen as part of a larger effort to 
> support correlated subqueries in the DataFrame API.
> Or perhaps there is no plan to support this style of query in the DataFrame 
> API, and queries like this should instead be written in a different way? How 
> would we write a query like the one I have above in the DataFrame API, 
> without needing to collect values locally for the NOT IN filter?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23950) Coalescing an empty dataframe to 1 partition

2018-04-10 Thread JIRA
João Neves created SPARK-23950:
--

 Summary: Coalescing an empty dataframe to 1 partition
 Key: SPARK-23950
 URL: https://issues.apache.org/jira/browse/SPARK-23950
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.2.1
 Environment: Operating System: Windows 7

Tested in Jupyter notebooks using Python 2.7.14 and Python 3.6.3.

Hardware specs not relevant to the issue.
Reporter: João Neves


Coalescing an empty dataframe to 1 partition returns an error.

The funny thing is that coalescing an empty dataframe to 2 or more partitions 
seem to work.

The test case is the following:
{code}
from pyspark.sql.types import StructType

df = spark.createDataFrame(spark.sparkContext.emptyRDD(), StructType([]))

print(df.coalesce(2).count())
print(df.coalesce(3).count())
print(df.coalesce(4).count())

df.coalesce(1).count(){code}
Output:
{code:java}
0
0
0
---
Py4JJavaError Traceback (most recent call last)
 in ()
7 print(df.coalesce(4).count())
8 
> 9 print(df.coalesce(1).count())

C:\spark-2.2.1-bin-hadoop2.7\python\pyspark\sql\dataframe.py in count(self)
425 2
426 """
--> 427 return int(self._jdf.count())
428 
429 @ignore_unicode_prefix

c:\python36\lib\site-packages\py4j\java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134 
1135 for temp_arg in temp_args:

C:\spark-2.2.1-bin-hadoop2.7\python\pyspark\sql\utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()

c:\python36\lib\site-packages\py4j\protocol.py in get_return_value(answer, 
gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(

Py4JJavaError: An error occurred while calling o176.count.
: java.util.NoSuchElementException: next on empty iterator
at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
at scala.collection.IterableLike$class.head(IterableLike.scala:107)
at 
scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:186)
at 
scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126)
at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:186)
at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2435)
at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2434)
at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2842)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2841)
at org.apache.spark.sql.Dataset.count(Dataset.scala:2434)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Unknown Source){code}
Shouldn't this be consistent?

Thank you very much.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org




[jira] [Created] (SPARK-23949) makes "&&" supports the function of predicate operator "and"

2018-04-10 Thread hantiantian (JIRA)
hantiantian created SPARK-23949:
---

 Summary: makes "&&" supports the function of predicate operator 
"and"
 Key: SPARK-23949
 URL: https://issues.apache.org/jira/browse/SPARK-23949
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: hantiantian


In mysql , symbol && supports  the function of predicate operator "and", maybe 
we can add support for the function in Spark SQL.

For example,

select * from tbl where id==1 && age=10

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23916) High-order function: array_join(x, delimiter, null_replacement) → varchar

2018-04-10 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431795#comment-16431795
 ] 

Kazuaki Ishizaki commented on SPARK-23916:
--

Sorry for my mistake regarding a PR with wrong JIRA number.

> High-order function: array_join(x, delimiter, null_replacement) → varchar
> -
>
> Key: SPARK-23916
> URL: https://issues.apache.org/jira/browse/SPARK-23916
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Concatenates the elements of the given array using the delimiter and an 
> optional string to replace nulls.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >