[jira] [Resolved] (SPARK-24244) Parse only required columns of CSV file

2018-05-24 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24244.
-
Resolution: Fixed

> Parse only required columns of CSV file
> ---
>
> Key: SPARK-24244
> URL: https://issues.apache.org/jira/browse/SPARK-24244
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> uniVocity parser allows to specify only required column names or indexes for 
> parsing like:
> {code}
> // Here we select only the columns by their indexes.
> // The parser just skips the values in other columns
> parserSettings.selectIndexes(4, 0, 1);
> CsvParser parser = new CsvParser(parserSettings);
> {code}
> Need to modify *UnivocityParser* to extract only needed columns from 
> requiredSchema



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24244) Parse only required columns of CSV file

2018-05-24 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24244:

Priority: Major  (was: Minor)

> Parse only required columns of CSV file
> ---
>
> Key: SPARK-24244
> URL: https://issues.apache.org/jira/browse/SPARK-24244
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> uniVocity parser allows to specify only required column names or indexes for 
> parsing like:
> {code}
> // Here we select only the columns by their indexes.
> // The parser just skips the values in other columns
> parserSettings.selectIndexes(4, 0, 1);
> CsvParser parser = new CsvParser(parserSettings);
> {code}
> Need to modify *UnivocityParser* to extract only needed columns from 
> requiredSchema



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24368) Flaky tests: org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite

2018-05-24 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24368.
-
   Resolution: Fixed
 Assignee: Maxim Gekk
Fix Version/s: 2.4.0

> Flaky tests: 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite
> 
>
> Key: SPARK-24368
> URL: https://issues.apache.org/jira/browse/SPARK-24368
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite failed 
> very often.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91020/testReport/org.apache.spark.sql.execution.datasources.csv/UnivocityParserSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/history/
> {code}
> org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite *** 
> ABORTED *** (1 millisecond)
> [info]   java.lang.IllegalStateException: LiveListenerBus is stopped.
> [info]   at 
> org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97)
> [info]   at 
> org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80)
> [info]   at 
> org.apache.spark.sql.internal.SharedState.(SharedState.scala:93)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
> [info]   at scala.Option.getOrElse(Option.scala:121)
> [info]   at 
> org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120)
> [info]   at 
> org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119)
> [info]   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286)
> [info]   at 
> org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42)
> [info]   at 
> org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
> [info]   at scala.Option.map(Option.scala:146)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:95)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:94)
> [info]   at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:125)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:84)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:40)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite.(UnivocityParserSuite.scala:30)
> [info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> [info]   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> [info]   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> [info]   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> [info]   at java.lang.Class.newInstance(Class.java:442)
> [info]   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:435)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
> [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [info]   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24367) Parquet: use JOB_SUMMARY_LEVEL instead of deprecated flag ENABLE_JOB_SUMMARY

2018-05-24 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24367:


Assignee: Gengliang Wang

> Parquet: use JOB_SUMMARY_LEVEL instead of deprecated flag ENABLE_JOB_SUMMARY
> 
>
> Key: SPARK-24367
> URL: https://issues.apache.org/jira/browse/SPARK-24367
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> In current parquet version,the conf ENABLE_JOB_SUMMARY is deprecated. 
> When writing to Parquet files, the warning message "WARN 
> org.apache.parquet.hadoop.ParquetOutputFormat: Setting 
> parquet.enable.summary-metadata is deprecated, please use 
> parquet.summary.metadata.level" keeps showing up.
> From 
> [https://github.com/apache/parquet-mr/blame/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java#L164]
>  we can see that we should use JOB_SUMMARY_LEVEL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24367) Parquet: use JOB_SUMMARY_LEVEL instead of deprecated flag ENABLE_JOB_SUMMARY

2018-05-24 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24367.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21411
[https://github.com/apache/spark/pull/21411]

> Parquet: use JOB_SUMMARY_LEVEL instead of deprecated flag ENABLE_JOB_SUMMARY
> 
>
> Key: SPARK-24367
> URL: https://issues.apache.org/jira/browse/SPARK-24367
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> In current parquet version,the conf ENABLE_JOB_SUMMARY is deprecated. 
> When writing to Parquet files, the warning message "WARN 
> org.apache.parquet.hadoop.ParquetOutputFormat: Setting 
> parquet.enable.summary-metadata is deprecated, please use 
> parquet.summary.metadata.level" keeps showing up.
> From 
> [https://github.com/apache/parquet-mr/blame/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java#L164]
>  we can see that we should use JOB_SUMMARY_LEVEL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23929) pandas_udf schema mapped by position and not by name

2018-05-24 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23929.
--
Resolution: Duplicate

> pandas_udf schema mapped by position and not by name
> 
>
> Key: SPARK-23929
> URL: https://issues.apache.org/jira/browse/SPARK-23929
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: PySpark
> Spark 2.3.0
>  
>Reporter: Omri
>Priority: Major
>
> The return struct of a pandas_udf should be mapped to the provided schema by 
> name. Currently it's not the case.
> Consider these two examples, where the only change is the order of the fields 
> in the provided schema struct:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show() 
> {code}
> and this one:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show()
> {code}
> The results should be the same but they are different:
> For the first code:
> {code:java}
> +---+---+
> |  v| id|
> +---+---+
> |1.0|  0|
> |1.0|  0|
> |2.0|  0|
> |2.0|  0|
> |2.0|  1|
> +---+---+
> {code}
> For the second code:
> {code:java}
> +---+---+
> | id|  v|
> +---+---+
> |  1|-0.7071067811865475|
> |  1| 0.7071067811865475|
> |  2|-0.8320502943378437|
> |  2|-0.2773500981126146|
> |  2| 1.1094003924504583|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24235) create the top-of-task RDD sending rows to the remote buffer

2018-05-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24235:


Assignee: (was: Apache Spark)

> create the top-of-task RDD sending rows to the remote buffer
> 
>
> Key: SPARK-24235
> URL: https://issues.apache.org/jira/browse/SPARK-24235
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> [https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE/edit#heading=h.8t3ci57f7uii]
>  
> Note that after 
> [https://github.com/apache/spark/pull/21239|https://github.com/apache/spark/pull/21239],this
>  will need to be responsible for incrementing its task's EpochTracker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24235) create the top-of-task RDD sending rows to the remote buffer

2018-05-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16490047#comment-16490047
 ] 

Apache Spark commented on SPARK-24235:
--

User 'jose-torres' has created a pull request for this issue:
https://github.com/apache/spark/pull/21428

> create the top-of-task RDD sending rows to the remote buffer
> 
>
> Key: SPARK-24235
> URL: https://issues.apache.org/jira/browse/SPARK-24235
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> [https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE/edit#heading=h.8t3ci57f7uii]
>  
> Note that after 
> [https://github.com/apache/spark/pull/21239|https://github.com/apache/spark/pull/21239],this
>  will need to be responsible for incrementing its task's EpochTracker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24235) create the top-of-task RDD sending rows to the remote buffer

2018-05-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24235:


Assignee: Apache Spark

> create the top-of-task RDD sending rows to the remote buffer
> 
>
> Key: SPARK-24235
> URL: https://issues.apache.org/jira/browse/SPARK-24235
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Apache Spark
>Priority: Major
>
> [https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE/edit#heading=h.8t3ci57f7uii]
>  
> Note that after 
> [https://github.com/apache/spark/pull/21239|https://github.com/apache/spark/pull/21239],this
>  will need to be responsible for incrementing its task's EpochTracker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24324) Pandas Grouped Map UserDefinedFunction mixes column labels

2018-05-24 Thread Cristian Consonni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16490022#comment-16490022
 ] 

Cristian Consonni commented on SPARK-24324:
---

[~bryanc] said:
> As a workaround, you could write your UDF to make the pandas.DataFrame using 
> an OrderedDict like so:

I confirm that using an OrderedDict is solving the issue for me. Thank you all 
for your support !

> Pandas Grouped Map UserDefinedFunction mixes column labels
> --
>
> Key: SPARK-24324
> URL: https://issues.apache.org/jira/browse/SPARK-24324
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Python (using virtualenv):
> {noformat}
> $ python --version 
> Python 3.6.5
> {noformat}
> Modules installed:
> {noformat}
> arrow==0.12.1
> backcall==0.1.0
> bleach==2.1.3
> chardet==3.0.4
> decorator==4.3.0
> entrypoints==0.2.3
> findspark==1.2.0
> html5lib==1.0.1
> ipdb==0.11
> ipykernel==4.8.2
> ipython==6.3.1
> ipython-genutils==0.2.0
> ipywidgets==7.2.1
> jedi==0.12.0
> Jinja2==2.10
> jsonschema==2.6.0
> jupyter==1.0.0
> jupyter-client==5.2.3
> jupyter-console==5.2.0
> jupyter-core==4.4.0
> MarkupSafe==1.0
> mistune==0.8.3
> nbconvert==5.3.1
> nbformat==4.4.0
> notebook==5.5.0
> numpy==1.14.3
> pandas==0.22.0
> pandocfilters==1.4.2
> parso==0.2.0
> pbr==3.1.1
> pexpect==4.5.0
> pickleshare==0.7.4
> progressbar2==3.37.1
> prompt-toolkit==1.0.15
> ptyprocess==0.5.2
> pyarrow==0.9.0
> Pygments==2.2.0
> python-dateutil==2.7.2
> python-utils==2.3.0
> pytz==2018.4
> pyzmq==17.0.0
> qtconsole==4.3.1
> Send2Trash==1.5.0
> simplegeneric==0.8.1
> six==1.11.0
> SQLAlchemy==1.2.7
> stevedore==1.28.0
> terminado==0.8.1
> testpath==0.3.1
> tornado==5.0.2
> traitlets==4.3.2
> virtualenv==15.1.0
> virtualenv-clone==0.2.6
> virtualenvwrapper==4.7.2
> wcwidth==0.1.7
> webencodings==0.5.1
> widgetsnbextension==3.2.1
> {noformat}
>  
> Java:
> {noformat}
> $ java -version 
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> System:
> {noformat}
> $ lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 16.04.4 LTS
> Release:  16.04
> Codename: xenial
> {noformat}
>Reporter: Cristian Consonni
>Priority: Major
>
> I am working on Wikipedia page views (see [task T188041 on Wikimedia's 
> Pharicator|https://phabricator.wikimedia.org/T188041]). For simplicity, let's 
> say that these are the data:
> {noformat}
>
> {noformat}
> For each combination of (lang, page, day(timestamp)) I need to transform  the 
>  views for each hour:
> {noformat}
> 00:00 -> A
> 01:00 -> B
> ...
> {noformat}
> and concatenate the number of views for that hour.  So, if a page got 5 views 
> at 00:00 and 7 views at 01:00 it would become:
> {noformat}
> A5B7
> {noformat}
>  
> I have written a UDF called {code:python}concat_hours{code}
> However, the function is mixing the columns and I am not sure what is going 
> on. I wrote here a minimal complete example that reproduces the issue on my 
> system (the details of my environment are above).
> {code:python}
> #!/usr/bin/env python3
> # coding: utf-8
> input_data = b"""en Albert_Camus 20071210-00 150
> en Albert_Camus 20071210-01 148
> en Albert_Camus 20071210-02 197
> en Albert_Camus 20071211-20 145
> en Albert_Camus 20071211-21 131
> en Albert_Camus 20071211-22 154
> en Albert_Camus 20071211-230001 142
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-040001 1
> en Albert_Caquot 20071211-06 1
> en Albert_Caquot 20071211-08 1
> en Albert_Caquot 20071211-15 3
> en Albert_Caquot 20071211-21 1"""
> import tempfile
> fp = tempfile.NamedTemporaryFile()
> fp.write(input_data)
> fp.seek(0)
> import findspark
> findspark.init()
> import pyspark
> from pyspark.sql.types import StructType, StructField
> from pyspark.sql.types import StringType, IntegerType, TimestampType
> from pyspark.sql import functions
> sc = pyspark.SparkContext(appName="udf_example")
> sqlctx = pyspark.SQLContext(sc)
> schema = StructType([StructField("lang", StringType(), False),
>  StructField("page", StringType(), False),
>  StructField("timestamp", TimestampType(), False),
>  StructField("views", IntegerType(), False)])
> df = sqlctx.read.csv(fp.name,
>  header=False,
>  schema=schema,
>  timestampFormat="MMdd-HHmmss",
>  sep=' ')
> df.count()
> df.dtypes
> df.show()
> new_schema = StructType([StructField("lang", StringType(), False),
>  

[jira] [Created] (SPARK-24386) implement continuous processing coalesce(1)

2018-05-24 Thread Jose Torres (JIRA)
Jose Torres created SPARK-24386:
---

 Summary: implement continuous processing coalesce(1)
 Key: SPARK-24386
 URL: https://issues.apache.org/jira/browse/SPARK-24386
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Jose Torres


[~marmbrus] suggested this as a good implementation checkpoint. If we do the 
shuffle reader and writer correctly, it should be easy to make a custom 
coalesce(1) execution for continuous processing using them, without having to 
implement the logic for shuffle writers finding out where shuffle readers are 
located. (The coalesce(1) can just get the RpcEndpointRef directly from the 
reader and pass it to the writers.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23929) pandas_udf schema mapped by position and not by name

2018-05-24 Thread Cristian Consonni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489972#comment-16489972
 ] 

Cristian Consonni commented on SPARK-23929:
---

This bug was referenced in [issue 
SPARK-24324|https://issues.apache.org/jira/browse/SPARK-24324], which is a 
duplicate of this one.

See also pull request  [#21427|https://github.com/apache/spark/pull/21427].

> pandas_udf schema mapped by position and not by name
> 
>
> Key: SPARK-23929
> URL: https://issues.apache.org/jira/browse/SPARK-23929
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: PySpark
> Spark 2.3.0
>  
>Reporter: Omri
>Priority: Major
>
> The return struct of a pandas_udf should be mapped to the provided schema by 
> name. Currently it's not the case.
> Consider these two examples, where the only change is the order of the fields 
> in the provided schema struct:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show() 
> {code}
> and this one:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show()
> {code}
> The results should be the same but they are different:
> For the first code:
> {code:java}
> +---+---+
> |  v| id|
> +---+---+
> |1.0|  0|
> |1.0|  0|
> |2.0|  0|
> |2.0|  0|
> |2.0|  1|
> +---+---+
> {code}
> For the second code:
> {code:java}
> +---+---+
> | id|  v|
> +---+---+
> |  1|-0.7071067811865475|
> |  1| 0.7071067811865475|
> |  2|-0.8320502943378437|
> |  2|-0.2773500981126146|
> |  2| 1.1094003924504583|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24383) spark on k8s: "driver-svc" are not getting deleted

2018-05-24 Thread Lenin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489959#comment-16489959
 ] 

Lenin commented on SPARK-24383:
---

Yes it does have it.

```

apiVersion: v1
kind: Service
metadata:
 creationTimestamp: 2018-05-24T22:58:35Z
 name: -driver-svc
 namespace: default
 ownerReferences:
 - apiVersion: v1
 controller: true
 kind: Pod
 name: -driver
 uid: fcce2b12-5fa5-11e8-99a8-000d3afdbc16
 resourceVersion: "17895980"
 selfLink: /api/v1/namespaces/default/services/-driver-svc
 uid: fd01e9bc-5fa5-11e8-99a8-000d3afdbc16
spec:
 clusterIP: None
 ports:
 - name: driver-rpc-port
 port: 7078
 protocol: TCP
 targetPort: 7078
 - name: blockmanager
 port: 7079
 protocol: TCP
 targetPort: 7079
 selector:
 spark-app-selector: spark-c9e1543546bc431d9e083798efb9c870
 spark-role: driver
 sessionAffinity: None
 type: ClusterIP
status:
 loadBalancer: {}

```

> spark on k8s: "driver-svc" are not getting deleted
> --
>
> Key: SPARK-24383
> URL: https://issues.apache.org/jira/browse/SPARK-24383
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Lenin
>Priority: Major
>
> When the driver pod exists, the "*driver-svc" services created for the driver 
> are not cleaned up. This causes accumulation of services in the k8s layer, at 
> one point no more services can be created. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-24 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489944#comment-16489944
 ] 

Hossein Falaki commented on SPARK-24359:


Thank you guys for feedback. I updated the SPIP and the design document to use 
snake_case everywhere. I also added a section to the design document to 
summarize CRAN release strategy. We can write integration tests that run on 
jenkins to detect when we need to re-publish SparkML to CRAN. CRAN tests will 
not include any integration tests that interact with JVM.

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparkly’s API 
> is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  * create a pipeline by chaining individual components and specifying their 
> parameters
>  * tune a pipeline in parallel, taking advantage of Spark
>  * inspect a pipeline’s parameters and evaluation metrics
>  * repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> functions are snake_case (e.g., {{spark_logistic_regression()}} and 
> {{set_max_iter()}}). If a constructor gets arguments, they will be named 
> arguments. For example:
> {code:java}
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 
> > 0.1){code}
> When calls need to be chained, like above example, syntax can nicely 
> translate to a natural pipeline style with help from very popular[ magrittr 
> package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
> example:
> {code:java}
> > logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> 

[jira] [Updated] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-24 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-24359:
---
Description: 
h1. Background and motivation

SparkR supports calling MLlib functionality with an [R-friendly 
API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
 Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
 has matured significantly. It allows users build and maintain complicated 
machine learning pipelines. A lot of this functionality is difficult to expose 
using the simple formula-based API in SparkR.

We propose a new R package, _SparkML_, to be distributed along with SparkR as 
part of Apache Spark. This new package will be built on top of SparkR’s APIs to 
expose SparkML’s pipeline APIs and functionality.

*Why not SparkR?*

SparkR package contains ~300 functions. Many of these shadow functions in base 
and other popular CRAN packages. We think adding more functions to SparkR will 
degrade usability and make maintenance harder.

*Why not sparklyr?*

sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
they are not comprehensive. Also we propose a code-gen approach for this 
package to minimize work needed to expose future MLlib API, but sparkly’s API 
is manually written.
h1. Target Personas
 * Existing SparkR users who need more flexible SparkML API
 * R users (data scientists, statisticians) who wish to build Spark ML 
pipelines in R

h1. Goals
 * R users can install SparkML from CRAN
 * R users will be able to import SparkML independent from SparkR
 * After setting up a Spark session R users can
 * create a pipeline by chaining individual components and specifying their 
parameters
 * tune a pipeline in parallel, taking advantage of Spark
 * inspect a pipeline’s parameters and evaluation metrics
 * repeatedly apply a pipeline
 * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
Transformers

h1. Non-Goals
 * Adding new algorithms to SparkML R package which do not exist in Scala
 * Parallelizing existing CRAN packages
 * Changing existing SparkR ML wrapping API

h1. Proposed API Changes
h2. Design goals

When encountering trade-offs in API, we will chose based on the following list 
of priorities. The API choice that addresses a higher priority goal will be 
chosen.
 # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
of future ML algorithms difficult will be ruled out.

 * *Semantic clarity*: We attempt to minimize confusion with other packages. 
Between consciousness and clarity, we will choose clarity.

 * *Maintainability and testability:* API choices that require manual 
maintenance or make testing difficult should be avoided.

 * *Interoperability with rest of Spark components:* We will keep the R API as 
thin as possible and keep all functionality implementation in JVM/Scala.

 * *Being natural to R users:* Ultimate users of this package are R users and 
they should find it easy and natural to use.

The API will follow familiar R function syntax, where the object is passed as 
the first argument of the method:  do_something(obj, arg1, arg2). All functions 
are snake_case (e.g., {{spark_logistic_regression()}} and {{set_max_iter()}}). 
If a constructor gets arguments, they will be named arguments. For example:
{code:java}
> lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 0.1){code}
When calls need to be chained, like above example, syntax can nicely translate 
to a natural pipeline style with help from very popular[ magrittr 
package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
example:
{code:java}
> logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> lr{code}
h2. Namespace

All new API will be under a new CRAN package, named SparkML. The package should 
be usable without needing SparkR in the namespace. The package will introduce a 
number of S4 classes that inherit from four basic classes. Here we will list 
the basic types with a few examples. An object of any child class can be 
instantiated with a function call that starts with {{spark_}}.
h2. Pipeline & PipelineStage

A pipeline object contains one or more stages.  
{code:java}
> pipeline <- spark_pipeline() %>% set_stages(stage1, stage2, stage3){code}
Where stage1, stage2, etc are S4 objects of a PipelineStage and pipeline is an 
object of type Pipeline.
h2. Transformers

A Transformer is an algorithm that can transform one SparkDataFrame into 
another SparkDataFrame.

*Example API:*
{code:java}
> tokenizer <- spark_tokenizer() %>%

            set_input_col(“text”) %>%

            set_output_col(“words”)

> tokenized.df <- tokenizer %>% transform(df) {code}
h2. 

[jira] [Updated] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-24 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-24359:
---
Attachment: SparkML_ ML Pipelines in R-v2.pdf

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparkly’s API 
> is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  * create a pipeline by chaining individual components and specifying their 
> parameters
>  * tune a pipeline in parallel, taking advantage of Spark
>  * inspect a pipeline’s parameters and evaluation metrics
>  * repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> functions are snake_case (e.g., {{spark_logistic_regression()}} and 
> {{set_max_iter()}}). If a constructor gets arguments, they will be named 
> arguments. For example:
> {code:java}
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 
> > 0.1){code}
> When calls need to be chained, like above example, syntax can nicely 
> translate to a natural pipeline style with help from very popular[ magrittr 
> package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
> example:
> {code:java}
> > logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> 
> > lr{code}
> h2. Namespace
> All new API will be under a new CRAN package, named SparkML. The package 
> should be usable without needing SparkR in the namespace. The package will 
> introduce a number of S4 classes that inherit from four basic classes. Here 
> we will list the basic types with a few examples. An object of any child 
> class can be 

[jira] [Commented] (SPARK-24383) spark on k8s: "driver-svc" are not getting deleted

2018-05-24 Thread Yinan Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489942#comment-16489942
 ] 

Yinan Li commented on SPARK-24383:
--

You can use {{kubectl get service  -o=yaml}} to get a 
YAML-formatted representation of the service and check if the {{metadata}} 
section contains a {{OwnerReference}} pointing to the driver pod. 

> spark on k8s: "driver-svc" are not getting deleted
> --
>
> Key: SPARK-24383
> URL: https://issues.apache.org/jira/browse/SPARK-24383
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Lenin
>Priority: Major
>
> When the driver pod exists, the "*driver-svc" services created for the driver 
> are not cleaned up. This causes accumulation of services in the k8s layer, at 
> one point no more services can be created. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24383) spark on k8s: "driver-svc" are not getting deleted

2018-05-24 Thread Lenin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489939#comment-16489939
 ] 

Lenin commented on SPARK-24383:
---

I tried to check for that, but couldnt find where to look. I can see in the 
code that it being set.

> spark on k8s: "driver-svc" are not getting deleted
> --
>
> Key: SPARK-24383
> URL: https://issues.apache.org/jira/browse/SPARK-24383
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Lenin
>Priority: Major
>
> When the driver pod exists, the "*driver-svc" services created for the driver 
> are not cleaned up. This causes accumulation of services in the k8s layer, at 
> one point no more services can be created. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23754) StopIterator exception in Python UDF results in partial result

2018-05-24 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-23754:
---
Target Version/s: 2.3.1

Adding target version so it actually shows up in searches...

> StopIterator exception in Python UDF results in partial result
> --
>
> Key: SPARK-23754
> URL: https://issues.apache.org/jira/browse/SPARK-23754
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Priority: Blocker
>
> Reproduce:
> {code:java}
> df = spark.range(0, 1000)
> from pyspark.sql.functions import udf
> def foo(x):
> raise StopIteration()
> df.withColumn('v', udf(foo)).show()
> # Results
> # +---+---+
> # | id|  v|
> # +---+---+
> # +---+---+{code}
> I think the task should fail in this case



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14220) Build and test Spark against Scala 2.12

2018-05-24 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-14220:
--

Assignee: Marcelo Vanzin

> Build and test Spark against Scala 2.12
> ---
>
> Key: SPARK-14220
> URL: https://issues.apache.org/jira/browse/SPARK-14220
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Assignee: Marcelo Vanzin
>Priority: Blocker
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.12 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14220) Build and test Spark against Scala 2.12

2018-05-24 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-14220:
--

Assignee: (was: Marcelo Vanzin)

> Build and test Spark against Scala 2.12
> ---
>
> Key: SPARK-14220
> URL: https://issues.apache.org/jira/browse/SPARK-14220
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.12 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data

2018-05-24 Thread Wenbo Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489820#comment-16489820
 ] 

Wenbo Zhao commented on SPARK-24373:


I guess we should use `planWithBarrier` in the 'RelationalGroupedDataset' or 
other similar places. Any suggestion?

> "df.cache() df.count()" no longer eagerly caches data
> -
>
> Key: SPARK-24373
> URL: https://issues.apache.org/jira/browse/SPARK-24373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenbo Zhao
>Priority: Major
>
> Here is the code to reproduce in local mode
> {code:java}
> scala> val df = sc.range(1, 2).toDF
> df: org.apache.spark.sql.DataFrame = [value: bigint]
> scala> val myudf = udf({x: Long => println(""); x + 1})
> myudf: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,LongType,Some(List(LongType)))
> scala> val df1 = df.withColumn("value1", myudf(col("value")))
> df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint]
> scala> df1.cache
> res0: df1.type = [value: bigint, value1: bigint]
> scala> df1.count
> res1: Long = 1 
> scala> df1.count
> res2: Long = 1
> scala> df1.count
> res3: Long = 1
> {code}
>  
> in Spark 2.2, you could see it prints "". 
> In the above example, when you do explain. You could see
> {code:java}
> scala> df1.explain(true)
> == Parsed Logical Plan ==
> 'Project [value#2L, UDF('value) AS value1#5]
> +- AnalysisBarrier
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> value: bigint, value1: bigint
> Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> == Physical Plan ==
> *(1) InMemoryTableScan [value#2L, value1#5L]
> +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
> but the ImMemoryTableScan is mising in the following explain()
> {code:java}
> scala> df1.groupBy().count().explain(true)
> == Parsed Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
> value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> count: bigint
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) 
> null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L])
> +- Exchange SinglePartition
> +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#175L])
> +- *(1) Project
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24350) ClassCastException in "array_position" function

2018-05-24 Thread Alex Vayda (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Vayda resolved SPARK-24350.

Resolution: Fixed

> ClassCastException in "array_position" function
> ---
>
> Key: SPARK-24350
> URL: https://issues.apache.org/jira/browse/SPARK-24350
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Alex Vayda
>Priority: Major
> Fix For: 2.4.0
>
>
> When calling {{array_position}} function with a wrong type of the 1st operand 
> a {{ClassCastException}} is thrown instead of {{AnalysisException}}
> Example:
> {code:sql}
> select array_position('foo', 'bar')
> {code}
> {noformat}
> java.lang.ClassCastException: org.apache.spark.sql.types.StringType$ cannot 
> be cast to org.apache.spark.sql.types.ArrayType
>   at 
> org.apache.spark.sql.catalyst.expressions.ArrayPosition.inputTypes(collectionOperations.scala:1398)
>   at 
> org.apache.spark.sql.catalyst.expressions.ExpectsInputTypes$class.checkInputDataTypes(ExpectsInputTypes.scala:44)
>   at 
> org.apache.spark.sql.catalyst.expressions.ArrayPosition.checkInputDataTypes(collectionOperations.scala:1401)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved$lzycompute(Expression.scala:168)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved(Expression.scala:168)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:256)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:252)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data

2018-05-24 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489060#comment-16489060
 ] 

Li Jin edited comment on SPARK-24373 at 5/24/18 9:00 PM:
-

This is a reproduce:
{code:java}
val myUDF = udf((x: Long) => { println(""); x + 1 })

val df1 = spark.range(0, 1).toDF("s").select(myUDF($"s"))
df1.cache()
df1.count()
// No  printed
{code}
It appears the issue is related to UDF:
{code:java}
val df1 = spark.range(0, 1).toDF("s").select(myUDF($"s"))
df1.cache()
df1.groupBy().count().explain()

== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
  +- *(1) HashAggregate(keys=[], functions=[partial_count(1)])
+- *(1) Project
  +- *(1) Range (0, 1, step=1, splits=2)
{code}
Without UDF it uses "count" materialize cache:
{code:java}
val df1 = spark.range(0, 1).toDF("s").select($"s" + 1)
df1.cache()
df1.groupBy().count().explain()

== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)])
  +- Exchange SinglePartition
+- *(1) HashAggregate(keys=[], functions=[partial_count(1)])
  +- *(1) InMemoryTableScan
+- InMemoryRelation [(s + 1)#179L], 
CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
replicas),*(1) Project [(id#175L + 1) AS (s + 1)#179L]
  +- *(1) Range (0, 1, step=1, splits=2) ,None)
+- *(1) Project [(id#175L + 1) AS (s + 1)#179L]
   +- *(1) Range (0, 1, step=1, splits=2)

{code}
 


was (Author: icexelloss):
This is a reproduce:
{code:java}
val myUDF = udf((x: Long) => { println(""); x + 1 })

val df1 = spark.range(0, 1).toDF("s").select(myUDF($"s"))
df1.cache()
df1.count()
// No  printed
{code}
It appears the issue is related to UDF:

 
{code:java}
val df1 = spark.range(0, 1).toDF("s").select(myUDF($"s"))
df1.cache()
df1.groupBy().count().explain()

== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
  +- *(1) HashAggregate(keys=[], functions=[partial_count(1)])
+- *(1) Project
  +- *(1) Range (0, 1, step=1, splits=2)
{code}
Without UDF it uses "count" materialize cache:

 
{code:java}
val df1 = spark.range(0, 1).toDF("s").select($"s" + 1)
df1.cache()
df1.groupBy().count().explain()

== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)])
  +- Exchange SinglePartition
+- *(1) HashAggregate(keys=[], functions=[partial_count(1)])
  +- *(1) InMemoryTableScan
+- InMemoryRelation [(s + 1)#179L], 
CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
replicas),*(1) Project [(id#175L + 1) AS (s + 1)#179L]
  +- *(1) Range (0, 1, step=1, splits=2) ,None)
+- *(1) Project [(id#175L + 1) AS (s + 1)#179L]
   +- *(1) Range (0, 1, step=1, splits=2)

{code}
 

 

 

> "df.cache() df.count()" no longer eagerly caches data
> -
>
> Key: SPARK-24373
> URL: https://issues.apache.org/jira/browse/SPARK-24373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenbo Zhao
>Priority: Major
>
> Here is the code to reproduce in local mode
> {code:java}
> scala> val df = sc.range(1, 2).toDF
> df: org.apache.spark.sql.DataFrame = [value: bigint]
> scala> val myudf = udf({x: Long => println(""); x + 1})
> myudf: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,LongType,Some(List(LongType)))
> scala> val df1 = df.withColumn("value1", myudf(col("value")))
> df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint]
> scala> df1.cache
> res0: df1.type = [value: bigint, value1: bigint]
> scala> df1.count
> res1: Long = 1 
> scala> df1.count
> res2: Long = 1
> scala> df1.count
> res3: Long = 1
> {code}
>  
> in Spark 2.2, you could see it prints "". 
> In the above example, when you do explain. You could see
> {code:java}
> scala> df1.explain(true)
> == Parsed Logical Plan ==
> 'Project [value#2L, UDF('value) AS value1#5]
> +- AnalysisBarrier
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> value: bigint, value1: bigint
> Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> == Physical Plan ==
> *(1) InMemoryTableScan [value#2L, value1#5L]
> +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, 

[jira] [Comment Edited] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data

2018-05-24 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489060#comment-16489060
 ] 

Li Jin edited comment on SPARK-24373 at 5/24/18 9:00 PM:
-

This is a reproduce:
{code:java}
val myUDF = udf((x: Long) => { println(""); x + 1 })

val df1 = spark.range(0, 1).toDF("s").select(myUDF($"s"))
df1.cache()
df1.count()
// No  printed
{code}
It appears the issue is related to UDF:

 
{code:java}
val df1 = spark.range(0, 1).toDF("s").select(myUDF($"s"))
df1.cache()
df1.groupBy().count().explain()

== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
  +- *(1) HashAggregate(keys=[], functions=[partial_count(1)])
+- *(1) Project
  +- *(1) Range (0, 1, step=1, splits=2)
{code}
Without UDF it uses "count" materialize cache:

 
{code:java}
val df1 = spark.range(0, 1).toDF("s").select($"s" + 1)
df1.cache()
df1.groupBy().count().explain()

== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)])
  +- Exchange SinglePartition
+- *(1) HashAggregate(keys=[], functions=[partial_count(1)])
  +- *(1) InMemoryTableScan
+- InMemoryRelation [(s + 1)#179L], 
CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
replicas),*(1) Project [(id#175L + 1) AS (s + 1)#179L]
  +- *(1) Range (0, 1, step=1, splits=2) ,None)
+- *(1) Project [(id#175L + 1) AS (s + 1)#179L]
   +- *(1) Range (0, 1, step=1, splits=2)

{code}
 

 

 


was (Author: icexelloss):
This is a reproduce:
{code:java}
val myUDF = udf((x: Long) => { println(""); x + 1 })

val df1 = spark.range(0, 1).toDF("s").select(myUDF($"s"))
df1.cache()
df1.count()
// No  printed{code}

> "df.cache() df.count()" no longer eagerly caches data
> -
>
> Key: SPARK-24373
> URL: https://issues.apache.org/jira/browse/SPARK-24373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenbo Zhao
>Priority: Major
>
> Here is the code to reproduce in local mode
> {code:java}
> scala> val df = sc.range(1, 2).toDF
> df: org.apache.spark.sql.DataFrame = [value: bigint]
> scala> val myudf = udf({x: Long => println(""); x + 1})
> myudf: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,LongType,Some(List(LongType)))
> scala> val df1 = df.withColumn("value1", myudf(col("value")))
> df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint]
> scala> df1.cache
> res0: df1.type = [value: bigint, value1: bigint]
> scala> df1.count
> res1: Long = 1 
> scala> df1.count
> res2: Long = 1
> scala> df1.count
> res3: Long = 1
> {code}
>  
> in Spark 2.2, you could see it prints "". 
> In the above example, when you do explain. You could see
> {code:java}
> scala> df1.explain(true)
> == Parsed Logical Plan ==
> 'Project [value#2L, UDF('value) AS value1#5]
> +- AnalysisBarrier
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> value: bigint, value1: bigint
> Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> == Physical Plan ==
> *(1) InMemoryTableScan [value#2L, value1#5L]
> +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
> but the ImMemoryTableScan is mising in the following explain()
> {code:java}
> scala> df1.groupBy().count().explain(true)
> == Parsed Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
> value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> count: bigint
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) 
> null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)], 

[jira] [Comment Edited] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data

2018-05-24 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489060#comment-16489060
 ] 

Li Jin edited comment on SPARK-24373 at 5/24/18 8:51 PM:
-

This is a reproduce:
{code:java}
val myUDF = udf((x: Long) => { println(""); x + 1 })

val df1 = spark.range(0, 1).toDF("s").select(myUDF($"s"))
df1.cache()
df1.count()
// No  printed{code}


was (Author: icexelloss):
This is a reproduce in unit test:
{code:java}
test("cache and count") {
  var evalCount = 0
  val myUDF = udf((x: String) => { evalCount += 1; "result" })
  val df = spark.range(0, 1).select(myUDF($"id"))
  df.cache()
  df.count()
  assert(evalCount === 1)

  df.count()
  assert(evalCount === 1)
}
{code}

> "df.cache() df.count()" no longer eagerly caches data
> -
>
> Key: SPARK-24373
> URL: https://issues.apache.org/jira/browse/SPARK-24373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenbo Zhao
>Priority: Major
>
> Here is the code to reproduce in local mode
> {code:java}
> scala> val df = sc.range(1, 2).toDF
> df: org.apache.spark.sql.DataFrame = [value: bigint]
> scala> val myudf = udf({x: Long => println(""); x + 1})
> myudf: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,LongType,Some(List(LongType)))
> scala> val df1 = df.withColumn("value1", myudf(col("value")))
> df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint]
> scala> df1.cache
> res0: df1.type = [value: bigint, value1: bigint]
> scala> df1.count
> res1: Long = 1 
> scala> df1.count
> res2: Long = 1
> scala> df1.count
> res3: Long = 1
> {code}
>  
> in Spark 2.2, you could see it prints "". 
> In the above example, when you do explain. You could see
> {code:java}
> scala> df1.explain(true)
> == Parsed Logical Plan ==
> 'Project [value#2L, UDF('value) AS value1#5]
> +- AnalysisBarrier
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> value: bigint, value1: bigint
> Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> == Physical Plan ==
> *(1) InMemoryTableScan [value#2L, value1#5L]
> +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
> but the ImMemoryTableScan is mising in the following explain()
> {code:java}
> scala> df1.groupBy().count().explain(true)
> == Parsed Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
> value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> count: bigint
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) 
> null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L])
> +- Exchange SinglePartition
> +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#175L])
> +- *(1) Project
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24383) spark on k8s: "driver-svc" are not getting deleted

2018-05-24 Thread Yinan Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489776#comment-16489776
 ] 

Yinan Li commented on SPARK-24383:
--

Can you double check if the services have an {{OwnerReference}} pointing to a 
driver pod?

> spark on k8s: "driver-svc" are not getting deleted
> --
>
> Key: SPARK-24383
> URL: https://issues.apache.org/jira/browse/SPARK-24383
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Lenin
>Priority: Major
>
> When the driver pod exists, the "*driver-svc" services created for the driver 
> are not cleaned up. This causes accumulation of services in the k8s layer, at 
> one point no more services can be created. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24375) Design sketch: support barrier scheduling in Apache Spark

2018-05-24 Thread Jiang Xingbo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489764#comment-16489764
 ] 

Jiang Xingbo commented on SPARK-24375:
--

We proposal to add new RDDBarrier and BarrierTaskContext to support barrier 
scheduling in Spark, it also requires to modify how the job scheduling works a 
bit to accommodate the new feature.

 

*Barrier Stage*: A barrier stage doesn’t launch any of its tasks until the 
available slots(free CPU cores can be used to launch pending tasks) satisfies 
the target to launch all the tasks at the same time, and always retry the whole 
stage when any task(s) fail. One way to identify whether a stage is a barrier 
stage can be tracing the RDD that the stage runs on, if the stage contains 
RDDBarrier or at least one of the ancestor RDD(s) are RDDBarrier then the stage 
is a barrier stage, the tracing shall stop at ShuffleRDD(s).

 

*Schedule Barrier Tasks*: Currently TaskScheduler schedule pending tasks on 
available slots by best effort, so normally all tasks in the same stage don’t 
get launched at the same time. We may add a check of total available slots 
before scheduling tasks from a barrier stage taskset. It is still possible that 
only partial tasks of a whole barrier stage taskset get launched due to task 
locality issues, so we have to check again before launch to ensure that all 
tasks in the same barrier stage get launched at the same time.

If we consider scheduling several jobs at the same time(both barrier and 
regular jobs), it may be possible that barrier tasks are block by regular 
tasks(when available slots are always less than that required by a barrier 
stage taskset), or barrier stage taskset may block another barrier stage 
taskset(when a barrier stage taskset that requires less slots is prone to be 
scheduled earlier). Currently we don’t have a perfect solution for all these 
scenarios, but at least we may avoid the worst case that a huge barrier stage 
taskset being blocked forever on a busy cluster, using a time-based weight 
approach(conceptionally, a taskset that have been pending for a longer time 
will be assigned greater priority weight to be scheduled).

 

*Task Barrier*: Barrier tasks shall allow users to insert sync in the middle of 
task execution, this can be achieved by introducing a glocal barrier operation 
in TaskContext, which makes the current task wait until all tasks in the same 
stage hit this barrier.

 

*Task Failure*: To ensure correctness, a barrier stage always retry the whole 
stage when any task(s) fail. Thus, it’s quite straightforward that we shall 
require kill all the running tasks of a failed stage, and that also guarantees 
at most one taskset shall be running for each single stage(no zombie tasks).

 

*Speculative Task*: Since we require launch all tasks in a barrier stage at the 
same time, there is no need to launch a speculative task for a barrier stage 
taskset.

 

*Share TaskInfo*: To share informations between tasks in a barrier stage, we 
may update them in `TaskContext.localProperties`.


*Python Support*: Expose RDDBarrier and BarrierTaskContext to pyspark.

 

[~cloud_fan] maybe you want to give additional information I didn't cover 
above? (esp. PySpark)

> Design sketch: support barrier scheduling in Apache Spark
> -
>
> Key: SPARK-24375
> URL: https://issues.apache.org/jira/browse/SPARK-24375
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Jiang Xingbo
>Priority: Major
>
> This task is to outline a design sketch for the barrier scheduling SPIP 
> discussion. It doesn't need to be a complete design before the vote. But it 
> should at least cover both Scala/Java and PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24324) Pandas Grouped Map UserDefinedFunction mixes column labels

2018-05-24 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489759#comment-16489759
 ] 

Li Jin commented on SPARK-24324:


This is a dup of https://issues.apache.org/jira/browse/SPARK-23929, I am now 
convinced this behavior should be fixed.

> Pandas Grouped Map UserDefinedFunction mixes column labels
> --
>
> Key: SPARK-24324
> URL: https://issues.apache.org/jira/browse/SPARK-24324
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Python (using virtualenv):
> {noformat}
> $ python --version 
> Python 3.6.5
> {noformat}
> Modules installed:
> {noformat}
> arrow==0.12.1
> backcall==0.1.0
> bleach==2.1.3
> chardet==3.0.4
> decorator==4.3.0
> entrypoints==0.2.3
> findspark==1.2.0
> html5lib==1.0.1
> ipdb==0.11
> ipykernel==4.8.2
> ipython==6.3.1
> ipython-genutils==0.2.0
> ipywidgets==7.2.1
> jedi==0.12.0
> Jinja2==2.10
> jsonschema==2.6.0
> jupyter==1.0.0
> jupyter-client==5.2.3
> jupyter-console==5.2.0
> jupyter-core==4.4.0
> MarkupSafe==1.0
> mistune==0.8.3
> nbconvert==5.3.1
> nbformat==4.4.0
> notebook==5.5.0
> numpy==1.14.3
> pandas==0.22.0
> pandocfilters==1.4.2
> parso==0.2.0
> pbr==3.1.1
> pexpect==4.5.0
> pickleshare==0.7.4
> progressbar2==3.37.1
> prompt-toolkit==1.0.15
> ptyprocess==0.5.2
> pyarrow==0.9.0
> Pygments==2.2.0
> python-dateutil==2.7.2
> python-utils==2.3.0
> pytz==2018.4
> pyzmq==17.0.0
> qtconsole==4.3.1
> Send2Trash==1.5.0
> simplegeneric==0.8.1
> six==1.11.0
> SQLAlchemy==1.2.7
> stevedore==1.28.0
> terminado==0.8.1
> testpath==0.3.1
> tornado==5.0.2
> traitlets==4.3.2
> virtualenv==15.1.0
> virtualenv-clone==0.2.6
> virtualenvwrapper==4.7.2
> wcwidth==0.1.7
> webencodings==0.5.1
> widgetsnbextension==3.2.1
> {noformat}
>  
> Java:
> {noformat}
> $ java -version 
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> System:
> {noformat}
> $ lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 16.04.4 LTS
> Release:  16.04
> Codename: xenial
> {noformat}
>Reporter: Cristian Consonni
>Priority: Major
>
> I am working on Wikipedia page views (see [task T188041 on Wikimedia's 
> Pharicator|https://phabricator.wikimedia.org/T188041]). For simplicity, let's 
> say that these are the data:
> {noformat}
>
> {noformat}
> For each combination of (lang, page, day(timestamp)) I need to transform  the 
>  views for each hour:
> {noformat}
> 00:00 -> A
> 01:00 -> B
> ...
> {noformat}
> and concatenate the number of views for that hour.  So, if a page got 5 views 
> at 00:00 and 7 views at 01:00 it would become:
> {noformat}
> A5B7
> {noformat}
>  
> I have written a UDF called {code:python}concat_hours{code}
> However, the function is mixing the columns and I am not sure what is going 
> on. I wrote here a minimal complete example that reproduces the issue on my 
> system (the details of my environment are above).
> {code:python}
> #!/usr/bin/env python3
> # coding: utf-8
> input_data = b"""en Albert_Camus 20071210-00 150
> en Albert_Camus 20071210-01 148
> en Albert_Camus 20071210-02 197
> en Albert_Camus 20071211-20 145
> en Albert_Camus 20071211-21 131
> en Albert_Camus 20071211-22 154
> en Albert_Camus 20071211-230001 142
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-040001 1
> en Albert_Caquot 20071211-06 1
> en Albert_Caquot 20071211-08 1
> en Albert_Caquot 20071211-15 3
> en Albert_Caquot 20071211-21 1"""
> import tempfile
> fp = tempfile.NamedTemporaryFile()
> fp.write(input_data)
> fp.seek(0)
> import findspark
> findspark.init()
> import pyspark
> from pyspark.sql.types import StructType, StructField
> from pyspark.sql.types import StringType, IntegerType, TimestampType
> from pyspark.sql import functions
> sc = pyspark.SparkContext(appName="udf_example")
> sqlctx = pyspark.SQLContext(sc)
> schema = StructType([StructField("lang", StringType(), False),
>  StructField("page", StringType(), False),
>  StructField("timestamp", TimestampType(), False),
>  StructField("views", IntegerType(), False)])
> df = sqlctx.read.csv(fp.name,
>  header=False,
>  schema=schema,
>  timestampFormat="MMdd-HHmmss",
>  sep=' ')
> df.count()
> df.dtypes
> df.show()
> new_schema = StructType([StructField("lang", StringType(), False),
>  StructField("page", StringType(), False),
>  StructField("day", StringType(), False),
>  

[jira] [Comment Edited] (SPARK-24036) Stateful operators in continuous processing

2018-05-24 Thread Jose Torres (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489648#comment-16489648
 ] 

Jose Torres edited comment on SPARK-24036 at 5/24/18 8:23 PM:
--

I've been notified of 
[https://issues.apache.org/jira/projects/SPARK/issues/SPARK-24374?filter=allopenissues,]
  a SPIP for an API which would provide much of what we need here wrt letting 
tasks know where the appropriate shuffle endpoints.


was (Author: joseph.torres):
I've been notified of 
[https://issues.apache.org/jira/projects/SPARK/issues/SPARK-24374|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-24374,],
 a SPIP for an API which would provide much of what we need here wrt letting 
tasks know where the appropriate shuffle endpoints.

> Stateful operators in continuous processing
> ---
>
> Key: SPARK-24036
> URL: https://issues.apache.org/jira/browse/SPARK-24036
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> The first iteration of continuous processing in Spark 2.3 does not work with 
> stateful operators.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24036) Stateful operators in continuous processing

2018-05-24 Thread Jose Torres (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489648#comment-16489648
 ] 

Jose Torres edited comment on SPARK-24036 at 5/24/18 8:22 PM:
--

I've been notified of 
[https://issues.apache.org/jira/projects/SPARK/issues/SPARK-24374|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-24374,],
 a SPIP for an API which would provide much of what we need here wrt letting 
tasks know where the appropriate shuffle endpoints.


was (Author: joseph.torres):
I've been notified of 
[https://issues.apache.org/jira/projects/SPARK/issues/SPARK-24374,] a SPIP for 
an API which would provide much of what we need here wrt letting tasks know 
where the appropriate shuffle endpoints.

> Stateful operators in continuous processing
> ---
>
> Key: SPARK-24036
> URL: https://issues.apache.org/jira/browse/SPARK-24036
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> The first iteration of continuous processing in Spark 2.3 does not work with 
> stateful operators.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24036) Stateful operators in continuous processing

2018-05-24 Thread Jose Torres (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489727#comment-16489727
 ] 

Jose Torres commented on SPARK-24036:
-

That's out of scope - the shuffle reader and writer work in this Jira would 
still be needed on top.

> Stateful operators in continuous processing
> ---
>
> Key: SPARK-24036
> URL: https://issues.apache.org/jira/browse/SPARK-24036
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> The first iteration of continuous processing in Spark 2.3 does not work with 
> stateful operators.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24036) Stateful operators in continuous processing

2018-05-24 Thread Arun Mahadevan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489700#comment-16489700
 ] 

Arun Mahadevan commented on SPARK-24036:


If I understand correctly, continuous job would have a single stage with tasks 
running at the same time shuffling data around (making use of the "TaskInfo" to 
figure out the endpoints). This means we cannot re-use the existing shuffle 
infra since it makes sense only if there are multiple stages ? Does SPARK-24374 
plan to provide the shuffle infra to move data around or is that out of scope ?

> Stateful operators in continuous processing
> ---
>
> Key: SPARK-24036
> URL: https://issues.apache.org/jira/browse/SPARK-24036
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> The first iteration of continuous processing in Spark 2.3 does not work with 
> stateful operators.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24332) Fix places reading 'spark.network.timeout' as milliseconds

2018-05-24 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-24332.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21382
[https://github.com/apache/spark/pull/21382]

> Fix places reading 'spark.network.timeout' as milliseconds
> --
>
> Key: SPARK-24332
> URL: https://issues.apache.org/jira/browse/SPARK-24332
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.4.0
>
>
> There are some places reading "spark.network.timeout" using "getTimeAsMs" 
> rather than "getTimeAsSeconds". This will return a wrong value when the user 
> specifies a value without a time unit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24383) spark on k8s: "driver-svc" are not getting deleted

2018-05-24 Thread Lenin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489665#comment-16489665
 ] 

Lenin commented on SPARK-24383:
---

its not something i observed. I had a lots of dangling services.

> spark on k8s: "driver-svc" are not getting deleted
> --
>
> Key: SPARK-24383
> URL: https://issues.apache.org/jira/browse/SPARK-24383
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Lenin
>Priority: Major
>
> When the driver pod exists, the "*driver-svc" services created for the driver 
> are not cleaned up. This causes accumulation of services in the k8s layer, at 
> one point no more services can be created. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23416) Flaky test: KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false

2018-05-24 Thread Jose Torres (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489661#comment-16489661
 ] 

Jose Torres commented on SPARK-23416:
-

Do you know how to drive that? I'm not sure what the process is.

> Flaky test: KafkaSourceStressForDontFailOnDataLossSuite.stress test for 
> failOnDataLoss=false
> 
>
> Key: SPARK-23416
> URL: https://issues.apache.org/jira/browse/SPARK-23416
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Minor
> Fix For: 2.4.0
>
>
> I suspect this is a race condition latent in the DataSourceV2 write path, or 
> at least the interaction of that write path with StreamTest.
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87241/testReport/org.apache.spark.sql.kafka010/KafkaSourceStressForDontFailOnDataLossSuite/stress_test_for_failOnDataLoss_false/]
> h3. Error Message
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 16b2a2b1-acdd-44ec-902f-531169193169, runId = 
> 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job 
> aborted.
> h3. Stacktrace
> sbt.ForkMain$ForkError: 
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 16b2a2b1-acdd-44ec-902f-531169193169, runId = 
> 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job 
> aborted. at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
>  Caused by: sbt.ForkMain$ForkError: org.apache.spark.SparkException: Writing 
> job aborted. at 
> org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:108)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) 
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
>  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) 
> at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) 
> at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at 
> org.apache.spark.sql.Dataset.collect(Dataset.scala:2722) at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$15.apply(MicroBatchExecution.scala:488)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:483)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:482)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>  at 
> 

[jira] [Assigned] (SPARK-24324) Pandas Grouped Map UserDefinedFunction mixes column labels

2018-05-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24324:


Assignee: Apache Spark

> Pandas Grouped Map UserDefinedFunction mixes column labels
> --
>
> Key: SPARK-24324
> URL: https://issues.apache.org/jira/browse/SPARK-24324
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Python (using virtualenv):
> {noformat}
> $ python --version 
> Python 3.6.5
> {noformat}
> Modules installed:
> {noformat}
> arrow==0.12.1
> backcall==0.1.0
> bleach==2.1.3
> chardet==3.0.4
> decorator==4.3.0
> entrypoints==0.2.3
> findspark==1.2.0
> html5lib==1.0.1
> ipdb==0.11
> ipykernel==4.8.2
> ipython==6.3.1
> ipython-genutils==0.2.0
> ipywidgets==7.2.1
> jedi==0.12.0
> Jinja2==2.10
> jsonschema==2.6.0
> jupyter==1.0.0
> jupyter-client==5.2.3
> jupyter-console==5.2.0
> jupyter-core==4.4.0
> MarkupSafe==1.0
> mistune==0.8.3
> nbconvert==5.3.1
> nbformat==4.4.0
> notebook==5.5.0
> numpy==1.14.3
> pandas==0.22.0
> pandocfilters==1.4.2
> parso==0.2.0
> pbr==3.1.1
> pexpect==4.5.0
> pickleshare==0.7.4
> progressbar2==3.37.1
> prompt-toolkit==1.0.15
> ptyprocess==0.5.2
> pyarrow==0.9.0
> Pygments==2.2.0
> python-dateutil==2.7.2
> python-utils==2.3.0
> pytz==2018.4
> pyzmq==17.0.0
> qtconsole==4.3.1
> Send2Trash==1.5.0
> simplegeneric==0.8.1
> six==1.11.0
> SQLAlchemy==1.2.7
> stevedore==1.28.0
> terminado==0.8.1
> testpath==0.3.1
> tornado==5.0.2
> traitlets==4.3.2
> virtualenv==15.1.0
> virtualenv-clone==0.2.6
> virtualenvwrapper==4.7.2
> wcwidth==0.1.7
> webencodings==0.5.1
> widgetsnbextension==3.2.1
> {noformat}
>  
> Java:
> {noformat}
> $ java -version 
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> System:
> {noformat}
> $ lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 16.04.4 LTS
> Release:  16.04
> Codename: xenial
> {noformat}
>Reporter: Cristian Consonni
>Assignee: Apache Spark
>Priority: Major
>
> I am working on Wikipedia page views (see [task T188041 on Wikimedia's 
> Pharicator|https://phabricator.wikimedia.org/T188041]). For simplicity, let's 
> say that these are the data:
> {noformat}
>
> {noformat}
> For each combination of (lang, page, day(timestamp)) I need to transform  the 
>  views for each hour:
> {noformat}
> 00:00 -> A
> 01:00 -> B
> ...
> {noformat}
> and concatenate the number of views for that hour.  So, if a page got 5 views 
> at 00:00 and 7 views at 01:00 it would become:
> {noformat}
> A5B7
> {noformat}
>  
> I have written a UDF called {code:python}concat_hours{code}
> However, the function is mixing the columns and I am not sure what is going 
> on. I wrote here a minimal complete example that reproduces the issue on my 
> system (the details of my environment are above).
> {code:python}
> #!/usr/bin/env python3
> # coding: utf-8
> input_data = b"""en Albert_Camus 20071210-00 150
> en Albert_Camus 20071210-01 148
> en Albert_Camus 20071210-02 197
> en Albert_Camus 20071211-20 145
> en Albert_Camus 20071211-21 131
> en Albert_Camus 20071211-22 154
> en Albert_Camus 20071211-230001 142
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-040001 1
> en Albert_Caquot 20071211-06 1
> en Albert_Caquot 20071211-08 1
> en Albert_Caquot 20071211-15 3
> en Albert_Caquot 20071211-21 1"""
> import tempfile
> fp = tempfile.NamedTemporaryFile()
> fp.write(input_data)
> fp.seek(0)
> import findspark
> findspark.init()
> import pyspark
> from pyspark.sql.types import StructType, StructField
> from pyspark.sql.types import StringType, IntegerType, TimestampType
> from pyspark.sql import functions
> sc = pyspark.SparkContext(appName="udf_example")
> sqlctx = pyspark.SQLContext(sc)
> schema = StructType([StructField("lang", StringType(), False),
>  StructField("page", StringType(), False),
>  StructField("timestamp", TimestampType(), False),
>  StructField("views", IntegerType(), False)])
> df = sqlctx.read.csv(fp.name,
>  header=False,
>  schema=schema,
>  timestampFormat="MMdd-HHmmss",
>  sep=' ')
> df.count()
> df.dtypes
> df.show()
> new_schema = StructType([StructField("lang", StringType(), False),
>  StructField("page", StringType(), False),
>  StructField("day", StringType(), False),
>  StructField("enc", StringType(), False)])
> from pyspark.sql.functions 

[jira] [Commented] (SPARK-24324) Pandas Grouped Map UserDefinedFunction mixes column labels

2018-05-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489650#comment-16489650
 ] 

Apache Spark commented on SPARK-24324:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/21427

> Pandas Grouped Map UserDefinedFunction mixes column labels
> --
>
> Key: SPARK-24324
> URL: https://issues.apache.org/jira/browse/SPARK-24324
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Python (using virtualenv):
> {noformat}
> $ python --version 
> Python 3.6.5
> {noformat}
> Modules installed:
> {noformat}
> arrow==0.12.1
> backcall==0.1.0
> bleach==2.1.3
> chardet==3.0.4
> decorator==4.3.0
> entrypoints==0.2.3
> findspark==1.2.0
> html5lib==1.0.1
> ipdb==0.11
> ipykernel==4.8.2
> ipython==6.3.1
> ipython-genutils==0.2.0
> ipywidgets==7.2.1
> jedi==0.12.0
> Jinja2==2.10
> jsonschema==2.6.0
> jupyter==1.0.0
> jupyter-client==5.2.3
> jupyter-console==5.2.0
> jupyter-core==4.4.0
> MarkupSafe==1.0
> mistune==0.8.3
> nbconvert==5.3.1
> nbformat==4.4.0
> notebook==5.5.0
> numpy==1.14.3
> pandas==0.22.0
> pandocfilters==1.4.2
> parso==0.2.0
> pbr==3.1.1
> pexpect==4.5.0
> pickleshare==0.7.4
> progressbar2==3.37.1
> prompt-toolkit==1.0.15
> ptyprocess==0.5.2
> pyarrow==0.9.0
> Pygments==2.2.0
> python-dateutil==2.7.2
> python-utils==2.3.0
> pytz==2018.4
> pyzmq==17.0.0
> qtconsole==4.3.1
> Send2Trash==1.5.0
> simplegeneric==0.8.1
> six==1.11.0
> SQLAlchemy==1.2.7
> stevedore==1.28.0
> terminado==0.8.1
> testpath==0.3.1
> tornado==5.0.2
> traitlets==4.3.2
> virtualenv==15.1.0
> virtualenv-clone==0.2.6
> virtualenvwrapper==4.7.2
> wcwidth==0.1.7
> webencodings==0.5.1
> widgetsnbextension==3.2.1
> {noformat}
>  
> Java:
> {noformat}
> $ java -version 
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> System:
> {noformat}
> $ lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 16.04.4 LTS
> Release:  16.04
> Codename: xenial
> {noformat}
>Reporter: Cristian Consonni
>Priority: Major
>
> I am working on Wikipedia page views (see [task T188041 on Wikimedia's 
> Pharicator|https://phabricator.wikimedia.org/T188041]). For simplicity, let's 
> say that these are the data:
> {noformat}
>
> {noformat}
> For each combination of (lang, page, day(timestamp)) I need to transform  the 
>  views for each hour:
> {noformat}
> 00:00 -> A
> 01:00 -> B
> ...
> {noformat}
> and concatenate the number of views for that hour.  So, if a page got 5 views 
> at 00:00 and 7 views at 01:00 it would become:
> {noformat}
> A5B7
> {noformat}
>  
> I have written a UDF called {code:python}concat_hours{code}
> However, the function is mixing the columns and I am not sure what is going 
> on. I wrote here a minimal complete example that reproduces the issue on my 
> system (the details of my environment are above).
> {code:python}
> #!/usr/bin/env python3
> # coding: utf-8
> input_data = b"""en Albert_Camus 20071210-00 150
> en Albert_Camus 20071210-01 148
> en Albert_Camus 20071210-02 197
> en Albert_Camus 20071211-20 145
> en Albert_Camus 20071211-21 131
> en Albert_Camus 20071211-22 154
> en Albert_Camus 20071211-230001 142
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-040001 1
> en Albert_Caquot 20071211-06 1
> en Albert_Caquot 20071211-08 1
> en Albert_Caquot 20071211-15 3
> en Albert_Caquot 20071211-21 1"""
> import tempfile
> fp = tempfile.NamedTemporaryFile()
> fp.write(input_data)
> fp.seek(0)
> import findspark
> findspark.init()
> import pyspark
> from pyspark.sql.types import StructType, StructField
> from pyspark.sql.types import StringType, IntegerType, TimestampType
> from pyspark.sql import functions
> sc = pyspark.SparkContext(appName="udf_example")
> sqlctx = pyspark.SQLContext(sc)
> schema = StructType([StructField("lang", StringType(), False),
>  StructField("page", StringType(), False),
>  StructField("timestamp", TimestampType(), False),
>  StructField("views", IntegerType(), False)])
> df = sqlctx.read.csv(fp.name,
>  header=False,
>  schema=schema,
>  timestampFormat="MMdd-HHmmss",
>  sep=' ')
> df.count()
> df.dtypes
> df.show()
> new_schema = StructType([StructField("lang", StringType(), False),
>  StructField("page", StringType(), False),
>  StructField("day", StringType(), False),
>

[jira] [Assigned] (SPARK-24324) Pandas Grouped Map UserDefinedFunction mixes column labels

2018-05-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24324:


Assignee: (was: Apache Spark)

> Pandas Grouped Map UserDefinedFunction mixes column labels
> --
>
> Key: SPARK-24324
> URL: https://issues.apache.org/jira/browse/SPARK-24324
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Python (using virtualenv):
> {noformat}
> $ python --version 
> Python 3.6.5
> {noformat}
> Modules installed:
> {noformat}
> arrow==0.12.1
> backcall==0.1.0
> bleach==2.1.3
> chardet==3.0.4
> decorator==4.3.0
> entrypoints==0.2.3
> findspark==1.2.0
> html5lib==1.0.1
> ipdb==0.11
> ipykernel==4.8.2
> ipython==6.3.1
> ipython-genutils==0.2.0
> ipywidgets==7.2.1
> jedi==0.12.0
> Jinja2==2.10
> jsonschema==2.6.0
> jupyter==1.0.0
> jupyter-client==5.2.3
> jupyter-console==5.2.0
> jupyter-core==4.4.0
> MarkupSafe==1.0
> mistune==0.8.3
> nbconvert==5.3.1
> nbformat==4.4.0
> notebook==5.5.0
> numpy==1.14.3
> pandas==0.22.0
> pandocfilters==1.4.2
> parso==0.2.0
> pbr==3.1.1
> pexpect==4.5.0
> pickleshare==0.7.4
> progressbar2==3.37.1
> prompt-toolkit==1.0.15
> ptyprocess==0.5.2
> pyarrow==0.9.0
> Pygments==2.2.0
> python-dateutil==2.7.2
> python-utils==2.3.0
> pytz==2018.4
> pyzmq==17.0.0
> qtconsole==4.3.1
> Send2Trash==1.5.0
> simplegeneric==0.8.1
> six==1.11.0
> SQLAlchemy==1.2.7
> stevedore==1.28.0
> terminado==0.8.1
> testpath==0.3.1
> tornado==5.0.2
> traitlets==4.3.2
> virtualenv==15.1.0
> virtualenv-clone==0.2.6
> virtualenvwrapper==4.7.2
> wcwidth==0.1.7
> webencodings==0.5.1
> widgetsnbextension==3.2.1
> {noformat}
>  
> Java:
> {noformat}
> $ java -version 
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> System:
> {noformat}
> $ lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 16.04.4 LTS
> Release:  16.04
> Codename: xenial
> {noformat}
>Reporter: Cristian Consonni
>Priority: Major
>
> I am working on Wikipedia page views (see [task T188041 on Wikimedia's 
> Pharicator|https://phabricator.wikimedia.org/T188041]). For simplicity, let's 
> say that these are the data:
> {noformat}
>
> {noformat}
> For each combination of (lang, page, day(timestamp)) I need to transform  the 
>  views for each hour:
> {noformat}
> 00:00 -> A
> 01:00 -> B
> ...
> {noformat}
> and concatenate the number of views for that hour.  So, if a page got 5 views 
> at 00:00 and 7 views at 01:00 it would become:
> {noformat}
> A5B7
> {noformat}
>  
> I have written a UDF called {code:python}concat_hours{code}
> However, the function is mixing the columns and I am not sure what is going 
> on. I wrote here a minimal complete example that reproduces the issue on my 
> system (the details of my environment are above).
> {code:python}
> #!/usr/bin/env python3
> # coding: utf-8
> input_data = b"""en Albert_Camus 20071210-00 150
> en Albert_Camus 20071210-01 148
> en Albert_Camus 20071210-02 197
> en Albert_Camus 20071211-20 145
> en Albert_Camus 20071211-21 131
> en Albert_Camus 20071211-22 154
> en Albert_Camus 20071211-230001 142
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-040001 1
> en Albert_Caquot 20071211-06 1
> en Albert_Caquot 20071211-08 1
> en Albert_Caquot 20071211-15 3
> en Albert_Caquot 20071211-21 1"""
> import tempfile
> fp = tempfile.NamedTemporaryFile()
> fp.write(input_data)
> fp.seek(0)
> import findspark
> findspark.init()
> import pyspark
> from pyspark.sql.types import StructType, StructField
> from pyspark.sql.types import StringType, IntegerType, TimestampType
> from pyspark.sql import functions
> sc = pyspark.SparkContext(appName="udf_example")
> sqlctx = pyspark.SQLContext(sc)
> schema = StructType([StructField("lang", StringType(), False),
>  StructField("page", StringType(), False),
>  StructField("timestamp", TimestampType(), False),
>  StructField("views", IntegerType(), False)])
> df = sqlctx.read.csv(fp.name,
>  header=False,
>  schema=schema,
>  timestampFormat="MMdd-HHmmss",
>  sep=' ')
> df.count()
> df.dtypes
> df.show()
> new_schema = StructType([StructField("lang", StringType(), False),
>  StructField("page", StringType(), False),
>  StructField("day", StringType(), False),
>  StructField("enc", StringType(), False)])
> from pyspark.sql.functions import pandas_udf, 

[jira] [Commented] (SPARK-24036) Stateful operators in continuous processing

2018-05-24 Thread Jose Torres (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489648#comment-16489648
 ] 

Jose Torres commented on SPARK-24036:
-

I've been notified of 
[https://issues.apache.org/jira/projects/SPARK/issues/SPARK-24374,] a SPIP for 
an API which would provide much of what we need here wrt letting tasks know 
where the appropriate shuffle endpoints.

> Stateful operators in continuous processing
> ---
>
> Key: SPARK-24036
> URL: https://issues.apache.org/jira/browse/SPARK-24036
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> The first iteration of continuous processing in Spark 2.3 does not work with 
> stateful operators.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24385) Trivially-true EqualNullSafe should be handled like EqualTo in Dataset.join

2018-05-24 Thread Daniel Shields (JIRA)
Daniel Shields created SPARK-24385:
--

 Summary: Trivially-true EqualNullSafe should be handled like 
EqualTo in Dataset.join
 Key: SPARK-24385
 URL: https://issues.apache.org/jira/browse/SPARK-24385
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0, 2.2.1
Reporter: Daniel Shields


Dataset.join(right: Dataset[_], joinExprs: Column, joinType: String) has 
special logic for resolving trivially-true predicates to both sides. It 
currently handles regular equals but not null-safe equals; the code should be 
updated to also handle null-safe equals.

Pyspark example:
{code:java}
df = spark.range(10)
df.join(df, 'id').collect() # This works.
df.join(df, df['id'] == df['id']).collect() # This works.
df.join(df, df['id'].eqNullSafe(df['id'])).collect() # This fails!!!

# This is a workaround that works.
df2 = df.withColumn('id', F.col('id'))
df.join(df2, df['id'].eqNullSafe(df2['id'])).collect(){code}
The relevant code in Dataset.join should look like this:
{code:java}
// Otherwise, find the trivially true predicates and automatically resolves 
them to both sides.
// By the time we get here, since we have already run analysis, all attributes 
should've been
// resolved and become AttributeReference.
val cond = plan.condition.map { _.transform {
  case catalyst.expressions.EqualTo(a: AttributeReference, b: 
AttributeReference) if a.sameRef(b) =>
catalyst.expressions.EqualTo(
  withPlan(plan.left).resolve(a.name),
  withPlan(plan.right).resolve(b.name))
  // This case is new!!!
  case catalyst.expressions.EqualNullSafe(a: AttributeReference, b: 
AttributeReference) if a.sameRef(b) =>
catalyst.expressions.EqualNullSafe(
  withPlan(plan.left).resolve(a.name),
  withPlan(plan.right).resolve(b.name))
}}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24324) Pandas Grouped Map UserDefinedFunction mixes column labels

2018-05-24 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-24324:
-
Summary: Pandas Grouped Map UserDefinedFunction mixes column labels  (was: 
UserDefinedFunction mixes column labels)

> Pandas Grouped Map UserDefinedFunction mixes column labels
> --
>
> Key: SPARK-24324
> URL: https://issues.apache.org/jira/browse/SPARK-24324
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Python (using virtualenv):
> {noformat}
> $ python --version 
> Python 3.6.5
> {noformat}
> Modules installed:
> {noformat}
> arrow==0.12.1
> backcall==0.1.0
> bleach==2.1.3
> chardet==3.0.4
> decorator==4.3.0
> entrypoints==0.2.3
> findspark==1.2.0
> html5lib==1.0.1
> ipdb==0.11
> ipykernel==4.8.2
> ipython==6.3.1
> ipython-genutils==0.2.0
> ipywidgets==7.2.1
> jedi==0.12.0
> Jinja2==2.10
> jsonschema==2.6.0
> jupyter==1.0.0
> jupyter-client==5.2.3
> jupyter-console==5.2.0
> jupyter-core==4.4.0
> MarkupSafe==1.0
> mistune==0.8.3
> nbconvert==5.3.1
> nbformat==4.4.0
> notebook==5.5.0
> numpy==1.14.3
> pandas==0.22.0
> pandocfilters==1.4.2
> parso==0.2.0
> pbr==3.1.1
> pexpect==4.5.0
> pickleshare==0.7.4
> progressbar2==3.37.1
> prompt-toolkit==1.0.15
> ptyprocess==0.5.2
> pyarrow==0.9.0
> Pygments==2.2.0
> python-dateutil==2.7.2
> python-utils==2.3.0
> pytz==2018.4
> pyzmq==17.0.0
> qtconsole==4.3.1
> Send2Trash==1.5.0
> simplegeneric==0.8.1
> six==1.11.0
> SQLAlchemy==1.2.7
> stevedore==1.28.0
> terminado==0.8.1
> testpath==0.3.1
> tornado==5.0.2
> traitlets==4.3.2
> virtualenv==15.1.0
> virtualenv-clone==0.2.6
> virtualenvwrapper==4.7.2
> wcwidth==0.1.7
> webencodings==0.5.1
> widgetsnbextension==3.2.1
> {noformat}
>  
> Java:
> {noformat}
> $ java -version 
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> System:
> {noformat}
> $ lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 16.04.4 LTS
> Release:  16.04
> Codename: xenial
> {noformat}
>Reporter: Cristian Consonni
>Priority: Major
>
> I am working on Wikipedia page views (see [task T188041 on Wikimedia's 
> Pharicator|https://phabricator.wikimedia.org/T188041]). For simplicity, let's 
> say that these are the data:
> {noformat}
>
> {noformat}
> For each combination of (lang, page, day(timestamp)) I need to transform  the 
>  views for each hour:
> {noformat}
> 00:00 -> A
> 01:00 -> B
> ...
> {noformat}
> and concatenate the number of views for that hour.  So, if a page got 5 views 
> at 00:00 and 7 views at 01:00 it would become:
> {noformat}
> A5B7
> {noformat}
>  
> I have written a UDF called {code:python}concat_hours{code}
> However, the function is mixing the columns and I am not sure what is going 
> on. I wrote here a minimal complete example that reproduces the issue on my 
> system (the details of my environment are above).
> {code:python}
> #!/usr/bin/env python3
> # coding: utf-8
> input_data = b"""en Albert_Camus 20071210-00 150
> en Albert_Camus 20071210-01 148
> en Albert_Camus 20071210-02 197
> en Albert_Camus 20071211-20 145
> en Albert_Camus 20071211-21 131
> en Albert_Camus 20071211-22 154
> en Albert_Camus 20071211-230001 142
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-040001 1
> en Albert_Caquot 20071211-06 1
> en Albert_Caquot 20071211-08 1
> en Albert_Caquot 20071211-15 3
> en Albert_Caquot 20071211-21 1"""
> import tempfile
> fp = tempfile.NamedTemporaryFile()
> fp.write(input_data)
> fp.seek(0)
> import findspark
> findspark.init()
> import pyspark
> from pyspark.sql.types import StructType, StructField
> from pyspark.sql.types import StringType, IntegerType, TimestampType
> from pyspark.sql import functions
> sc = pyspark.SparkContext(appName="udf_example")
> sqlctx = pyspark.SQLContext(sc)
> schema = StructType([StructField("lang", StringType(), False),
>  StructField("page", StringType(), False),
>  StructField("timestamp", TimestampType(), False),
>  StructField("views", IntegerType(), False)])
> df = sqlctx.read.csv(fp.name,
>  header=False,
>  schema=schema,
>  timestampFormat="MMdd-HHmmss",
>  sep=' ')
> df.count()
> df.dtypes
> df.show()
> new_schema = StructType([StructField("lang", StringType(), False),
>  StructField("page", StringType(), False),
>  StructField("day", StringType(), False),
>  StructField("enc", 

[jira] [Commented] (SPARK-24383) spark on k8s: "driver-svc" are not getting deleted

2018-05-24 Thread Yinan Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489601#comment-16489601
 ] 

Yinan Li commented on SPARK-24383:
--

The Kubernetes specific submission client adds an {{OwnerReference}} 
referencing the driver pod to the service so if you delete the driver pod, the 
corresponding service should be garbage collected. 

> spark on k8s: "driver-svc" are not getting deleted
> --
>
> Key: SPARK-24383
> URL: https://issues.apache.org/jira/browse/SPARK-24383
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Lenin
>Priority: Major
>
> When the driver pod exists, the "*driver-svc" services created for the driver 
> are not cleaned up. This causes accumulation of services in the k8s layer, at 
> one point no more services can be created. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24356) Duplicate strings in File.path managed by FileSegmentManagedBuffer

2018-05-24 Thread Misha Dmitriev (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Misha Dmitriev updated SPARK-24356:
---
Attachment: SPARK-24356.01.patch

> Duplicate strings in File.path managed by FileSegmentManagedBuffer
> --
>
> Key: SPARK-24356
> URL: https://issues.apache.org/jira/browse/SPARK-24356
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.0
>Reporter: Misha Dmitriev
>Priority: Major
> Attachments: SPARK-24356.01.patch
>
>
> I recently analyzed a heap dump of Yarn Node Manager that was suffering from 
> high GC pressure due to high object churn. Analysis was done with the jxray 
> tool ([www.jxray.com)|http://www.jxray.com)/] that checks a heap dump for a 
> number of well-known memory issues. One problem that it found in this dump is 
> 19.5% of memory wasted due to duplicate strings. Of these duplicates, more 
> than a half come from {{FileInputStream.path}} and {{File.path}}. All the 
> {{FileInputStream}} objects that JXRay shows are garbage - looks like they 
> are used for a very short period and then discarded (I guess there is a 
> separate question of whether that's a good pattern). But {{File}} instances 
> are traceable to 
> {{org.apache.spark.network.buffer.FileSegmentManagedBuffer.file}} field. Here 
> is the full reference chain:
>  
> {code:java}
> ↖java.io.File.path
> ↖org.apache.spark.network.buffer.FileSegmentManagedBuffer.file
> ↖{j.u.ArrayList}
> ↖j.u.ArrayList$Itr.this$0
> ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.buffers
> ↖{java.util.concurrent.ConcurrentHashMap}.values
> ↖org.apache.spark.network.server.OneForOneStreamManager.streams
> ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager
> ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler
> ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance
> {code}
>  
> Values of these {{File.path}}'s and {{FileInputStream.path}}'s look very 
> similar, so I think {{FileInputStream}}s are generated by the 
> {{FileSegmentManagedBuffer}} code. Instances of {{File}}, in turn, likely 
> come from 
> [https://github.com/apache/spark/blob/master/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L258-L263]
>  
> To avoid duplicate strings in {{File.path}}'s in this case, it is suggested 
> that in the above code we create a File with a complete, normalized pathname, 
> that has been already interned. This will prevent the code inside 
> {{java.io.File}} from modifying this string, and thus it will use the 
> interned copy, and will pass it to FileInputStream. Essentially the current 
> line
> {code:java}
> return new File(new File(localDir, String.format("%02x", subDirId)), 
> filename);{code}
> should be replaced with something like
> {code:java}
> String pathname = localDir + File.separator + String.format(...) + 
> File.separator + filename;
> pathname = fileSystem.normalize(pathname).intern();
> return new File(pathname);{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24358) createDataFrame in Python 3 should be able to infer bytes type as Binary type

2018-05-24 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489546#comment-16489546
 ] 

Hyukjin Kwon edited comment on SPARK-24358 at 5/24/18 6:31 PM:
---

Yea, I know the differences and I know the rationale here. We should need 
strong evidences and reasons to accept the divergence. Also we need to take a 
look for PySpark codes bases too and check such divergence.
FWIW, I was trying to take a look and fix the difference among bytes, str and 
unicode and I am currently stuck due to other works.




was (Author: hyukjin.kwon):
Yea, I know the differences and I know the rationale here. We should need 
strong evidences and reasons to accept the divergence. Also we need to take a 
look for PySpark codes bases too and check such divergence.
FWIW, I was trying to take a look and fix the difference among bytes, str and 
unicode and I am currently stuck due to other swarming works.



> createDataFrame in Python 3 should be able to infer bytes type as Binary type
> -
>
> Key: SPARK-24358
> URL: https://issues.apache.org/jira/browse/SPARK-24358
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Joel Croteau
>Priority: Minor
>  Labels: Python3
>
> createDataFrame can infer Python 3's bytearray type as a Binary. Since bytes 
> is just the immutable, hashable version of this same structure, it makes 
> sense for the same thing to apply there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24358) createDataFrame in Python 3 should be able to infer bytes type as Binary type

2018-05-24 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489546#comment-16489546
 ] 

Hyukjin Kwon commented on SPARK-24358:
--

Yea, I know the differences and I know the rationale here. We should need 
strong evidences and reasons to accept the divergence. Also we need to take a 
look for PySpark codes bases too and check such divergence.
FWIW, I was trying to take a look and fix the difference among bytes, str and 
unicode and I am currently stuck due to other swarming works.



> createDataFrame in Python 3 should be able to infer bytes type as Binary type
> -
>
> Key: SPARK-24358
> URL: https://issues.apache.org/jira/browse/SPARK-24358
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Joel Croteau
>Priority: Minor
>  Labels: Python3
>
> createDataFrame can infer Python 3's bytearray type as a Binary. Since bytes 
> is just the immutable, hashable version of this same structure, it makes 
> sense for the same thing to apply there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data

2018-05-24 Thread Wenbo Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489534#comment-16489534
 ] 

Wenbo Zhao commented on SPARK-24373:


It is not apparently to me that they are the same issue though

> "df.cache() df.count()" no longer eagerly caches data
> -
>
> Key: SPARK-24373
> URL: https://issues.apache.org/jira/browse/SPARK-24373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenbo Zhao
>Priority: Major
>
> Here is the code to reproduce in local mode
> {code:java}
> scala> val df = sc.range(1, 2).toDF
> df: org.apache.spark.sql.DataFrame = [value: bigint]
> scala> val myudf = udf({x: Long => println(""); x + 1})
> myudf: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,LongType,Some(List(LongType)))
> scala> val df1 = df.withColumn("value1", myudf(col("value")))
> df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint]
> scala> df1.cache
> res0: df1.type = [value: bigint, value1: bigint]
> scala> df1.count
> res1: Long = 1 
> scala> df1.count
> res2: Long = 1
> scala> df1.count
> res3: Long = 1
> {code}
>  
> in Spark 2.2, you could see it prints "". 
> In the above example, when you do explain. You could see
> {code:java}
> scala> df1.explain(true)
> == Parsed Logical Plan ==
> 'Project [value#2L, UDF('value) AS value1#5]
> +- AnalysisBarrier
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> value: bigint, value1: bigint
> Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> == Physical Plan ==
> *(1) InMemoryTableScan [value#2L, value1#5L]
> +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
> but the ImMemoryTableScan is mising in the following explain()
> {code:java}
> scala> df1.groupBy().count().explain(true)
> == Parsed Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
> value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> count: bigint
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) 
> null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L])
> +- Exchange SinglePartition
> +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#175L])
> +- *(1) Project
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24324) UserDefinedFunction mixes column labels

2018-05-24 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489533#comment-16489533
 ] 

Bryan Cutler commented on SPARK-24324:
--

I was able to reproduce, the problem is that when pyspark processes the result 
from the UDF, it iterates over the columns by index not by name.  Since you are 
returning a new pandas.DataFrame with input given by a dict, there is no 
guarantee of the column order when pandas makes the DataFrame (this is just the 
nature of the python dict for python < 3.6 I believe).  I can submit a fix that 
should not change any behaviors, I think.  As a workaround, you could write 
your UDF to make the pandas.DataFrame using an OrderedDict like so:

 
{code:java}
from collections import OrderedDict

@pandas_udf(new_schema, PandasUDFType.GROUPED_MAP)
def myudf(x):
foo = 'foo'
return pd.DataFrame(OrderedDict([('lang', x.lang), ('page', x.page), 
('foo', 'foo')])){code}

> UserDefinedFunction mixes column labels
> ---
>
> Key: SPARK-24324
> URL: https://issues.apache.org/jira/browse/SPARK-24324
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Python (using virtualenv):
> {noformat}
> $ python --version 
> Python 3.6.5
> {noformat}
> Modules installed:
> {noformat}
> arrow==0.12.1
> backcall==0.1.0
> bleach==2.1.3
> chardet==3.0.4
> decorator==4.3.0
> entrypoints==0.2.3
> findspark==1.2.0
> html5lib==1.0.1
> ipdb==0.11
> ipykernel==4.8.2
> ipython==6.3.1
> ipython-genutils==0.2.0
> ipywidgets==7.2.1
> jedi==0.12.0
> Jinja2==2.10
> jsonschema==2.6.0
> jupyter==1.0.0
> jupyter-client==5.2.3
> jupyter-console==5.2.0
> jupyter-core==4.4.0
> MarkupSafe==1.0
> mistune==0.8.3
> nbconvert==5.3.1
> nbformat==4.4.0
> notebook==5.5.0
> numpy==1.14.3
> pandas==0.22.0
> pandocfilters==1.4.2
> parso==0.2.0
> pbr==3.1.1
> pexpect==4.5.0
> pickleshare==0.7.4
> progressbar2==3.37.1
> prompt-toolkit==1.0.15
> ptyprocess==0.5.2
> pyarrow==0.9.0
> Pygments==2.2.0
> python-dateutil==2.7.2
> python-utils==2.3.0
> pytz==2018.4
> pyzmq==17.0.0
> qtconsole==4.3.1
> Send2Trash==1.5.0
> simplegeneric==0.8.1
> six==1.11.0
> SQLAlchemy==1.2.7
> stevedore==1.28.0
> terminado==0.8.1
> testpath==0.3.1
> tornado==5.0.2
> traitlets==4.3.2
> virtualenv==15.1.0
> virtualenv-clone==0.2.6
> virtualenvwrapper==4.7.2
> wcwidth==0.1.7
> webencodings==0.5.1
> widgetsnbextension==3.2.1
> {noformat}
>  
> Java:
> {noformat}
> $ java -version 
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> System:
> {noformat}
> $ lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 16.04.4 LTS
> Release:  16.04
> Codename: xenial
> {noformat}
>Reporter: Cristian Consonni
>Priority: Major
>
> I am working on Wikipedia page views (see [task T188041 on Wikimedia's 
> Pharicator|https://phabricator.wikimedia.org/T188041]). For simplicity, let's 
> say that these are the data:
> {noformat}
>
> {noformat}
> For each combination of (lang, page, day(timestamp)) I need to transform  the 
>  views for each hour:
> {noformat}
> 00:00 -> A
> 01:00 -> B
> ...
> {noformat}
> and concatenate the number of views for that hour.  So, if a page got 5 views 
> at 00:00 and 7 views at 01:00 it would become:
> {noformat}
> A5B7
> {noformat}
>  
> I have written a UDF called {code:python}concat_hours{code}
> However, the function is mixing the columns and I am not sure what is going 
> on. I wrote here a minimal complete example that reproduces the issue on my 
> system (the details of my environment are above).
> {code:python}
> #!/usr/bin/env python3
> # coding: utf-8
> input_data = b"""en Albert_Camus 20071210-00 150
> en Albert_Camus 20071210-01 148
> en Albert_Camus 20071210-02 197
> en Albert_Camus 20071211-20 145
> en Albert_Camus 20071211-21 131
> en Albert_Camus 20071211-22 154
> en Albert_Camus 20071211-230001 142
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-040001 1
> en Albert_Caquot 20071211-06 1
> en Albert_Caquot 20071211-08 1
> en Albert_Caquot 20071211-15 3
> en Albert_Caquot 20071211-21 1"""
> import tempfile
> fp = tempfile.NamedTemporaryFile()
> fp.write(input_data)
> fp.seek(0)
> import findspark
> findspark.init()
> import pyspark
> from pyspark.sql.types import StructType, StructField
> from pyspark.sql.types import StringType, IntegerType, TimestampType
> from pyspark.sql import functions
> sc = pyspark.SparkContext(appName="udf_example")
> sqlctx = pyspark.SQLContext(sc)
> schema = StructType([StructField("lang", StringType(), False),
>  

[jira] [Commented] (SPARK-24384) spark-submit --py-files with .py files doesn't work in client mode before context initialization

2018-05-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489524#comment-16489524
 ] 

Apache Spark commented on SPARK-24384:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/21426

> spark-submit --py-files with .py files doesn't work in client mode before 
> context initialization
> 
>
> Key: SPARK-24384
> URL: https://issues.apache.org/jira/browse/SPARK-24384
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> In case the given Python file is .py file (zip file seems fine), seems the 
> python path is dynamically added after the context is got initialized.
> with this pyFile:
> {code}
> $ cat /home/spark/tmp.py
> def testtest():
> return 1
> {code}
> This works:
> {code}
> $ cat app.py
> import pyspark
> pyspark.sql.SparkSession.builder.getOrCreate()
> import tmp
> print("%s" % tmp.testtest())
> $ ./bin/spark-submit --master yarn --deploy-mode client --py-files 
> /home/spark/tmp.py app.py
> ...
> 1
> {code}
> but this doesn't:
> {code}
> $ cat app.py
> import pyspark
> import tmp
> pyspark.sql.SparkSession.builder.getOrCreate()
> print("%s" % tmp.testtest())
> $ ./bin/spark-submit --master yarn --deploy-mode client --py-files 
> /home/spark/tmp.py app.py
> Traceback (most recent call last):
>   File "/home/spark/spark/app.py", line 2, in 
> import tmp
> ImportError: No module named tmp
> {code}
> See 
> https://issues.apache.org/jira/browse/SPARK-21945?focusedCommentId=16488486=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16488486



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24384) spark-submit --py-files with .py files doesn't work in client mode before context initialization

2018-05-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24384:


Assignee: (was: Apache Spark)

> spark-submit --py-files with .py files doesn't work in client mode before 
> context initialization
> 
>
> Key: SPARK-24384
> URL: https://issues.apache.org/jira/browse/SPARK-24384
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> In case the given Python file is .py file (zip file seems fine), seems the 
> python path is dynamically added after the context is got initialized.
> with this pyFile:
> {code}
> $ cat /home/spark/tmp.py
> def testtest():
> return 1
> {code}
> This works:
> {code}
> $ cat app.py
> import pyspark
> pyspark.sql.SparkSession.builder.getOrCreate()
> import tmp
> print("%s" % tmp.testtest())
> $ ./bin/spark-submit --master yarn --deploy-mode client --py-files 
> /home/spark/tmp.py app.py
> ...
> 1
> {code}
> but this doesn't:
> {code}
> $ cat app.py
> import pyspark
> import tmp
> pyspark.sql.SparkSession.builder.getOrCreate()
> print("%s" % tmp.testtest())
> $ ./bin/spark-submit --master yarn --deploy-mode client --py-files 
> /home/spark/tmp.py app.py
> Traceback (most recent call last):
>   File "/home/spark/spark/app.py", line 2, in 
> import tmp
> ImportError: No module named tmp
> {code}
> See 
> https://issues.apache.org/jira/browse/SPARK-21945?focusedCommentId=16488486=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16488486



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24384) spark-submit --py-files with .py files doesn't work in client mode before context initialization

2018-05-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24384:


Assignee: Apache Spark

> spark-submit --py-files with .py files doesn't work in client mode before 
> context initialization
> 
>
> Key: SPARK-24384
> URL: https://issues.apache.org/jira/browse/SPARK-24384
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> In case the given Python file is .py file (zip file seems fine), seems the 
> python path is dynamically added after the context is got initialized.
> with this pyFile:
> {code}
> $ cat /home/spark/tmp.py
> def testtest():
> return 1
> {code}
> This works:
> {code}
> $ cat app.py
> import pyspark
> pyspark.sql.SparkSession.builder.getOrCreate()
> import tmp
> print("%s" % tmp.testtest())
> $ ./bin/spark-submit --master yarn --deploy-mode client --py-files 
> /home/spark/tmp.py app.py
> ...
> 1
> {code}
> but this doesn't:
> {code}
> $ cat app.py
> import pyspark
> import tmp
> pyspark.sql.SparkSession.builder.getOrCreate()
> print("%s" % tmp.testtest())
> $ ./bin/spark-submit --master yarn --deploy-mode client --py-files 
> /home/spark/tmp.py app.py
> Traceback (most recent call last):
>   File "/home/spark/spark/app.py", line 2, in 
> import tmp
> ImportError: No module named tmp
> {code}
> See 
> https://issues.apache.org/jira/browse/SPARK-21945?focusedCommentId=16488486=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16488486



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24358) createDataFrame in Python 3 should be able to infer bytes type as Binary type

2018-05-24 Thread Joel Croteau (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489507#comment-16489507
 ] 

Joel Croteau commented on SPARK-24358:
--

This does mean that the current implementation has some compatibility issues 
with Python 3. In Python 2, a bytes will be inferred as a StringType, 
regardless of content. StringType and BinaryType are functionally identical, as 
they are both just arbitrary arrays of bytes, and Python 2 will handle any 
value of them just fine. In Python 3, attempting to infer the type of a bytes 
is an error, and Python 3 will convert a StringType to Unicode. Since not every 
byte string is valid Unicode, some errors may occur in processing StringTypes 
in Python 3 that worked fine in Python 2.

> createDataFrame in Python 3 should be able to infer bytes type as Binary type
> -
>
> Key: SPARK-24358
> URL: https://issues.apache.org/jira/browse/SPARK-24358
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Joel Croteau
>Priority: Minor
>  Labels: Python3
>
> createDataFrame can infer Python 3's bytearray type as a Binary. Since bytes 
> is just the immutable, hashable version of this same structure, it makes 
> sense for the same thing to apply there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24384) spark-submit --py-files with .py files doesn't work in client mode before context initialization

2018-05-24 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24384:
-
Component/s: Spark Submit

> spark-submit --py-files with .py files doesn't work in client mode before 
> context initialization
> 
>
> Key: SPARK-24384
> URL: https://issues.apache.org/jira/browse/SPARK-24384
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> In case the given Python file is .py file (zip file seems fine), seems the 
> python path is dynamically added after the context is got initialized.
> with this pyFile:
> {code}
> $ cat /home/spark/tmp.py
> def testtest():
> return 1
> {code}
> This works:
> {code}
> $ cat app.py
> import pyspark
> pyspark.sql.SparkSession.builder.getOrCreate()
> import tmp
> print("%s" % tmp.testtest())
> $ ./bin/spark-submit --master yarn --deploy-mode client --py-files 
> /home/spark/tmp.py app.py
> ...
> 1
> {code}
> but this doesn't:
> {code}
> $ cat app.py
> import pyspark
> import tmp
> pyspark.sql.SparkSession.builder.getOrCreate()
> print("%s" % tmp.testtest())
> $ ./bin/spark-submit --master yarn --deploy-mode client --py-files 
> /home/spark/tmp.py app.py
> Traceback (most recent call last):
>   File "/home/spark/spark/app.py", line 2, in 
> import tmp
> ImportError: No module named tmp
> {code}
> See 
> https://issues.apache.org/jira/browse/SPARK-21945?focusedCommentId=16488486=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16488486



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24384) spark-submit --py-files with .py files doesn't work in client mode before context initialization

2018-05-24 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-24384:


 Summary: spark-submit --py-files with .py files doesn't work in 
client mode before context initialization
 Key: SPARK-24384
 URL: https://issues.apache.org/jira/browse/SPARK-24384
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.0, 2.4.0
Reporter: Hyukjin Kwon


In case the given Python file is .py file (zip file seems fine), seems the 
python path is dynamically added after the context is got initialized.

with this pyFile:

{code}
$ cat /home/spark/tmp.py
def testtest():
return 1
{code}

This works:

{code}
$ cat app.py
import pyspark
pyspark.sql.SparkSession.builder.getOrCreate()
import tmp
print("%s" % tmp.testtest())

$ ./bin/spark-submit --master yarn --deploy-mode client --py-files 
/home/spark/tmp.py app.py
...
1
{code}

but this doesn't:

{code}
$ cat app.py
import pyspark
import tmp
pyspark.sql.SparkSession.builder.getOrCreate()
print("%s" % tmp.testtest())

$ ./bin/spark-submit --master yarn --deploy-mode client --py-files 
/home/spark/tmp.py app.py
Traceback (most recent call last):
  File "/home/spark/spark/app.py", line 2, in 
import tmp
ImportError: No module named tmp
{code}

See 
https://issues.apache.org/jira/browse/SPARK-21945?focusedCommentId=16488486=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16488486



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23754) StopIterator exception in Python UDF results in partial result

2018-05-24 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-23754:
-
Priority: Blocker  (was: Major)

> StopIterator exception in Python UDF results in partial result
> --
>
> Key: SPARK-23754
> URL: https://issues.apache.org/jira/browse/SPARK-23754
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Priority: Blocker
>
> Reproduce:
> {code:java}
> df = spark.range(0, 1000)
> from pyspark.sql.functions import udf
> def foo(x):
> raise StopIteration()
> df.withColumn('v', udf(foo)).show()
> # Results
> # +---+---+
> # | id|  v|
> # +---+---+
> # +---+---+{code}
> I think the task should fail in this case



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24356) Duplicate strings in File.path managed by FileSegmentManagedBuffer

2018-05-24 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489417#comment-16489417
 ] 

Imran Rashid commented on SPARK-24356:
--

cc [~jinxing6...@126.com] [~elu] [~felixcheung] -- this could be a nice win to 
decrease GC pressure on the shuffle service, might be related to issues you are 
running into.

> Duplicate strings in File.path managed by FileSegmentManagedBuffer
> --
>
> Key: SPARK-24356
> URL: https://issues.apache.org/jira/browse/SPARK-24356
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.0
>Reporter: Misha Dmitriev
>Priority: Major
>
> I recently analyzed a heap dump of Yarn Node Manager that was suffering from 
> high GC pressure due to high object churn. Analysis was done with the jxray 
> tool ([www.jxray.com)|http://www.jxray.com)/] that checks a heap dump for a 
> number of well-known memory issues. One problem that it found in this dump is 
> 19.5% of memory wasted due to duplicate strings. Of these duplicates, more 
> than a half come from {{FileInputStream.path}} and {{File.path}}. All the 
> {{FileInputStream}} objects that JXRay shows are garbage - looks like they 
> are used for a very short period and then discarded (I guess there is a 
> separate question of whether that's a good pattern). But {{File}} instances 
> are traceable to 
> {{org.apache.spark.network.buffer.FileSegmentManagedBuffer.file}} field. Here 
> is the full reference chain:
>  
> {code:java}
> ↖java.io.File.path
> ↖org.apache.spark.network.buffer.FileSegmentManagedBuffer.file
> ↖{j.u.ArrayList}
> ↖j.u.ArrayList$Itr.this$0
> ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.buffers
> ↖{java.util.concurrent.ConcurrentHashMap}.values
> ↖org.apache.spark.network.server.OneForOneStreamManager.streams
> ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager
> ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler
> ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance
> {code}
>  
> Values of these {{File.path}}'s and {{FileInputStream.path}}'s look very 
> similar, so I think {{FileInputStream}}s are generated by the 
> {{FileSegmentManagedBuffer}} code. Instances of {{File}}, in turn, likely 
> come from 
> [https://github.com/apache/spark/blob/master/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L258-L263]
>  
> To avoid duplicate strings in {{File.path}}'s in this case, it is suggested 
> that in the above code we create a File with a complete, normalized pathname, 
> that has been already interned. This will prevent the code inside 
> {{java.io.File}} from modifying this string, and thus it will use the 
> interned copy, and will pass it to FileInputStream. Essentially the current 
> line
> {code:java}
> return new File(new File(localDir, String.format("%02x", subDirId)), 
> filename);{code}
> should be replaced with something like
> {code:java}
> String pathname = localDir + File.separator + String.format(...) + 
> File.separator + filename;
> pathname = fileSystem.normalize(pathname).intern();
> return new File(pathname);{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21945) pyspark --py-files doesn't work in yarn client mode

2018-05-24 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489397#comment-16489397
 ] 

Hyukjin Kwon commented on SPARK-21945:
--

For another clarification, the launch execution codepath is diverted for shell 
(and also Yarn) specifically. So, I believe the merged PR itself is a-okay.

> pyspark --py-files doesn't work in yarn client mode
> ---
>
> Key: SPARK-21945
> URL: https://issues.apache.org/jira/browse/SPARK-21945
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Thomas Graves
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> I tried running pyspark with --py-files pythonfiles.zip  but it doesn't 
> properly add the zip file to the PYTHONPATH.
> I can work around by exporting PYTHONPATH.
> Looking in SparkSubmitCommandBuilder.buildPySparkShellCommand  I don't see 
> this supported at all.   If that is the case perhaps it should be moved to 
> improvement.
> Note it works via spark-submit in both client and cluster mode to run python 
> script.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data

2018-05-24 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489389#comment-16489389
 ] 

Marcelo Vanzin commented on SPARK-24373:


This  could be the same as SPARK-23309.

> "df.cache() df.count()" no longer eagerly caches data
> -
>
> Key: SPARK-24373
> URL: https://issues.apache.org/jira/browse/SPARK-24373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenbo Zhao
>Priority: Major
>
> Here is the code to reproduce in local mode
> {code:java}
> scala> val df = sc.range(1, 2).toDF
> df: org.apache.spark.sql.DataFrame = [value: bigint]
> scala> val myudf = udf({x: Long => println(""); x + 1})
> myudf: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,LongType,Some(List(LongType)))
> scala> val df1 = df.withColumn("value1", myudf(col("value")))
> df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint]
> scala> df1.cache
> res0: df1.type = [value: bigint, value1: bigint]
> scala> df1.count
> res1: Long = 1 
> scala> df1.count
> res2: Long = 1
> scala> df1.count
> res3: Long = 1
> {code}
>  
> in Spark 2.2, you could see it prints "". 
> In the above example, when you do explain. You could see
> {code:java}
> scala> df1.explain(true)
> == Parsed Logical Plan ==
> 'Project [value#2L, UDF('value) AS value1#5]
> +- AnalysisBarrier
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> value: bigint, value1: bigint
> Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> == Physical Plan ==
> *(1) InMemoryTableScan [value#2L, value1#5L]
> +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
> but the ImMemoryTableScan is mising in the following explain()
> {code:java}
> scala> df1.groupBy().count().explain(true)
> == Parsed Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
> value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> count: bigint
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) 
> null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L])
> +- Exchange SinglePartition
> +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#175L])
> +- *(1) Project
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2

2018-05-24 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489391#comment-16489391
 ] 

Marcelo Vanzin commented on SPARK-23309:


[~kiszk] SPARK-24373 has some code.

> Spark 2.3 cached query performance 20-30% worse then spark 2.2
> --
>
> Key: SPARK-23309
> URL: https://issues.apache.org/jira/browse/SPARK-23309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Major
>
> I was testing spark 2.3 rc2 and I am seeing a performance regression in sql 
> queries on cached data.
> The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 
> partitions
> Here is the example query:
> val dailycached = spark.sql("select something from table where dt = 
> '20170301' AND something IS NOT NULL")
> dailycached.createOrReplaceTempView("dailycached") 
> spark.catalog.cacheTable("dailyCached")
> spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show()
>  
> On spark 2.2 I see queries times average 13 seconds
> On the same nodes I see spark 2.3 queries times average 17 seconds
> Note these are times of queries after the initial caching.  so just running 
> the last line again: 
> spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() 
> multiple times.
>  
> I also ran a query over more data (335GB input/587.5 GB cached) and saw a 
> similar discrepancy in the performance of querying cached data between spark 
> 2.3 and spark 2.2, where 2.2 was better by like 20%.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24372) Create script for preparing RCs

2018-05-24 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489378#comment-16489378
 ] 

Marcelo Vanzin commented on SPARK-24372:


I'm keeping the current version of the scripts here while I tweak them:
https://github.com/vanzin/spark/tree/SPARK-24372

[~tgraves] [~holdenkarau] you might be interested in checking those out, 
although there may be tweaks needed for older branches.

I also took a stab of incorporating [~felixcheung]'s docker image in that patch.

> Create script for preparing RCs
> ---
>
> Key: SPARK-24372
> URL: https://issues.apache.org/jira/browse/SPARK-24372
> Project: Spark
>  Issue Type: New Feature
>  Components: Project Infra
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> Currently, when preparing RCs, the RM has to invoke many scripts manually, 
> make sure that is being done in the correct environment, and set all the 
> correct environment variables, which differ from one script to the other.
> It will be much easier for RMs if all that was automated as much as possible.
> I'm working on something like this as part of releasing 2.3.1, and plan to 
> send my scripts for review after the release is done (i.e. after I make sure 
> the scripts are working properly).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21945) pyspark --py-files doesn't work in yarn client mode

2018-05-24 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489367#comment-16489367
 ] 

Hyukjin Kwon edited comment on SPARK-21945 at 5/24/18 4:50 PM:
---

To be more correct, the paths are added as are given my investigation so far. 
It's fine for zip archive but for .py file the paths shouldn't be added as are 
(but its parent directory) ...
so for py files, yes, we should copy them too.

It's weird but I think this is all because we happened to support .py file in 
the same option whereas PYTHONPATH doesn't expect a .py file.


was (Author: hyukjin.kwon):
To be more correct, the paths are added as are given my investigation so far. 
It's fine for zip archive but for .py file the paths shouldn't be added as are 
(but its parent directory) ...
so for py files, yes, we should copy them too.

It's weird but I think this is all because we happened to support .py file in 
the same option whereas PYTHONPATH doesn't expect a file.

> pyspark --py-files doesn't work in yarn client mode
> ---
>
> Key: SPARK-21945
> URL: https://issues.apache.org/jira/browse/SPARK-21945
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Thomas Graves
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> I tried running pyspark with --py-files pythonfiles.zip  but it doesn't 
> properly add the zip file to the PYTHONPATH.
> I can work around by exporting PYTHONPATH.
> Looking in SparkSubmitCommandBuilder.buildPySparkShellCommand  I don't see 
> this supported at all.   If that is the case perhaps it should be moved to 
> improvement.
> Note it works via spark-submit in both client and cluster mode to run python 
> script.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21945) pyspark --py-files doesn't work in yarn client mode

2018-05-24 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489369#comment-16489369
 ] 

Hyukjin Kwon commented on SPARK-21945:
--

Just for clarification, zip file works fine because the python paths are added 
as are to PythonRunner given my investigation so far.

> pyspark --py-files doesn't work in yarn client mode
> ---
>
> Key: SPARK-21945
> URL: https://issues.apache.org/jira/browse/SPARK-21945
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Thomas Graves
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> I tried running pyspark with --py-files pythonfiles.zip  but it doesn't 
> properly add the zip file to the PYTHONPATH.
> I can work around by exporting PYTHONPATH.
> Looking in SparkSubmitCommandBuilder.buildPySparkShellCommand  I don't see 
> this supported at all.   If that is the case perhaps it should be moved to 
> improvement.
> Note it works via spark-submit in both client and cluster mode to run python 
> script.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21945) pyspark --py-files doesn't work in yarn client mode

2018-05-24 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489367#comment-16489367
 ] 

Hyukjin Kwon commented on SPARK-21945:
--

To be more correct, the paths are added as are given my investigation so far. 
It's fine for zip archive but for .py file the paths shouldn't be added as are 
(but its parent directory) ...
so for py files, yes, we should copy them too.

It's weird but I think this is all because we happened to support .py file in 
the same option whereas PYTHONPATH doesn't expect a file.

> pyspark --py-files doesn't work in yarn client mode
> ---
>
> Key: SPARK-21945
> URL: https://issues.apache.org/jira/browse/SPARK-21945
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Thomas Graves
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> I tried running pyspark with --py-files pythonfiles.zip  but it doesn't 
> properly add the zip file to the PYTHONPATH.
> I can work around by exporting PYTHONPATH.
> Looking in SparkSubmitCommandBuilder.buildPySparkShellCommand  I don't see 
> this supported at all.   If that is the case perhaps it should be moved to 
> improvement.
> Note it works via spark-submit in both client and cluster mode to run python 
> script.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21945) pyspark --py-files doesn't work in yarn client mode

2018-05-24 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489338#comment-16489338
 ] 

Marcelo Vanzin commented on SPARK-21945:


It happens because the import happens before the context is initialized, and 
your fix only copies the files during initialization of the context.

To fix this case you'd have to add logic to perform the copy into the launcher 
library, which would be kinda weird...

> pyspark --py-files doesn't work in yarn client mode
> ---
>
> Key: SPARK-21945
> URL: https://issues.apache.org/jira/browse/SPARK-21945
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Thomas Graves
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> I tried running pyspark with --py-files pythonfiles.zip  but it doesn't 
> properly add the zip file to the PYTHONPATH.
> I can work around by exporting PYTHONPATH.
> Looking in SparkSubmitCommandBuilder.buildPySparkShellCommand  I don't see 
> this supported at all.   If that is the case perhaps it should be moved to 
> improvement.
> Note it works via spark-submit in both client and cluster mode to run python 
> script.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24374) SPIP: Support Barrier Scheduling in Apache Spark

2018-05-24 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489333#comment-16489333
 ] 

Xiangrui Meng commented on SPARK-24374:
---

[~galv] Thanks for your feedback!
 * SPARK-20327 allows Spark to request GPU from YARN. Within Spark, we also 
need to address the issue with allocating custom resources to tasks. For 
example, a user can request a task should take 1 CPU core and 2 GPUs and Spark 
needs to assign the executor as well as GPU devices to that task. This is 
relevant to the proposal here. But I want to consider it an orthogonal topic 
because barrier scheduling works with or without GPUs.
 * I picked version 3.0.0 because I'm not sure if there would be breaking 
changes. The proposed API doesn't have breaking changes. But this adds a new 
model to job scheduler and hence fault tolerance model. I want to see more 
design discussions first.
 * I think most MPI implementations use SSH. Spark standalone is usually 
configured with keyless SSH. But this might be an issue on YARN / Kube / Mesos. 
I thought about this but I want to treat it as a separate issue. Spark 
executors can talk to each other in all those cluster modes. So user should 
have a way to set up a hybrid cluster within the barrier stage. The question is 
what info we need to provide at the very minimum.
 * "barrier" comes from "MPI_Barrier". Please check the example code in the 
SPIP. We need to set barrier in two scenarios: 1) configure a Spark stage to be 
in the barrier mode, where Spark job scheduler knows to wait until all task 
slots are ready to launch them, 2) within the mapPartitions closure, we still 
need to provide users contxt.barrier() to wait for data exchange or user 
program to finish on all tasks.

> SPIP: Support Barrier Scheduling in Apache Spark
> 
>
> Key: SPARK-24374
> URL: https://issues.apache.org/jira/browse/SPARK-24374
> Project: Spark
>  Issue Type: Epic
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf
>
>
> (See details in the linked/attached SPIP doc.)
> {quote}
> The proposal here is to add a new scheduling model to Apache Spark so users 
> can properly embed distributed DL training as a Spark stage to simplify the 
> distributed training workflow. For example, Horovod uses MPI to implement 
> all-reduce to accelerate distributed TensorFlow training. The computation 
> model is different from MapReduce used by Spark. In Spark, a task in a stage 
> doesn’t depend on any other tasks in the same stage, and hence it can be 
> scheduled independently. In MPI, all workers start at the same time and pass 
> messages around. To embed this workload in Spark, we need to introduce a new 
> scheduling model, tentatively named “barrier scheduling”, which launches 
> tasks at the same time and provides users enough information and tooling to 
> embed distributed DL training. Spark can also provide an extra layer of fault 
> tolerance in case some tasks failed in the middle, where Spark would abort 
> all tasks and restart the stage.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20712) [SPARK 2.1 REGRESSION][SQL] Spark can't read Hive table when column type has length greater than 4000 bytes

2018-05-24 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20712:

Target Version/s: 2.4.0

> [SPARK 2.1 REGRESSION][SQL] Spark can't read Hive table when column type has 
> length greater than 4000 bytes
> ---
>
> Key: SPARK-20712
> URL: https://issues.apache.org/jira/browse/SPARK-20712
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.1.2, 2.2.0, 2.3.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> Hi,
> I have following issue.
> I'm trying to read a table from hive when one of the column is nested so it's 
> schema has length longer than 4000 bytes.
> Everything worked on Spark 2.0.2. On 2.1.1 I'm getting Exception:
> {code}
> >> spark.read.table("SOME_TABLE")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/opt/spark-2.1.1/python/pyspark/sql/readwriter.py", line 259, in table
> return self._df(self._jreader.table(tableName))
>   File 
> "/opt/spark-2.1.1/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 
> 1133, in __call__
>   File "/opt/spark-2.1.1/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/opt/spark-2.1.1/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", 
> line 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o71.table.
> : org.apache.spark.SparkException: Cannot recognize hive type string: 
> SOME_VERY_LONG_FIELD_TYPE
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:361)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:359)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:279)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:226)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:225)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:268)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:359)
> at 
> org.apache.spark.sql.hive.client.HiveClient$class.getTable(HiveClient.scala:74)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:78)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable(HiveExternalCatalog.scala:117)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:628)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:628)
> at 
> 

[jira] [Updated] (SPARK-24378) Incorrect examples for date_trunc function in spark 2.3.0

2018-05-24 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24378:
-
Flags:   (was: Important)

> Incorrect examples for date_trunc function in spark 2.3.0
> -
>
> Key: SPARK-24378
> URL: https://issues.apache.org/jira/browse/SPARK-24378
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: Prakhar Gupta
>Assignee: Yuming Wang
>Priority: Trivial
> Fix For: 2.3.1, 2.4.0
>
>
> Within Spark documentation for spark 2.3.0, Listed examples are incorrect.
> date_trunc(fmt, ts) - Returns timestamp {{ts}} truncated to the unit 
> specified by the format model {{fmt}}. {{fmt}} should be one of ["YEAR", 
> "", "YY", "MON", "MONTH", "MM", "DAY", "DD", "HOUR", "MINUTE", "SECOND", 
> "WEEK", "QUARTER"]
> *Examples:*
> {{> SELECT date_trunc('2015-03-05T09:32:05.359', 'YEAR'); 2015-01-01T00:00:00 
> > SELECT date_trunc('2015-03-05T09:32:05.359', 'MM'); 2015-03-01T00:00:00 > 
> SELECT date_trunc('2015-03-05T09:32:05.359', 'DD'); 2015-03-05T00:00:00 > 
> SELECT date_trunc('2015-03-05T09:32:05.359', 'HOUR'); 2015-03-05T09:00:00 }}
>  
> {{Examples should be date_trunc(format, value)}}
> {{SELECT date_trunc('YEAR','2015-03-05T09:32:05.359'); }}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24378) Incorrect examples for date_trunc function in spark 2.3.0

2018-05-24 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24378:
-
Priority: Trivial  (was: Major)

> Incorrect examples for date_trunc function in spark 2.3.0
> -
>
> Key: SPARK-24378
> URL: https://issues.apache.org/jira/browse/SPARK-24378
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: Prakhar Gupta
>Assignee: Yuming Wang
>Priority: Trivial
> Fix For: 2.3.1, 2.4.0
>
>
> Within Spark documentation for spark 2.3.0, Listed examples are incorrect.
> date_trunc(fmt, ts) - Returns timestamp {{ts}} truncated to the unit 
> specified by the format model {{fmt}}. {{fmt}} should be one of ["YEAR", 
> "", "YY", "MON", "MONTH", "MM", "DAY", "DD", "HOUR", "MINUTE", "SECOND", 
> "WEEK", "QUARTER"]
> *Examples:*
> {{> SELECT date_trunc('2015-03-05T09:32:05.359', 'YEAR'); 2015-01-01T00:00:00 
> > SELECT date_trunc('2015-03-05T09:32:05.359', 'MM'); 2015-03-01T00:00:00 > 
> SELECT date_trunc('2015-03-05T09:32:05.359', 'DD'); 2015-03-05T00:00:00 > 
> SELECT date_trunc('2015-03-05T09:32:05.359', 'HOUR'); 2015-03-05T09:00:00 }}
>  
> {{Examples should be date_trunc(format, value)}}
> {{SELECT date_trunc('YEAR','2015-03-05T09:32:05.359'); }}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24378) Incorrect examples for date_trunc function in spark 2.3.0

2018-05-24 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24378:
-
Issue Type: Documentation  (was: Bug)

> Incorrect examples for date_trunc function in spark 2.3.0
> -
>
> Key: SPARK-24378
> URL: https://issues.apache.org/jira/browse/SPARK-24378
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: Prakhar Gupta
>Assignee: Yuming Wang
>Priority: Trivial
> Fix For: 2.3.1, 2.4.0
>
>
> Within Spark documentation for spark 2.3.0, Listed examples are incorrect.
> date_trunc(fmt, ts) - Returns timestamp {{ts}} truncated to the unit 
> specified by the format model {{fmt}}. {{fmt}} should be one of ["YEAR", 
> "", "YY", "MON", "MONTH", "MM", "DAY", "DD", "HOUR", "MINUTE", "SECOND", 
> "WEEK", "QUARTER"]
> *Examples:*
> {{> SELECT date_trunc('2015-03-05T09:32:05.359', 'YEAR'); 2015-01-01T00:00:00 
> > SELECT date_trunc('2015-03-05T09:32:05.359', 'MM'); 2015-03-01T00:00:00 > 
> SELECT date_trunc('2015-03-05T09:32:05.359', 'DD'); 2015-03-05T00:00:00 > 
> SELECT date_trunc('2015-03-05T09:32:05.359', 'HOUR'); 2015-03-05T09:00:00 }}
>  
> {{Examples should be date_trunc(format, value)}}
> {{SELECT date_trunc('YEAR','2015-03-05T09:32:05.359'); }}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24378) Incorrect examples for date_trunc function in spark 2.3.0

2018-05-24 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24378.
--
   Resolution: Fixed
Fix Version/s: 2.4.0
   2.3.1

Fixed in https://github.com/apache/spark/pull/21423

> Incorrect examples for date_trunc function in spark 2.3.0
> -
>
> Key: SPARK-24378
> URL: https://issues.apache.org/jira/browse/SPARK-24378
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: Prakhar Gupta
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> Within Spark documentation for spark 2.3.0, Listed examples are incorrect.
> date_trunc(fmt, ts) - Returns timestamp {{ts}} truncated to the unit 
> specified by the format model {{fmt}}. {{fmt}} should be one of ["YEAR", 
> "", "YY", "MON", "MONTH", "MM", "DAY", "DD", "HOUR", "MINUTE", "SECOND", 
> "WEEK", "QUARTER"]
> *Examples:*
> {{> SELECT date_trunc('2015-03-05T09:32:05.359', 'YEAR'); 2015-01-01T00:00:00 
> > SELECT date_trunc('2015-03-05T09:32:05.359', 'MM'); 2015-03-01T00:00:00 > 
> SELECT date_trunc('2015-03-05T09:32:05.359', 'DD'); 2015-03-05T00:00:00 > 
> SELECT date_trunc('2015-03-05T09:32:05.359', 'HOUR'); 2015-03-05T09:00:00 }}
>  
> {{Examples should be date_trunc(format, value)}}
> {{SELECT date_trunc('YEAR','2015-03-05T09:32:05.359'); }}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24378) Incorrect examples for date_trunc function in spark 2.3.0

2018-05-24 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24378:


Assignee: Yuming Wang

> Incorrect examples for date_trunc function in spark 2.3.0
> -
>
> Key: SPARK-24378
> URL: https://issues.apache.org/jira/browse/SPARK-24378
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: Prakhar Gupta
>Assignee: Yuming Wang
>Priority: Major
>
> Within Spark documentation for spark 2.3.0, Listed examples are incorrect.
> date_trunc(fmt, ts) - Returns timestamp {{ts}} truncated to the unit 
> specified by the format model {{fmt}}. {{fmt}} should be one of ["YEAR", 
> "", "YY", "MON", "MONTH", "MM", "DAY", "DD", "HOUR", "MINUTE", "SECOND", 
> "WEEK", "QUARTER"]
> *Examples:*
> {{> SELECT date_trunc('2015-03-05T09:32:05.359', 'YEAR'); 2015-01-01T00:00:00 
> > SELECT date_trunc('2015-03-05T09:32:05.359', 'MM'); 2015-03-01T00:00:00 > 
> SELECT date_trunc('2015-03-05T09:32:05.359', 'DD'); 2015-03-05T00:00:00 > 
> SELECT date_trunc('2015-03-05T09:32:05.359', 'HOUR'); 2015-03-05T09:00:00 }}
>  
> {{Examples should be date_trunc(format, value)}}
> {{SELECT date_trunc('YEAR','2015-03-05T09:32:05.359'); }}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24383) spark on k8s: "driver-svc" are not getting deleted

2018-05-24 Thread Lenin (JIRA)
Lenin created SPARK-24383:
-

 Summary: spark on k8s: "driver-svc" are not getting deleted
 Key: SPARK-24383
 URL: https://issues.apache.org/jira/browse/SPARK-24383
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.3.0
Reporter: Lenin


When the driver pod exists, the "*driver-svc" services created for the driver 
are not cleaned up. This causes accumulation of services in the k8s layer, at 
one point no more services can be created. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24381) Improve Unit Test Coverage of NOT IN subqueries

2018-05-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489242#comment-16489242
 ] 

Apache Spark commented on SPARK-24381:
--

User 'mgyucht' has created a pull request for this issue:
https://github.com/apache/spark/pull/21425

> Improve Unit Test Coverage of NOT IN subqueries
> ---
>
> Key: SPARK-24381
> URL: https://issues.apache.org/jira/browse/SPARK-24381
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Miles Yucht
>Priority: Major
>
> Today, the unit test coverage for NOT IN queries in SubquerySuite is somewhat 
> lacking. There are a couple test cases that exist, but it isn't necessarily 
> clear that those tests cover all of the subcomponents of null-aware anti 
> joins, i.e. where the subquery returns a null value, if specific columns of 
> either relation are null, etc. Also, it is somewhat difficult for a newcomer 
> to understand the intended behavior of a null-aware anti join without great 
> effort. We should make sure we have proper coverage as well as improve the 
> documentation of this particular subquery, especially with respect to null 
> behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24381) Improve Unit Test Coverage of NOT IN subqueries

2018-05-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24381:


Assignee: (was: Apache Spark)

> Improve Unit Test Coverage of NOT IN subqueries
> ---
>
> Key: SPARK-24381
> URL: https://issues.apache.org/jira/browse/SPARK-24381
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Miles Yucht
>Priority: Major
>
> Today, the unit test coverage for NOT IN queries in SubquerySuite is somewhat 
> lacking. There are a couple test cases that exist, but it isn't necessarily 
> clear that those tests cover all of the subcomponents of null-aware anti 
> joins, i.e. where the subquery returns a null value, if specific columns of 
> either relation are null, etc. Also, it is somewhat difficult for a newcomer 
> to understand the intended behavior of a null-aware anti join without great 
> effort. We should make sure we have proper coverage as well as improve the 
> documentation of this particular subquery, especially with respect to null 
> behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24381) Improve Unit Test Coverage of NOT IN subqueries

2018-05-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24381:


Assignee: Apache Spark

> Improve Unit Test Coverage of NOT IN subqueries
> ---
>
> Key: SPARK-24381
> URL: https://issues.apache.org/jira/browse/SPARK-24381
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Miles Yucht
>Assignee: Apache Spark
>Priority: Major
>
> Today, the unit test coverage for NOT IN queries in SubquerySuite is somewhat 
> lacking. There are a couple test cases that exist, but it isn't necessarily 
> clear that those tests cover all of the subcomponents of null-aware anti 
> joins, i.e. where the subquery returns a null value, if specific columns of 
> either relation are null, etc. Also, it is somewhat difficult for a newcomer 
> to understand the intended behavior of a null-aware anti join without great 
> effort. We should make sure we have proper coverage as well as improve the 
> documentation of this particular subquery, especially with respect to null 
> behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24091) Internally used ConfigMap prevents use of user-specified ConfigMaps carrying Spark configs files

2018-05-24 Thread Trevor McKay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489186#comment-16489186
 ] 

Trevor McKay edited comment on SPARK-24091 at 5/24/18 3:39 PM:
---

I had a similar situation in a project.

One way to handle this is allow a user-specified ConfigMap to be mounted at 
another location, then in the image startup script (entrypoint.sh) check for a 
non-empty directory. If there are files there, copy them one by one to 
$SPARK_HOME/conf. This gives you an override semantic for just those files 
included.  The ConfigMap volume can be marked as optional so that if it's not 
created the container will still start.

 

Another way to do this of course would be to somehow allow the user to modify 
the contents of the original ConfigMap mentioned above, but this might be tough 
depending on how it's named, when it's generated, etc etc. 

 For reference, this is how you make a ConfigMap volume optional (borrowing an 
example from OpenShift documentation)
{code}
{code:YAML}
apiVersion: v1
kind: Pod
metadata: 
  name: dapi-test-pod
spec: 
  containers: 
- name: test-container
  image: gcr.io/google_containers/busybox
  command: [ "/bin/sh", "cat", "/etc/config/special.how" ]
  volumeMounts: 
  - name: config-volume
mountPath: /etc/config
volumes: 
  - name: config-volume
configMap: 
  name: special-config
  optional: true
restartPolicy: Never{code}
 


was (Author: tmckay):
I had a similar situation in a project.

One way to handle this is allow a user-specified ConfigMap to be mounted at 
another location, then in the image startup script (entrypoint.sh) check for a 
non-empty directory. If there are files there, copy them one by one to 
$SPARK_HOME/conf. This gives you an override semantic for just those files 
included.  The ConfigMap volume can be marked as optional so that if it's not 
created the container will still start.

 

Another way to do this of course would be to somehow allow the user to modify 
the contents of the original ConfigMap mentioned above, but this might be tough 
depending on how it's named, when it's generated, etc etc. 

 For reference, this is how you make a ConfigMap volume optional (borrowing an 
example from OpenShift documentation)

{code:yaml}
{code:java}
apiVersion: v1
kind: Pod
metadata:
  name: dapi-test-pod
spec:
  containers:
- name: test-container
  image: gcr.io/google_containers/busybox
  command: [ "/bin/sh", "cat", "/etc/config/special.how" ]
  volumeMounts:
  - name: config-volume
mountPath: /etc/config
volumes:
  - name: config-volume
configMap:
  name: special-config
  optional: true
restartPolicy: Never{code}
 

> Internally used ConfigMap prevents use of user-specified ConfigMaps carrying 
> Spark configs files
> 
>
> Key: SPARK-24091
> URL: https://issues.apache.org/jira/browse/SPARK-24091
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> The recent PR [https://github.com/apache/spark/pull/20669] for removing the 
> init-container introduced a internally used ConfigMap carrying Spark 
> configuration properties in a file for the driver. This ConfigMap gets 
> mounted under {{$SPARK_HOME/conf}} and the environment variable 
> {{SPARK_CONF_DIR}} is set to point to the mount path. This pretty much 
> prevents users from mounting their own ConfigMaps that carry custom Spark 
> configuration files, e.g., {{log4j.properties}} and {{spark-env.sh}} and 
> leaves users with only the option of building custom images. IMO, it is very 
> useful to support mounting user-specified ConfigMaps for custom Spark 
> configuration files. This worths further discussions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24091) Internally used ConfigMap prevents use of user-specified ConfigMaps carrying Spark configs files

2018-05-24 Thread Trevor McKay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489186#comment-16489186
 ] 

Trevor McKay edited comment on SPARK-24091 at 5/24/18 3:38 PM:
---

I had a similar situation in a project.

One way to handle this is allow a user-specified ConfigMap to be mounted at 
another location, then in the image startup script (entrypoint.sh) check for a 
non-empty directory. If there are files there, copy them one by one to 
$SPARK_HOME/conf. This gives you an override semantic for just those files 
included.  The ConfigMap volume can be marked as optional so that if it's not 
created the container will still start.

 

Another way to do this of course would be to somehow allow the user to modify 
the contents of the original ConfigMap mentioned above, but this might be tough 
depending on how it's named, when it's generated, etc etc. 

 For reference, this is how you make a ConfigMap volume optional (borrowing an 
example from OpenShift documentation)

{code:yaml}
{code:java}
apiVersion: v1
kind: Pod
metadata:
  name: dapi-test-pod
spec:
  containers:
- name: test-container
  image: gcr.io/google_containers/busybox
  command: [ "/bin/sh", "cat", "/etc/config/special.how" ]
  volumeMounts:
  - name: config-volume
mountPath: /etc/config
volumes:
  - name: config-volume
configMap:
  name: special-config
  optional: true
restartPolicy: Never{code}
 


was (Author: tmckay):
I had a similar situation in a project.

One way to handle this is allow a user-specified ConfigMap to be mounted at 
another location, then in the image startup script (entrypoint.sh) check for a 
non-empty directory. If there are files there, copy them one by one to 
$SPARK_HOME/conf. This gives you an override semantic for just those files 
included.  The ConfigMap volume can be marked as optional so that if it's not 
created the container will still start.

 

Another way to do this of course would be to somehow allow the user to modify 
the contents of the original ConfigMap mentioned above, but this might be tough 
depending on how it's named, when it's generated, etc etc. 

 

 

> Internally used ConfigMap prevents use of user-specified ConfigMaps carrying 
> Spark configs files
> 
>
> Key: SPARK-24091
> URL: https://issues.apache.org/jira/browse/SPARK-24091
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> The recent PR [https://github.com/apache/spark/pull/20669] for removing the 
> init-container introduced a internally used ConfigMap carrying Spark 
> configuration properties in a file for the driver. This ConfigMap gets 
> mounted under {{$SPARK_HOME/conf}} and the environment variable 
> {{SPARK_CONF_DIR}} is set to point to the mount path. This pretty much 
> prevents users from mounting their own ConfigMaps that carry custom Spark 
> configuration files, e.g., {{log4j.properties}} and {{spark-env.sh}} and 
> leaves users with only the option of building custom images. IMO, it is very 
> useful to support mounting user-specified ConfigMaps for custom Spark 
> configuration files. This worths further discussions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24378) Incorrect examples for date_trunc function in spark 2.3.0

2018-05-24 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24378:
-
Target Version/s:   (was: 2.3.0)

> Incorrect examples for date_trunc function in spark 2.3.0
> -
>
> Key: SPARK-24378
> URL: https://issues.apache.org/jira/browse/SPARK-24378
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: Prakhar Gupta
>Priority: Major
>
> Within Spark documentation for spark 2.3.0, Listed examples are incorrect.
> date_trunc(fmt, ts) - Returns timestamp {{ts}} truncated to the unit 
> specified by the format model {{fmt}}. {{fmt}} should be one of ["YEAR", 
> "", "YY", "MON", "MONTH", "MM", "DAY", "DD", "HOUR", "MINUTE", "SECOND", 
> "WEEK", "QUARTER"]
> *Examples:*
> {{> SELECT date_trunc('2015-03-05T09:32:05.359', 'YEAR'); 2015-01-01T00:00:00 
> > SELECT date_trunc('2015-03-05T09:32:05.359', 'MM'); 2015-03-01T00:00:00 > 
> SELECT date_trunc('2015-03-05T09:32:05.359', 'DD'); 2015-03-05T00:00:00 > 
> SELECT date_trunc('2015-03-05T09:32:05.359', 'HOUR'); 2015-03-05T09:00:00 }}
>  
> {{Examples should be date_trunc(format, value)}}
> {{SELECT date_trunc('YEAR','2015-03-05T09:32:05.359'); }}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data

2018-05-24 Thread Wenbo Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489233#comment-16489233
 ] 

Wenbo Zhao edited comment on SPARK-24373 at 5/24/18 3:37 PM:
-

I turned on the log trace of RuleExecutor and found that in my example of 
df1.count() after cache. 

 
{code:java}
scala> df1.groupBy().count().explain(true)

=== Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$HandleNullInputsForUDF ===
Aggregate [count(1) AS count#40L] Aggregate [count(1) AS count#40L]
!+- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
value1#5L] +- Project [value#2L, if (isnull(value#2L)) null else if 
(isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
+- SerializeFromObject [input[0, bigint, false] AS value#2L] +- 
SerializeFromObject [input[0, bigint, false] AS value#2L]
+- ExternalRDD [obj#1L] +- ExternalRDD [obj#1L]
{code}
 that is node
{code:java}
Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
value1#5L]{code}
becomes 
{code:java}
Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) null 
else UDF(value#2L) AS value1#5L]{code}
 

This will cause a miss in the CacheManager?

which could be confirmed by later applying ColumnPrunning rule's log trace.  

May question is: is that supposed protected by AnalysisBarrier ?


was (Author: wbzhao):
I turned on the log trace of RuleExecutor and found that in my example of 
df1.count() after cache. 

 
{code:java}
scala> df1.groupBy().count().explain(true)

=== Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$HandleNullInputsForUDF ===
Aggregate [count(1) AS count#40L] Aggregate [count(1) AS count#40L]
!+- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
value1#5L] +- Project [value#2L, if (isnull(value#2L)) null else if 
(isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
+- SerializeFromObject [input[0, bigint, false] AS value#2L] +- 
SerializeFromObject [input[0, bigint, false] AS value#2L]
+- ExternalRDD [obj#1L] +- ExternalRDD [obj#1L]
{code}
 that is node
{code:java}
Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
value1#5L]{code}
becomes 
{code:java}
Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) null 
else UDF(value#2L) AS value1#5L]{code}
 

This will cause a miss in the CacheManager?

which could be confirmed by later applying ColumnPrunning rule's log trace. 

 

May question is: is that supposed protected by AnalysisBarrier ?

 

 

 

 

 

> "df.cache() df.count()" no longer eagerly caches data
> -
>
> Key: SPARK-24373
> URL: https://issues.apache.org/jira/browse/SPARK-24373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenbo Zhao
>Priority: Major
>
> Here is the code to reproduce in local mode
> {code:java}
> scala> val df = sc.range(1, 2).toDF
> df: org.apache.spark.sql.DataFrame = [value: bigint]
> scala> val myudf = udf({x: Long => println(""); x + 1})
> myudf: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,LongType,Some(List(LongType)))
> scala> val df1 = df.withColumn("value1", myudf(col("value")))
> df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint]
> scala> df1.cache
> res0: df1.type = [value: bigint, value1: bigint]
> scala> df1.count
> res1: Long = 1 
> scala> df1.count
> res2: Long = 1
> scala> df1.count
> res3: Long = 1
> {code}
>  
> in Spark 2.2, you could see it prints "". 
> In the above example, when you do explain. You could see
> {code:java}
> scala> df1.explain(true)
> == Parsed Logical Plan ==
> 'Project [value#2L, UDF('value) AS value1#5]
> +- AnalysisBarrier
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> value: bigint, value1: bigint
> Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> == Physical Plan ==
> *(1) InMemoryTableScan [value#2L, value1#5L]
> +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
> but the ImMemoryTableScan is mising in the following explain()
> {code:java}
> scala> 

[jira] [Updated] (SPARK-24378) Incorrect examples for date_trunc function in spark 2.3.0

2018-05-24 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24378:
-
Fix Version/s: (was: 2.3.0)

> Incorrect examples for date_trunc function in spark 2.3.0
> -
>
> Key: SPARK-24378
> URL: https://issues.apache.org/jira/browse/SPARK-24378
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: Prakhar Gupta
>Priority: Major
>
> Within Spark documentation for spark 2.3.0, Listed examples are incorrect.
> date_trunc(fmt, ts) - Returns timestamp {{ts}} truncated to the unit 
> specified by the format model {{fmt}}. {{fmt}} should be one of ["YEAR", 
> "", "YY", "MON", "MONTH", "MM", "DAY", "DD", "HOUR", "MINUTE", "SECOND", 
> "WEEK", "QUARTER"]
> *Examples:*
> {{> SELECT date_trunc('2015-03-05T09:32:05.359', 'YEAR'); 2015-01-01T00:00:00 
> > SELECT date_trunc('2015-03-05T09:32:05.359', 'MM'); 2015-03-01T00:00:00 > 
> SELECT date_trunc('2015-03-05T09:32:05.359', 'DD'); 2015-03-05T00:00:00 > 
> SELECT date_trunc('2015-03-05T09:32:05.359', 'HOUR'); 2015-03-05T09:00:00 }}
>  
> {{Examples should be date_trunc(format, value)}}
> {{SELECT date_trunc('YEAR','2015-03-05T09:32:05.359'); }}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data

2018-05-24 Thread Wenbo Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489233#comment-16489233
 ] 

Wenbo Zhao commented on SPARK-24373:


I turned on the log trace of RuleExecutor and found that in my example of 
df1.count() after cache. 

 
{code:java}
scala> df1.groupBy().count().explain(true)

=== Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$HandleNullInputsForUDF ===
Aggregate [count(1) AS count#40L] Aggregate [count(1) AS count#40L]
!+- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
value1#5L] +- Project [value#2L, if (isnull(value#2L)) null else if 
(isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
+- SerializeFromObject [input[0, bigint, false] AS value#2L] +- 
SerializeFromObject [input[0, bigint, false] AS value#2L]
+- ExternalRDD [obj#1L] +- ExternalRDD [obj#1L]
{code}
 that is node
{code:java}
Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
value1#5L]{code}
becomes 
{code:java}
Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) null 
else UDF(value#2L) AS value1#5L]{code}
 

This will cause a miss in the CacheManager?

which could be confirmed by later applying ColumnPrunning rule's log trace. 

 

May question is: is that supposed protected by AnalysisBarrier ?

 

 

 

 

 

> "df.cache() df.count()" no longer eagerly caches data
> -
>
> Key: SPARK-24373
> URL: https://issues.apache.org/jira/browse/SPARK-24373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenbo Zhao
>Priority: Major
>
> Here is the code to reproduce in local mode
> {code:java}
> scala> val df = sc.range(1, 2).toDF
> df: org.apache.spark.sql.DataFrame = [value: bigint]
> scala> val myudf = udf({x: Long => println(""); x + 1})
> myudf: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,LongType,Some(List(LongType)))
> scala> val df1 = df.withColumn("value1", myudf(col("value")))
> df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint]
> scala> df1.cache
> res0: df1.type = [value: bigint, value1: bigint]
> scala> df1.count
> res1: Long = 1 
> scala> df1.count
> res2: Long = 1
> scala> df1.count
> res3: Long = 1
> {code}
>  
> in Spark 2.2, you could see it prints "". 
> In the above example, when you do explain. You could see
> {code:java}
> scala> df1.explain(true)
> == Parsed Logical Plan ==
> 'Project [value#2L, UDF('value) AS value1#5]
> +- AnalysisBarrier
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> value: bigint, value1: bigint
> Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> == Physical Plan ==
> *(1) InMemoryTableScan [value#2L, value1#5L]
> +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
> but the ImMemoryTableScan is mising in the following explain()
> {code:java}
> scala> df1.groupBy().count().explain(true)
> == Parsed Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
> value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> count: bigint
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) 
> null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L])
> +- Exchange SinglePartition
> +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#175L])
> +- *(1) Project
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, 

[jira] [Created] (SPARK-24382) Spark Structured Streaming aggregation on old timestamp data

2018-05-24 Thread Karthik (JIRA)
Karthik created SPARK-24382:
---

 Summary: Spark Structured Streaming aggregation on old timestamp 
data
 Key: SPARK-24382
 URL: https://issues.apache.org/jira/browse/SPARK-24382
 Project: Spark
  Issue Type: Question
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Karthik


I am trying to aggregate the count of records every 10 seconds using the 
structured streaming for the following incoming kafka data
{code:java}
{ 
"ts2" : "2018/05/01 00:02:50.041", 
"serviceGroupId" : "123", 
"userId" : "avv-0", 
"stream" : "", 
"lastUserActivity" : "00:02:50", 
"lastUserActivityCount" : "0" 
} 
{ 
"ts2" : "2018/05/01 00:09:02.079", 
"serviceGroupId" : "123", 
"userId" : "avv-0", 
"stream" : "", 
"lastUserActivity" : "00:09:02", 
"lastUserActivityCount" : "0" 
} 
{ 
"ts2" : "2018/05/01 00:09:02.086", 
"serviceGroupId" : "123", 
"userId" : "avv-2", 
"stream" : "", 
"lastUserActivity" : "00:09:02", 
"lastUserActivityCount" : "0" 
}
{code}
With the following logic
{code:java}
val sdvTuneInsAgg1 = df 
 .withWatermark("ts2", "10 seconds") 
 .groupBy(window(col("ts2"),"10 seconds")) 
 .agg(count("*") as "count") 
 .as[CountMetric1]

val query1 = sdvTuneInsAgg1.writeStream
.format("console")
.foreach(writer)
.start()
{code}
and I do not see any records inside the writer. But, the only anomaly is that 
the current date is 2018/05/24 but the record that I am processing (ts2) has 
old dates. Will aggregation / count work in this scenario ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24091) Internally used ConfigMap prevents use of user-specified ConfigMaps carrying Spark configs files

2018-05-24 Thread Trevor McKay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489186#comment-16489186
 ] 

Trevor McKay commented on SPARK-24091:
--

I had a similar situation in a project.

One way to handle this is allow a user-specified ConfigMap to be mounted at 
another location, then in the image startup script (entrypoint.sh) check for a 
non-empty directory. If there are files there, copy them one by one to 
$SPARK_HOME/conf. This gives you an override semantic for just those files 
included.  The ConfigMap volume can be marked as optional so that if it's not 
created the container will still start.

 

Another way to do this of course would be to somehow allow the user to modify 
the contents of the original ConfigMap mentioned above, but this might be tough 
depending on how it's named, when it's generated, etc etc. 

 

 

> Internally used ConfigMap prevents use of user-specified ConfigMaps carrying 
> Spark configs files
> 
>
> Key: SPARK-24091
> URL: https://issues.apache.org/jira/browse/SPARK-24091
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> The recent PR [https://github.com/apache/spark/pull/20669] for removing the 
> init-container introduced a internally used ConfigMap carrying Spark 
> configuration properties in a file for the driver. This ConfigMap gets 
> mounted under {{$SPARK_HOME/conf}} and the environment variable 
> {{SPARK_CONF_DIR}} is set to point to the mount path. This pretty much 
> prevents users from mounting their own ConfigMaps that carry custom Spark 
> configuration files, e.g., {{log4j.properties}} and {{spark-env.sh}} and 
> leaves users with only the option of building custom images. IMO, it is very 
> useful to support mounting user-specified ConfigMaps for custom Spark 
> configuration files. This worths further discussions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17592) SQL: CAST string as INT inconsistent with Hive

2018-05-24 Thread Jorge Machado (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489185#comment-16489185
 ] 

Jorge Machado edited comment on SPARK-17592 at 5/24/18 3:11 PM:


I'm hitting the same issue I'm afraid but in slightly another way. When I have 
a dataframe (that comes from oracle DB ) as parquet I can see in the logs that 
a field is beeing saved as integer : 

 

{ "type" : "struct", "fields" : [ \{ "name" : "project_id", "type" : "integer", 
"nullable" : true, "metadata" : { } },... 

 

on hue (which reads from hive) I see : 

!image-2018-05-24-17-10-24-515.png!


was (Author: jomach):
I'm hitting the same issue I'm afraid but in slightly another way. When I have 
a dataframe as parquet I can see in the logs that a field is beeing saved as 
integer : 

 

{ "type" : "struct", "fields" : [ \{ "name" : "project_id", "type" : "integer", 
"nullable" : true, "metadata" : { } },... 

 

on hue (which reads from hive) I see : 

!image-2018-05-24-17-10-24-515.png!

> SQL: CAST string as INT inconsistent with Hive
> --
>
> Key: SPARK-17592
> URL: https://issues.apache.org/jira/browse/SPARK-17592
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Furcy Pin
>Priority: Major
> Attachments: image-2018-05-24-17-10-24-515.png
>
>
> Hello,
> there seem to be an inconsistency between Spark and Hive when casting a 
> string into an Int. 
> With Hive:
> {code}
> select cast("0.4" as INT) ;
> > 0
> select cast("0.5" as INT) ;
> > 0
> select cast("0.6" as INT) ;
> > 0
> {code}
> With Spark-SQL:
> {code}
> select cast("0.4" as INT) ;
> > 0
> select cast("0.5" as INT) ;
> > 1
> select cast("0.6" as INT) ;
> > 1
> {code}
> Hive seems to perform a floor(string.toDouble), while Spark seems to perform 
> a round(string.toDouble)
> I'm not sure there is any ISO standard for this, mysql has the same behavior 
> than Hive, while postgresql performs a string.toInt and throws an 
> NumberFormatException
> Personnally I think Hive is right, hence my posting this here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17592) SQL: CAST string as INT inconsistent with Hive

2018-05-24 Thread Jorge Machado (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Machado updated SPARK-17592:
--
Attachment: image-2018-05-24-17-10-24-515.png

> SQL: CAST string as INT inconsistent with Hive
> --
>
> Key: SPARK-17592
> URL: https://issues.apache.org/jira/browse/SPARK-17592
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Furcy Pin
>Priority: Major
> Attachments: image-2018-05-24-17-10-24-515.png
>
>
> Hello,
> there seem to be an inconsistency between Spark and Hive when casting a 
> string into an Int. 
> With Hive:
> {code}
> select cast("0.4" as INT) ;
> > 0
> select cast("0.5" as INT) ;
> > 0
> select cast("0.6" as INT) ;
> > 0
> {code}
> With Spark-SQL:
> {code}
> select cast("0.4" as INT) ;
> > 0
> select cast("0.5" as INT) ;
> > 1
> select cast("0.6" as INT) ;
> > 1
> {code}
> Hive seems to perform a floor(string.toDouble), while Spark seems to perform 
> a round(string.toDouble)
> I'm not sure there is any ISO standard for this, mysql has the same behavior 
> than Hive, while postgresql performs a string.toInt and throws an 
> NumberFormatException
> Personnally I think Hive is right, hence my posting this here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17592) SQL: CAST string as INT inconsistent with Hive

2018-05-24 Thread Jorge Machado (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489185#comment-16489185
 ] 

Jorge Machado commented on SPARK-17592:
---

I'm hitting the same issue I'm afraid but in slightly another way. When I have 
a dataframe as parquet I can see in the logs that a field is beeing saved as 
integer : 

 

{ "type" : "struct", "fields" : [ \{ "name" : "project_id", "type" : "integer", 
"nullable" : true, "metadata" : { } },... 

 

on hue (which reads from hive) I see : 

!image-2018-05-24-17-10-24-515.png!

> SQL: CAST string as INT inconsistent with Hive
> --
>
> Key: SPARK-17592
> URL: https://issues.apache.org/jira/browse/SPARK-17592
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Furcy Pin
>Priority: Major
> Attachments: image-2018-05-24-17-10-24-515.png
>
>
> Hello,
> there seem to be an inconsistency between Spark and Hive when casting a 
> string into an Int. 
> With Hive:
> {code}
> select cast("0.4" as INT) ;
> > 0
> select cast("0.5" as INT) ;
> > 0
> select cast("0.6" as INT) ;
> > 0
> {code}
> With Spark-SQL:
> {code}
> select cast("0.4" as INT) ;
> > 0
> select cast("0.5" as INT) ;
> > 1
> select cast("0.6" as INT) ;
> > 1
> {code}
> Hive seems to perform a floor(string.toDouble), while Spark seems to perform 
> a round(string.toDouble)
> I'm not sure there is any ISO standard for this, mysql has the same behavior 
> than Hive, while postgresql performs a string.toInt and throws an 
> NumberFormatException
> Personnally I think Hive is right, hence my posting this here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data

2018-05-24 Thread Andreas Weise (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489129#comment-16489129
 ] 

Andreas Weise commented on SPARK-24373:
---

We are also facing increased runtime duration for our SQL jobs (after upgrading 
from 2.2.1 to 2.3.0), but didn't trace it down to the root cause. This issue 
sounds reasonable to me, as we are also using cache() + count() quite often.

> "df.cache() df.count()" no longer eagerly caches data
> -
>
> Key: SPARK-24373
> URL: https://issues.apache.org/jira/browse/SPARK-24373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenbo Zhao
>Priority: Major
>
> Here is the code to reproduce in local mode
> {code:java}
> scala> val df = sc.range(1, 2).toDF
> df: org.apache.spark.sql.DataFrame = [value: bigint]
> scala> val myudf = udf({x: Long => println(""); x + 1})
> myudf: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,LongType,Some(List(LongType)))
> scala> val df1 = df.withColumn("value1", myudf(col("value")))
> df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint]
> scala> df1.cache
> res0: df1.type = [value: bigint, value1: bigint]
> scala> df1.count
> res1: Long = 1 
> scala> df1.count
> res2: Long = 1
> scala> df1.count
> res3: Long = 1
> {code}
>  
> in Spark 2.2, you could see it prints "". 
> In the above example, when you do explain. You could see
> {code:java}
> scala> df1.explain(true)
> == Parsed Logical Plan ==
> 'Project [value#2L, UDF('value) AS value1#5]
> +- AnalysisBarrier
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> value: bigint, value1: bigint
> Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> == Physical Plan ==
> *(1) InMemoryTableScan [value#2L, value1#5L]
> +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
> but the ImMemoryTableScan is mising in the following explain()
> {code:java}
> scala> df1.groupBy().count().explain(true)
> == Parsed Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
> value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> count: bigint
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) 
> null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L])
> +- Exchange SinglePartition
> +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#175L])
> +- *(1) Project
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24381) Improve Unit Test Coverage of NOT IN subqueries

2018-05-24 Thread Miles Yucht (JIRA)
Miles Yucht created SPARK-24381:
---

 Summary: Improve Unit Test Coverage of NOT IN subqueries
 Key: SPARK-24381
 URL: https://issues.apache.org/jira/browse/SPARK-24381
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.2
Reporter: Miles Yucht


Today, the unit test coverage for NOT IN queries in SubquerySuite is somewhat 
lacking. There are a couple test cases that exist, but it isn't necessarily 
clear that those tests cover all of the subcomponents of null-aware anti joins, 
i.e. where the subquery returns a null value, if specific columns of either 
relation are null, etc. Also, it is somewhat difficult for a newcomer to 
understand the intended behavior of a null-aware anti join without great 
effort. We should make sure we have proper coverage as well as improve the 
documentation of this particular subquery, especially with respect to null 
behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24329) Remove comments filtering before parsing of CSV files

2018-05-24 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489100#comment-16489100
 ] 

Hyukjin Kwon commented on SPARK-24329:
--

Fixed by explicitly adding a test where the code is valid.

> Remove comments filtering before parsing of CSV files
> -
>
> Key: SPARK-24329
> URL: https://issues.apache.org/jira/browse/SPARK-24329
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 2.4.0
>
>
> Comments and whitespace filtering has been performed by uniVocity parser 
> already according to parser settings:
> https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L178-L180
> It is not necessary to do the same before parsing. Need to inspect all places 
> where the filterCommentAndEmpty method is called, and remove the former one 
> if it duplicates filtering of uniVocity parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24329) Remove comments filtering before parsing of CSV files

2018-05-24 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24329.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21394
[https://github.com/apache/spark/pull/21394]

> Remove comments filtering before parsing of CSV files
> -
>
> Key: SPARK-24329
> URL: https://issues.apache.org/jira/browse/SPARK-24329
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 2.4.0
>
>
> Comments and whitespace filtering has been performed by uniVocity parser 
> already according to parser settings:
> https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L178-L180
> It is not necessary to do the same before parsing. Need to inspect all places 
> where the filterCommentAndEmpty method is called, and remove the former one 
> if it duplicates filtering of uniVocity parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24329) Remove comments filtering before parsing of CSV files

2018-05-24 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24329:


Assignee: Maxim Gekk

> Remove comments filtering before parsing of CSV files
> -
>
> Key: SPARK-24329
> URL: https://issues.apache.org/jira/browse/SPARK-24329
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 2.4.0
>
>
> Comments and whitespace filtering has been performed by uniVocity parser 
> already according to parser settings:
> https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L178-L180
> It is not necessary to do the same before parsing. Need to inspect all places 
> where the filterCommentAndEmpty method is called, and remove the former one 
> if it duplicates filtering of uniVocity parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data

2018-05-24 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489060#comment-16489060
 ] 

Li Jin commented on SPARK-24373:


This is a reproduce in unit test:
{code:java}
test("cache and count") {
  var evalCount = 0
  val myUDF = udf((x: String) => { evalCount += 1; "result" })
  val df = spark.range(0, 1).select(myUDF($"id"))
  df.cache()
  df.count()
  assert(evalCount === 1)

  df.count()
  assert(evalCount === 1)
}
{code}

> "df.cache() df.count()" no longer eagerly caches data
> -
>
> Key: SPARK-24373
> URL: https://issues.apache.org/jira/browse/SPARK-24373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenbo Zhao
>Priority: Major
>
> Here is the code to reproduce in local mode
> {code:java}
> scala> val df = sc.range(1, 2).toDF
> df: org.apache.spark.sql.DataFrame = [value: bigint]
> scala> val myudf = udf({x: Long => println(""); x + 1})
> myudf: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,LongType,Some(List(LongType)))
> scala> val df1 = df.withColumn("value1", myudf(col("value")))
> df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint]
> scala> df1.cache
> res0: df1.type = [value: bigint, value1: bigint]
> scala> df1.count
> res1: Long = 1 
> scala> df1.count
> res2: Long = 1
> scala> df1.count
> res3: Long = 1
> {code}
>  
> in Spark 2.2, you could see it prints "". 
> In the above example, when you do explain. You could see
> {code:java}
> scala> df1.explain(true)
> == Parsed Logical Plan ==
> 'Project [value#2L, UDF('value) AS value1#5]
> +- AnalysisBarrier
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> value: bigint, value1: bigint
> Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> == Physical Plan ==
> *(1) InMemoryTableScan [value#2L, value1#5L]
> +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
> but the ImMemoryTableScan is mising in the following explain()
> {code:java}
> scala> df1.groupBy().count().explain(true)
> == Parsed Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
> value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> count: bigint
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) 
> null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L])
> +- Exchange SinglePartition
> +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#175L])
> +- *(1) Project
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24380) argument quoting/escaping broken

2018-05-24 Thread paul mackles (JIRA)
paul mackles created SPARK-24380:


 Summary: argument quoting/escaping broken
 Key: SPARK-24380
 URL: https://issues.apache.org/jira/browse/SPARK-24380
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Mesos
Affects Versions: 2.3.0, 2.2.0
Reporter: paul mackles
 Fix For: 2.4.0


When a configuration property contains shell characters that require quoting, 
the Mesos cluster scheduler generates the spark-submit argument like so:
{code:java}
--conf "spark.mesos.executor.docker.parameters="label=logging=|foo|""{code}
Note the quotes around the property value as well as the key=value pair. When 
using docker, this breaks the spark-submit command and causes the "|" to be 
interpreted as an actual shell PIPE. Spaces, semi-colons, etc also cause issues.

Although I haven't tried, I suspect this is also a potential security issue in 
that someone could exploit it to run arbitrary code on the host.

My patch is pretty minimal and just removes the outer quotes around the 
key=value pair, resulting in something like:
{code:java}
--conf spark.mesos.executor.docker.parameters="label=logging=|foo|"{code}
A more extensive fix might try wrapping the entire key=value pair in single 
quotes but I was concerned about backwards compatibility with that change.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24380) argument quoting/escaping broken in mesos cluster scheduler

2018-05-24 Thread paul mackles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

paul mackles updated SPARK-24380:
-
Summary: argument quoting/escaping broken in mesos cluster scheduler  (was: 
argument quoting/escaping broken)

> argument quoting/escaping broken in mesos cluster scheduler
> ---
>
> Key: SPARK-24380
> URL: https://issues.apache.org/jira/browse/SPARK-24380
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Mesos
>Affects Versions: 2.2.0, 2.3.0
>Reporter: paul mackles
>Priority: Critical
> Fix For: 2.4.0
>
>
> When a configuration property contains shell characters that require quoting, 
> the Mesos cluster scheduler generates the spark-submit argument like so:
> {code:java}
> --conf "spark.mesos.executor.docker.parameters="label=logging=|foo|""{code}
> Note the quotes around the property value as well as the key=value pair. When 
> using docker, this breaks the spark-submit command and causes the "|" to be 
> interpreted as an actual shell PIPE. Spaces, semi-colons, etc also cause 
> issues.
> Although I haven't tried, I suspect this is also a potential security issue 
> in that someone could exploit it to run arbitrary code on the host.
> My patch is pretty minimal and just removes the outer quotes around the 
> key=value pair, resulting in something like:
> {code:java}
> --conf spark.mesos.executor.docker.parameters="label=logging=|foo|"{code}
> A more extensive fix might try wrapping the entire key=value pair in single 
> quotes but I was concerned about backwards compatibility with that change.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24230) With Parquet 1.10 upgrade has errors in the vectorized reader

2018-05-24 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24230.
-
   Resolution: Fixed
 Assignee: Ryan Blue
Fix Version/s: 2.4.0
   2.3.1

> With Parquet 1.10 upgrade has errors in the vectorized reader
> -
>
> Key: SPARK-24230
> URL: https://issues.apache.org/jira/browse/SPARK-24230
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Ian O Connell
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> When reading some parquet files can get an error like:
> java.io.IOException: expecting more rows but reached last block. Read 0 out 
> of 1194236
> This happens when looking for a needle thats pretty rare in a large haystack.
>  
> The issue here I believe is that the total row count is calculated at
> [https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L229]
>  
> But we pass the blocks we filtered via 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups
> to the ParquetFileReader constructor.
>  
> However the ParquetFileReader constructor will filter the list of blocks 
> again using
>  
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L737]
>  
> if a block is filtered out by the latter method, and not the former the 
> vectorized reader will believe it should see more rows than it will.
> the fix I used locally is pretty straight forward:
> {code:java}
> for (BlockMetaData block : blocks) {
> this.totalRowCount += block.getRowCount();
> }
> {code}
> goes to
> {code:java}
> this.totalRowCount = this.reader.getRecordCount();
> {code}
> [~rdblue] do you know if this sounds right? The second filter method in the 
> ParquetFileReader might filter more blocks leading to the count being off? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24379) BroadcastExchangeExec should catch SparkOutOfMemory and re-throw SparkFatalException, which wraps SparkOutOfMemory inside.

2018-05-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24379:


Assignee: (was: Apache Spark)

> BroadcastExchangeExec should catch SparkOutOfMemory and re-throw 
> SparkFatalException, which wraps SparkOutOfMemory inside.
> --
>
> Key: SPARK-24379
> URL: https://issues.apache.org/jira/browse/SPARK-24379
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: jin xing
>Priority: Major
>
> After SPARK-22827, Spark won't fails the entire executor but only fails the 
> task suffering SparkOutOfMemoryError. In current BroadcastExchangeExec, it 
> try-catch OutOfMemoryError. Think about below scenario:
>  # SparkOutOfMemoryError(subclass of OutOfMemoryError) is thrown in 
> scala.concurrent.Future;
>  # SparkOutOfMemoryError is caught and a OutOfMemoryError is wrapped in 
> SparkFatalException and re-thrown;
>  # ThreadUtils.awaitResult catches SparkFatalException and a OutOfMemoryError 
> is thrown;
>  # The OutOfMemoryError will go to 
> SparkUncaughtExceptionHandler.uncaughtException and Executor fails.
> So it make more sense to catch SparkOutOfMemory and re-throw 
> SparkFatalException, which wraps SparkOutOfMemory inside.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24379) BroadcastExchangeExec should catch SparkOutOfMemory and re-throw SparkFatalException, which wraps SparkOutOfMemory inside.

2018-05-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488860#comment-16488860
 ] 

Apache Spark commented on SPARK-24379:
--

User 'jinxing64' has created a pull request for this issue:
https://github.com/apache/spark/pull/21424

> BroadcastExchangeExec should catch SparkOutOfMemory and re-throw 
> SparkFatalException, which wraps SparkOutOfMemory inside.
> --
>
> Key: SPARK-24379
> URL: https://issues.apache.org/jira/browse/SPARK-24379
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: jin xing
>Priority: Major
>
> After SPARK-22827, Spark won't fails the entire executor but only fails the 
> task suffering SparkOutOfMemoryError. In current BroadcastExchangeExec, it 
> try-catch OutOfMemoryError. Think about below scenario:
>  # SparkOutOfMemoryError(subclass of OutOfMemoryError) is thrown in 
> scala.concurrent.Future;
>  # SparkOutOfMemoryError is caught and a OutOfMemoryError is wrapped in 
> SparkFatalException and re-thrown;
>  # ThreadUtils.awaitResult catches SparkFatalException and a OutOfMemoryError 
> is thrown;
>  # The OutOfMemoryError will go to 
> SparkUncaughtExceptionHandler.uncaughtException and Executor fails.
> So it make more sense to catch SparkOutOfMemory and re-throw 
> SparkFatalException, which wraps SparkOutOfMemory inside.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24379) BroadcastExchangeExec should catch SparkOutOfMemory and re-throw SparkFatalException, which wraps SparkOutOfMemory inside.

2018-05-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24379:


Assignee: Apache Spark

> BroadcastExchangeExec should catch SparkOutOfMemory and re-throw 
> SparkFatalException, which wraps SparkOutOfMemory inside.
> --
>
> Key: SPARK-24379
> URL: https://issues.apache.org/jira/browse/SPARK-24379
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: jin xing
>Assignee: Apache Spark
>Priority: Major
>
> After SPARK-22827, Spark won't fails the entire executor but only fails the 
> task suffering SparkOutOfMemoryError. In current BroadcastExchangeExec, it 
> try-catch OutOfMemoryError. Think about below scenario:
>  # SparkOutOfMemoryError(subclass of OutOfMemoryError) is thrown in 
> scala.concurrent.Future;
>  # SparkOutOfMemoryError is caught and a OutOfMemoryError is wrapped in 
> SparkFatalException and re-thrown;
>  # ThreadUtils.awaitResult catches SparkFatalException and a OutOfMemoryError 
> is thrown;
>  # The OutOfMemoryError will go to 
> SparkUncaughtExceptionHandler.uncaughtException and Executor fails.
> So it make more sense to catch SparkOutOfMemory and re-throw 
> SparkFatalException, which wraps SparkOutOfMemory inside.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24378) Incorrect examples for date_trunc function in spark 2.3.0

2018-05-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488850#comment-16488850
 ] 

Apache Spark commented on SPARK-24378:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/21423

> Incorrect examples for date_trunc function in spark 2.3.0
> -
>
> Key: SPARK-24378
> URL: https://issues.apache.org/jira/browse/SPARK-24378
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: Prakhar Gupta
>Priority: Major
> Fix For: 2.3.0
>
>
> Within Spark documentation for spark 2.3.0, Listed examples are incorrect.
> date_trunc(fmt, ts) - Returns timestamp {{ts}} truncated to the unit 
> specified by the format model {{fmt}}. {{fmt}} should be one of ["YEAR", 
> "", "YY", "MON", "MONTH", "MM", "DAY", "DD", "HOUR", "MINUTE", "SECOND", 
> "WEEK", "QUARTER"]
> *Examples:*
> {{> SELECT date_trunc('2015-03-05T09:32:05.359', 'YEAR'); 2015-01-01T00:00:00 
> > SELECT date_trunc('2015-03-05T09:32:05.359', 'MM'); 2015-03-01T00:00:00 > 
> SELECT date_trunc('2015-03-05T09:32:05.359', 'DD'); 2015-03-05T00:00:00 > 
> SELECT date_trunc('2015-03-05T09:32:05.359', 'HOUR'); 2015-03-05T09:00:00 }}
>  
> {{Examples should be date_trunc(format, value)}}
> {{SELECT date_trunc('YEAR','2015-03-05T09:32:05.359'); }}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24378) Incorrect examples for date_trunc function in spark 2.3.0

2018-05-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24378:


Assignee: Apache Spark

> Incorrect examples for date_trunc function in spark 2.3.0
> -
>
> Key: SPARK-24378
> URL: https://issues.apache.org/jira/browse/SPARK-24378
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: Prakhar Gupta
>Assignee: Apache Spark
>Priority: Major
> Fix For: 2.3.0
>
>
> Within Spark documentation for spark 2.3.0, Listed examples are incorrect.
> date_trunc(fmt, ts) - Returns timestamp {{ts}} truncated to the unit 
> specified by the format model {{fmt}}. {{fmt}} should be one of ["YEAR", 
> "", "YY", "MON", "MONTH", "MM", "DAY", "DD", "HOUR", "MINUTE", "SECOND", 
> "WEEK", "QUARTER"]
> *Examples:*
> {{> SELECT date_trunc('2015-03-05T09:32:05.359', 'YEAR'); 2015-01-01T00:00:00 
> > SELECT date_trunc('2015-03-05T09:32:05.359', 'MM'); 2015-03-01T00:00:00 > 
> SELECT date_trunc('2015-03-05T09:32:05.359', 'DD'); 2015-03-05T00:00:00 > 
> SELECT date_trunc('2015-03-05T09:32:05.359', 'HOUR'); 2015-03-05T09:00:00 }}
>  
> {{Examples should be date_trunc(format, value)}}
> {{SELECT date_trunc('YEAR','2015-03-05T09:32:05.359'); }}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >