[jira] [Resolved] (SPARK-24244) Parse only required columns of CSV file
[ https://issues.apache.org/jira/browse/SPARK-24244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24244. - Resolution: Fixed > Parse only required columns of CSV file > --- > > Key: SPARK-24244 > URL: https://issues.apache.org/jira/browse/SPARK-24244 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 2.4.0 > > > uniVocity parser allows to specify only required column names or indexes for > parsing like: > {code} > // Here we select only the columns by their indexes. > // The parser just skips the values in other columns > parserSettings.selectIndexes(4, 0, 1); > CsvParser parser = new CsvParser(parserSettings); > {code} > Need to modify *UnivocityParser* to extract only needed columns from > requiredSchema -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24244) Parse only required columns of CSV file
[ https://issues.apache.org/jira/browse/SPARK-24244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-24244: Priority: Major (was: Minor) > Parse only required columns of CSV file > --- > > Key: SPARK-24244 > URL: https://issues.apache.org/jira/browse/SPARK-24244 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 2.4.0 > > > uniVocity parser allows to specify only required column names or indexes for > parsing like: > {code} > // Here we select only the columns by their indexes. > // The parser just skips the values in other columns > parserSettings.selectIndexes(4, 0, 1); > CsvParser parser = new CsvParser(parserSettings); > {code} > Need to modify *UnivocityParser* to extract only needed columns from > requiredSchema -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24368) Flaky tests: org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite
[ https://issues.apache.org/jira/browse/SPARK-24368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24368. - Resolution: Fixed Assignee: Maxim Gekk Fix Version/s: 2.4.0 > Flaky tests: > org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite > > > Key: SPARK-24368 > URL: https://issues.apache.org/jira/browse/SPARK-24368 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: Xiao Li >Assignee: Maxim Gekk >Priority: Major > Fix For: 2.4.0 > > > org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite failed > very often. > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91020/testReport/org.apache.spark.sql.execution.datasources.csv/UnivocityParserSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/history/ > {code} > org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite *** > ABORTED *** (1 millisecond) > [info] java.lang.IllegalStateException: LiveListenerBus is stopped. > [info] at > org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97) > [info] at > org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80) > [info] at > org.apache.spark.sql.internal.SharedState.(SharedState.scala:93) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120) > [info] at scala.Option.getOrElse(Option.scala:121) > [info] at > org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120) > [info] at > org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119) > [info] at > org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286) > [info] at > org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42) > [info] at > org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95) > [info] at scala.Option.map(Option.scala:146) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:95) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:94) > [info] at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:125) > [info] at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:84) > [info] at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:40) > [info] at > org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite.(UnivocityParserSuite.scala:30) > [info] at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > [info] at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > [info] at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > [info] at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > [info] at java.lang.Class.newInstance(Class.java:442) > [info] at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:435) > [info] at sbt.ForkMain$Run$2.call(ForkMain.java:296) > [info] at sbt.ForkMain$Run$2.call(ForkMain.java:286) > [info] at java.util.concurrent.FutureTask.run(FutureTask.java:266) > [info] at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [info] at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [info] at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24367) Parquet: use JOB_SUMMARY_LEVEL instead of deprecated flag ENABLE_JOB_SUMMARY
[ https://issues.apache.org/jira/browse/SPARK-24367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-24367: Assignee: Gengliang Wang > Parquet: use JOB_SUMMARY_LEVEL instead of deprecated flag ENABLE_JOB_SUMMARY > > > Key: SPARK-24367 > URL: https://issues.apache.org/jira/browse/SPARK-24367 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > > In current parquet version,the conf ENABLE_JOB_SUMMARY is deprecated. > When writing to Parquet files, the warning message "WARN > org.apache.parquet.hadoop.ParquetOutputFormat: Setting > parquet.enable.summary-metadata is deprecated, please use > parquet.summary.metadata.level" keeps showing up. > From > [https://github.com/apache/parquet-mr/blame/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java#L164] > we can see that we should use JOB_SUMMARY_LEVEL. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24367) Parquet: use JOB_SUMMARY_LEVEL instead of deprecated flag ENABLE_JOB_SUMMARY
[ https://issues.apache.org/jira/browse/SPARK-24367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24367. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21411 [https://github.com/apache/spark/pull/21411] > Parquet: use JOB_SUMMARY_LEVEL instead of deprecated flag ENABLE_JOB_SUMMARY > > > Key: SPARK-24367 > URL: https://issues.apache.org/jira/browse/SPARK-24367 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > > In current parquet version,the conf ENABLE_JOB_SUMMARY is deprecated. > When writing to Parquet files, the warning message "WARN > org.apache.parquet.hadoop.ParquetOutputFormat: Setting > parquet.enable.summary-metadata is deprecated, please use > parquet.summary.metadata.level" keeps showing up. > From > [https://github.com/apache/parquet-mr/blame/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java#L164] > we can see that we should use JOB_SUMMARY_LEVEL. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23929) pandas_udf schema mapped by position and not by name
[ https://issues.apache.org/jira/browse/SPARK-23929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23929. -- Resolution: Duplicate > pandas_udf schema mapped by position and not by name > > > Key: SPARK-23929 > URL: https://issues.apache.org/jira/browse/SPARK-23929 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: PySpark > Spark 2.3.0 > >Reporter: Omri >Priority: Major > > The return struct of a pandas_udf should be mapped to the provided schema by > name. Currently it's not the case. > Consider these two examples, where the only change is the order of the fields > in the provided schema struct: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame( > [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], > ("id", "v")) > @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP) > def normalize(pdf): > v = pdf.v > return pdf.assign(v=(v - v.mean()) / v.std()) > df.groupby("id").apply(normalize).show() > {code} > and this one: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame( > [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], > ("id", "v")) > @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP) > def normalize(pdf): > v = pdf.v > return pdf.assign(v=(v - v.mean()) / v.std()) > df.groupby("id").apply(normalize).show() > {code} > The results should be the same but they are different: > For the first code: > {code:java} > +---+---+ > | v| id| > +---+---+ > |1.0| 0| > |1.0| 0| > |2.0| 0| > |2.0| 0| > |2.0| 1| > +---+---+ > {code} > For the second code: > {code:java} > +---+---+ > | id| v| > +---+---+ > | 1|-0.7071067811865475| > | 1| 0.7071067811865475| > | 2|-0.8320502943378437| > | 2|-0.2773500981126146| > | 2| 1.1094003924504583| > +---+---+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24235) create the top-of-task RDD sending rows to the remote buffer
[ https://issues.apache.org/jira/browse/SPARK-24235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24235: Assignee: (was: Apache Spark) > create the top-of-task RDD sending rows to the remote buffer > > > Key: SPARK-24235 > URL: https://issues.apache.org/jira/browse/SPARK-24235 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > > [https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE/edit#heading=h.8t3ci57f7uii] > > Note that after > [https://github.com/apache/spark/pull/21239|https://github.com/apache/spark/pull/21239],this > will need to be responsible for incrementing its task's EpochTracker. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24235) create the top-of-task RDD sending rows to the remote buffer
[ https://issues.apache.org/jira/browse/SPARK-24235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16490047#comment-16490047 ] Apache Spark commented on SPARK-24235: -- User 'jose-torres' has created a pull request for this issue: https://github.com/apache/spark/pull/21428 > create the top-of-task RDD sending rows to the remote buffer > > > Key: SPARK-24235 > URL: https://issues.apache.org/jira/browse/SPARK-24235 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > > [https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE/edit#heading=h.8t3ci57f7uii] > > Note that after > [https://github.com/apache/spark/pull/21239|https://github.com/apache/spark/pull/21239],this > will need to be responsible for incrementing its task's EpochTracker. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24235) create the top-of-task RDD sending rows to the remote buffer
[ https://issues.apache.org/jira/browse/SPARK-24235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24235: Assignee: Apache Spark > create the top-of-task RDD sending rows to the remote buffer > > > Key: SPARK-24235 > URL: https://issues.apache.org/jira/browse/SPARK-24235 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Apache Spark >Priority: Major > > [https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE/edit#heading=h.8t3ci57f7uii] > > Note that after > [https://github.com/apache/spark/pull/21239|https://github.com/apache/spark/pull/21239],this > will need to be responsible for incrementing its task's EpochTracker. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24324) Pandas Grouped Map UserDefinedFunction mixes column labels
[ https://issues.apache.org/jira/browse/SPARK-24324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16490022#comment-16490022 ] Cristian Consonni commented on SPARK-24324: --- [~bryanc] said: > As a workaround, you could write your UDF to make the pandas.DataFrame using > an OrderedDict like so: I confirm that using an OrderedDict is solving the issue for me. Thank you all for your support ! > Pandas Grouped Map UserDefinedFunction mixes column labels > -- > > Key: SPARK-24324 > URL: https://issues.apache.org/jira/browse/SPARK-24324 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: Python (using virtualenv): > {noformat} > $ python --version > Python 3.6.5 > {noformat} > Modules installed: > {noformat} > arrow==0.12.1 > backcall==0.1.0 > bleach==2.1.3 > chardet==3.0.4 > decorator==4.3.0 > entrypoints==0.2.3 > findspark==1.2.0 > html5lib==1.0.1 > ipdb==0.11 > ipykernel==4.8.2 > ipython==6.3.1 > ipython-genutils==0.2.0 > ipywidgets==7.2.1 > jedi==0.12.0 > Jinja2==2.10 > jsonschema==2.6.0 > jupyter==1.0.0 > jupyter-client==5.2.3 > jupyter-console==5.2.0 > jupyter-core==4.4.0 > MarkupSafe==1.0 > mistune==0.8.3 > nbconvert==5.3.1 > nbformat==4.4.0 > notebook==5.5.0 > numpy==1.14.3 > pandas==0.22.0 > pandocfilters==1.4.2 > parso==0.2.0 > pbr==3.1.1 > pexpect==4.5.0 > pickleshare==0.7.4 > progressbar2==3.37.1 > prompt-toolkit==1.0.15 > ptyprocess==0.5.2 > pyarrow==0.9.0 > Pygments==2.2.0 > python-dateutil==2.7.2 > python-utils==2.3.0 > pytz==2018.4 > pyzmq==17.0.0 > qtconsole==4.3.1 > Send2Trash==1.5.0 > simplegeneric==0.8.1 > six==1.11.0 > SQLAlchemy==1.2.7 > stevedore==1.28.0 > terminado==0.8.1 > testpath==0.3.1 > tornado==5.0.2 > traitlets==4.3.2 > virtualenv==15.1.0 > virtualenv-clone==0.2.6 > virtualenvwrapper==4.7.2 > wcwidth==0.1.7 > webencodings==0.5.1 > widgetsnbextension==3.2.1 > {noformat} > > Java: > {noformat} > $ java -version > java version "1.8.0_171" > Java(TM) SE Runtime Environment (build 1.8.0_171-b11) > Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat} > System: > {noformat} > $ lsb_release -a > No LSB modules are available. > Distributor ID: Ubuntu > Description: Ubuntu 16.04.4 LTS > Release: 16.04 > Codename: xenial > {noformat} >Reporter: Cristian Consonni >Priority: Major > > I am working on Wikipedia page views (see [task T188041 on Wikimedia's > Pharicator|https://phabricator.wikimedia.org/T188041]). For simplicity, let's > say that these are the data: > {noformat} > > {noformat} > For each combination of (lang, page, day(timestamp)) I need to transform the > views for each hour: > {noformat} > 00:00 -> A > 01:00 -> B > ... > {noformat} > and concatenate the number of views for that hour. So, if a page got 5 views > at 00:00 and 7 views at 01:00 it would become: > {noformat} > A5B7 > {noformat} > > I have written a UDF called {code:python}concat_hours{code} > However, the function is mixing the columns and I am not sure what is going > on. I wrote here a minimal complete example that reproduces the issue on my > system (the details of my environment are above). > {code:python} > #!/usr/bin/env python3 > # coding: utf-8 > input_data = b"""en Albert_Camus 20071210-00 150 > en Albert_Camus 20071210-01 148 > en Albert_Camus 20071210-02 197 > en Albert_Camus 20071211-20 145 > en Albert_Camus 20071211-21 131 > en Albert_Camus 20071211-22 154 > en Albert_Camus 20071211-230001 142 > en Albert_Caquot 20071210-02 1 > en Albert_Caquot 20071210-02 1 > en Albert_Caquot 20071210-040001 1 > en Albert_Caquot 20071211-06 1 > en Albert_Caquot 20071211-08 1 > en Albert_Caquot 20071211-15 3 > en Albert_Caquot 20071211-21 1""" > import tempfile > fp = tempfile.NamedTemporaryFile() > fp.write(input_data) > fp.seek(0) > import findspark > findspark.init() > import pyspark > from pyspark.sql.types import StructType, StructField > from pyspark.sql.types import StringType, IntegerType, TimestampType > from pyspark.sql import functions > sc = pyspark.SparkContext(appName="udf_example") > sqlctx = pyspark.SQLContext(sc) > schema = StructType([StructField("lang", StringType(), False), > StructField("page", StringType(), False), > StructField("timestamp", TimestampType(), False), > StructField("views", IntegerType(), False)]) > df = sqlctx.read.csv(fp.name, > header=False, > schema=schema, > timestampFormat="MMdd-HHmmss", > sep=' ') > df.count() > df.dtypes > df.show() > new_schema = StructType([StructField("lang", StringType(), False), >
[jira] [Created] (SPARK-24386) implement continuous processing coalesce(1)
Jose Torres created SPARK-24386: --- Summary: implement continuous processing coalesce(1) Key: SPARK-24386 URL: https://issues.apache.org/jira/browse/SPARK-24386 Project: Spark Issue Type: Sub-task Components: Structured Streaming Affects Versions: 2.4.0 Reporter: Jose Torres [~marmbrus] suggested this as a good implementation checkpoint. If we do the shuffle reader and writer correctly, it should be easy to make a custom coalesce(1) execution for continuous processing using them, without having to implement the logic for shuffle writers finding out where shuffle readers are located. (The coalesce(1) can just get the RpcEndpointRef directly from the reader and pass it to the writers.) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23929) pandas_udf schema mapped by position and not by name
[ https://issues.apache.org/jira/browse/SPARK-23929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489972#comment-16489972 ] Cristian Consonni commented on SPARK-23929: --- This bug was referenced in [issue SPARK-24324|https://issues.apache.org/jira/browse/SPARK-24324], which is a duplicate of this one. See also pull request [#21427|https://github.com/apache/spark/pull/21427]. > pandas_udf schema mapped by position and not by name > > > Key: SPARK-23929 > URL: https://issues.apache.org/jira/browse/SPARK-23929 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: PySpark > Spark 2.3.0 > >Reporter: Omri >Priority: Major > > The return struct of a pandas_udf should be mapped to the provided schema by > name. Currently it's not the case. > Consider these two examples, where the only change is the order of the fields > in the provided schema struct: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame( > [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], > ("id", "v")) > @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP) > def normalize(pdf): > v = pdf.v > return pdf.assign(v=(v - v.mean()) / v.std()) > df.groupby("id").apply(normalize).show() > {code} > and this one: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame( > [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], > ("id", "v")) > @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP) > def normalize(pdf): > v = pdf.v > return pdf.assign(v=(v - v.mean()) / v.std()) > df.groupby("id").apply(normalize).show() > {code} > The results should be the same but they are different: > For the first code: > {code:java} > +---+---+ > | v| id| > +---+---+ > |1.0| 0| > |1.0| 0| > |2.0| 0| > |2.0| 0| > |2.0| 1| > +---+---+ > {code} > For the second code: > {code:java} > +---+---+ > | id| v| > +---+---+ > | 1|-0.7071067811865475| > | 1| 0.7071067811865475| > | 2|-0.8320502943378437| > | 2|-0.2773500981126146| > | 2| 1.1094003924504583| > +---+---+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24383) spark on k8s: "driver-svc" are not getting deleted
[ https://issues.apache.org/jira/browse/SPARK-24383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489959#comment-16489959 ] Lenin commented on SPARK-24383: --- Yes it does have it. ``` apiVersion: v1 kind: Service metadata: creationTimestamp: 2018-05-24T22:58:35Z name: -driver-svc namespace: default ownerReferences: - apiVersion: v1 controller: true kind: Pod name: -driver uid: fcce2b12-5fa5-11e8-99a8-000d3afdbc16 resourceVersion: "17895980" selfLink: /api/v1/namespaces/default/services/-driver-svc uid: fd01e9bc-5fa5-11e8-99a8-000d3afdbc16 spec: clusterIP: None ports: - name: driver-rpc-port port: 7078 protocol: TCP targetPort: 7078 - name: blockmanager port: 7079 protocol: TCP targetPort: 7079 selector: spark-app-selector: spark-c9e1543546bc431d9e083798efb9c870 spark-role: driver sessionAffinity: None type: ClusterIP status: loadBalancer: {} ``` > spark on k8s: "driver-svc" are not getting deleted > -- > > Key: SPARK-24383 > URL: https://issues.apache.org/jira/browse/SPARK-24383 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Lenin >Priority: Major > > When the driver pod exists, the "*driver-svc" services created for the driver > are not cleaned up. This causes accumulation of services in the k8s layer, at > one point no more services can be created. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R
[ https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489944#comment-16489944 ] Hossein Falaki commented on SPARK-24359: Thank you guys for feedback. I updated the SPIP and the design document to use snake_case everywhere. I also added a section to the design document to summarize CRAN release strategy. We can write integration tests that run on jenkins to detect when we need to re-publish SparkML to CRAN. CRAN tests will not include any integration tests that interact with JVM. > SPIP: ML Pipelines in R > --- > > Key: SPARK-24359 > URL: https://issues.apache.org/jira/browse/SPARK-24359 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hossein Falaki >Priority: Major > Labels: SPIP > Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines > in R.pdf > > > h1. Background and motivation > SparkR supports calling MLlib functionality with an [R-friendly > API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. > Since Spark 1.5 the (new) SparkML API which is based on [pipelines and > parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] > has matured significantly. It allows users build and maintain complicated > machine learning pipelines. A lot of this functionality is difficult to > expose using the simple formula-based API in SparkR. > We propose a new R package, _SparkML_, to be distributed along with SparkR as > part of Apache Spark. This new package will be built on top of SparkR’s APIs > to expose SparkML’s pipeline APIs and functionality. > *Why not SparkR?* > SparkR package contains ~300 functions. Many of these shadow functions in > base and other popular CRAN packages. We think adding more functions to > SparkR will degrade usability and make maintenance harder. > *Why not sparklyr?* > sparklyr is an R package developed by RStudio Inc. to expose Spark API to R > users. sparklyr includes MLlib API wrappers, but to the best of our knowledge > they are not comprehensive. Also we propose a code-gen approach for this > package to minimize work needed to expose future MLlib API, but sparkly’s API > is manually written. > h1. Target Personas > * Existing SparkR users who need more flexible SparkML API > * R users (data scientists, statisticians) who wish to build Spark ML > pipelines in R > h1. Goals > * R users can install SparkML from CRAN > * R users will be able to import SparkML independent from SparkR > * After setting up a Spark session R users can > * create a pipeline by chaining individual components and specifying their > parameters > * tune a pipeline in parallel, taking advantage of Spark > * inspect a pipeline’s parameters and evaluation metrics > * repeatedly apply a pipeline > * MLlib contributors can easily add R wrappers for new MLlib Estimators and > Transformers > h1. Non-Goals > * Adding new algorithms to SparkML R package which do not exist in Scala > * Parallelizing existing CRAN packages > * Changing existing SparkR ML wrapping API > h1. Proposed API Changes > h2. Design goals > When encountering trade-offs in API, we will chose based on the following > list of priorities. The API choice that addresses a higher priority goal will > be chosen. > # *Comprehensive coverage of MLlib API:* Design choices that make R coverage > of future ML algorithms difficult will be ruled out. > * *Semantic clarity*: We attempt to minimize confusion with other packages. > Between consciousness and clarity, we will choose clarity. > * *Maintainability and testability:* API choices that require manual > maintenance or make testing difficult should be avoided. > * *Interoperability with rest of Spark components:* We will keep the R API > as thin as possible and keep all functionality implementation in JVM/Scala. > * *Being natural to R users:* Ultimate users of this package are R users and > they should find it easy and natural to use. > The API will follow familiar R function syntax, where the object is passed as > the first argument of the method: do_something(obj, arg1, arg2). All > functions are snake_case (e.g., {{spark_logistic_regression()}} and > {{set_max_iter()}}). If a constructor gets arguments, they will be named > arguments. For example: > {code:java} > > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), > > 0.1){code} > When calls need to be chained, like above example, syntax can nicely > translate to a natural pipeline style with help from very popular[ magrittr > package|https://cran.r-project.org/web/packages/magrittr/index.html]. For > example: > {code:java} > > logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) ->
[jira] [Updated] (SPARK-24359) SPIP: ML Pipelines in R
[ https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-24359: --- Description: h1. Background and motivation SparkR supports calling MLlib functionality with an [R-friendly API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. Since Spark 1.5 the (new) SparkML API which is based on [pipelines and parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] has matured significantly. It allows users build and maintain complicated machine learning pipelines. A lot of this functionality is difficult to expose using the simple formula-based API in SparkR. We propose a new R package, _SparkML_, to be distributed along with SparkR as part of Apache Spark. This new package will be built on top of SparkR’s APIs to expose SparkML’s pipeline APIs and functionality. *Why not SparkR?* SparkR package contains ~300 functions. Many of these shadow functions in base and other popular CRAN packages. We think adding more functions to SparkR will degrade usability and make maintenance harder. *Why not sparklyr?* sparklyr is an R package developed by RStudio Inc. to expose Spark API to R users. sparklyr includes MLlib API wrappers, but to the best of our knowledge they are not comprehensive. Also we propose a code-gen approach for this package to minimize work needed to expose future MLlib API, but sparkly’s API is manually written. h1. Target Personas * Existing SparkR users who need more flexible SparkML API * R users (data scientists, statisticians) who wish to build Spark ML pipelines in R h1. Goals * R users can install SparkML from CRAN * R users will be able to import SparkML independent from SparkR * After setting up a Spark session R users can * create a pipeline by chaining individual components and specifying their parameters * tune a pipeline in parallel, taking advantage of Spark * inspect a pipeline’s parameters and evaluation metrics * repeatedly apply a pipeline * MLlib contributors can easily add R wrappers for new MLlib Estimators and Transformers h1. Non-Goals * Adding new algorithms to SparkML R package which do not exist in Scala * Parallelizing existing CRAN packages * Changing existing SparkR ML wrapping API h1. Proposed API Changes h2. Design goals When encountering trade-offs in API, we will chose based on the following list of priorities. The API choice that addresses a higher priority goal will be chosen. # *Comprehensive coverage of MLlib API:* Design choices that make R coverage of future ML algorithms difficult will be ruled out. * *Semantic clarity*: We attempt to minimize confusion with other packages. Between consciousness and clarity, we will choose clarity. * *Maintainability and testability:* API choices that require manual maintenance or make testing difficult should be avoided. * *Interoperability with rest of Spark components:* We will keep the R API as thin as possible and keep all functionality implementation in JVM/Scala. * *Being natural to R users:* Ultimate users of this package are R users and they should find it easy and natural to use. The API will follow familiar R function syntax, where the object is passed as the first argument of the method: do_something(obj, arg1, arg2). All functions are snake_case (e.g., {{spark_logistic_regression()}} and {{set_max_iter()}}). If a constructor gets arguments, they will be named arguments. For example: {code:java} > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 0.1){code} When calls need to be chained, like above example, syntax can nicely translate to a natural pipeline style with help from very popular[ magrittr package|https://cran.r-project.org/web/packages/magrittr/index.html]. For example: {code:java} > logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> lr{code} h2. Namespace All new API will be under a new CRAN package, named SparkML. The package should be usable without needing SparkR in the namespace. The package will introduce a number of S4 classes that inherit from four basic classes. Here we will list the basic types with a few examples. An object of any child class can be instantiated with a function call that starts with {{spark_}}. h2. Pipeline & PipelineStage A pipeline object contains one or more stages. {code:java} > pipeline <- spark_pipeline() %>% set_stages(stage1, stage2, stage3){code} Where stage1, stage2, etc are S4 objects of a PipelineStage and pipeline is an object of type Pipeline. h2. Transformers A Transformer is an algorithm that can transform one SparkDataFrame into another SparkDataFrame. *Example API:* {code:java} > tokenizer <- spark_tokenizer() %>% set_input_col(“text”) %>% set_output_col(“words”) > tokenized.df <- tokenizer %>% transform(df) {code} h2.
[jira] [Updated] (SPARK-24359) SPIP: ML Pipelines in R
[ https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-24359: --- Attachment: SparkML_ ML Pipelines in R-v2.pdf > SPIP: ML Pipelines in R > --- > > Key: SPARK-24359 > URL: https://issues.apache.org/jira/browse/SPARK-24359 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hossein Falaki >Priority: Major > Labels: SPIP > Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines > in R.pdf > > > h1. Background and motivation > SparkR supports calling MLlib functionality with an [R-friendly > API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. > Since Spark 1.5 the (new) SparkML API which is based on [pipelines and > parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] > has matured significantly. It allows users build and maintain complicated > machine learning pipelines. A lot of this functionality is difficult to > expose using the simple formula-based API in SparkR. > We propose a new R package, _SparkML_, to be distributed along with SparkR as > part of Apache Spark. This new package will be built on top of SparkR’s APIs > to expose SparkML’s pipeline APIs and functionality. > *Why not SparkR?* > SparkR package contains ~300 functions. Many of these shadow functions in > base and other popular CRAN packages. We think adding more functions to > SparkR will degrade usability and make maintenance harder. > *Why not sparklyr?* > sparklyr is an R package developed by RStudio Inc. to expose Spark API to R > users. sparklyr includes MLlib API wrappers, but to the best of our knowledge > they are not comprehensive. Also we propose a code-gen approach for this > package to minimize work needed to expose future MLlib API, but sparkly’s API > is manually written. > h1. Target Personas > * Existing SparkR users who need more flexible SparkML API > * R users (data scientists, statisticians) who wish to build Spark ML > pipelines in R > h1. Goals > * R users can install SparkML from CRAN > * R users will be able to import SparkML independent from SparkR > * After setting up a Spark session R users can > * create a pipeline by chaining individual components and specifying their > parameters > * tune a pipeline in parallel, taking advantage of Spark > * inspect a pipeline’s parameters and evaluation metrics > * repeatedly apply a pipeline > * MLlib contributors can easily add R wrappers for new MLlib Estimators and > Transformers > h1. Non-Goals > * Adding new algorithms to SparkML R package which do not exist in Scala > * Parallelizing existing CRAN packages > * Changing existing SparkR ML wrapping API > h1. Proposed API Changes > h2. Design goals > When encountering trade-offs in API, we will chose based on the following > list of priorities. The API choice that addresses a higher priority goal will > be chosen. > # *Comprehensive coverage of MLlib API:* Design choices that make R coverage > of future ML algorithms difficult will be ruled out. > * *Semantic clarity*: We attempt to minimize confusion with other packages. > Between consciousness and clarity, we will choose clarity. > * *Maintainability and testability:* API choices that require manual > maintenance or make testing difficult should be avoided. > * *Interoperability with rest of Spark components:* We will keep the R API > as thin as possible and keep all functionality implementation in JVM/Scala. > * *Being natural to R users:* Ultimate users of this package are R users and > they should find it easy and natural to use. > The API will follow familiar R function syntax, where the object is passed as > the first argument of the method: do_something(obj, arg1, arg2). All > functions are snake_case (e.g., {{spark_logistic_regression()}} and > {{set_max_iter()}}). If a constructor gets arguments, they will be named > arguments. For example: > {code:java} > > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), > > 0.1){code} > When calls need to be chained, like above example, syntax can nicely > translate to a natural pipeline style with help from very popular[ magrittr > package|https://cran.r-project.org/web/packages/magrittr/index.html]. For > example: > {code:java} > > logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> > > lr{code} > h2. Namespace > All new API will be under a new CRAN package, named SparkML. The package > should be usable without needing SparkR in the namespace. The package will > introduce a number of S4 classes that inherit from four basic classes. Here > we will list the basic types with a few examples. An object of any child > class can be
[jira] [Commented] (SPARK-24383) spark on k8s: "driver-svc" are not getting deleted
[ https://issues.apache.org/jira/browse/SPARK-24383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489942#comment-16489942 ] Yinan Li commented on SPARK-24383: -- You can use {{kubectl get service -o=yaml}} to get a YAML-formatted representation of the service and check if the {{metadata}} section contains a {{OwnerReference}} pointing to the driver pod. > spark on k8s: "driver-svc" are not getting deleted > -- > > Key: SPARK-24383 > URL: https://issues.apache.org/jira/browse/SPARK-24383 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Lenin >Priority: Major > > When the driver pod exists, the "*driver-svc" services created for the driver > are not cleaned up. This causes accumulation of services in the k8s layer, at > one point no more services can be created. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24383) spark on k8s: "driver-svc" are not getting deleted
[ https://issues.apache.org/jira/browse/SPARK-24383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489939#comment-16489939 ] Lenin commented on SPARK-24383: --- I tried to check for that, but couldnt find where to look. I can see in the code that it being set. > spark on k8s: "driver-svc" are not getting deleted > -- > > Key: SPARK-24383 > URL: https://issues.apache.org/jira/browse/SPARK-24383 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Lenin >Priority: Major > > When the driver pod exists, the "*driver-svc" services created for the driver > are not cleaned up. This causes accumulation of services in the k8s layer, at > one point no more services can be created. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23754) StopIterator exception in Python UDF results in partial result
[ https://issues.apache.org/jira/browse/SPARK-23754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-23754: --- Target Version/s: 2.3.1 Adding target version so it actually shows up in searches... > StopIterator exception in Python UDF results in partial result > -- > > Key: SPARK-23754 > URL: https://issues.apache.org/jira/browse/SPARK-23754 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Li Jin >Priority: Blocker > > Reproduce: > {code:java} > df = spark.range(0, 1000) > from pyspark.sql.functions import udf > def foo(x): > raise StopIteration() > df.withColumn('v', udf(foo)).show() > # Results > # +---+---+ > # | id| v| > # +---+---+ > # +---+---+{code} > I think the task should fail in this case -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14220) Build and test Spark against Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-14220: -- Assignee: Marcelo Vanzin > Build and test Spark against Scala 2.12 > --- > > Key: SPARK-14220 > URL: https://issues.apache.org/jira/browse/SPARK-14220 > Project: Spark > Issue Type: Umbrella > Components: Build, Project Infra >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Assignee: Marcelo Vanzin >Priority: Blocker > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.12 milestone. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14220) Build and test Spark against Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-14220: -- Assignee: (was: Marcelo Vanzin) > Build and test Spark against Scala 2.12 > --- > > Key: SPARK-14220 > URL: https://issues.apache.org/jira/browse/SPARK-14220 > Project: Spark > Issue Type: Umbrella > Components: Build, Project Infra >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Priority: Blocker > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.12 milestone. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data
[ https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489820#comment-16489820 ] Wenbo Zhao commented on SPARK-24373: I guess we should use `planWithBarrier` in the 'RelationalGroupedDataset' or other similar places. Any suggestion? > "df.cache() df.count()" no longer eagerly caches data > - > > Key: SPARK-24373 > URL: https://issues.apache.org/jira/browse/SPARK-24373 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenbo Zhao >Priority: Major > > Here is the code to reproduce in local mode > {code:java} > scala> val df = sc.range(1, 2).toDF > df: org.apache.spark.sql.DataFrame = [value: bigint] > scala> val myudf = udf({x: Long => println(""); x + 1}) > myudf: org.apache.spark.sql.expressions.UserDefinedFunction = > UserDefinedFunction(,LongType,Some(List(LongType))) > scala> val df1 = df.withColumn("value1", myudf(col("value"))) > df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint] > scala> df1.cache > res0: df1.type = [value: bigint, value1: bigint] > scala> df1.count > res1: Long = 1 > scala> df1.count > res2: Long = 1 > scala> df1.count > res3: Long = 1 > {code} > > in Spark 2.2, you could see it prints "". > In the above example, when you do explain. You could see > {code:java} > scala> df1.explain(true) > == Parsed Logical Plan == > 'Project [value#2L, UDF('value) AS value1#5] > +- AnalysisBarrier > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Analyzed Logical Plan == > value: bigint, value1: bigint > Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Optimized Logical Plan == > InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > == Physical Plan == > *(1) InMemoryTableScan [value#2L, value1#5L] > +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > {code} > but the ImMemoryTableScan is mising in the following explain() > {code:java} > scala> df1.groupBy().count().explain(true) > == Parsed Logical Plan == > Aggregate [count(1) AS count#170L] > +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS > value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Analyzed Logical Plan == > count: bigint > Aggregate [count(1) AS count#170L] > +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) > null else UDF(value#2L) AS value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Optimized Logical Plan == > Aggregate [count(1) AS count#170L] > +- Project > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Physical Plan == > *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L]) > +- Exchange SinglePartition > +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#175L]) > +- *(1) Project > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24350) ClassCastException in "array_position" function
[ https://issues.apache.org/jira/browse/SPARK-24350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Vayda resolved SPARK-24350. Resolution: Fixed > ClassCastException in "array_position" function > --- > > Key: SPARK-24350 > URL: https://issues.apache.org/jira/browse/SPARK-24350 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Alex Vayda >Priority: Major > Fix For: 2.4.0 > > > When calling {{array_position}} function with a wrong type of the 1st operand > a {{ClassCastException}} is thrown instead of {{AnalysisException}} > Example: > {code:sql} > select array_position('foo', 'bar') > {code} > {noformat} > java.lang.ClassCastException: org.apache.spark.sql.types.StringType$ cannot > be cast to org.apache.spark.sql.types.ArrayType > at > org.apache.spark.sql.catalyst.expressions.ArrayPosition.inputTypes(collectionOperations.scala:1398) > at > org.apache.spark.sql.catalyst.expressions.ExpectsInputTypes$class.checkInputDataTypes(ExpectsInputTypes.scala:44) > at > org.apache.spark.sql.catalyst.expressions.ArrayPosition.checkInputDataTypes(collectionOperations.scala:1401) > at > org.apache.spark.sql.catalyst.expressions.Expression.resolved$lzycompute(Expression.scala:168) > at > org.apache.spark.sql.catalyst.expressions.Expression.resolved(Expression.scala:168) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:256) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:252) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data
[ https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489060#comment-16489060 ] Li Jin edited comment on SPARK-24373 at 5/24/18 9:00 PM: - This is a reproduce: {code:java} val myUDF = udf((x: Long) => { println(""); x + 1 }) val df1 = spark.range(0, 1).toDF("s").select(myUDF($"s")) df1.cache() df1.count() // No printed {code} It appears the issue is related to UDF: {code:java} val df1 = spark.range(0, 1).toDF("s").select(myUDF($"s")) df1.cache() df1.groupBy().count().explain() == Physical Plan == *(2) HashAggregate(keys=[], functions=[count(1)]) +- Exchange SinglePartition +- *(1) HashAggregate(keys=[], functions=[partial_count(1)]) +- *(1) Project +- *(1) Range (0, 1, step=1, splits=2) {code} Without UDF it uses "count" materialize cache: {code:java} val df1 = spark.range(0, 1).toDF("s").select($"s" + 1) df1.cache() df1.groupBy().count().explain() == Physical Plan == *(2) HashAggregate(keys=[], functions=[count(1)]) +- Exchange SinglePartition +- *(1) HashAggregate(keys=[], functions=[partial_count(1)]) +- *(1) InMemoryTableScan +- InMemoryRelation [(s + 1)#179L], CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 replicas),*(1) Project [(id#175L + 1) AS (s + 1)#179L] +- *(1) Range (0, 1, step=1, splits=2) ,None) +- *(1) Project [(id#175L + 1) AS (s + 1)#179L] +- *(1) Range (0, 1, step=1, splits=2) {code} was (Author: icexelloss): This is a reproduce: {code:java} val myUDF = udf((x: Long) => { println(""); x + 1 }) val df1 = spark.range(0, 1).toDF("s").select(myUDF($"s")) df1.cache() df1.count() // No printed {code} It appears the issue is related to UDF: {code:java} val df1 = spark.range(0, 1).toDF("s").select(myUDF($"s")) df1.cache() df1.groupBy().count().explain() == Physical Plan == *(2) HashAggregate(keys=[], functions=[count(1)]) +- Exchange SinglePartition +- *(1) HashAggregate(keys=[], functions=[partial_count(1)]) +- *(1) Project +- *(1) Range (0, 1, step=1, splits=2) {code} Without UDF it uses "count" materialize cache: {code:java} val df1 = spark.range(0, 1).toDF("s").select($"s" + 1) df1.cache() df1.groupBy().count().explain() == Physical Plan == *(2) HashAggregate(keys=[], functions=[count(1)]) +- Exchange SinglePartition +- *(1) HashAggregate(keys=[], functions=[partial_count(1)]) +- *(1) InMemoryTableScan +- InMemoryRelation [(s + 1)#179L], CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 replicas),*(1) Project [(id#175L + 1) AS (s + 1)#179L] +- *(1) Range (0, 1, step=1, splits=2) ,None) +- *(1) Project [(id#175L + 1) AS (s + 1)#179L] +- *(1) Range (0, 1, step=1, splits=2) {code} > "df.cache() df.count()" no longer eagerly caches data > - > > Key: SPARK-24373 > URL: https://issues.apache.org/jira/browse/SPARK-24373 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenbo Zhao >Priority: Major > > Here is the code to reproduce in local mode > {code:java} > scala> val df = sc.range(1, 2).toDF > df: org.apache.spark.sql.DataFrame = [value: bigint] > scala> val myudf = udf({x: Long => println(""); x + 1}) > myudf: org.apache.spark.sql.expressions.UserDefinedFunction = > UserDefinedFunction(,LongType,Some(List(LongType))) > scala> val df1 = df.withColumn("value1", myudf(col("value"))) > df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint] > scala> df1.cache > res0: df1.type = [value: bigint, value1: bigint] > scala> df1.count > res1: Long = 1 > scala> df1.count > res2: Long = 1 > scala> df1.count > res3: Long = 1 > {code} > > in Spark 2.2, you could see it prints "". > In the above example, when you do explain. You could see > {code:java} > scala> df1.explain(true) > == Parsed Logical Plan == > 'Project [value#2L, UDF('value) AS value1#5] > +- AnalysisBarrier > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Analyzed Logical Plan == > value: bigint, value1: bigint > Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Optimized Logical Plan == > InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > == Physical Plan == > *(1) InMemoryTableScan [value#2L, value1#5L] > +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, > memory,
[jira] [Comment Edited] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data
[ https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489060#comment-16489060 ] Li Jin edited comment on SPARK-24373 at 5/24/18 9:00 PM: - This is a reproduce: {code:java} val myUDF = udf((x: Long) => { println(""); x + 1 }) val df1 = spark.range(0, 1).toDF("s").select(myUDF($"s")) df1.cache() df1.count() // No printed {code} It appears the issue is related to UDF: {code:java} val df1 = spark.range(0, 1).toDF("s").select(myUDF($"s")) df1.cache() df1.groupBy().count().explain() == Physical Plan == *(2) HashAggregate(keys=[], functions=[count(1)]) +- Exchange SinglePartition +- *(1) HashAggregate(keys=[], functions=[partial_count(1)]) +- *(1) Project +- *(1) Range (0, 1, step=1, splits=2) {code} Without UDF it uses "count" materialize cache: {code:java} val df1 = spark.range(0, 1).toDF("s").select($"s" + 1) df1.cache() df1.groupBy().count().explain() == Physical Plan == *(2) HashAggregate(keys=[], functions=[count(1)]) +- Exchange SinglePartition +- *(1) HashAggregate(keys=[], functions=[partial_count(1)]) +- *(1) InMemoryTableScan +- InMemoryRelation [(s + 1)#179L], CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 replicas),*(1) Project [(id#175L + 1) AS (s + 1)#179L] +- *(1) Range (0, 1, step=1, splits=2) ,None) +- *(1) Project [(id#175L + 1) AS (s + 1)#179L] +- *(1) Range (0, 1, step=1, splits=2) {code} was (Author: icexelloss): This is a reproduce: {code:java} val myUDF = udf((x: Long) => { println(""); x + 1 }) val df1 = spark.range(0, 1).toDF("s").select(myUDF($"s")) df1.cache() df1.count() // No printed{code} > "df.cache() df.count()" no longer eagerly caches data > - > > Key: SPARK-24373 > URL: https://issues.apache.org/jira/browse/SPARK-24373 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenbo Zhao >Priority: Major > > Here is the code to reproduce in local mode > {code:java} > scala> val df = sc.range(1, 2).toDF > df: org.apache.spark.sql.DataFrame = [value: bigint] > scala> val myudf = udf({x: Long => println(""); x + 1}) > myudf: org.apache.spark.sql.expressions.UserDefinedFunction = > UserDefinedFunction(,LongType,Some(List(LongType))) > scala> val df1 = df.withColumn("value1", myudf(col("value"))) > df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint] > scala> df1.cache > res0: df1.type = [value: bigint, value1: bigint] > scala> df1.count > res1: Long = 1 > scala> df1.count > res2: Long = 1 > scala> df1.count > res3: Long = 1 > {code} > > in Spark 2.2, you could see it prints "". > In the above example, when you do explain. You could see > {code:java} > scala> df1.explain(true) > == Parsed Logical Plan == > 'Project [value#2L, UDF('value) AS value1#5] > +- AnalysisBarrier > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Analyzed Logical Plan == > value: bigint, value1: bigint > Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Optimized Logical Plan == > InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > == Physical Plan == > *(1) InMemoryTableScan [value#2L, value1#5L] > +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > {code} > but the ImMemoryTableScan is mising in the following explain() > {code:java} > scala> df1.groupBy().count().explain(true) > == Parsed Logical Plan == > Aggregate [count(1) AS count#170L] > +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS > value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Analyzed Logical Plan == > count: bigint > Aggregate [count(1) AS count#170L] > +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) > null else UDF(value#2L) AS value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Optimized Logical Plan == > Aggregate [count(1) AS count#170L] > +- Project > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Physical Plan == > *(2) HashAggregate(keys=[], functions=[count(1)],
[jira] [Comment Edited] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data
[ https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489060#comment-16489060 ] Li Jin edited comment on SPARK-24373 at 5/24/18 8:51 PM: - This is a reproduce: {code:java} val myUDF = udf((x: Long) => { println(""); x + 1 }) val df1 = spark.range(0, 1).toDF("s").select(myUDF($"s")) df1.cache() df1.count() // No printed{code} was (Author: icexelloss): This is a reproduce in unit test: {code:java} test("cache and count") { var evalCount = 0 val myUDF = udf((x: String) => { evalCount += 1; "result" }) val df = spark.range(0, 1).select(myUDF($"id")) df.cache() df.count() assert(evalCount === 1) df.count() assert(evalCount === 1) } {code} > "df.cache() df.count()" no longer eagerly caches data > - > > Key: SPARK-24373 > URL: https://issues.apache.org/jira/browse/SPARK-24373 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenbo Zhao >Priority: Major > > Here is the code to reproduce in local mode > {code:java} > scala> val df = sc.range(1, 2).toDF > df: org.apache.spark.sql.DataFrame = [value: bigint] > scala> val myudf = udf({x: Long => println(""); x + 1}) > myudf: org.apache.spark.sql.expressions.UserDefinedFunction = > UserDefinedFunction(,LongType,Some(List(LongType))) > scala> val df1 = df.withColumn("value1", myudf(col("value"))) > df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint] > scala> df1.cache > res0: df1.type = [value: bigint, value1: bigint] > scala> df1.count > res1: Long = 1 > scala> df1.count > res2: Long = 1 > scala> df1.count > res3: Long = 1 > {code} > > in Spark 2.2, you could see it prints "". > In the above example, when you do explain. You could see > {code:java} > scala> df1.explain(true) > == Parsed Logical Plan == > 'Project [value#2L, UDF('value) AS value1#5] > +- AnalysisBarrier > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Analyzed Logical Plan == > value: bigint, value1: bigint > Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Optimized Logical Plan == > InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > == Physical Plan == > *(1) InMemoryTableScan [value#2L, value1#5L] > +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > {code} > but the ImMemoryTableScan is mising in the following explain() > {code:java} > scala> df1.groupBy().count().explain(true) > == Parsed Logical Plan == > Aggregate [count(1) AS count#170L] > +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS > value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Analyzed Logical Plan == > count: bigint > Aggregate [count(1) AS count#170L] > +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) > null else UDF(value#2L) AS value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Optimized Logical Plan == > Aggregate [count(1) AS count#170L] > +- Project > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Physical Plan == > *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L]) > +- Exchange SinglePartition > +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#175L]) > +- *(1) Project > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24383) spark on k8s: "driver-svc" are not getting deleted
[ https://issues.apache.org/jira/browse/SPARK-24383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489776#comment-16489776 ] Yinan Li commented on SPARK-24383: -- Can you double check if the services have an {{OwnerReference}} pointing to a driver pod? > spark on k8s: "driver-svc" are not getting deleted > -- > > Key: SPARK-24383 > URL: https://issues.apache.org/jira/browse/SPARK-24383 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Lenin >Priority: Major > > When the driver pod exists, the "*driver-svc" services created for the driver > are not cleaned up. This causes accumulation of services in the k8s layer, at > one point no more services can be created. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24375) Design sketch: support barrier scheduling in Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-24375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489764#comment-16489764 ] Jiang Xingbo commented on SPARK-24375: -- We proposal to add new RDDBarrier and BarrierTaskContext to support barrier scheduling in Spark, it also requires to modify how the job scheduling works a bit to accommodate the new feature. *Barrier Stage*: A barrier stage doesn’t launch any of its tasks until the available slots(free CPU cores can be used to launch pending tasks) satisfies the target to launch all the tasks at the same time, and always retry the whole stage when any task(s) fail. One way to identify whether a stage is a barrier stage can be tracing the RDD that the stage runs on, if the stage contains RDDBarrier or at least one of the ancestor RDD(s) are RDDBarrier then the stage is a barrier stage, the tracing shall stop at ShuffleRDD(s). *Schedule Barrier Tasks*: Currently TaskScheduler schedule pending tasks on available slots by best effort, so normally all tasks in the same stage don’t get launched at the same time. We may add a check of total available slots before scheduling tasks from a barrier stage taskset. It is still possible that only partial tasks of a whole barrier stage taskset get launched due to task locality issues, so we have to check again before launch to ensure that all tasks in the same barrier stage get launched at the same time. If we consider scheduling several jobs at the same time(both barrier and regular jobs), it may be possible that barrier tasks are block by regular tasks(when available slots are always less than that required by a barrier stage taskset), or barrier stage taskset may block another barrier stage taskset(when a barrier stage taskset that requires less slots is prone to be scheduled earlier). Currently we don’t have a perfect solution for all these scenarios, but at least we may avoid the worst case that a huge barrier stage taskset being blocked forever on a busy cluster, using a time-based weight approach(conceptionally, a taskset that have been pending for a longer time will be assigned greater priority weight to be scheduled). *Task Barrier*: Barrier tasks shall allow users to insert sync in the middle of task execution, this can be achieved by introducing a glocal barrier operation in TaskContext, which makes the current task wait until all tasks in the same stage hit this barrier. *Task Failure*: To ensure correctness, a barrier stage always retry the whole stage when any task(s) fail. Thus, it’s quite straightforward that we shall require kill all the running tasks of a failed stage, and that also guarantees at most one taskset shall be running for each single stage(no zombie tasks). *Speculative Task*: Since we require launch all tasks in a barrier stage at the same time, there is no need to launch a speculative task for a barrier stage taskset. *Share TaskInfo*: To share informations between tasks in a barrier stage, we may update them in `TaskContext.localProperties`. *Python Support*: Expose RDDBarrier and BarrierTaskContext to pyspark. [~cloud_fan] maybe you want to give additional information I didn't cover above? (esp. PySpark) > Design sketch: support barrier scheduling in Apache Spark > - > > Key: SPARK-24375 > URL: https://issues.apache.org/jira/browse/SPARK-24375 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Jiang Xingbo >Priority: Major > > This task is to outline a design sketch for the barrier scheduling SPIP > discussion. It doesn't need to be a complete design before the vote. But it > should at least cover both Scala/Java and PySpark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24324) Pandas Grouped Map UserDefinedFunction mixes column labels
[ https://issues.apache.org/jira/browse/SPARK-24324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489759#comment-16489759 ] Li Jin commented on SPARK-24324: This is a dup of https://issues.apache.org/jira/browse/SPARK-23929, I am now convinced this behavior should be fixed. > Pandas Grouped Map UserDefinedFunction mixes column labels > -- > > Key: SPARK-24324 > URL: https://issues.apache.org/jira/browse/SPARK-24324 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: Python (using virtualenv): > {noformat} > $ python --version > Python 3.6.5 > {noformat} > Modules installed: > {noformat} > arrow==0.12.1 > backcall==0.1.0 > bleach==2.1.3 > chardet==3.0.4 > decorator==4.3.0 > entrypoints==0.2.3 > findspark==1.2.0 > html5lib==1.0.1 > ipdb==0.11 > ipykernel==4.8.2 > ipython==6.3.1 > ipython-genutils==0.2.0 > ipywidgets==7.2.1 > jedi==0.12.0 > Jinja2==2.10 > jsonschema==2.6.0 > jupyter==1.0.0 > jupyter-client==5.2.3 > jupyter-console==5.2.0 > jupyter-core==4.4.0 > MarkupSafe==1.0 > mistune==0.8.3 > nbconvert==5.3.1 > nbformat==4.4.0 > notebook==5.5.0 > numpy==1.14.3 > pandas==0.22.0 > pandocfilters==1.4.2 > parso==0.2.0 > pbr==3.1.1 > pexpect==4.5.0 > pickleshare==0.7.4 > progressbar2==3.37.1 > prompt-toolkit==1.0.15 > ptyprocess==0.5.2 > pyarrow==0.9.0 > Pygments==2.2.0 > python-dateutil==2.7.2 > python-utils==2.3.0 > pytz==2018.4 > pyzmq==17.0.0 > qtconsole==4.3.1 > Send2Trash==1.5.0 > simplegeneric==0.8.1 > six==1.11.0 > SQLAlchemy==1.2.7 > stevedore==1.28.0 > terminado==0.8.1 > testpath==0.3.1 > tornado==5.0.2 > traitlets==4.3.2 > virtualenv==15.1.0 > virtualenv-clone==0.2.6 > virtualenvwrapper==4.7.2 > wcwidth==0.1.7 > webencodings==0.5.1 > widgetsnbextension==3.2.1 > {noformat} > > Java: > {noformat} > $ java -version > java version "1.8.0_171" > Java(TM) SE Runtime Environment (build 1.8.0_171-b11) > Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat} > System: > {noformat} > $ lsb_release -a > No LSB modules are available. > Distributor ID: Ubuntu > Description: Ubuntu 16.04.4 LTS > Release: 16.04 > Codename: xenial > {noformat} >Reporter: Cristian Consonni >Priority: Major > > I am working on Wikipedia page views (see [task T188041 on Wikimedia's > Pharicator|https://phabricator.wikimedia.org/T188041]). For simplicity, let's > say that these are the data: > {noformat} > > {noformat} > For each combination of (lang, page, day(timestamp)) I need to transform the > views for each hour: > {noformat} > 00:00 -> A > 01:00 -> B > ... > {noformat} > and concatenate the number of views for that hour. So, if a page got 5 views > at 00:00 and 7 views at 01:00 it would become: > {noformat} > A5B7 > {noformat} > > I have written a UDF called {code:python}concat_hours{code} > However, the function is mixing the columns and I am not sure what is going > on. I wrote here a minimal complete example that reproduces the issue on my > system (the details of my environment are above). > {code:python} > #!/usr/bin/env python3 > # coding: utf-8 > input_data = b"""en Albert_Camus 20071210-00 150 > en Albert_Camus 20071210-01 148 > en Albert_Camus 20071210-02 197 > en Albert_Camus 20071211-20 145 > en Albert_Camus 20071211-21 131 > en Albert_Camus 20071211-22 154 > en Albert_Camus 20071211-230001 142 > en Albert_Caquot 20071210-02 1 > en Albert_Caquot 20071210-02 1 > en Albert_Caquot 20071210-040001 1 > en Albert_Caquot 20071211-06 1 > en Albert_Caquot 20071211-08 1 > en Albert_Caquot 20071211-15 3 > en Albert_Caquot 20071211-21 1""" > import tempfile > fp = tempfile.NamedTemporaryFile() > fp.write(input_data) > fp.seek(0) > import findspark > findspark.init() > import pyspark > from pyspark.sql.types import StructType, StructField > from pyspark.sql.types import StringType, IntegerType, TimestampType > from pyspark.sql import functions > sc = pyspark.SparkContext(appName="udf_example") > sqlctx = pyspark.SQLContext(sc) > schema = StructType([StructField("lang", StringType(), False), > StructField("page", StringType(), False), > StructField("timestamp", TimestampType(), False), > StructField("views", IntegerType(), False)]) > df = sqlctx.read.csv(fp.name, > header=False, > schema=schema, > timestampFormat="MMdd-HHmmss", > sep=' ') > df.count() > df.dtypes > df.show() > new_schema = StructType([StructField("lang", StringType(), False), > StructField("page", StringType(), False), > StructField("day", StringType(), False), >
[jira] [Comment Edited] (SPARK-24036) Stateful operators in continuous processing
[ https://issues.apache.org/jira/browse/SPARK-24036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489648#comment-16489648 ] Jose Torres edited comment on SPARK-24036 at 5/24/18 8:23 PM: -- I've been notified of [https://issues.apache.org/jira/projects/SPARK/issues/SPARK-24374?filter=allopenissues,] a SPIP for an API which would provide much of what we need here wrt letting tasks know where the appropriate shuffle endpoints. was (Author: joseph.torres): I've been notified of [https://issues.apache.org/jira/projects/SPARK/issues/SPARK-24374|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-24374,], a SPIP for an API which would provide much of what we need here wrt letting tasks know where the appropriate shuffle endpoints. > Stateful operators in continuous processing > --- > > Key: SPARK-24036 > URL: https://issues.apache.org/jira/browse/SPARK-24036 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > > The first iteration of continuous processing in Spark 2.3 does not work with > stateful operators. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24036) Stateful operators in continuous processing
[ https://issues.apache.org/jira/browse/SPARK-24036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489648#comment-16489648 ] Jose Torres edited comment on SPARK-24036 at 5/24/18 8:22 PM: -- I've been notified of [https://issues.apache.org/jira/projects/SPARK/issues/SPARK-24374|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-24374,], a SPIP for an API which would provide much of what we need here wrt letting tasks know where the appropriate shuffle endpoints. was (Author: joseph.torres): I've been notified of [https://issues.apache.org/jira/projects/SPARK/issues/SPARK-24374,] a SPIP for an API which would provide much of what we need here wrt letting tasks know where the appropriate shuffle endpoints. > Stateful operators in continuous processing > --- > > Key: SPARK-24036 > URL: https://issues.apache.org/jira/browse/SPARK-24036 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > > The first iteration of continuous processing in Spark 2.3 does not work with > stateful operators. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24036) Stateful operators in continuous processing
[ https://issues.apache.org/jira/browse/SPARK-24036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489727#comment-16489727 ] Jose Torres commented on SPARK-24036: - That's out of scope - the shuffle reader and writer work in this Jira would still be needed on top. > Stateful operators in continuous processing > --- > > Key: SPARK-24036 > URL: https://issues.apache.org/jira/browse/SPARK-24036 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > > The first iteration of continuous processing in Spark 2.3 does not work with > stateful operators. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24036) Stateful operators in continuous processing
[ https://issues.apache.org/jira/browse/SPARK-24036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489700#comment-16489700 ] Arun Mahadevan commented on SPARK-24036: If I understand correctly, continuous job would have a single stage with tasks running at the same time shuffling data around (making use of the "TaskInfo" to figure out the endpoints). This means we cannot re-use the existing shuffle infra since it makes sense only if there are multiple stages ? Does SPARK-24374 plan to provide the shuffle infra to move data around or is that out of scope ? > Stateful operators in continuous processing > --- > > Key: SPARK-24036 > URL: https://issues.apache.org/jira/browse/SPARK-24036 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > > The first iteration of continuous processing in Spark 2.3 does not work with > stateful operators. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24332) Fix places reading 'spark.network.timeout' as milliseconds
[ https://issues.apache.org/jira/browse/SPARK-24332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-24332. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21382 [https://github.com/apache/spark/pull/21382] > Fix places reading 'spark.network.timeout' as milliseconds > -- > > Key: SPARK-24332 > URL: https://issues.apache.org/jira/browse/SPARK-24332 > Project: Spark > Issue Type: Improvement > Components: Mesos, Structured Streaming >Affects Versions: 2.3.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > Fix For: 2.4.0 > > > There are some places reading "spark.network.timeout" using "getTimeAsMs" > rather than "getTimeAsSeconds". This will return a wrong value when the user > specifies a value without a time unit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24383) spark on k8s: "driver-svc" are not getting deleted
[ https://issues.apache.org/jira/browse/SPARK-24383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489665#comment-16489665 ] Lenin commented on SPARK-24383: --- its not something i observed. I had a lots of dangling services. > spark on k8s: "driver-svc" are not getting deleted > -- > > Key: SPARK-24383 > URL: https://issues.apache.org/jira/browse/SPARK-24383 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Lenin >Priority: Major > > When the driver pod exists, the "*driver-svc" services created for the driver > are not cleaned up. This causes accumulation of services in the k8s layer, at > one point no more services can be created. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23416) Flaky test: KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false
[ https://issues.apache.org/jira/browse/SPARK-23416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489661#comment-16489661 ] Jose Torres commented on SPARK-23416: - Do you know how to drive that? I'm not sure what the process is. > Flaky test: KafkaSourceStressForDontFailOnDataLossSuite.stress test for > failOnDataLoss=false > > > Key: SPARK-23416 > URL: https://issues.apache.org/jira/browse/SPARK-23416 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Jose Torres >Priority: Minor > Fix For: 2.4.0 > > > I suspect this is a race condition latent in the DataSourceV2 write path, or > at least the interaction of that write path with StreamTest. > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87241/testReport/org.apache.spark.sql.kafka010/KafkaSourceStressForDontFailOnDataLossSuite/stress_test_for_failOnDataLoss_false/] > h3. Error Message > org.apache.spark.sql.streaming.StreamingQueryException: Query [id = > 16b2a2b1-acdd-44ec-902f-531169193169, runId = > 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job > aborted. > h3. Stacktrace > sbt.ForkMain$ForkError: > org.apache.spark.sql.streaming.StreamingQueryException: Query [id = > 16b2a2b1-acdd-44ec-902f-531169193169, runId = > 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job > aborted. at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) > Caused by: sbt.ForkMain$ForkError: org.apache.spark.SparkException: Writing > job aborted. at > org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:108) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) > at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at > org.apache.spark.sql.Dataset.collect(Dataset.scala:2722) at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$15.apply(MicroBatchExecution.scala:488) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:483) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:482) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at >
[jira] [Assigned] (SPARK-24324) Pandas Grouped Map UserDefinedFunction mixes column labels
[ https://issues.apache.org/jira/browse/SPARK-24324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24324: Assignee: Apache Spark > Pandas Grouped Map UserDefinedFunction mixes column labels > -- > > Key: SPARK-24324 > URL: https://issues.apache.org/jira/browse/SPARK-24324 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: Python (using virtualenv): > {noformat} > $ python --version > Python 3.6.5 > {noformat} > Modules installed: > {noformat} > arrow==0.12.1 > backcall==0.1.0 > bleach==2.1.3 > chardet==3.0.4 > decorator==4.3.0 > entrypoints==0.2.3 > findspark==1.2.0 > html5lib==1.0.1 > ipdb==0.11 > ipykernel==4.8.2 > ipython==6.3.1 > ipython-genutils==0.2.0 > ipywidgets==7.2.1 > jedi==0.12.0 > Jinja2==2.10 > jsonschema==2.6.0 > jupyter==1.0.0 > jupyter-client==5.2.3 > jupyter-console==5.2.0 > jupyter-core==4.4.0 > MarkupSafe==1.0 > mistune==0.8.3 > nbconvert==5.3.1 > nbformat==4.4.0 > notebook==5.5.0 > numpy==1.14.3 > pandas==0.22.0 > pandocfilters==1.4.2 > parso==0.2.0 > pbr==3.1.1 > pexpect==4.5.0 > pickleshare==0.7.4 > progressbar2==3.37.1 > prompt-toolkit==1.0.15 > ptyprocess==0.5.2 > pyarrow==0.9.0 > Pygments==2.2.0 > python-dateutil==2.7.2 > python-utils==2.3.0 > pytz==2018.4 > pyzmq==17.0.0 > qtconsole==4.3.1 > Send2Trash==1.5.0 > simplegeneric==0.8.1 > six==1.11.0 > SQLAlchemy==1.2.7 > stevedore==1.28.0 > terminado==0.8.1 > testpath==0.3.1 > tornado==5.0.2 > traitlets==4.3.2 > virtualenv==15.1.0 > virtualenv-clone==0.2.6 > virtualenvwrapper==4.7.2 > wcwidth==0.1.7 > webencodings==0.5.1 > widgetsnbextension==3.2.1 > {noformat} > > Java: > {noformat} > $ java -version > java version "1.8.0_171" > Java(TM) SE Runtime Environment (build 1.8.0_171-b11) > Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat} > System: > {noformat} > $ lsb_release -a > No LSB modules are available. > Distributor ID: Ubuntu > Description: Ubuntu 16.04.4 LTS > Release: 16.04 > Codename: xenial > {noformat} >Reporter: Cristian Consonni >Assignee: Apache Spark >Priority: Major > > I am working on Wikipedia page views (see [task T188041 on Wikimedia's > Pharicator|https://phabricator.wikimedia.org/T188041]). For simplicity, let's > say that these are the data: > {noformat} > > {noformat} > For each combination of (lang, page, day(timestamp)) I need to transform the > views for each hour: > {noformat} > 00:00 -> A > 01:00 -> B > ... > {noformat} > and concatenate the number of views for that hour. So, if a page got 5 views > at 00:00 and 7 views at 01:00 it would become: > {noformat} > A5B7 > {noformat} > > I have written a UDF called {code:python}concat_hours{code} > However, the function is mixing the columns and I am not sure what is going > on. I wrote here a minimal complete example that reproduces the issue on my > system (the details of my environment are above). > {code:python} > #!/usr/bin/env python3 > # coding: utf-8 > input_data = b"""en Albert_Camus 20071210-00 150 > en Albert_Camus 20071210-01 148 > en Albert_Camus 20071210-02 197 > en Albert_Camus 20071211-20 145 > en Albert_Camus 20071211-21 131 > en Albert_Camus 20071211-22 154 > en Albert_Camus 20071211-230001 142 > en Albert_Caquot 20071210-02 1 > en Albert_Caquot 20071210-02 1 > en Albert_Caquot 20071210-040001 1 > en Albert_Caquot 20071211-06 1 > en Albert_Caquot 20071211-08 1 > en Albert_Caquot 20071211-15 3 > en Albert_Caquot 20071211-21 1""" > import tempfile > fp = tempfile.NamedTemporaryFile() > fp.write(input_data) > fp.seek(0) > import findspark > findspark.init() > import pyspark > from pyspark.sql.types import StructType, StructField > from pyspark.sql.types import StringType, IntegerType, TimestampType > from pyspark.sql import functions > sc = pyspark.SparkContext(appName="udf_example") > sqlctx = pyspark.SQLContext(sc) > schema = StructType([StructField("lang", StringType(), False), > StructField("page", StringType(), False), > StructField("timestamp", TimestampType(), False), > StructField("views", IntegerType(), False)]) > df = sqlctx.read.csv(fp.name, > header=False, > schema=schema, > timestampFormat="MMdd-HHmmss", > sep=' ') > df.count() > df.dtypes > df.show() > new_schema = StructType([StructField("lang", StringType(), False), > StructField("page", StringType(), False), > StructField("day", StringType(), False), > StructField("enc", StringType(), False)]) > from pyspark.sql.functions
[jira] [Commented] (SPARK-24324) Pandas Grouped Map UserDefinedFunction mixes column labels
[ https://issues.apache.org/jira/browse/SPARK-24324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489650#comment-16489650 ] Apache Spark commented on SPARK-24324: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/21427 > Pandas Grouped Map UserDefinedFunction mixes column labels > -- > > Key: SPARK-24324 > URL: https://issues.apache.org/jira/browse/SPARK-24324 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: Python (using virtualenv): > {noformat} > $ python --version > Python 3.6.5 > {noformat} > Modules installed: > {noformat} > arrow==0.12.1 > backcall==0.1.0 > bleach==2.1.3 > chardet==3.0.4 > decorator==4.3.0 > entrypoints==0.2.3 > findspark==1.2.0 > html5lib==1.0.1 > ipdb==0.11 > ipykernel==4.8.2 > ipython==6.3.1 > ipython-genutils==0.2.0 > ipywidgets==7.2.1 > jedi==0.12.0 > Jinja2==2.10 > jsonschema==2.6.0 > jupyter==1.0.0 > jupyter-client==5.2.3 > jupyter-console==5.2.0 > jupyter-core==4.4.0 > MarkupSafe==1.0 > mistune==0.8.3 > nbconvert==5.3.1 > nbformat==4.4.0 > notebook==5.5.0 > numpy==1.14.3 > pandas==0.22.0 > pandocfilters==1.4.2 > parso==0.2.0 > pbr==3.1.1 > pexpect==4.5.0 > pickleshare==0.7.4 > progressbar2==3.37.1 > prompt-toolkit==1.0.15 > ptyprocess==0.5.2 > pyarrow==0.9.0 > Pygments==2.2.0 > python-dateutil==2.7.2 > python-utils==2.3.0 > pytz==2018.4 > pyzmq==17.0.0 > qtconsole==4.3.1 > Send2Trash==1.5.0 > simplegeneric==0.8.1 > six==1.11.0 > SQLAlchemy==1.2.7 > stevedore==1.28.0 > terminado==0.8.1 > testpath==0.3.1 > tornado==5.0.2 > traitlets==4.3.2 > virtualenv==15.1.0 > virtualenv-clone==0.2.6 > virtualenvwrapper==4.7.2 > wcwidth==0.1.7 > webencodings==0.5.1 > widgetsnbextension==3.2.1 > {noformat} > > Java: > {noformat} > $ java -version > java version "1.8.0_171" > Java(TM) SE Runtime Environment (build 1.8.0_171-b11) > Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat} > System: > {noformat} > $ lsb_release -a > No LSB modules are available. > Distributor ID: Ubuntu > Description: Ubuntu 16.04.4 LTS > Release: 16.04 > Codename: xenial > {noformat} >Reporter: Cristian Consonni >Priority: Major > > I am working on Wikipedia page views (see [task T188041 on Wikimedia's > Pharicator|https://phabricator.wikimedia.org/T188041]). For simplicity, let's > say that these are the data: > {noformat} > > {noformat} > For each combination of (lang, page, day(timestamp)) I need to transform the > views for each hour: > {noformat} > 00:00 -> A > 01:00 -> B > ... > {noformat} > and concatenate the number of views for that hour. So, if a page got 5 views > at 00:00 and 7 views at 01:00 it would become: > {noformat} > A5B7 > {noformat} > > I have written a UDF called {code:python}concat_hours{code} > However, the function is mixing the columns and I am not sure what is going > on. I wrote here a minimal complete example that reproduces the issue on my > system (the details of my environment are above). > {code:python} > #!/usr/bin/env python3 > # coding: utf-8 > input_data = b"""en Albert_Camus 20071210-00 150 > en Albert_Camus 20071210-01 148 > en Albert_Camus 20071210-02 197 > en Albert_Camus 20071211-20 145 > en Albert_Camus 20071211-21 131 > en Albert_Camus 20071211-22 154 > en Albert_Camus 20071211-230001 142 > en Albert_Caquot 20071210-02 1 > en Albert_Caquot 20071210-02 1 > en Albert_Caquot 20071210-040001 1 > en Albert_Caquot 20071211-06 1 > en Albert_Caquot 20071211-08 1 > en Albert_Caquot 20071211-15 3 > en Albert_Caquot 20071211-21 1""" > import tempfile > fp = tempfile.NamedTemporaryFile() > fp.write(input_data) > fp.seek(0) > import findspark > findspark.init() > import pyspark > from pyspark.sql.types import StructType, StructField > from pyspark.sql.types import StringType, IntegerType, TimestampType > from pyspark.sql import functions > sc = pyspark.SparkContext(appName="udf_example") > sqlctx = pyspark.SQLContext(sc) > schema = StructType([StructField("lang", StringType(), False), > StructField("page", StringType(), False), > StructField("timestamp", TimestampType(), False), > StructField("views", IntegerType(), False)]) > df = sqlctx.read.csv(fp.name, > header=False, > schema=schema, > timestampFormat="MMdd-HHmmss", > sep=' ') > df.count() > df.dtypes > df.show() > new_schema = StructType([StructField("lang", StringType(), False), > StructField("page", StringType(), False), > StructField("day", StringType(), False), >
[jira] [Assigned] (SPARK-24324) Pandas Grouped Map UserDefinedFunction mixes column labels
[ https://issues.apache.org/jira/browse/SPARK-24324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24324: Assignee: (was: Apache Spark) > Pandas Grouped Map UserDefinedFunction mixes column labels > -- > > Key: SPARK-24324 > URL: https://issues.apache.org/jira/browse/SPARK-24324 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: Python (using virtualenv): > {noformat} > $ python --version > Python 3.6.5 > {noformat} > Modules installed: > {noformat} > arrow==0.12.1 > backcall==0.1.0 > bleach==2.1.3 > chardet==3.0.4 > decorator==4.3.0 > entrypoints==0.2.3 > findspark==1.2.0 > html5lib==1.0.1 > ipdb==0.11 > ipykernel==4.8.2 > ipython==6.3.1 > ipython-genutils==0.2.0 > ipywidgets==7.2.1 > jedi==0.12.0 > Jinja2==2.10 > jsonschema==2.6.0 > jupyter==1.0.0 > jupyter-client==5.2.3 > jupyter-console==5.2.0 > jupyter-core==4.4.0 > MarkupSafe==1.0 > mistune==0.8.3 > nbconvert==5.3.1 > nbformat==4.4.0 > notebook==5.5.0 > numpy==1.14.3 > pandas==0.22.0 > pandocfilters==1.4.2 > parso==0.2.0 > pbr==3.1.1 > pexpect==4.5.0 > pickleshare==0.7.4 > progressbar2==3.37.1 > prompt-toolkit==1.0.15 > ptyprocess==0.5.2 > pyarrow==0.9.0 > Pygments==2.2.0 > python-dateutil==2.7.2 > python-utils==2.3.0 > pytz==2018.4 > pyzmq==17.0.0 > qtconsole==4.3.1 > Send2Trash==1.5.0 > simplegeneric==0.8.1 > six==1.11.0 > SQLAlchemy==1.2.7 > stevedore==1.28.0 > terminado==0.8.1 > testpath==0.3.1 > tornado==5.0.2 > traitlets==4.3.2 > virtualenv==15.1.0 > virtualenv-clone==0.2.6 > virtualenvwrapper==4.7.2 > wcwidth==0.1.7 > webencodings==0.5.1 > widgetsnbextension==3.2.1 > {noformat} > > Java: > {noformat} > $ java -version > java version "1.8.0_171" > Java(TM) SE Runtime Environment (build 1.8.0_171-b11) > Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat} > System: > {noformat} > $ lsb_release -a > No LSB modules are available. > Distributor ID: Ubuntu > Description: Ubuntu 16.04.4 LTS > Release: 16.04 > Codename: xenial > {noformat} >Reporter: Cristian Consonni >Priority: Major > > I am working on Wikipedia page views (see [task T188041 on Wikimedia's > Pharicator|https://phabricator.wikimedia.org/T188041]). For simplicity, let's > say that these are the data: > {noformat} > > {noformat} > For each combination of (lang, page, day(timestamp)) I need to transform the > views for each hour: > {noformat} > 00:00 -> A > 01:00 -> B > ... > {noformat} > and concatenate the number of views for that hour. So, if a page got 5 views > at 00:00 and 7 views at 01:00 it would become: > {noformat} > A5B7 > {noformat} > > I have written a UDF called {code:python}concat_hours{code} > However, the function is mixing the columns and I am not sure what is going > on. I wrote here a minimal complete example that reproduces the issue on my > system (the details of my environment are above). > {code:python} > #!/usr/bin/env python3 > # coding: utf-8 > input_data = b"""en Albert_Camus 20071210-00 150 > en Albert_Camus 20071210-01 148 > en Albert_Camus 20071210-02 197 > en Albert_Camus 20071211-20 145 > en Albert_Camus 20071211-21 131 > en Albert_Camus 20071211-22 154 > en Albert_Camus 20071211-230001 142 > en Albert_Caquot 20071210-02 1 > en Albert_Caquot 20071210-02 1 > en Albert_Caquot 20071210-040001 1 > en Albert_Caquot 20071211-06 1 > en Albert_Caquot 20071211-08 1 > en Albert_Caquot 20071211-15 3 > en Albert_Caquot 20071211-21 1""" > import tempfile > fp = tempfile.NamedTemporaryFile() > fp.write(input_data) > fp.seek(0) > import findspark > findspark.init() > import pyspark > from pyspark.sql.types import StructType, StructField > from pyspark.sql.types import StringType, IntegerType, TimestampType > from pyspark.sql import functions > sc = pyspark.SparkContext(appName="udf_example") > sqlctx = pyspark.SQLContext(sc) > schema = StructType([StructField("lang", StringType(), False), > StructField("page", StringType(), False), > StructField("timestamp", TimestampType(), False), > StructField("views", IntegerType(), False)]) > df = sqlctx.read.csv(fp.name, > header=False, > schema=schema, > timestampFormat="MMdd-HHmmss", > sep=' ') > df.count() > df.dtypes > df.show() > new_schema = StructType([StructField("lang", StringType(), False), > StructField("page", StringType(), False), > StructField("day", StringType(), False), > StructField("enc", StringType(), False)]) > from pyspark.sql.functions import pandas_udf,
[jira] [Commented] (SPARK-24036) Stateful operators in continuous processing
[ https://issues.apache.org/jira/browse/SPARK-24036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489648#comment-16489648 ] Jose Torres commented on SPARK-24036: - I've been notified of [https://issues.apache.org/jira/projects/SPARK/issues/SPARK-24374,] a SPIP for an API which would provide much of what we need here wrt letting tasks know where the appropriate shuffle endpoints. > Stateful operators in continuous processing > --- > > Key: SPARK-24036 > URL: https://issues.apache.org/jira/browse/SPARK-24036 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > > The first iteration of continuous processing in Spark 2.3 does not work with > stateful operators. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24385) Trivially-true EqualNullSafe should be handled like EqualTo in Dataset.join
Daniel Shields created SPARK-24385: -- Summary: Trivially-true EqualNullSafe should be handled like EqualTo in Dataset.join Key: SPARK-24385 URL: https://issues.apache.org/jira/browse/SPARK-24385 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0, 2.2.1 Reporter: Daniel Shields Dataset.join(right: Dataset[_], joinExprs: Column, joinType: String) has special logic for resolving trivially-true predicates to both sides. It currently handles regular equals but not null-safe equals; the code should be updated to also handle null-safe equals. Pyspark example: {code:java} df = spark.range(10) df.join(df, 'id').collect() # This works. df.join(df, df['id'] == df['id']).collect() # This works. df.join(df, df['id'].eqNullSafe(df['id'])).collect() # This fails!!! # This is a workaround that works. df2 = df.withColumn('id', F.col('id')) df.join(df2, df['id'].eqNullSafe(df2['id'])).collect(){code} The relevant code in Dataset.join should look like this: {code:java} // Otherwise, find the trivially true predicates and automatically resolves them to both sides. // By the time we get here, since we have already run analysis, all attributes should've been // resolved and become AttributeReference. val cond = plan.condition.map { _.transform { case catalyst.expressions.EqualTo(a: AttributeReference, b: AttributeReference) if a.sameRef(b) => catalyst.expressions.EqualTo( withPlan(plan.left).resolve(a.name), withPlan(plan.right).resolve(b.name)) // This case is new!!! case catalyst.expressions.EqualNullSafe(a: AttributeReference, b: AttributeReference) if a.sameRef(b) => catalyst.expressions.EqualNullSafe( withPlan(plan.left).resolve(a.name), withPlan(plan.right).resolve(b.name)) }} {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24324) Pandas Grouped Map UserDefinedFunction mixes column labels
[ https://issues.apache.org/jira/browse/SPARK-24324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-24324: - Summary: Pandas Grouped Map UserDefinedFunction mixes column labels (was: UserDefinedFunction mixes column labels) > Pandas Grouped Map UserDefinedFunction mixes column labels > -- > > Key: SPARK-24324 > URL: https://issues.apache.org/jira/browse/SPARK-24324 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: Python (using virtualenv): > {noformat} > $ python --version > Python 3.6.5 > {noformat} > Modules installed: > {noformat} > arrow==0.12.1 > backcall==0.1.0 > bleach==2.1.3 > chardet==3.0.4 > decorator==4.3.0 > entrypoints==0.2.3 > findspark==1.2.0 > html5lib==1.0.1 > ipdb==0.11 > ipykernel==4.8.2 > ipython==6.3.1 > ipython-genutils==0.2.0 > ipywidgets==7.2.1 > jedi==0.12.0 > Jinja2==2.10 > jsonschema==2.6.0 > jupyter==1.0.0 > jupyter-client==5.2.3 > jupyter-console==5.2.0 > jupyter-core==4.4.0 > MarkupSafe==1.0 > mistune==0.8.3 > nbconvert==5.3.1 > nbformat==4.4.0 > notebook==5.5.0 > numpy==1.14.3 > pandas==0.22.0 > pandocfilters==1.4.2 > parso==0.2.0 > pbr==3.1.1 > pexpect==4.5.0 > pickleshare==0.7.4 > progressbar2==3.37.1 > prompt-toolkit==1.0.15 > ptyprocess==0.5.2 > pyarrow==0.9.0 > Pygments==2.2.0 > python-dateutil==2.7.2 > python-utils==2.3.0 > pytz==2018.4 > pyzmq==17.0.0 > qtconsole==4.3.1 > Send2Trash==1.5.0 > simplegeneric==0.8.1 > six==1.11.0 > SQLAlchemy==1.2.7 > stevedore==1.28.0 > terminado==0.8.1 > testpath==0.3.1 > tornado==5.0.2 > traitlets==4.3.2 > virtualenv==15.1.0 > virtualenv-clone==0.2.6 > virtualenvwrapper==4.7.2 > wcwidth==0.1.7 > webencodings==0.5.1 > widgetsnbextension==3.2.1 > {noformat} > > Java: > {noformat} > $ java -version > java version "1.8.0_171" > Java(TM) SE Runtime Environment (build 1.8.0_171-b11) > Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat} > System: > {noformat} > $ lsb_release -a > No LSB modules are available. > Distributor ID: Ubuntu > Description: Ubuntu 16.04.4 LTS > Release: 16.04 > Codename: xenial > {noformat} >Reporter: Cristian Consonni >Priority: Major > > I am working on Wikipedia page views (see [task T188041 on Wikimedia's > Pharicator|https://phabricator.wikimedia.org/T188041]). For simplicity, let's > say that these are the data: > {noformat} > > {noformat} > For each combination of (lang, page, day(timestamp)) I need to transform the > views for each hour: > {noformat} > 00:00 -> A > 01:00 -> B > ... > {noformat} > and concatenate the number of views for that hour. So, if a page got 5 views > at 00:00 and 7 views at 01:00 it would become: > {noformat} > A5B7 > {noformat} > > I have written a UDF called {code:python}concat_hours{code} > However, the function is mixing the columns and I am not sure what is going > on. I wrote here a minimal complete example that reproduces the issue on my > system (the details of my environment are above). > {code:python} > #!/usr/bin/env python3 > # coding: utf-8 > input_data = b"""en Albert_Camus 20071210-00 150 > en Albert_Camus 20071210-01 148 > en Albert_Camus 20071210-02 197 > en Albert_Camus 20071211-20 145 > en Albert_Camus 20071211-21 131 > en Albert_Camus 20071211-22 154 > en Albert_Camus 20071211-230001 142 > en Albert_Caquot 20071210-02 1 > en Albert_Caquot 20071210-02 1 > en Albert_Caquot 20071210-040001 1 > en Albert_Caquot 20071211-06 1 > en Albert_Caquot 20071211-08 1 > en Albert_Caquot 20071211-15 3 > en Albert_Caquot 20071211-21 1""" > import tempfile > fp = tempfile.NamedTemporaryFile() > fp.write(input_data) > fp.seek(0) > import findspark > findspark.init() > import pyspark > from pyspark.sql.types import StructType, StructField > from pyspark.sql.types import StringType, IntegerType, TimestampType > from pyspark.sql import functions > sc = pyspark.SparkContext(appName="udf_example") > sqlctx = pyspark.SQLContext(sc) > schema = StructType([StructField("lang", StringType(), False), > StructField("page", StringType(), False), > StructField("timestamp", TimestampType(), False), > StructField("views", IntegerType(), False)]) > df = sqlctx.read.csv(fp.name, > header=False, > schema=schema, > timestampFormat="MMdd-HHmmss", > sep=' ') > df.count() > df.dtypes > df.show() > new_schema = StructType([StructField("lang", StringType(), False), > StructField("page", StringType(), False), > StructField("day", StringType(), False), > StructField("enc",
[jira] [Commented] (SPARK-24383) spark on k8s: "driver-svc" are not getting deleted
[ https://issues.apache.org/jira/browse/SPARK-24383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489601#comment-16489601 ] Yinan Li commented on SPARK-24383: -- The Kubernetes specific submission client adds an {{OwnerReference}} referencing the driver pod to the service so if you delete the driver pod, the corresponding service should be garbage collected. > spark on k8s: "driver-svc" are not getting deleted > -- > > Key: SPARK-24383 > URL: https://issues.apache.org/jira/browse/SPARK-24383 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Lenin >Priority: Major > > When the driver pod exists, the "*driver-svc" services created for the driver > are not cleaned up. This causes accumulation of services in the k8s layer, at > one point no more services can be created. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24356) Duplicate strings in File.path managed by FileSegmentManagedBuffer
[ https://issues.apache.org/jira/browse/SPARK-24356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Misha Dmitriev updated SPARK-24356: --- Attachment: SPARK-24356.01.patch > Duplicate strings in File.path managed by FileSegmentManagedBuffer > -- > > Key: SPARK-24356 > URL: https://issues.apache.org/jira/browse/SPARK-24356 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.3.0 >Reporter: Misha Dmitriev >Priority: Major > Attachments: SPARK-24356.01.patch > > > I recently analyzed a heap dump of Yarn Node Manager that was suffering from > high GC pressure due to high object churn. Analysis was done with the jxray > tool ([www.jxray.com)|http://www.jxray.com)/] that checks a heap dump for a > number of well-known memory issues. One problem that it found in this dump is > 19.5% of memory wasted due to duplicate strings. Of these duplicates, more > than a half come from {{FileInputStream.path}} and {{File.path}}. All the > {{FileInputStream}} objects that JXRay shows are garbage - looks like they > are used for a very short period and then discarded (I guess there is a > separate question of whether that's a good pattern). But {{File}} instances > are traceable to > {{org.apache.spark.network.buffer.FileSegmentManagedBuffer.file}} field. Here > is the full reference chain: > > {code:java} > ↖java.io.File.path > ↖org.apache.spark.network.buffer.FileSegmentManagedBuffer.file > ↖{j.u.ArrayList} > ↖j.u.ArrayList$Itr.this$0 > ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.buffers > ↖{java.util.concurrent.ConcurrentHashMap}.values > ↖org.apache.spark.network.server.OneForOneStreamManager.streams > ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager > ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler > ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance > {code} > > Values of these {{File.path}}'s and {{FileInputStream.path}}'s look very > similar, so I think {{FileInputStream}}s are generated by the > {{FileSegmentManagedBuffer}} code. Instances of {{File}}, in turn, likely > come from > [https://github.com/apache/spark/blob/master/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L258-L263] > > To avoid duplicate strings in {{File.path}}'s in this case, it is suggested > that in the above code we create a File with a complete, normalized pathname, > that has been already interned. This will prevent the code inside > {{java.io.File}} from modifying this string, and thus it will use the > interned copy, and will pass it to FileInputStream. Essentially the current > line > {code:java} > return new File(new File(localDir, String.format("%02x", subDirId)), > filename);{code} > should be replaced with something like > {code:java} > String pathname = localDir + File.separator + String.format(...) + > File.separator + filename; > pathname = fileSystem.normalize(pathname).intern(); > return new File(pathname);{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24358) createDataFrame in Python 3 should be able to infer bytes type as Binary type
[ https://issues.apache.org/jira/browse/SPARK-24358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489546#comment-16489546 ] Hyukjin Kwon edited comment on SPARK-24358 at 5/24/18 6:31 PM: --- Yea, I know the differences and I know the rationale here. We should need strong evidences and reasons to accept the divergence. Also we need to take a look for PySpark codes bases too and check such divergence. FWIW, I was trying to take a look and fix the difference among bytes, str and unicode and I am currently stuck due to other works. was (Author: hyukjin.kwon): Yea, I know the differences and I know the rationale here. We should need strong evidences and reasons to accept the divergence. Also we need to take a look for PySpark codes bases too and check such divergence. FWIW, I was trying to take a look and fix the difference among bytes, str and unicode and I am currently stuck due to other swarming works. > createDataFrame in Python 3 should be able to infer bytes type as Binary type > - > > Key: SPARK-24358 > URL: https://issues.apache.org/jira/browse/SPARK-24358 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Joel Croteau >Priority: Minor > Labels: Python3 > > createDataFrame can infer Python 3's bytearray type as a Binary. Since bytes > is just the immutable, hashable version of this same structure, it makes > sense for the same thing to apply there. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24358) createDataFrame in Python 3 should be able to infer bytes type as Binary type
[ https://issues.apache.org/jira/browse/SPARK-24358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489546#comment-16489546 ] Hyukjin Kwon commented on SPARK-24358: -- Yea, I know the differences and I know the rationale here. We should need strong evidences and reasons to accept the divergence. Also we need to take a look for PySpark codes bases too and check such divergence. FWIW, I was trying to take a look and fix the difference among bytes, str and unicode and I am currently stuck due to other swarming works. > createDataFrame in Python 3 should be able to infer bytes type as Binary type > - > > Key: SPARK-24358 > URL: https://issues.apache.org/jira/browse/SPARK-24358 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Joel Croteau >Priority: Minor > Labels: Python3 > > createDataFrame can infer Python 3's bytearray type as a Binary. Since bytes > is just the immutable, hashable version of this same structure, it makes > sense for the same thing to apply there. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data
[ https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489534#comment-16489534 ] Wenbo Zhao commented on SPARK-24373: It is not apparently to me that they are the same issue though > "df.cache() df.count()" no longer eagerly caches data > - > > Key: SPARK-24373 > URL: https://issues.apache.org/jira/browse/SPARK-24373 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenbo Zhao >Priority: Major > > Here is the code to reproduce in local mode > {code:java} > scala> val df = sc.range(1, 2).toDF > df: org.apache.spark.sql.DataFrame = [value: bigint] > scala> val myudf = udf({x: Long => println(""); x + 1}) > myudf: org.apache.spark.sql.expressions.UserDefinedFunction = > UserDefinedFunction(,LongType,Some(List(LongType))) > scala> val df1 = df.withColumn("value1", myudf(col("value"))) > df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint] > scala> df1.cache > res0: df1.type = [value: bigint, value1: bigint] > scala> df1.count > res1: Long = 1 > scala> df1.count > res2: Long = 1 > scala> df1.count > res3: Long = 1 > {code} > > in Spark 2.2, you could see it prints "". > In the above example, when you do explain. You could see > {code:java} > scala> df1.explain(true) > == Parsed Logical Plan == > 'Project [value#2L, UDF('value) AS value1#5] > +- AnalysisBarrier > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Analyzed Logical Plan == > value: bigint, value1: bigint > Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Optimized Logical Plan == > InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > == Physical Plan == > *(1) InMemoryTableScan [value#2L, value1#5L] > +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > {code} > but the ImMemoryTableScan is mising in the following explain() > {code:java} > scala> df1.groupBy().count().explain(true) > == Parsed Logical Plan == > Aggregate [count(1) AS count#170L] > +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS > value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Analyzed Logical Plan == > count: bigint > Aggregate [count(1) AS count#170L] > +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) > null else UDF(value#2L) AS value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Optimized Logical Plan == > Aggregate [count(1) AS count#170L] > +- Project > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Physical Plan == > *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L]) > +- Exchange SinglePartition > +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#175L]) > +- *(1) Project > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24324) UserDefinedFunction mixes column labels
[ https://issues.apache.org/jira/browse/SPARK-24324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489533#comment-16489533 ] Bryan Cutler commented on SPARK-24324: -- I was able to reproduce, the problem is that when pyspark processes the result from the UDF, it iterates over the columns by index not by name. Since you are returning a new pandas.DataFrame with input given by a dict, there is no guarantee of the column order when pandas makes the DataFrame (this is just the nature of the python dict for python < 3.6 I believe). I can submit a fix that should not change any behaviors, I think. As a workaround, you could write your UDF to make the pandas.DataFrame using an OrderedDict like so: {code:java} from collections import OrderedDict @pandas_udf(new_schema, PandasUDFType.GROUPED_MAP) def myudf(x): foo = 'foo' return pd.DataFrame(OrderedDict([('lang', x.lang), ('page', x.page), ('foo', 'foo')])){code} > UserDefinedFunction mixes column labels > --- > > Key: SPARK-24324 > URL: https://issues.apache.org/jira/browse/SPARK-24324 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: Python (using virtualenv): > {noformat} > $ python --version > Python 3.6.5 > {noformat} > Modules installed: > {noformat} > arrow==0.12.1 > backcall==0.1.0 > bleach==2.1.3 > chardet==3.0.4 > decorator==4.3.0 > entrypoints==0.2.3 > findspark==1.2.0 > html5lib==1.0.1 > ipdb==0.11 > ipykernel==4.8.2 > ipython==6.3.1 > ipython-genutils==0.2.0 > ipywidgets==7.2.1 > jedi==0.12.0 > Jinja2==2.10 > jsonschema==2.6.0 > jupyter==1.0.0 > jupyter-client==5.2.3 > jupyter-console==5.2.0 > jupyter-core==4.4.0 > MarkupSafe==1.0 > mistune==0.8.3 > nbconvert==5.3.1 > nbformat==4.4.0 > notebook==5.5.0 > numpy==1.14.3 > pandas==0.22.0 > pandocfilters==1.4.2 > parso==0.2.0 > pbr==3.1.1 > pexpect==4.5.0 > pickleshare==0.7.4 > progressbar2==3.37.1 > prompt-toolkit==1.0.15 > ptyprocess==0.5.2 > pyarrow==0.9.0 > Pygments==2.2.0 > python-dateutil==2.7.2 > python-utils==2.3.0 > pytz==2018.4 > pyzmq==17.0.0 > qtconsole==4.3.1 > Send2Trash==1.5.0 > simplegeneric==0.8.1 > six==1.11.0 > SQLAlchemy==1.2.7 > stevedore==1.28.0 > terminado==0.8.1 > testpath==0.3.1 > tornado==5.0.2 > traitlets==4.3.2 > virtualenv==15.1.0 > virtualenv-clone==0.2.6 > virtualenvwrapper==4.7.2 > wcwidth==0.1.7 > webencodings==0.5.1 > widgetsnbextension==3.2.1 > {noformat} > > Java: > {noformat} > $ java -version > java version "1.8.0_171" > Java(TM) SE Runtime Environment (build 1.8.0_171-b11) > Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat} > System: > {noformat} > $ lsb_release -a > No LSB modules are available. > Distributor ID: Ubuntu > Description: Ubuntu 16.04.4 LTS > Release: 16.04 > Codename: xenial > {noformat} >Reporter: Cristian Consonni >Priority: Major > > I am working on Wikipedia page views (see [task T188041 on Wikimedia's > Pharicator|https://phabricator.wikimedia.org/T188041]). For simplicity, let's > say that these are the data: > {noformat} > > {noformat} > For each combination of (lang, page, day(timestamp)) I need to transform the > views for each hour: > {noformat} > 00:00 -> A > 01:00 -> B > ... > {noformat} > and concatenate the number of views for that hour. So, if a page got 5 views > at 00:00 and 7 views at 01:00 it would become: > {noformat} > A5B7 > {noformat} > > I have written a UDF called {code:python}concat_hours{code} > However, the function is mixing the columns and I am not sure what is going > on. I wrote here a minimal complete example that reproduces the issue on my > system (the details of my environment are above). > {code:python} > #!/usr/bin/env python3 > # coding: utf-8 > input_data = b"""en Albert_Camus 20071210-00 150 > en Albert_Camus 20071210-01 148 > en Albert_Camus 20071210-02 197 > en Albert_Camus 20071211-20 145 > en Albert_Camus 20071211-21 131 > en Albert_Camus 20071211-22 154 > en Albert_Camus 20071211-230001 142 > en Albert_Caquot 20071210-02 1 > en Albert_Caquot 20071210-02 1 > en Albert_Caquot 20071210-040001 1 > en Albert_Caquot 20071211-06 1 > en Albert_Caquot 20071211-08 1 > en Albert_Caquot 20071211-15 3 > en Albert_Caquot 20071211-21 1""" > import tempfile > fp = tempfile.NamedTemporaryFile() > fp.write(input_data) > fp.seek(0) > import findspark > findspark.init() > import pyspark > from pyspark.sql.types import StructType, StructField > from pyspark.sql.types import StringType, IntegerType, TimestampType > from pyspark.sql import functions > sc = pyspark.SparkContext(appName="udf_example") > sqlctx = pyspark.SQLContext(sc) > schema = StructType([StructField("lang", StringType(), False), >
[jira] [Commented] (SPARK-24384) spark-submit --py-files with .py files doesn't work in client mode before context initialization
[ https://issues.apache.org/jira/browse/SPARK-24384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489524#comment-16489524 ] Apache Spark commented on SPARK-24384: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/21426 > spark-submit --py-files with .py files doesn't work in client mode before > context initialization > > > Key: SPARK-24384 > URL: https://issues.apache.org/jira/browse/SPARK-24384 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Submit >Affects Versions: 2.3.0, 2.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > In case the given Python file is .py file (zip file seems fine), seems the > python path is dynamically added after the context is got initialized. > with this pyFile: > {code} > $ cat /home/spark/tmp.py > def testtest(): > return 1 > {code} > This works: > {code} > $ cat app.py > import pyspark > pyspark.sql.SparkSession.builder.getOrCreate() > import tmp > print("%s" % tmp.testtest()) > $ ./bin/spark-submit --master yarn --deploy-mode client --py-files > /home/spark/tmp.py app.py > ... > 1 > {code} > but this doesn't: > {code} > $ cat app.py > import pyspark > import tmp > pyspark.sql.SparkSession.builder.getOrCreate() > print("%s" % tmp.testtest()) > $ ./bin/spark-submit --master yarn --deploy-mode client --py-files > /home/spark/tmp.py app.py > Traceback (most recent call last): > File "/home/spark/spark/app.py", line 2, in > import tmp > ImportError: No module named tmp > {code} > See > https://issues.apache.org/jira/browse/SPARK-21945?focusedCommentId=16488486=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16488486 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24384) spark-submit --py-files with .py files doesn't work in client mode before context initialization
[ https://issues.apache.org/jira/browse/SPARK-24384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24384: Assignee: (was: Apache Spark) > spark-submit --py-files with .py files doesn't work in client mode before > context initialization > > > Key: SPARK-24384 > URL: https://issues.apache.org/jira/browse/SPARK-24384 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Submit >Affects Versions: 2.3.0, 2.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > In case the given Python file is .py file (zip file seems fine), seems the > python path is dynamically added after the context is got initialized. > with this pyFile: > {code} > $ cat /home/spark/tmp.py > def testtest(): > return 1 > {code} > This works: > {code} > $ cat app.py > import pyspark > pyspark.sql.SparkSession.builder.getOrCreate() > import tmp > print("%s" % tmp.testtest()) > $ ./bin/spark-submit --master yarn --deploy-mode client --py-files > /home/spark/tmp.py app.py > ... > 1 > {code} > but this doesn't: > {code} > $ cat app.py > import pyspark > import tmp > pyspark.sql.SparkSession.builder.getOrCreate() > print("%s" % tmp.testtest()) > $ ./bin/spark-submit --master yarn --deploy-mode client --py-files > /home/spark/tmp.py app.py > Traceback (most recent call last): > File "/home/spark/spark/app.py", line 2, in > import tmp > ImportError: No module named tmp > {code} > See > https://issues.apache.org/jira/browse/SPARK-21945?focusedCommentId=16488486=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16488486 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24384) spark-submit --py-files with .py files doesn't work in client mode before context initialization
[ https://issues.apache.org/jira/browse/SPARK-24384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24384: Assignee: Apache Spark > spark-submit --py-files with .py files doesn't work in client mode before > context initialization > > > Key: SPARK-24384 > URL: https://issues.apache.org/jira/browse/SPARK-24384 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Submit >Affects Versions: 2.3.0, 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > In case the given Python file is .py file (zip file seems fine), seems the > python path is dynamically added after the context is got initialized. > with this pyFile: > {code} > $ cat /home/spark/tmp.py > def testtest(): > return 1 > {code} > This works: > {code} > $ cat app.py > import pyspark > pyspark.sql.SparkSession.builder.getOrCreate() > import tmp > print("%s" % tmp.testtest()) > $ ./bin/spark-submit --master yarn --deploy-mode client --py-files > /home/spark/tmp.py app.py > ... > 1 > {code} > but this doesn't: > {code} > $ cat app.py > import pyspark > import tmp > pyspark.sql.SparkSession.builder.getOrCreate() > print("%s" % tmp.testtest()) > $ ./bin/spark-submit --master yarn --deploy-mode client --py-files > /home/spark/tmp.py app.py > Traceback (most recent call last): > File "/home/spark/spark/app.py", line 2, in > import tmp > ImportError: No module named tmp > {code} > See > https://issues.apache.org/jira/browse/SPARK-21945?focusedCommentId=16488486=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16488486 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24358) createDataFrame in Python 3 should be able to infer bytes type as Binary type
[ https://issues.apache.org/jira/browse/SPARK-24358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489507#comment-16489507 ] Joel Croteau commented on SPARK-24358: -- This does mean that the current implementation has some compatibility issues with Python 3. In Python 2, a bytes will be inferred as a StringType, regardless of content. StringType and BinaryType are functionally identical, as they are both just arbitrary arrays of bytes, and Python 2 will handle any value of them just fine. In Python 3, attempting to infer the type of a bytes is an error, and Python 3 will convert a StringType to Unicode. Since not every byte string is valid Unicode, some errors may occur in processing StringTypes in Python 3 that worked fine in Python 2. > createDataFrame in Python 3 should be able to infer bytes type as Binary type > - > > Key: SPARK-24358 > URL: https://issues.apache.org/jira/browse/SPARK-24358 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Joel Croteau >Priority: Minor > Labels: Python3 > > createDataFrame can infer Python 3's bytearray type as a Binary. Since bytes > is just the immutable, hashable version of this same structure, it makes > sense for the same thing to apply there. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24384) spark-submit --py-files with .py files doesn't work in client mode before context initialization
[ https://issues.apache.org/jira/browse/SPARK-24384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-24384: - Component/s: Spark Submit > spark-submit --py-files with .py files doesn't work in client mode before > context initialization > > > Key: SPARK-24384 > URL: https://issues.apache.org/jira/browse/SPARK-24384 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Submit >Affects Versions: 2.3.0, 2.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > In case the given Python file is .py file (zip file seems fine), seems the > python path is dynamically added after the context is got initialized. > with this pyFile: > {code} > $ cat /home/spark/tmp.py > def testtest(): > return 1 > {code} > This works: > {code} > $ cat app.py > import pyspark > pyspark.sql.SparkSession.builder.getOrCreate() > import tmp > print("%s" % tmp.testtest()) > $ ./bin/spark-submit --master yarn --deploy-mode client --py-files > /home/spark/tmp.py app.py > ... > 1 > {code} > but this doesn't: > {code} > $ cat app.py > import pyspark > import tmp > pyspark.sql.SparkSession.builder.getOrCreate() > print("%s" % tmp.testtest()) > $ ./bin/spark-submit --master yarn --deploy-mode client --py-files > /home/spark/tmp.py app.py > Traceback (most recent call last): > File "/home/spark/spark/app.py", line 2, in > import tmp > ImportError: No module named tmp > {code} > See > https://issues.apache.org/jira/browse/SPARK-21945?focusedCommentId=16488486=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16488486 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24384) spark-submit --py-files with .py files doesn't work in client mode before context initialization
Hyukjin Kwon created SPARK-24384: Summary: spark-submit --py-files with .py files doesn't work in client mode before context initialization Key: SPARK-24384 URL: https://issues.apache.org/jira/browse/SPARK-24384 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.3.0, 2.4.0 Reporter: Hyukjin Kwon In case the given Python file is .py file (zip file seems fine), seems the python path is dynamically added after the context is got initialized. with this pyFile: {code} $ cat /home/spark/tmp.py def testtest(): return 1 {code} This works: {code} $ cat app.py import pyspark pyspark.sql.SparkSession.builder.getOrCreate() import tmp print("%s" % tmp.testtest()) $ ./bin/spark-submit --master yarn --deploy-mode client --py-files /home/spark/tmp.py app.py ... 1 {code} but this doesn't: {code} $ cat app.py import pyspark import tmp pyspark.sql.SparkSession.builder.getOrCreate() print("%s" % tmp.testtest()) $ ./bin/spark-submit --master yarn --deploy-mode client --py-files /home/spark/tmp.py app.py Traceback (most recent call last): File "/home/spark/spark/app.py", line 2, in import tmp ImportError: No module named tmp {code} See https://issues.apache.org/jira/browse/SPARK-21945?focusedCommentId=16488486=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16488486 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23754) StopIterator exception in Python UDF results in partial result
[ https://issues.apache.org/jira/browse/SPARK-23754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-23754: - Priority: Blocker (was: Major) > StopIterator exception in Python UDF results in partial result > -- > > Key: SPARK-23754 > URL: https://issues.apache.org/jira/browse/SPARK-23754 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Li Jin >Priority: Blocker > > Reproduce: > {code:java} > df = spark.range(0, 1000) > from pyspark.sql.functions import udf > def foo(x): > raise StopIteration() > df.withColumn('v', udf(foo)).show() > # Results > # +---+---+ > # | id| v| > # +---+---+ > # +---+---+{code} > I think the task should fail in this case -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24356) Duplicate strings in File.path managed by FileSegmentManagedBuffer
[ https://issues.apache.org/jira/browse/SPARK-24356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489417#comment-16489417 ] Imran Rashid commented on SPARK-24356: -- cc [~jinxing6...@126.com] [~elu] [~felixcheung] -- this could be a nice win to decrease GC pressure on the shuffle service, might be related to issues you are running into. > Duplicate strings in File.path managed by FileSegmentManagedBuffer > -- > > Key: SPARK-24356 > URL: https://issues.apache.org/jira/browse/SPARK-24356 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.3.0 >Reporter: Misha Dmitriev >Priority: Major > > I recently analyzed a heap dump of Yarn Node Manager that was suffering from > high GC pressure due to high object churn. Analysis was done with the jxray > tool ([www.jxray.com)|http://www.jxray.com)/] that checks a heap dump for a > number of well-known memory issues. One problem that it found in this dump is > 19.5% of memory wasted due to duplicate strings. Of these duplicates, more > than a half come from {{FileInputStream.path}} and {{File.path}}. All the > {{FileInputStream}} objects that JXRay shows are garbage - looks like they > are used for a very short period and then discarded (I guess there is a > separate question of whether that's a good pattern). But {{File}} instances > are traceable to > {{org.apache.spark.network.buffer.FileSegmentManagedBuffer.file}} field. Here > is the full reference chain: > > {code:java} > ↖java.io.File.path > ↖org.apache.spark.network.buffer.FileSegmentManagedBuffer.file > ↖{j.u.ArrayList} > ↖j.u.ArrayList$Itr.this$0 > ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.buffers > ↖{java.util.concurrent.ConcurrentHashMap}.values > ↖org.apache.spark.network.server.OneForOneStreamManager.streams > ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager > ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler > ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance > {code} > > Values of these {{File.path}}'s and {{FileInputStream.path}}'s look very > similar, so I think {{FileInputStream}}s are generated by the > {{FileSegmentManagedBuffer}} code. Instances of {{File}}, in turn, likely > come from > [https://github.com/apache/spark/blob/master/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L258-L263] > > To avoid duplicate strings in {{File.path}}'s in this case, it is suggested > that in the above code we create a File with a complete, normalized pathname, > that has been already interned. This will prevent the code inside > {{java.io.File}} from modifying this string, and thus it will use the > interned copy, and will pass it to FileInputStream. Essentially the current > line > {code:java} > return new File(new File(localDir, String.format("%02x", subDirId)), > filename);{code} > should be replaced with something like > {code:java} > String pathname = localDir + File.separator + String.format(...) + > File.separator + filename; > pathname = fileSystem.normalize(pathname).intern(); > return new File(pathname);{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21945) pyspark --py-files doesn't work in yarn client mode
[ https://issues.apache.org/jira/browse/SPARK-21945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489397#comment-16489397 ] Hyukjin Kwon commented on SPARK-21945: -- For another clarification, the launch execution codepath is diverted for shell (and also Yarn) specifically. So, I believe the merged PR itself is a-okay. > pyspark --py-files doesn't work in yarn client mode > --- > > Key: SPARK-21945 > URL: https://issues.apache.org/jira/browse/SPARK-21945 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Thomas Graves >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > I tried running pyspark with --py-files pythonfiles.zip but it doesn't > properly add the zip file to the PYTHONPATH. > I can work around by exporting PYTHONPATH. > Looking in SparkSubmitCommandBuilder.buildPySparkShellCommand I don't see > this supported at all. If that is the case perhaps it should be moved to > improvement. > Note it works via spark-submit in both client and cluster mode to run python > script. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data
[ https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489389#comment-16489389 ] Marcelo Vanzin commented on SPARK-24373: This could be the same as SPARK-23309. > "df.cache() df.count()" no longer eagerly caches data > - > > Key: SPARK-24373 > URL: https://issues.apache.org/jira/browse/SPARK-24373 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenbo Zhao >Priority: Major > > Here is the code to reproduce in local mode > {code:java} > scala> val df = sc.range(1, 2).toDF > df: org.apache.spark.sql.DataFrame = [value: bigint] > scala> val myudf = udf({x: Long => println(""); x + 1}) > myudf: org.apache.spark.sql.expressions.UserDefinedFunction = > UserDefinedFunction(,LongType,Some(List(LongType))) > scala> val df1 = df.withColumn("value1", myudf(col("value"))) > df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint] > scala> df1.cache > res0: df1.type = [value: bigint, value1: bigint] > scala> df1.count > res1: Long = 1 > scala> df1.count > res2: Long = 1 > scala> df1.count > res3: Long = 1 > {code} > > in Spark 2.2, you could see it prints "". > In the above example, when you do explain. You could see > {code:java} > scala> df1.explain(true) > == Parsed Logical Plan == > 'Project [value#2L, UDF('value) AS value1#5] > +- AnalysisBarrier > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Analyzed Logical Plan == > value: bigint, value1: bigint > Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Optimized Logical Plan == > InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > == Physical Plan == > *(1) InMemoryTableScan [value#2L, value1#5L] > +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > {code} > but the ImMemoryTableScan is mising in the following explain() > {code:java} > scala> df1.groupBy().count().explain(true) > == Parsed Logical Plan == > Aggregate [count(1) AS count#170L] > +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS > value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Analyzed Logical Plan == > count: bigint > Aggregate [count(1) AS count#170L] > +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) > null else UDF(value#2L) AS value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Optimized Logical Plan == > Aggregate [count(1) AS count#170L] > +- Project > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Physical Plan == > *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L]) > +- Exchange SinglePartition > +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#175L]) > +- *(1) Project > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489391#comment-16489391 ] Marcelo Vanzin commented on SPARK-23309: [~kiszk] SPARK-24373 has some code. > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Major > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24372) Create script for preparing RCs
[ https://issues.apache.org/jira/browse/SPARK-24372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489378#comment-16489378 ] Marcelo Vanzin commented on SPARK-24372: I'm keeping the current version of the scripts here while I tweak them: https://github.com/vanzin/spark/tree/SPARK-24372 [~tgraves] [~holdenkarau] you might be interested in checking those out, although there may be tweaks needed for older branches. I also took a stab of incorporating [~felixcheung]'s docker image in that patch. > Create script for preparing RCs > --- > > Key: SPARK-24372 > URL: https://issues.apache.org/jira/browse/SPARK-24372 > Project: Spark > Issue Type: New Feature > Components: Project Infra >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Major > > Currently, when preparing RCs, the RM has to invoke many scripts manually, > make sure that is being done in the correct environment, and set all the > correct environment variables, which differ from one script to the other. > It will be much easier for RMs if all that was automated as much as possible. > I'm working on something like this as part of releasing 2.3.1, and plan to > send my scripts for review after the release is done (i.e. after I make sure > the scripts are working properly). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21945) pyspark --py-files doesn't work in yarn client mode
[ https://issues.apache.org/jira/browse/SPARK-21945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489367#comment-16489367 ] Hyukjin Kwon edited comment on SPARK-21945 at 5/24/18 4:50 PM: --- To be more correct, the paths are added as are given my investigation so far. It's fine for zip archive but for .py file the paths shouldn't be added as are (but its parent directory) ... so for py files, yes, we should copy them too. It's weird but I think this is all because we happened to support .py file in the same option whereas PYTHONPATH doesn't expect a .py file. was (Author: hyukjin.kwon): To be more correct, the paths are added as are given my investigation so far. It's fine for zip archive but for .py file the paths shouldn't be added as are (but its parent directory) ... so for py files, yes, we should copy them too. It's weird but I think this is all because we happened to support .py file in the same option whereas PYTHONPATH doesn't expect a file. > pyspark --py-files doesn't work in yarn client mode > --- > > Key: SPARK-21945 > URL: https://issues.apache.org/jira/browse/SPARK-21945 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Thomas Graves >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > I tried running pyspark with --py-files pythonfiles.zip but it doesn't > properly add the zip file to the PYTHONPATH. > I can work around by exporting PYTHONPATH. > Looking in SparkSubmitCommandBuilder.buildPySparkShellCommand I don't see > this supported at all. If that is the case perhaps it should be moved to > improvement. > Note it works via spark-submit in both client and cluster mode to run python > script. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21945) pyspark --py-files doesn't work in yarn client mode
[ https://issues.apache.org/jira/browse/SPARK-21945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489369#comment-16489369 ] Hyukjin Kwon commented on SPARK-21945: -- Just for clarification, zip file works fine because the python paths are added as are to PythonRunner given my investigation so far. > pyspark --py-files doesn't work in yarn client mode > --- > > Key: SPARK-21945 > URL: https://issues.apache.org/jira/browse/SPARK-21945 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Thomas Graves >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > I tried running pyspark with --py-files pythonfiles.zip but it doesn't > properly add the zip file to the PYTHONPATH. > I can work around by exporting PYTHONPATH. > Looking in SparkSubmitCommandBuilder.buildPySparkShellCommand I don't see > this supported at all. If that is the case perhaps it should be moved to > improvement. > Note it works via spark-submit in both client and cluster mode to run python > script. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21945) pyspark --py-files doesn't work in yarn client mode
[ https://issues.apache.org/jira/browse/SPARK-21945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489367#comment-16489367 ] Hyukjin Kwon commented on SPARK-21945: -- To be more correct, the paths are added as are given my investigation so far. It's fine for zip archive but for .py file the paths shouldn't be added as are (but its parent directory) ... so for py files, yes, we should copy them too. It's weird but I think this is all because we happened to support .py file in the same option whereas PYTHONPATH doesn't expect a file. > pyspark --py-files doesn't work in yarn client mode > --- > > Key: SPARK-21945 > URL: https://issues.apache.org/jira/browse/SPARK-21945 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Thomas Graves >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > I tried running pyspark with --py-files pythonfiles.zip but it doesn't > properly add the zip file to the PYTHONPATH. > I can work around by exporting PYTHONPATH. > Looking in SparkSubmitCommandBuilder.buildPySparkShellCommand I don't see > this supported at all. If that is the case perhaps it should be moved to > improvement. > Note it works via spark-submit in both client and cluster mode to run python > script. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21945) pyspark --py-files doesn't work in yarn client mode
[ https://issues.apache.org/jira/browse/SPARK-21945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489338#comment-16489338 ] Marcelo Vanzin commented on SPARK-21945: It happens because the import happens before the context is initialized, and your fix only copies the files during initialization of the context. To fix this case you'd have to add logic to perform the copy into the launcher library, which would be kinda weird... > pyspark --py-files doesn't work in yarn client mode > --- > > Key: SPARK-21945 > URL: https://issues.apache.org/jira/browse/SPARK-21945 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Thomas Graves >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > I tried running pyspark with --py-files pythonfiles.zip but it doesn't > properly add the zip file to the PYTHONPATH. > I can work around by exporting PYTHONPATH. > Looking in SparkSubmitCommandBuilder.buildPySparkShellCommand I don't see > this supported at all. If that is the case perhaps it should be moved to > improvement. > Note it works via spark-submit in both client and cluster mode to run python > script. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24374) SPIP: Support Barrier Scheduling in Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489333#comment-16489333 ] Xiangrui Meng commented on SPARK-24374: --- [~galv] Thanks for your feedback! * SPARK-20327 allows Spark to request GPU from YARN. Within Spark, we also need to address the issue with allocating custom resources to tasks. For example, a user can request a task should take 1 CPU core and 2 GPUs and Spark needs to assign the executor as well as GPU devices to that task. This is relevant to the proposal here. But I want to consider it an orthogonal topic because barrier scheduling works with or without GPUs. * I picked version 3.0.0 because I'm not sure if there would be breaking changes. The proposed API doesn't have breaking changes. But this adds a new model to job scheduler and hence fault tolerance model. I want to see more design discussions first. * I think most MPI implementations use SSH. Spark standalone is usually configured with keyless SSH. But this might be an issue on YARN / Kube / Mesos. I thought about this but I want to treat it as a separate issue. Spark executors can talk to each other in all those cluster modes. So user should have a way to set up a hybrid cluster within the barrier stage. The question is what info we need to provide at the very minimum. * "barrier" comes from "MPI_Barrier". Please check the example code in the SPIP. We need to set barrier in two scenarios: 1) configure a Spark stage to be in the barrier mode, where Spark job scheduler knows to wait until all task slots are ready to launch them, 2) within the mapPartitions closure, we still need to provide users contxt.barrier() to wait for data exchange or user program to finish on all tasks. > SPIP: Support Barrier Scheduling in Apache Spark > > > Key: SPARK-24374 > URL: https://issues.apache.org/jira/browse/SPARK-24374 > Project: Spark > Issue Type: Epic > Components: ML, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: SPIP > Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf > > > (See details in the linked/attached SPIP doc.) > {quote} > The proposal here is to add a new scheduling model to Apache Spark so users > can properly embed distributed DL training as a Spark stage to simplify the > distributed training workflow. For example, Horovod uses MPI to implement > all-reduce to accelerate distributed TensorFlow training. The computation > model is different from MapReduce used by Spark. In Spark, a task in a stage > doesn’t depend on any other tasks in the same stage, and hence it can be > scheduled independently. In MPI, all workers start at the same time and pass > messages around. To embed this workload in Spark, we need to introduce a new > scheduling model, tentatively named “barrier scheduling”, which launches > tasks at the same time and provides users enough information and tooling to > embed distributed DL training. Spark can also provide an extra layer of fault > tolerance in case some tasks failed in the middle, where Spark would abort > all tasks and restart the stage. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20712) [SPARK 2.1 REGRESSION][SQL] Spark can't read Hive table when column type has length greater than 4000 bytes
[ https://issues.apache.org/jira/browse/SPARK-20712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-20712: Target Version/s: 2.4.0 > [SPARK 2.1 REGRESSION][SQL] Spark can't read Hive table when column type has > length greater than 4000 bytes > --- > > Key: SPARK-20712 > URL: https://issues.apache.org/jira/browse/SPARK-20712 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.1.2, 2.2.0, 2.3.0 >Reporter: Maciej Bryński >Priority: Critical > > Hi, > I have following issue. > I'm trying to read a table from hive when one of the column is nested so it's > schema has length longer than 4000 bytes. > Everything worked on Spark 2.0.2. On 2.1.1 I'm getting Exception: > {code} > >> spark.read.table("SOME_TABLE") > Traceback (most recent call last): > File "", line 1, in > File "/opt/spark-2.1.1/python/pyspark/sql/readwriter.py", line 259, in table > return self._df(self._jreader.table(tableName)) > File > "/opt/spark-2.1.1/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line > 1133, in __call__ > File "/opt/spark-2.1.1/python/pyspark/sql/utils.py", line 63, in deco > return f(*a, **kw) > File "/opt/spark-2.1.1/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", > line 319, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o71.table. > : org.apache.spark.SparkException: Cannot recognize hive type string: > SOME_VERY_LONG_FIELD_TYPE > at > org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:361) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:359) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:279) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:226) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:225) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:268) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:359) > at > org.apache.spark.sql.hive.client.HiveClient$class.getTable(HiveClient.scala:74) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:78) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) > at > org.apache.spark.sql.hive.HiveExternalCatalog.org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable(HiveExternalCatalog.scala:117) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:628) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:628) > at >
[jira] [Updated] (SPARK-24378) Incorrect examples for date_trunc function in spark 2.3.0
[ https://issues.apache.org/jira/browse/SPARK-24378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-24378: - Flags: (was: Important) > Incorrect examples for date_trunc function in spark 2.3.0 > - > > Key: SPARK-24378 > URL: https://issues.apache.org/jira/browse/SPARK-24378 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.3.0 >Reporter: Prakhar Gupta >Assignee: Yuming Wang >Priority: Trivial > Fix For: 2.3.1, 2.4.0 > > > Within Spark documentation for spark 2.3.0, Listed examples are incorrect. > date_trunc(fmt, ts) - Returns timestamp {{ts}} truncated to the unit > specified by the format model {{fmt}}. {{fmt}} should be one of ["YEAR", > "", "YY", "MON", "MONTH", "MM", "DAY", "DD", "HOUR", "MINUTE", "SECOND", > "WEEK", "QUARTER"] > *Examples:* > {{> SELECT date_trunc('2015-03-05T09:32:05.359', 'YEAR'); 2015-01-01T00:00:00 > > SELECT date_trunc('2015-03-05T09:32:05.359', 'MM'); 2015-03-01T00:00:00 > > SELECT date_trunc('2015-03-05T09:32:05.359', 'DD'); 2015-03-05T00:00:00 > > SELECT date_trunc('2015-03-05T09:32:05.359', 'HOUR'); 2015-03-05T09:00:00 }} > > {{Examples should be date_trunc(format, value)}} > {{SELECT date_trunc('YEAR','2015-03-05T09:32:05.359'); }} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24378) Incorrect examples for date_trunc function in spark 2.3.0
[ https://issues.apache.org/jira/browse/SPARK-24378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-24378: - Priority: Trivial (was: Major) > Incorrect examples for date_trunc function in spark 2.3.0 > - > > Key: SPARK-24378 > URL: https://issues.apache.org/jira/browse/SPARK-24378 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.3.0 >Reporter: Prakhar Gupta >Assignee: Yuming Wang >Priority: Trivial > Fix For: 2.3.1, 2.4.0 > > > Within Spark documentation for spark 2.3.0, Listed examples are incorrect. > date_trunc(fmt, ts) - Returns timestamp {{ts}} truncated to the unit > specified by the format model {{fmt}}. {{fmt}} should be one of ["YEAR", > "", "YY", "MON", "MONTH", "MM", "DAY", "DD", "HOUR", "MINUTE", "SECOND", > "WEEK", "QUARTER"] > *Examples:* > {{> SELECT date_trunc('2015-03-05T09:32:05.359', 'YEAR'); 2015-01-01T00:00:00 > > SELECT date_trunc('2015-03-05T09:32:05.359', 'MM'); 2015-03-01T00:00:00 > > SELECT date_trunc('2015-03-05T09:32:05.359', 'DD'); 2015-03-05T00:00:00 > > SELECT date_trunc('2015-03-05T09:32:05.359', 'HOUR'); 2015-03-05T09:00:00 }} > > {{Examples should be date_trunc(format, value)}} > {{SELECT date_trunc('YEAR','2015-03-05T09:32:05.359'); }} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24378) Incorrect examples for date_trunc function in spark 2.3.0
[ https://issues.apache.org/jira/browse/SPARK-24378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-24378: - Issue Type: Documentation (was: Bug) > Incorrect examples for date_trunc function in spark 2.3.0 > - > > Key: SPARK-24378 > URL: https://issues.apache.org/jira/browse/SPARK-24378 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.3.0 >Reporter: Prakhar Gupta >Assignee: Yuming Wang >Priority: Trivial > Fix For: 2.3.1, 2.4.0 > > > Within Spark documentation for spark 2.3.0, Listed examples are incorrect. > date_trunc(fmt, ts) - Returns timestamp {{ts}} truncated to the unit > specified by the format model {{fmt}}. {{fmt}} should be one of ["YEAR", > "", "YY", "MON", "MONTH", "MM", "DAY", "DD", "HOUR", "MINUTE", "SECOND", > "WEEK", "QUARTER"] > *Examples:* > {{> SELECT date_trunc('2015-03-05T09:32:05.359', 'YEAR'); 2015-01-01T00:00:00 > > SELECT date_trunc('2015-03-05T09:32:05.359', 'MM'); 2015-03-01T00:00:00 > > SELECT date_trunc('2015-03-05T09:32:05.359', 'DD'); 2015-03-05T00:00:00 > > SELECT date_trunc('2015-03-05T09:32:05.359', 'HOUR'); 2015-03-05T09:00:00 }} > > {{Examples should be date_trunc(format, value)}} > {{SELECT date_trunc('YEAR','2015-03-05T09:32:05.359'); }} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24378) Incorrect examples for date_trunc function in spark 2.3.0
[ https://issues.apache.org/jira/browse/SPARK-24378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24378. -- Resolution: Fixed Fix Version/s: 2.4.0 2.3.1 Fixed in https://github.com/apache/spark/pull/21423 > Incorrect examples for date_trunc function in spark 2.3.0 > - > > Key: SPARK-24378 > URL: https://issues.apache.org/jira/browse/SPARK-24378 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.3.0 >Reporter: Prakhar Gupta >Assignee: Yuming Wang >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > Within Spark documentation for spark 2.3.0, Listed examples are incorrect. > date_trunc(fmt, ts) - Returns timestamp {{ts}} truncated to the unit > specified by the format model {{fmt}}. {{fmt}} should be one of ["YEAR", > "", "YY", "MON", "MONTH", "MM", "DAY", "DD", "HOUR", "MINUTE", "SECOND", > "WEEK", "QUARTER"] > *Examples:* > {{> SELECT date_trunc('2015-03-05T09:32:05.359', 'YEAR'); 2015-01-01T00:00:00 > > SELECT date_trunc('2015-03-05T09:32:05.359', 'MM'); 2015-03-01T00:00:00 > > SELECT date_trunc('2015-03-05T09:32:05.359', 'DD'); 2015-03-05T00:00:00 > > SELECT date_trunc('2015-03-05T09:32:05.359', 'HOUR'); 2015-03-05T09:00:00 }} > > {{Examples should be date_trunc(format, value)}} > {{SELECT date_trunc('YEAR','2015-03-05T09:32:05.359'); }} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24378) Incorrect examples for date_trunc function in spark 2.3.0
[ https://issues.apache.org/jira/browse/SPARK-24378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-24378: Assignee: Yuming Wang > Incorrect examples for date_trunc function in spark 2.3.0 > - > > Key: SPARK-24378 > URL: https://issues.apache.org/jira/browse/SPARK-24378 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.3.0 >Reporter: Prakhar Gupta >Assignee: Yuming Wang >Priority: Major > > Within Spark documentation for spark 2.3.0, Listed examples are incorrect. > date_trunc(fmt, ts) - Returns timestamp {{ts}} truncated to the unit > specified by the format model {{fmt}}. {{fmt}} should be one of ["YEAR", > "", "YY", "MON", "MONTH", "MM", "DAY", "DD", "HOUR", "MINUTE", "SECOND", > "WEEK", "QUARTER"] > *Examples:* > {{> SELECT date_trunc('2015-03-05T09:32:05.359', 'YEAR'); 2015-01-01T00:00:00 > > SELECT date_trunc('2015-03-05T09:32:05.359', 'MM'); 2015-03-01T00:00:00 > > SELECT date_trunc('2015-03-05T09:32:05.359', 'DD'); 2015-03-05T00:00:00 > > SELECT date_trunc('2015-03-05T09:32:05.359', 'HOUR'); 2015-03-05T09:00:00 }} > > {{Examples should be date_trunc(format, value)}} > {{SELECT date_trunc('YEAR','2015-03-05T09:32:05.359'); }} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24383) spark on k8s: "driver-svc" are not getting deleted
Lenin created SPARK-24383: - Summary: spark on k8s: "driver-svc" are not getting deleted Key: SPARK-24383 URL: https://issues.apache.org/jira/browse/SPARK-24383 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 2.3.0 Reporter: Lenin When the driver pod exists, the "*driver-svc" services created for the driver are not cleaned up. This causes accumulation of services in the k8s layer, at one point no more services can be created. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24381) Improve Unit Test Coverage of NOT IN subqueries
[ https://issues.apache.org/jira/browse/SPARK-24381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489242#comment-16489242 ] Apache Spark commented on SPARK-24381: -- User 'mgyucht' has created a pull request for this issue: https://github.com/apache/spark/pull/21425 > Improve Unit Test Coverage of NOT IN subqueries > --- > > Key: SPARK-24381 > URL: https://issues.apache.org/jira/browse/SPARK-24381 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2 >Reporter: Miles Yucht >Priority: Major > > Today, the unit test coverage for NOT IN queries in SubquerySuite is somewhat > lacking. There are a couple test cases that exist, but it isn't necessarily > clear that those tests cover all of the subcomponents of null-aware anti > joins, i.e. where the subquery returns a null value, if specific columns of > either relation are null, etc. Also, it is somewhat difficult for a newcomer > to understand the intended behavior of a null-aware anti join without great > effort. We should make sure we have proper coverage as well as improve the > documentation of this particular subquery, especially with respect to null > behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24381) Improve Unit Test Coverage of NOT IN subqueries
[ https://issues.apache.org/jira/browse/SPARK-24381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24381: Assignee: (was: Apache Spark) > Improve Unit Test Coverage of NOT IN subqueries > --- > > Key: SPARK-24381 > URL: https://issues.apache.org/jira/browse/SPARK-24381 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2 >Reporter: Miles Yucht >Priority: Major > > Today, the unit test coverage for NOT IN queries in SubquerySuite is somewhat > lacking. There are a couple test cases that exist, but it isn't necessarily > clear that those tests cover all of the subcomponents of null-aware anti > joins, i.e. where the subquery returns a null value, if specific columns of > either relation are null, etc. Also, it is somewhat difficult for a newcomer > to understand the intended behavior of a null-aware anti join without great > effort. We should make sure we have proper coverage as well as improve the > documentation of this particular subquery, especially with respect to null > behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24381) Improve Unit Test Coverage of NOT IN subqueries
[ https://issues.apache.org/jira/browse/SPARK-24381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24381: Assignee: Apache Spark > Improve Unit Test Coverage of NOT IN subqueries > --- > > Key: SPARK-24381 > URL: https://issues.apache.org/jira/browse/SPARK-24381 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2 >Reporter: Miles Yucht >Assignee: Apache Spark >Priority: Major > > Today, the unit test coverage for NOT IN queries in SubquerySuite is somewhat > lacking. There are a couple test cases that exist, but it isn't necessarily > clear that those tests cover all of the subcomponents of null-aware anti > joins, i.e. where the subquery returns a null value, if specific columns of > either relation are null, etc. Also, it is somewhat difficult for a newcomer > to understand the intended behavior of a null-aware anti join without great > effort. We should make sure we have proper coverage as well as improve the > documentation of this particular subquery, especially with respect to null > behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24091) Internally used ConfigMap prevents use of user-specified ConfigMaps carrying Spark configs files
[ https://issues.apache.org/jira/browse/SPARK-24091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489186#comment-16489186 ] Trevor McKay edited comment on SPARK-24091 at 5/24/18 3:39 PM: --- I had a similar situation in a project. One way to handle this is allow a user-specified ConfigMap to be mounted at another location, then in the image startup script (entrypoint.sh) check for a non-empty directory. If there are files there, copy them one by one to $SPARK_HOME/conf. This gives you an override semantic for just those files included. The ConfigMap volume can be marked as optional so that if it's not created the container will still start. Another way to do this of course would be to somehow allow the user to modify the contents of the original ConfigMap mentioned above, but this might be tough depending on how it's named, when it's generated, etc etc. For reference, this is how you make a ConfigMap volume optional (borrowing an example from OpenShift documentation) {code} {code:YAML} apiVersion: v1 kind: Pod metadata: name: dapi-test-pod spec: containers: - name: test-container image: gcr.io/google_containers/busybox command: [ "/bin/sh", "cat", "/etc/config/special.how" ] volumeMounts: - name: config-volume mountPath: /etc/config volumes: - name: config-volume configMap: name: special-config optional: true restartPolicy: Never{code} was (Author: tmckay): I had a similar situation in a project. One way to handle this is allow a user-specified ConfigMap to be mounted at another location, then in the image startup script (entrypoint.sh) check for a non-empty directory. If there are files there, copy them one by one to $SPARK_HOME/conf. This gives you an override semantic for just those files included. The ConfigMap volume can be marked as optional so that if it's not created the container will still start. Another way to do this of course would be to somehow allow the user to modify the contents of the original ConfigMap mentioned above, but this might be tough depending on how it's named, when it's generated, etc etc. For reference, this is how you make a ConfigMap volume optional (borrowing an example from OpenShift documentation) {code:yaml} {code:java} apiVersion: v1 kind: Pod metadata: name: dapi-test-pod spec: containers: - name: test-container image: gcr.io/google_containers/busybox command: [ "/bin/sh", "cat", "/etc/config/special.how" ] volumeMounts: - name: config-volume mountPath: /etc/config volumes: - name: config-volume configMap: name: special-config optional: true restartPolicy: Never{code} > Internally used ConfigMap prevents use of user-specified ConfigMaps carrying > Spark configs files > > > Key: SPARK-24091 > URL: https://issues.apache.org/jira/browse/SPARK-24091 > Project: Spark > Issue Type: Brainstorming > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > The recent PR [https://github.com/apache/spark/pull/20669] for removing the > init-container introduced a internally used ConfigMap carrying Spark > configuration properties in a file for the driver. This ConfigMap gets > mounted under {{$SPARK_HOME/conf}} and the environment variable > {{SPARK_CONF_DIR}} is set to point to the mount path. This pretty much > prevents users from mounting their own ConfigMaps that carry custom Spark > configuration files, e.g., {{log4j.properties}} and {{spark-env.sh}} and > leaves users with only the option of building custom images. IMO, it is very > useful to support mounting user-specified ConfigMaps for custom Spark > configuration files. This worths further discussions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24091) Internally used ConfigMap prevents use of user-specified ConfigMaps carrying Spark configs files
[ https://issues.apache.org/jira/browse/SPARK-24091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489186#comment-16489186 ] Trevor McKay edited comment on SPARK-24091 at 5/24/18 3:38 PM: --- I had a similar situation in a project. One way to handle this is allow a user-specified ConfigMap to be mounted at another location, then in the image startup script (entrypoint.sh) check for a non-empty directory. If there are files there, copy them one by one to $SPARK_HOME/conf. This gives you an override semantic for just those files included. The ConfigMap volume can be marked as optional so that if it's not created the container will still start. Another way to do this of course would be to somehow allow the user to modify the contents of the original ConfigMap mentioned above, but this might be tough depending on how it's named, when it's generated, etc etc. For reference, this is how you make a ConfigMap volume optional (borrowing an example from OpenShift documentation) {code:yaml} {code:java} apiVersion: v1 kind: Pod metadata: name: dapi-test-pod spec: containers: - name: test-container image: gcr.io/google_containers/busybox command: [ "/bin/sh", "cat", "/etc/config/special.how" ] volumeMounts: - name: config-volume mountPath: /etc/config volumes: - name: config-volume configMap: name: special-config optional: true restartPolicy: Never{code} was (Author: tmckay): I had a similar situation in a project. One way to handle this is allow a user-specified ConfigMap to be mounted at another location, then in the image startup script (entrypoint.sh) check for a non-empty directory. If there are files there, copy them one by one to $SPARK_HOME/conf. This gives you an override semantic for just those files included. The ConfigMap volume can be marked as optional so that if it's not created the container will still start. Another way to do this of course would be to somehow allow the user to modify the contents of the original ConfigMap mentioned above, but this might be tough depending on how it's named, when it's generated, etc etc. > Internally used ConfigMap prevents use of user-specified ConfigMaps carrying > Spark configs files > > > Key: SPARK-24091 > URL: https://issues.apache.org/jira/browse/SPARK-24091 > Project: Spark > Issue Type: Brainstorming > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > The recent PR [https://github.com/apache/spark/pull/20669] for removing the > init-container introduced a internally used ConfigMap carrying Spark > configuration properties in a file for the driver. This ConfigMap gets > mounted under {{$SPARK_HOME/conf}} and the environment variable > {{SPARK_CONF_DIR}} is set to point to the mount path. This pretty much > prevents users from mounting their own ConfigMaps that carry custom Spark > configuration files, e.g., {{log4j.properties}} and {{spark-env.sh}} and > leaves users with only the option of building custom images. IMO, it is very > useful to support mounting user-specified ConfigMaps for custom Spark > configuration files. This worths further discussions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24378) Incorrect examples for date_trunc function in spark 2.3.0
[ https://issues.apache.org/jira/browse/SPARK-24378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-24378: - Target Version/s: (was: 2.3.0) > Incorrect examples for date_trunc function in spark 2.3.0 > - > > Key: SPARK-24378 > URL: https://issues.apache.org/jira/browse/SPARK-24378 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.3.0 >Reporter: Prakhar Gupta >Priority: Major > > Within Spark documentation for spark 2.3.0, Listed examples are incorrect. > date_trunc(fmt, ts) - Returns timestamp {{ts}} truncated to the unit > specified by the format model {{fmt}}. {{fmt}} should be one of ["YEAR", > "", "YY", "MON", "MONTH", "MM", "DAY", "DD", "HOUR", "MINUTE", "SECOND", > "WEEK", "QUARTER"] > *Examples:* > {{> SELECT date_trunc('2015-03-05T09:32:05.359', 'YEAR'); 2015-01-01T00:00:00 > > SELECT date_trunc('2015-03-05T09:32:05.359', 'MM'); 2015-03-01T00:00:00 > > SELECT date_trunc('2015-03-05T09:32:05.359', 'DD'); 2015-03-05T00:00:00 > > SELECT date_trunc('2015-03-05T09:32:05.359', 'HOUR'); 2015-03-05T09:00:00 }} > > {{Examples should be date_trunc(format, value)}} > {{SELECT date_trunc('YEAR','2015-03-05T09:32:05.359'); }} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data
[ https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489233#comment-16489233 ] Wenbo Zhao edited comment on SPARK-24373 at 5/24/18 3:37 PM: - I turned on the log trace of RuleExecutor and found that in my example of df1.count() after cache. {code:java} scala> df1.groupBy().count().explain(true) === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$HandleNullInputsForUDF === Aggregate [count(1) AS count#40L] Aggregate [count(1) AS count#40L] !+- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] +- SerializeFromObject [input[0, bigint, false] AS value#2L] +- SerializeFromObject [input[0, bigint, false] AS value#2L] +- ExternalRDD [obj#1L] +- ExternalRDD [obj#1L] {code} that is node {code:java} Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]{code} becomes {code:java} Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]{code} This will cause a miss in the CacheManager? which could be confirmed by later applying ColumnPrunning rule's log trace. May question is: is that supposed protected by AnalysisBarrier ? was (Author: wbzhao): I turned on the log trace of RuleExecutor and found that in my example of df1.count() after cache. {code:java} scala> df1.groupBy().count().explain(true) === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$HandleNullInputsForUDF === Aggregate [count(1) AS count#40L] Aggregate [count(1) AS count#40L] !+- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] +- SerializeFromObject [input[0, bigint, false] AS value#2L] +- SerializeFromObject [input[0, bigint, false] AS value#2L] +- ExternalRDD [obj#1L] +- ExternalRDD [obj#1L] {code} that is node {code:java} Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]{code} becomes {code:java} Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]{code} This will cause a miss in the CacheManager? which could be confirmed by later applying ColumnPrunning rule's log trace. May question is: is that supposed protected by AnalysisBarrier ? > "df.cache() df.count()" no longer eagerly caches data > - > > Key: SPARK-24373 > URL: https://issues.apache.org/jira/browse/SPARK-24373 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenbo Zhao >Priority: Major > > Here is the code to reproduce in local mode > {code:java} > scala> val df = sc.range(1, 2).toDF > df: org.apache.spark.sql.DataFrame = [value: bigint] > scala> val myudf = udf({x: Long => println(""); x + 1}) > myudf: org.apache.spark.sql.expressions.UserDefinedFunction = > UserDefinedFunction(,LongType,Some(List(LongType))) > scala> val df1 = df.withColumn("value1", myudf(col("value"))) > df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint] > scala> df1.cache > res0: df1.type = [value: bigint, value1: bigint] > scala> df1.count > res1: Long = 1 > scala> df1.count > res2: Long = 1 > scala> df1.count > res3: Long = 1 > {code} > > in Spark 2.2, you could see it prints "". > In the above example, when you do explain. You could see > {code:java} > scala> df1.explain(true) > == Parsed Logical Plan == > 'Project [value#2L, UDF('value) AS value1#5] > +- AnalysisBarrier > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Analyzed Logical Plan == > value: bigint, value1: bigint > Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Optimized Logical Plan == > InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > == Physical Plan == > *(1) InMemoryTableScan [value#2L, value1#5L] > +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > {code} > but the ImMemoryTableScan is mising in the following explain() > {code:java} > scala>
[jira] [Updated] (SPARK-24378) Incorrect examples for date_trunc function in spark 2.3.0
[ https://issues.apache.org/jira/browse/SPARK-24378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-24378: - Fix Version/s: (was: 2.3.0) > Incorrect examples for date_trunc function in spark 2.3.0 > - > > Key: SPARK-24378 > URL: https://issues.apache.org/jira/browse/SPARK-24378 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.3.0 >Reporter: Prakhar Gupta >Priority: Major > > Within Spark documentation for spark 2.3.0, Listed examples are incorrect. > date_trunc(fmt, ts) - Returns timestamp {{ts}} truncated to the unit > specified by the format model {{fmt}}. {{fmt}} should be one of ["YEAR", > "", "YY", "MON", "MONTH", "MM", "DAY", "DD", "HOUR", "MINUTE", "SECOND", > "WEEK", "QUARTER"] > *Examples:* > {{> SELECT date_trunc('2015-03-05T09:32:05.359', 'YEAR'); 2015-01-01T00:00:00 > > SELECT date_trunc('2015-03-05T09:32:05.359', 'MM'); 2015-03-01T00:00:00 > > SELECT date_trunc('2015-03-05T09:32:05.359', 'DD'); 2015-03-05T00:00:00 > > SELECT date_trunc('2015-03-05T09:32:05.359', 'HOUR'); 2015-03-05T09:00:00 }} > > {{Examples should be date_trunc(format, value)}} > {{SELECT date_trunc('YEAR','2015-03-05T09:32:05.359'); }} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data
[ https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489233#comment-16489233 ] Wenbo Zhao commented on SPARK-24373: I turned on the log trace of RuleExecutor and found that in my example of df1.count() after cache. {code:java} scala> df1.groupBy().count().explain(true) === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$HandleNullInputsForUDF === Aggregate [count(1) AS count#40L] Aggregate [count(1) AS count#40L] !+- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] +- SerializeFromObject [input[0, bigint, false] AS value#2L] +- SerializeFromObject [input[0, bigint, false] AS value#2L] +- ExternalRDD [obj#1L] +- ExternalRDD [obj#1L] {code} that is node {code:java} Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]{code} becomes {code:java} Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]{code} This will cause a miss in the CacheManager? which could be confirmed by later applying ColumnPrunning rule's log trace. May question is: is that supposed protected by AnalysisBarrier ? > "df.cache() df.count()" no longer eagerly caches data > - > > Key: SPARK-24373 > URL: https://issues.apache.org/jira/browse/SPARK-24373 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenbo Zhao >Priority: Major > > Here is the code to reproduce in local mode > {code:java} > scala> val df = sc.range(1, 2).toDF > df: org.apache.spark.sql.DataFrame = [value: bigint] > scala> val myudf = udf({x: Long => println(""); x + 1}) > myudf: org.apache.spark.sql.expressions.UserDefinedFunction = > UserDefinedFunction(,LongType,Some(List(LongType))) > scala> val df1 = df.withColumn("value1", myudf(col("value"))) > df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint] > scala> df1.cache > res0: df1.type = [value: bigint, value1: bigint] > scala> df1.count > res1: Long = 1 > scala> df1.count > res2: Long = 1 > scala> df1.count > res3: Long = 1 > {code} > > in Spark 2.2, you could see it prints "". > In the above example, when you do explain. You could see > {code:java} > scala> df1.explain(true) > == Parsed Logical Plan == > 'Project [value#2L, UDF('value) AS value1#5] > +- AnalysisBarrier > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Analyzed Logical Plan == > value: bigint, value1: bigint > Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Optimized Logical Plan == > InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > == Physical Plan == > *(1) InMemoryTableScan [value#2L, value1#5L] > +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > {code} > but the ImMemoryTableScan is mising in the following explain() > {code:java} > scala> df1.groupBy().count().explain(true) > == Parsed Logical Plan == > Aggregate [count(1) AS count#170L] > +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS > value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Analyzed Logical Plan == > count: bigint > Aggregate [count(1) AS count#170L] > +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) > null else UDF(value#2L) AS value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Optimized Logical Plan == > Aggregate [count(1) AS count#170L] > +- Project > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Physical Plan == > *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L]) > +- Exchange SinglePartition > +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#175L]) > +- *(1) Project > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe,
[jira] [Created] (SPARK-24382) Spark Structured Streaming aggregation on old timestamp data
Karthik created SPARK-24382: --- Summary: Spark Structured Streaming aggregation on old timestamp data Key: SPARK-24382 URL: https://issues.apache.org/jira/browse/SPARK-24382 Project: Spark Issue Type: Question Components: Structured Streaming Affects Versions: 2.2.0 Reporter: Karthik I am trying to aggregate the count of records every 10 seconds using the structured streaming for the following incoming kafka data {code:java} { "ts2" : "2018/05/01 00:02:50.041", "serviceGroupId" : "123", "userId" : "avv-0", "stream" : "", "lastUserActivity" : "00:02:50", "lastUserActivityCount" : "0" } { "ts2" : "2018/05/01 00:09:02.079", "serviceGroupId" : "123", "userId" : "avv-0", "stream" : "", "lastUserActivity" : "00:09:02", "lastUserActivityCount" : "0" } { "ts2" : "2018/05/01 00:09:02.086", "serviceGroupId" : "123", "userId" : "avv-2", "stream" : "", "lastUserActivity" : "00:09:02", "lastUserActivityCount" : "0" } {code} With the following logic {code:java} val sdvTuneInsAgg1 = df .withWatermark("ts2", "10 seconds") .groupBy(window(col("ts2"),"10 seconds")) .agg(count("*") as "count") .as[CountMetric1] val query1 = sdvTuneInsAgg1.writeStream .format("console") .foreach(writer) .start() {code} and I do not see any records inside the writer. But, the only anomaly is that the current date is 2018/05/24 but the record that I am processing (ts2) has old dates. Will aggregation / count work in this scenario ? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24091) Internally used ConfigMap prevents use of user-specified ConfigMaps carrying Spark configs files
[ https://issues.apache.org/jira/browse/SPARK-24091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489186#comment-16489186 ] Trevor McKay commented on SPARK-24091: -- I had a similar situation in a project. One way to handle this is allow a user-specified ConfigMap to be mounted at another location, then in the image startup script (entrypoint.sh) check for a non-empty directory. If there are files there, copy them one by one to $SPARK_HOME/conf. This gives you an override semantic for just those files included. The ConfigMap volume can be marked as optional so that if it's not created the container will still start. Another way to do this of course would be to somehow allow the user to modify the contents of the original ConfigMap mentioned above, but this might be tough depending on how it's named, when it's generated, etc etc. > Internally used ConfigMap prevents use of user-specified ConfigMaps carrying > Spark configs files > > > Key: SPARK-24091 > URL: https://issues.apache.org/jira/browse/SPARK-24091 > Project: Spark > Issue Type: Brainstorming > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > The recent PR [https://github.com/apache/spark/pull/20669] for removing the > init-container introduced a internally used ConfigMap carrying Spark > configuration properties in a file for the driver. This ConfigMap gets > mounted under {{$SPARK_HOME/conf}} and the environment variable > {{SPARK_CONF_DIR}} is set to point to the mount path. This pretty much > prevents users from mounting their own ConfigMaps that carry custom Spark > configuration files, e.g., {{log4j.properties}} and {{spark-env.sh}} and > leaves users with only the option of building custom images. IMO, it is very > useful to support mounting user-specified ConfigMaps for custom Spark > configuration files. This worths further discussions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17592) SQL: CAST string as INT inconsistent with Hive
[ https://issues.apache.org/jira/browse/SPARK-17592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489185#comment-16489185 ] Jorge Machado edited comment on SPARK-17592 at 5/24/18 3:11 PM: I'm hitting the same issue I'm afraid but in slightly another way. When I have a dataframe (that comes from oracle DB ) as parquet I can see in the logs that a field is beeing saved as integer : { "type" : "struct", "fields" : [ \{ "name" : "project_id", "type" : "integer", "nullable" : true, "metadata" : { } },... on hue (which reads from hive) I see : !image-2018-05-24-17-10-24-515.png! was (Author: jomach): I'm hitting the same issue I'm afraid but in slightly another way. When I have a dataframe as parquet I can see in the logs that a field is beeing saved as integer : { "type" : "struct", "fields" : [ \{ "name" : "project_id", "type" : "integer", "nullable" : true, "metadata" : { } },... on hue (which reads from hive) I see : !image-2018-05-24-17-10-24-515.png! > SQL: CAST string as INT inconsistent with Hive > -- > > Key: SPARK-17592 > URL: https://issues.apache.org/jira/browse/SPARK-17592 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Furcy Pin >Priority: Major > Attachments: image-2018-05-24-17-10-24-515.png > > > Hello, > there seem to be an inconsistency between Spark and Hive when casting a > string into an Int. > With Hive: > {code} > select cast("0.4" as INT) ; > > 0 > select cast("0.5" as INT) ; > > 0 > select cast("0.6" as INT) ; > > 0 > {code} > With Spark-SQL: > {code} > select cast("0.4" as INT) ; > > 0 > select cast("0.5" as INT) ; > > 1 > select cast("0.6" as INT) ; > > 1 > {code} > Hive seems to perform a floor(string.toDouble), while Spark seems to perform > a round(string.toDouble) > I'm not sure there is any ISO standard for this, mysql has the same behavior > than Hive, while postgresql performs a string.toInt and throws an > NumberFormatException > Personnally I think Hive is right, hence my posting this here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17592) SQL: CAST string as INT inconsistent with Hive
[ https://issues.apache.org/jira/browse/SPARK-17592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Machado updated SPARK-17592: -- Attachment: image-2018-05-24-17-10-24-515.png > SQL: CAST string as INT inconsistent with Hive > -- > > Key: SPARK-17592 > URL: https://issues.apache.org/jira/browse/SPARK-17592 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Furcy Pin >Priority: Major > Attachments: image-2018-05-24-17-10-24-515.png > > > Hello, > there seem to be an inconsistency between Spark and Hive when casting a > string into an Int. > With Hive: > {code} > select cast("0.4" as INT) ; > > 0 > select cast("0.5" as INT) ; > > 0 > select cast("0.6" as INT) ; > > 0 > {code} > With Spark-SQL: > {code} > select cast("0.4" as INT) ; > > 0 > select cast("0.5" as INT) ; > > 1 > select cast("0.6" as INT) ; > > 1 > {code} > Hive seems to perform a floor(string.toDouble), while Spark seems to perform > a round(string.toDouble) > I'm not sure there is any ISO standard for this, mysql has the same behavior > than Hive, while postgresql performs a string.toInt and throws an > NumberFormatException > Personnally I think Hive is right, hence my posting this here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17592) SQL: CAST string as INT inconsistent with Hive
[ https://issues.apache.org/jira/browse/SPARK-17592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489185#comment-16489185 ] Jorge Machado commented on SPARK-17592: --- I'm hitting the same issue I'm afraid but in slightly another way. When I have a dataframe as parquet I can see in the logs that a field is beeing saved as integer : { "type" : "struct", "fields" : [ \{ "name" : "project_id", "type" : "integer", "nullable" : true, "metadata" : { } },... on hue (which reads from hive) I see : !image-2018-05-24-17-10-24-515.png! > SQL: CAST string as INT inconsistent with Hive > -- > > Key: SPARK-17592 > URL: https://issues.apache.org/jira/browse/SPARK-17592 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Furcy Pin >Priority: Major > Attachments: image-2018-05-24-17-10-24-515.png > > > Hello, > there seem to be an inconsistency between Spark and Hive when casting a > string into an Int. > With Hive: > {code} > select cast("0.4" as INT) ; > > 0 > select cast("0.5" as INT) ; > > 0 > select cast("0.6" as INT) ; > > 0 > {code} > With Spark-SQL: > {code} > select cast("0.4" as INT) ; > > 0 > select cast("0.5" as INT) ; > > 1 > select cast("0.6" as INT) ; > > 1 > {code} > Hive seems to perform a floor(string.toDouble), while Spark seems to perform > a round(string.toDouble) > I'm not sure there is any ISO standard for this, mysql has the same behavior > than Hive, while postgresql performs a string.toInt and throws an > NumberFormatException > Personnally I think Hive is right, hence my posting this here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data
[ https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489129#comment-16489129 ] Andreas Weise commented on SPARK-24373: --- We are also facing increased runtime duration for our SQL jobs (after upgrading from 2.2.1 to 2.3.0), but didn't trace it down to the root cause. This issue sounds reasonable to me, as we are also using cache() + count() quite often. > "df.cache() df.count()" no longer eagerly caches data > - > > Key: SPARK-24373 > URL: https://issues.apache.org/jira/browse/SPARK-24373 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenbo Zhao >Priority: Major > > Here is the code to reproduce in local mode > {code:java} > scala> val df = sc.range(1, 2).toDF > df: org.apache.spark.sql.DataFrame = [value: bigint] > scala> val myudf = udf({x: Long => println(""); x + 1}) > myudf: org.apache.spark.sql.expressions.UserDefinedFunction = > UserDefinedFunction(,LongType,Some(List(LongType))) > scala> val df1 = df.withColumn("value1", myudf(col("value"))) > df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint] > scala> df1.cache > res0: df1.type = [value: bigint, value1: bigint] > scala> df1.count > res1: Long = 1 > scala> df1.count > res2: Long = 1 > scala> df1.count > res3: Long = 1 > {code} > > in Spark 2.2, you could see it prints "". > In the above example, when you do explain. You could see > {code:java} > scala> df1.explain(true) > == Parsed Logical Plan == > 'Project [value#2L, UDF('value) AS value1#5] > +- AnalysisBarrier > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Analyzed Logical Plan == > value: bigint, value1: bigint > Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Optimized Logical Plan == > InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > == Physical Plan == > *(1) InMemoryTableScan [value#2L, value1#5L] > +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > {code} > but the ImMemoryTableScan is mising in the following explain() > {code:java} > scala> df1.groupBy().count().explain(true) > == Parsed Logical Plan == > Aggregate [count(1) AS count#170L] > +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS > value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Analyzed Logical Plan == > count: bigint > Aggregate [count(1) AS count#170L] > +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) > null else UDF(value#2L) AS value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Optimized Logical Plan == > Aggregate [count(1) AS count#170L] > +- Project > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Physical Plan == > *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L]) > +- Exchange SinglePartition > +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#175L]) > +- *(1) Project > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24381) Improve Unit Test Coverage of NOT IN subqueries
Miles Yucht created SPARK-24381: --- Summary: Improve Unit Test Coverage of NOT IN subqueries Key: SPARK-24381 URL: https://issues.apache.org/jira/browse/SPARK-24381 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.2 Reporter: Miles Yucht Today, the unit test coverage for NOT IN queries in SubquerySuite is somewhat lacking. There are a couple test cases that exist, but it isn't necessarily clear that those tests cover all of the subcomponents of null-aware anti joins, i.e. where the subquery returns a null value, if specific columns of either relation are null, etc. Also, it is somewhat difficult for a newcomer to understand the intended behavior of a null-aware anti join without great effort. We should make sure we have proper coverage as well as improve the documentation of this particular subquery, especially with respect to null behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24329) Remove comments filtering before parsing of CSV files
[ https://issues.apache.org/jira/browse/SPARK-24329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489100#comment-16489100 ] Hyukjin Kwon commented on SPARK-24329: -- Fixed by explicitly adding a test where the code is valid. > Remove comments filtering before parsing of CSV files > - > > Key: SPARK-24329 > URL: https://issues.apache.org/jira/browse/SPARK-24329 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 2.4.0 > > > Comments and whitespace filtering has been performed by uniVocity parser > already according to parser settings: > https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L178-L180 > It is not necessary to do the same before parsing. Need to inspect all places > where the filterCommentAndEmpty method is called, and remove the former one > if it duplicates filtering of uniVocity parser. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24329) Remove comments filtering before parsing of CSV files
[ https://issues.apache.org/jira/browse/SPARK-24329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24329. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21394 [https://github.com/apache/spark/pull/21394] > Remove comments filtering before parsing of CSV files > - > > Key: SPARK-24329 > URL: https://issues.apache.org/jira/browse/SPARK-24329 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 2.4.0 > > > Comments and whitespace filtering has been performed by uniVocity parser > already according to parser settings: > https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L178-L180 > It is not necessary to do the same before parsing. Need to inspect all places > where the filterCommentAndEmpty method is called, and remove the former one > if it duplicates filtering of uniVocity parser. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24329) Remove comments filtering before parsing of CSV files
[ https://issues.apache.org/jira/browse/SPARK-24329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-24329: Assignee: Maxim Gekk > Remove comments filtering before parsing of CSV files > - > > Key: SPARK-24329 > URL: https://issues.apache.org/jira/browse/SPARK-24329 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 2.4.0 > > > Comments and whitespace filtering has been performed by uniVocity parser > already according to parser settings: > https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L178-L180 > It is not necessary to do the same before parsing. Need to inspect all places > where the filterCommentAndEmpty method is called, and remove the former one > if it duplicates filtering of uniVocity parser. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data
[ https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489060#comment-16489060 ] Li Jin commented on SPARK-24373: This is a reproduce in unit test: {code:java} test("cache and count") { var evalCount = 0 val myUDF = udf((x: String) => { evalCount += 1; "result" }) val df = spark.range(0, 1).select(myUDF($"id")) df.cache() df.count() assert(evalCount === 1) df.count() assert(evalCount === 1) } {code} > "df.cache() df.count()" no longer eagerly caches data > - > > Key: SPARK-24373 > URL: https://issues.apache.org/jira/browse/SPARK-24373 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenbo Zhao >Priority: Major > > Here is the code to reproduce in local mode > {code:java} > scala> val df = sc.range(1, 2).toDF > df: org.apache.spark.sql.DataFrame = [value: bigint] > scala> val myudf = udf({x: Long => println(""); x + 1}) > myudf: org.apache.spark.sql.expressions.UserDefinedFunction = > UserDefinedFunction(,LongType,Some(List(LongType))) > scala> val df1 = df.withColumn("value1", myudf(col("value"))) > df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint] > scala> df1.cache > res0: df1.type = [value: bigint, value1: bigint] > scala> df1.count > res1: Long = 1 > scala> df1.count > res2: Long = 1 > scala> df1.count > res3: Long = 1 > {code} > > in Spark 2.2, you could see it prints "". > In the above example, when you do explain. You could see > {code:java} > scala> df1.explain(true) > == Parsed Logical Plan == > 'Project [value#2L, UDF('value) AS value1#5] > +- AnalysisBarrier > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Analyzed Logical Plan == > value: bigint, value1: bigint > Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Optimized Logical Plan == > InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > == Physical Plan == > *(1) InMemoryTableScan [value#2L, value1#5L] > +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > {code} > but the ImMemoryTableScan is mising in the following explain() > {code:java} > scala> df1.groupBy().count().explain(true) > == Parsed Logical Plan == > Aggregate [count(1) AS count#170L] > +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS > value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Analyzed Logical Plan == > count: bigint > Aggregate [count(1) AS count#170L] > +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) > null else UDF(value#2L) AS value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Optimized Logical Plan == > Aggregate [count(1) AS count#170L] > +- Project > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Physical Plan == > *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L]) > +- Exchange SinglePartition > +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#175L]) > +- *(1) Project > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24380) argument quoting/escaping broken
paul mackles created SPARK-24380: Summary: argument quoting/escaping broken Key: SPARK-24380 URL: https://issues.apache.org/jira/browse/SPARK-24380 Project: Spark Issue Type: Bug Components: Deploy, Mesos Affects Versions: 2.3.0, 2.2.0 Reporter: paul mackles Fix For: 2.4.0 When a configuration property contains shell characters that require quoting, the Mesos cluster scheduler generates the spark-submit argument like so: {code:java} --conf "spark.mesos.executor.docker.parameters="label=logging=|foo|""{code} Note the quotes around the property value as well as the key=value pair. When using docker, this breaks the spark-submit command and causes the "|" to be interpreted as an actual shell PIPE. Spaces, semi-colons, etc also cause issues. Although I haven't tried, I suspect this is also a potential security issue in that someone could exploit it to run arbitrary code on the host. My patch is pretty minimal and just removes the outer quotes around the key=value pair, resulting in something like: {code:java} --conf spark.mesos.executor.docker.parameters="label=logging=|foo|"{code} A more extensive fix might try wrapping the entire key=value pair in single quotes but I was concerned about backwards compatibility with that change. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24380) argument quoting/escaping broken in mesos cluster scheduler
[ https://issues.apache.org/jira/browse/SPARK-24380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] paul mackles updated SPARK-24380: - Summary: argument quoting/escaping broken in mesos cluster scheduler (was: argument quoting/escaping broken) > argument quoting/escaping broken in mesos cluster scheduler > --- > > Key: SPARK-24380 > URL: https://issues.apache.org/jira/browse/SPARK-24380 > Project: Spark > Issue Type: Bug > Components: Deploy, Mesos >Affects Versions: 2.2.0, 2.3.0 >Reporter: paul mackles >Priority: Critical > Fix For: 2.4.0 > > > When a configuration property contains shell characters that require quoting, > the Mesos cluster scheduler generates the spark-submit argument like so: > {code:java} > --conf "spark.mesos.executor.docker.parameters="label=logging=|foo|""{code} > Note the quotes around the property value as well as the key=value pair. When > using docker, this breaks the spark-submit command and causes the "|" to be > interpreted as an actual shell PIPE. Spaces, semi-colons, etc also cause > issues. > Although I haven't tried, I suspect this is also a potential security issue > in that someone could exploit it to run arbitrary code on the host. > My patch is pretty minimal and just removes the outer quotes around the > key=value pair, resulting in something like: > {code:java} > --conf spark.mesos.executor.docker.parameters="label=logging=|foo|"{code} > A more extensive fix might try wrapping the entire key=value pair in single > quotes but I was concerned about backwards compatibility with that change. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24230) With Parquet 1.10 upgrade has errors in the vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-24230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-24230. - Resolution: Fixed Assignee: Ryan Blue Fix Version/s: 2.4.0 2.3.1 > With Parquet 1.10 upgrade has errors in the vectorized reader > - > > Key: SPARK-24230 > URL: https://issues.apache.org/jira/browse/SPARK-24230 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Ian O Connell >Assignee: Ryan Blue >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > When reading some parquet files can get an error like: > java.io.IOException: expecting more rows but reached last block. Read 0 out > of 1194236 > This happens when looking for a needle thats pretty rare in a large haystack. > > The issue here I believe is that the total row count is calculated at > [https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L229] > > But we pass the blocks we filtered via > org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups > to the ParquetFileReader constructor. > > However the ParquetFileReader constructor will filter the list of blocks > again using > > [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L737] > > if a block is filtered out by the latter method, and not the former the > vectorized reader will believe it should see more rows than it will. > the fix I used locally is pretty straight forward: > {code:java} > for (BlockMetaData block : blocks) { > this.totalRowCount += block.getRowCount(); > } > {code} > goes to > {code:java} > this.totalRowCount = this.reader.getRecordCount(); > {code} > [~rdblue] do you know if this sounds right? The second filter method in the > ParquetFileReader might filter more blocks leading to the count being off? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24379) BroadcastExchangeExec should catch SparkOutOfMemory and re-throw SparkFatalException, which wraps SparkOutOfMemory inside.
[ https://issues.apache.org/jira/browse/SPARK-24379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24379: Assignee: (was: Apache Spark) > BroadcastExchangeExec should catch SparkOutOfMemory and re-throw > SparkFatalException, which wraps SparkOutOfMemory inside. > -- > > Key: SPARK-24379 > URL: https://issues.apache.org/jira/browse/SPARK-24379 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: jin xing >Priority: Major > > After SPARK-22827, Spark won't fails the entire executor but only fails the > task suffering SparkOutOfMemoryError. In current BroadcastExchangeExec, it > try-catch OutOfMemoryError. Think about below scenario: > # SparkOutOfMemoryError(subclass of OutOfMemoryError) is thrown in > scala.concurrent.Future; > # SparkOutOfMemoryError is caught and a OutOfMemoryError is wrapped in > SparkFatalException and re-thrown; > # ThreadUtils.awaitResult catches SparkFatalException and a OutOfMemoryError > is thrown; > # The OutOfMemoryError will go to > SparkUncaughtExceptionHandler.uncaughtException and Executor fails. > So it make more sense to catch SparkOutOfMemory and re-throw > SparkFatalException, which wraps SparkOutOfMemory inside. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24379) BroadcastExchangeExec should catch SparkOutOfMemory and re-throw SparkFatalException, which wraps SparkOutOfMemory inside.
[ https://issues.apache.org/jira/browse/SPARK-24379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488860#comment-16488860 ] Apache Spark commented on SPARK-24379: -- User 'jinxing64' has created a pull request for this issue: https://github.com/apache/spark/pull/21424 > BroadcastExchangeExec should catch SparkOutOfMemory and re-throw > SparkFatalException, which wraps SparkOutOfMemory inside. > -- > > Key: SPARK-24379 > URL: https://issues.apache.org/jira/browse/SPARK-24379 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: jin xing >Priority: Major > > After SPARK-22827, Spark won't fails the entire executor but only fails the > task suffering SparkOutOfMemoryError. In current BroadcastExchangeExec, it > try-catch OutOfMemoryError. Think about below scenario: > # SparkOutOfMemoryError(subclass of OutOfMemoryError) is thrown in > scala.concurrent.Future; > # SparkOutOfMemoryError is caught and a OutOfMemoryError is wrapped in > SparkFatalException and re-thrown; > # ThreadUtils.awaitResult catches SparkFatalException and a OutOfMemoryError > is thrown; > # The OutOfMemoryError will go to > SparkUncaughtExceptionHandler.uncaughtException and Executor fails. > So it make more sense to catch SparkOutOfMemory and re-throw > SparkFatalException, which wraps SparkOutOfMemory inside. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24379) BroadcastExchangeExec should catch SparkOutOfMemory and re-throw SparkFatalException, which wraps SparkOutOfMemory inside.
[ https://issues.apache.org/jira/browse/SPARK-24379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24379: Assignee: Apache Spark > BroadcastExchangeExec should catch SparkOutOfMemory and re-throw > SparkFatalException, which wraps SparkOutOfMemory inside. > -- > > Key: SPARK-24379 > URL: https://issues.apache.org/jira/browse/SPARK-24379 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: jin xing >Assignee: Apache Spark >Priority: Major > > After SPARK-22827, Spark won't fails the entire executor but only fails the > task suffering SparkOutOfMemoryError. In current BroadcastExchangeExec, it > try-catch OutOfMemoryError. Think about below scenario: > # SparkOutOfMemoryError(subclass of OutOfMemoryError) is thrown in > scala.concurrent.Future; > # SparkOutOfMemoryError is caught and a OutOfMemoryError is wrapped in > SparkFatalException and re-thrown; > # ThreadUtils.awaitResult catches SparkFatalException and a OutOfMemoryError > is thrown; > # The OutOfMemoryError will go to > SparkUncaughtExceptionHandler.uncaughtException and Executor fails. > So it make more sense to catch SparkOutOfMemory and re-throw > SparkFatalException, which wraps SparkOutOfMemory inside. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24378) Incorrect examples for date_trunc function in spark 2.3.0
[ https://issues.apache.org/jira/browse/SPARK-24378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488850#comment-16488850 ] Apache Spark commented on SPARK-24378: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/21423 > Incorrect examples for date_trunc function in spark 2.3.0 > - > > Key: SPARK-24378 > URL: https://issues.apache.org/jira/browse/SPARK-24378 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.3.0 >Reporter: Prakhar Gupta >Priority: Major > Fix For: 2.3.0 > > > Within Spark documentation for spark 2.3.0, Listed examples are incorrect. > date_trunc(fmt, ts) - Returns timestamp {{ts}} truncated to the unit > specified by the format model {{fmt}}. {{fmt}} should be one of ["YEAR", > "", "YY", "MON", "MONTH", "MM", "DAY", "DD", "HOUR", "MINUTE", "SECOND", > "WEEK", "QUARTER"] > *Examples:* > {{> SELECT date_trunc('2015-03-05T09:32:05.359', 'YEAR'); 2015-01-01T00:00:00 > > SELECT date_trunc('2015-03-05T09:32:05.359', 'MM'); 2015-03-01T00:00:00 > > SELECT date_trunc('2015-03-05T09:32:05.359', 'DD'); 2015-03-05T00:00:00 > > SELECT date_trunc('2015-03-05T09:32:05.359', 'HOUR'); 2015-03-05T09:00:00 }} > > {{Examples should be date_trunc(format, value)}} > {{SELECT date_trunc('YEAR','2015-03-05T09:32:05.359'); }} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24378) Incorrect examples for date_trunc function in spark 2.3.0
[ https://issues.apache.org/jira/browse/SPARK-24378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24378: Assignee: Apache Spark > Incorrect examples for date_trunc function in spark 2.3.0 > - > > Key: SPARK-24378 > URL: https://issues.apache.org/jira/browse/SPARK-24378 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.3.0 >Reporter: Prakhar Gupta >Assignee: Apache Spark >Priority: Major > Fix For: 2.3.0 > > > Within Spark documentation for spark 2.3.0, Listed examples are incorrect. > date_trunc(fmt, ts) - Returns timestamp {{ts}} truncated to the unit > specified by the format model {{fmt}}. {{fmt}} should be one of ["YEAR", > "", "YY", "MON", "MONTH", "MM", "DAY", "DD", "HOUR", "MINUTE", "SECOND", > "WEEK", "QUARTER"] > *Examples:* > {{> SELECT date_trunc('2015-03-05T09:32:05.359', 'YEAR'); 2015-01-01T00:00:00 > > SELECT date_trunc('2015-03-05T09:32:05.359', 'MM'); 2015-03-01T00:00:00 > > SELECT date_trunc('2015-03-05T09:32:05.359', 'DD'); 2015-03-05T00:00:00 > > SELECT date_trunc('2015-03-05T09:32:05.359', 'HOUR'); 2015-03-05T09:00:00 }} > > {{Examples should be date_trunc(format, value)}} > {{SELECT date_trunc('YEAR','2015-03-05T09:32:05.359'); }} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org