[jira] [Updated] (SPARK-24324) Pandas Grouped Map UserDefinedFunction mixes column labels

2018-05-25 Thread Li Jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-24324:
---
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-22216

> Pandas Grouped Map UserDefinedFunction mixes column labels
> --
>
> Key: SPARK-24324
> URL: https://issues.apache.org/jira/browse/SPARK-24324
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Python (using virtualenv):
> {noformat}
> $ python --version 
> Python 3.6.5
> {noformat}
> Modules installed:
> {noformat}
> arrow==0.12.1
> backcall==0.1.0
> bleach==2.1.3
> chardet==3.0.4
> decorator==4.3.0
> entrypoints==0.2.3
> findspark==1.2.0
> html5lib==1.0.1
> ipdb==0.11
> ipykernel==4.8.2
> ipython==6.3.1
> ipython-genutils==0.2.0
> ipywidgets==7.2.1
> jedi==0.12.0
> Jinja2==2.10
> jsonschema==2.6.0
> jupyter==1.0.0
> jupyter-client==5.2.3
> jupyter-console==5.2.0
> jupyter-core==4.4.0
> MarkupSafe==1.0
> mistune==0.8.3
> nbconvert==5.3.1
> nbformat==4.4.0
> notebook==5.5.0
> numpy==1.14.3
> pandas==0.22.0
> pandocfilters==1.4.2
> parso==0.2.0
> pbr==3.1.1
> pexpect==4.5.0
> pickleshare==0.7.4
> progressbar2==3.37.1
> prompt-toolkit==1.0.15
> ptyprocess==0.5.2
> pyarrow==0.9.0
> Pygments==2.2.0
> python-dateutil==2.7.2
> python-utils==2.3.0
> pytz==2018.4
> pyzmq==17.0.0
> qtconsole==4.3.1
> Send2Trash==1.5.0
> simplegeneric==0.8.1
> six==1.11.0
> SQLAlchemy==1.2.7
> stevedore==1.28.0
> terminado==0.8.1
> testpath==0.3.1
> tornado==5.0.2
> traitlets==4.3.2
> virtualenv==15.1.0
> virtualenv-clone==0.2.6
> virtualenvwrapper==4.7.2
> wcwidth==0.1.7
> webencodings==0.5.1
> widgetsnbextension==3.2.1
> {noformat}
>  
> Java:
> {noformat}
> $ java -version 
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> System:
> {noformat}
> $ lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 16.04.4 LTS
> Release:  16.04
> Codename: xenial
> {noformat}
>Reporter: Cristian Consonni
>Priority: Major
>
> I am working on Wikipedia page views (see [task T188041 on Wikimedia's 
> Pharicator|https://phabricator.wikimedia.org/T188041]). For simplicity, let's 
> say that these are the data:
> {noformat}
>
> {noformat}
> For each combination of (lang, page, day(timestamp)) I need to transform  the 
>  views for each hour:
> {noformat}
> 00:00 -> A
> 01:00 -> B
> ...
> {noformat}
> and concatenate the number of views for that hour.  So, if a page got 5 views 
> at 00:00 and 7 views at 01:00 it would become:
> {noformat}
> A5B7
> {noformat}
>  
> I have written a UDF called {code:python}concat_hours{code}
> However, the function is mixing the columns and I am not sure what is going 
> on. I wrote here a minimal complete example that reproduces the issue on my 
> system (the details of my environment are above).
> {code:python}
> #!/usr/bin/env python3
> # coding: utf-8
> input_data = b"""en Albert_Camus 20071210-00 150
> en Albert_Camus 20071210-01 148
> en Albert_Camus 20071210-02 197
> en Albert_Camus 20071211-20 145
> en Albert_Camus 20071211-21 131
> en Albert_Camus 20071211-22 154
> en Albert_Camus 20071211-230001 142
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-040001 1
> en Albert_Caquot 20071211-06 1
> en Albert_Caquot 20071211-08 1
> en Albert_Caquot 20071211-15 3
> en Albert_Caquot 20071211-21 1"""
> import tempfile
> fp = tempfile.NamedTemporaryFile()
> fp.write(input_data)
> fp.seek(0)
> import findspark
> findspark.init()
> import pyspark
> from pyspark.sql.types import StructType, StructField
> from pyspark.sql.types import StringType, IntegerType, TimestampType
> from pyspark.sql import functions
> sc = pyspark.SparkContext(appName="udf_example")
> sqlctx = pyspark.SQLContext(sc)
> schema = StructType([StructField("lang", StringType(), False),
>  StructField("page", StringType(), False),
>  StructField("timestamp", TimestampType(), False),
>  StructField("views", IntegerType(), False)])
> df = sqlctx.read.csv(fp.name,
>  header=False,
>  schema=schema,
>  timestampFormat="MMdd-HHmmss",
>  sep=' ')
> df.count()
> df.dtypes
> df.show()
> new_schema = StructType([StructField("lang", StringType(), False),
>  StructField("page", StringType(), False),
>  StructField("day", StringType(), False),
>  StructField("enc", StringType(), False)])
> from pyspark.sql.functions import 

[jira] [Updated] (SPARK-24324) Pandas Grouped Map UserDefinedFunction mixes column labels

2018-05-24 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-24324:
-
Summary: Pandas Grouped Map UserDefinedFunction mixes column labels  (was: 
UserDefinedFunction mixes column labels)

> Pandas Grouped Map UserDefinedFunction mixes column labels
> --
>
> Key: SPARK-24324
> URL: https://issues.apache.org/jira/browse/SPARK-24324
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Python (using virtualenv):
> {noformat}
> $ python --version 
> Python 3.6.5
> {noformat}
> Modules installed:
> {noformat}
> arrow==0.12.1
> backcall==0.1.0
> bleach==2.1.3
> chardet==3.0.4
> decorator==4.3.0
> entrypoints==0.2.3
> findspark==1.2.0
> html5lib==1.0.1
> ipdb==0.11
> ipykernel==4.8.2
> ipython==6.3.1
> ipython-genutils==0.2.0
> ipywidgets==7.2.1
> jedi==0.12.0
> Jinja2==2.10
> jsonschema==2.6.0
> jupyter==1.0.0
> jupyter-client==5.2.3
> jupyter-console==5.2.0
> jupyter-core==4.4.0
> MarkupSafe==1.0
> mistune==0.8.3
> nbconvert==5.3.1
> nbformat==4.4.0
> notebook==5.5.0
> numpy==1.14.3
> pandas==0.22.0
> pandocfilters==1.4.2
> parso==0.2.0
> pbr==3.1.1
> pexpect==4.5.0
> pickleshare==0.7.4
> progressbar2==3.37.1
> prompt-toolkit==1.0.15
> ptyprocess==0.5.2
> pyarrow==0.9.0
> Pygments==2.2.0
> python-dateutil==2.7.2
> python-utils==2.3.0
> pytz==2018.4
> pyzmq==17.0.0
> qtconsole==4.3.1
> Send2Trash==1.5.0
> simplegeneric==0.8.1
> six==1.11.0
> SQLAlchemy==1.2.7
> stevedore==1.28.0
> terminado==0.8.1
> testpath==0.3.1
> tornado==5.0.2
> traitlets==4.3.2
> virtualenv==15.1.0
> virtualenv-clone==0.2.6
> virtualenvwrapper==4.7.2
> wcwidth==0.1.7
> webencodings==0.5.1
> widgetsnbextension==3.2.1
> {noformat}
>  
> Java:
> {noformat}
> $ java -version 
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> System:
> {noformat}
> $ lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 16.04.4 LTS
> Release:  16.04
> Codename: xenial
> {noformat}
>Reporter: Cristian Consonni
>Priority: Major
>
> I am working on Wikipedia page views (see [task T188041 on Wikimedia's 
> Pharicator|https://phabricator.wikimedia.org/T188041]). For simplicity, let's 
> say that these are the data:
> {noformat}
>
> {noformat}
> For each combination of (lang, page, day(timestamp)) I need to transform  the 
>  views for each hour:
> {noformat}
> 00:00 -> A
> 01:00 -> B
> ...
> {noformat}
> and concatenate the number of views for that hour.  So, if a page got 5 views 
> at 00:00 and 7 views at 01:00 it would become:
> {noformat}
> A5B7
> {noformat}
>  
> I have written a UDF called {code:python}concat_hours{code}
> However, the function is mixing the columns and I am not sure what is going 
> on. I wrote here a minimal complete example that reproduces the issue on my 
> system (the details of my environment are above).
> {code:python}
> #!/usr/bin/env python3
> # coding: utf-8
> input_data = b"""en Albert_Camus 20071210-00 150
> en Albert_Camus 20071210-01 148
> en Albert_Camus 20071210-02 197
> en Albert_Camus 20071211-20 145
> en Albert_Camus 20071211-21 131
> en Albert_Camus 20071211-22 154
> en Albert_Camus 20071211-230001 142
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-040001 1
> en Albert_Caquot 20071211-06 1
> en Albert_Caquot 20071211-08 1
> en Albert_Caquot 20071211-15 3
> en Albert_Caquot 20071211-21 1"""
> import tempfile
> fp = tempfile.NamedTemporaryFile()
> fp.write(input_data)
> fp.seek(0)
> import findspark
> findspark.init()
> import pyspark
> from pyspark.sql.types import StructType, StructField
> from pyspark.sql.types import StringType, IntegerType, TimestampType
> from pyspark.sql import functions
> sc = pyspark.SparkContext(appName="udf_example")
> sqlctx = pyspark.SQLContext(sc)
> schema = StructType([StructField("lang", StringType(), False),
>  StructField("page", StringType(), False),
>  StructField("timestamp", TimestampType(), False),
>  StructField("views", IntegerType(), False)])
> df = sqlctx.read.csv(fp.name,
>  header=False,
>  schema=schema,
>  timestampFormat="MMdd-HHmmss",
>  sep=' ')
> df.count()
> df.dtypes
> df.show()
> new_schema = StructType([StructField("lang", StringType(), False),
>  StructField("page", StringType(), False),
>  StructField("day", StringType(), False),
>  StructField("enc",