[ 
https://issues.apache.org/jira/browse/SPARK-21011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Landes updated SPARK-21011:
----------------------------------
    Description: 
I used PySpark to read in some CSV files (actually separated by backspace, 
might be relevant).  The resulting dataframe.show() gives me good data - all my 
columns are there, everything's great.

df = spark.read.option('delimiter', '\b').csv('<some S3 location>')
df.show() # all is good here

Now, I want to filter this bad boy...  but I want to use RDD's filters because 
they're just nicer to use.

my_rdd = df.rdd
my_rdd.take(5) #all my columns are still here

filtered_rdd = my_rdd.filter(<some filter criteria here>)
filtered_rdd.take(5)

My filtered_rdd is missing a column.  Specifically, _c2 has been mashed in to 
_c1.

Here's a relevant record (anonymized) from the df.show():

|3  |Text Field     |12345|<some alphanumeric ID mess 
here>|150.00|UserName|2012-08-14 00:50:00|2015-02-24 01:23:45|2017-02-34 
13:02:33|true|false|

...and the return from the filtered_rdd.take()

Row(_c0=u'3', _c1=u'"Text Field"\x08"12345"', _c2=u'|<some alphanumeric ID mess 
here>', _c3=u'150.00', _c4=u'UserName', _c5=u'2012-08-14 00:50:00', 
_c6=u'2015-02-24 01:23:45', _c7=u'2017-02-34 13:02:33', _c8=u'true', 
_c9=u'false', _c10=None)

Look at _c1 there - it's been mishmashed together with what was formerly _c2 
(with an ascii backspace - \x08 - in there)...  and poor old _c10 is left 
without a value.

  was:
I used PySpark to read in some CSV files (actually separated by backspace, 
might be relevant).  The resulting dataframe.show() gives me good data - all my 
columns are there, everything's great.

df = spark.read.option('delimiter', '\b').csv('<some S3 location>')
df.show() # all is good here

Now, I want to filter this bad boy...  but I want to use RDD's filters because 
they're just nicer to use.

my_rdd = df.rdd
my_rdd.take(5) #all my columns are still here

filtered_rdd = my_rdd.filter(<some filter criteria here>)
filtered_rdd.take(5)

My filtered_rdd is missing a column.  Specifically, _c2 has been mashed in to 
_c1.

Here's a relevant record (anonymized) from the df.show():

|3  |Text Field     |12345|<some alphanumeric ID mess 
here>|150.00|UserName|2012-08-14 00:50:00|2015-02-24 01:23:45|2017-02-34 
13:02:33|true|false|

...and the return from the filtered_rdd.take()

Row(_c0=u'3', _c1=u'"Text Field"\x08"12345"', _c2=u'|<some alphanumeric ID mess 
here>', _c3=u'150.00', _c4=u'UserName', _c5=u'2012-08-14 00:50:00', 
_c6=u'2015-02-24 01:23:45', _c7=u'2017-02-34 13:02:33', _c8=u'true', 
_c9=u'false', _c10=None)

Look at _c1 there - it's been mishmashed together with what was formerly _c2... 
 and poor old _c10 is left without a value.


> RDD filter can combine/corrupt columns
> --------------------------------------
>
>                 Key: SPARK-21011
>                 URL: https://issues.apache.org/jira/browse/SPARK-21011
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.1.0
>            Reporter: Steven Landes
>
> I used PySpark to read in some CSV files (actually separated by backspace, 
> might be relevant).  The resulting dataframe.show() gives me good data - all 
> my columns are there, everything's great.
> df = spark.read.option('delimiter', '\b').csv('<some S3 location>')
> df.show() # all is good here
> Now, I want to filter this bad boy...  but I want to use RDD's filters 
> because they're just nicer to use.
> my_rdd = df.rdd
> my_rdd.take(5) #all my columns are still here
> filtered_rdd = my_rdd.filter(<some filter criteria here>)
> filtered_rdd.take(5)
> My filtered_rdd is missing a column.  Specifically, _c2 has been mashed in to 
> _c1.
> Here's a relevant record (anonymized) from the df.show():
> |3  |Text Field     |12345|<some alphanumeric ID mess 
> here>|150.00|UserName|2012-08-14 00:50:00|2015-02-24 01:23:45|2017-02-34 
> 13:02:33|true|false|
> ...and the return from the filtered_rdd.take()
> Row(_c0=u'3', _c1=u'"Text Field"\x08"12345"', _c2=u'|<some alphanumeric ID 
> mess here>', _c3=u'150.00', _c4=u'UserName', _c5=u'2012-08-14 00:50:00', 
> _c6=u'2015-02-24 01:23:45', _c7=u'2017-02-34 13:02:33', _c8=u'true', 
> _c9=u'false', _c10=None)
> Look at _c1 there - it's been mishmashed together with what was formerly _c2 
> (with an ascii backspace - \x08 - in there)...  and poor old _c10 is left 
> without a value.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to