[ https://issues.apache.org/jira/browse/SPARK-26240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Marcelo Vanzin updated SPARK-26240: ----------------------------------- Component/s: (was: Spark Core) SQL > [pyspark] Updating illegal column names with withColumnRenamed does not > change schema changes, causing pyspark.sql.utils.AnalysisException > ------------------------------------------------------------------------------------------------------------------------------------------ > > Key: SPARK-26240 > URL: https://issues.apache.org/jira/browse/SPARK-26240 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.2.1 > Environment: Ubuntu 16.04 LTS (x86_64/deb) > > Reporter: Ying Wang > Priority: Major > > I am unfamiliar with the internals of Spark, but I tried to ingest a Parquet > file with illegal column headers, and when I had called df = > df.withColumnRenamed($COLUMN_NAME, $NEW_COLUMN_NAME) and then called > df.show(), pyspark errored out with the failed attribute being the old column > name. > Steps to reproduce: > - Create a Parquet file from Pandas using this dataframe schema: > ```python > In [10]: df.info() > <class 'pandas.core.frame.DataFrame'> > Int64Index: 1000 entries, 0 to 999 > Data columns (total 16 columns): > Record_ID 1000 non-null int64 > registration_dttm 1000 non-null object > id 1000 non-null int64 > first_name 984 non-null object > last_name 1000 non-null object > email 984 non-null object > gender 933 non-null object > ip_address 1000 non-null object > cc 709 non-null float64 > country 1000 non-null object > birthdate 803 non-null object > salary 932 non-null float64 > title 803 non-null object > comments 179 non-null object > Unnamed: 14 10 non-null object > Unnamed: 15 9 non-null object > dtypes: float64(2), int64(2), object(12) > memory usage: 132.8+ KB > ``` > * Open pyspark shell with `pyspark` and read in the Parquet file with > `spark.read.format('parquet').load('/path/to/file.parquet') > Call `spark_df.show()` Note the error with column 'Unnamed: 14'. > Rename column, replacing illegal space character with underscore character: > `spark_df = spark_df.withColumnRenamed('Unnamed: 14', 'Unnamed:_14')` > Call `spark_df.show()` again, and note that the error still shows attribute > 'Unnamed: 14' in the error message: > ```python > >>> df = spark.read.parquet('/home/yingw787/Downloads/userdata1.parquet') > >>> newdf = df.withColumnRenamed('Unnamed: 14', 'Unnamed:_14') > >>> newdf.show() > Traceback (most recent call last): > File > "/home/yingw787/anaconda2/envs/scratch/lib/python3.6/site-packages/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/home/yingw787/anaconda2/envs/scratch/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", > line 319, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o32.showString. > : org.apache.spark.sql.AnalysisException: Attribute name "Unnamed: 14" > contains invalid character(s) among " ,;{}()\n\t=". Please use alias to > rename it.; > ... > ``` > I would have thought that there would be a way in order to read in Parquet > files such that illegal column names can be changed after the fact with the > spark dataframe was generated, and thus this is unintended behavior. Please > let me know if I am wrong. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org