[jira] [Comment Edited] (SPARK-21392) Unable to infer schema when loading large Parquet file
[ https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16090159#comment-16090159 ] Stuart Reynolds edited comment on SPARK-21392 at 7/18/17 5:30 PM: -- So trying to look at the csv was helpful. {code:none} #root = "/network/folder/mi" # succeeds root = "mi" # fails rdd.write.parquet(root+"mi", mode="overwrite") rdd.write.csv(root+"minn.csv", mode="overwrite") rdd2 = sqlc.read.parquet(root+"mi") {code} The above creates a folder on my local machine, but no data. {code:none} % ls -la mi minn.csv mi: total 12 drwxrwxr-x 2 builder builder 4096 Jul 17 10:42 . drwxrwxr-x 5 builder builder 4096 Jul 17 10:42 .. -rw-r--r-- 1 builder builder0 Jul 17 10:42 _SUCCESS -rw-r--r-- 1 builder builder8 Jul 17 10:42 ._SUCCESS.crc minn.csv/: total 12 drwxrwxr-x 2 builder builder 4096 Jul 17 10:42 . drwxrwxr-x 5 builder builder 4096 Jul 17 10:42 .. -rw-r--r-- 1 builder builder0 Jul 17 10:42 _SUCCESS -rw-r--r-- 1 builder builder8 Jul 17 10:42 ._SUCCESS.crc {code} Prepending the paths with network folder that's available to spark succeeds. So, is this just a "file not found error", with a terrible error message? was (Author: stuartreynolds): So trying to look at the csv was helpful. {code:none} #root = "/network/folder" # succeeds root = "" # fails rdd.write.parquet(root+"mi", mode="overwrite") rdd.write.csv(root+"minn.csv", mode="overwrite") rdd2 = sqlc.read.parquet(root+"mi") {code} The above creates a folder on my local machine, but no data. {code:none} % ls -la mi minn.csv mi: total 12 drwxrwxr-x 2 builder builder 4096 Jul 17 10:42 . drwxrwxr-x 5 builder builder 4096 Jul 17 10:42 .. -rw-r--r-- 1 builder builder0 Jul 17 10:42 _SUCCESS -rw-r--r-- 1 builder builder8 Jul 17 10:42 ._SUCCESS.crc minn.csv/: total 12 drwxrwxr-x 2 builder builder 4096 Jul 17 10:42 . drwxrwxr-x 5 builder builder 4096 Jul 17 10:42 .. -rw-r--r-- 1 builder builder0 Jul 17 10:42 _SUCCESS -rw-r--r-- 1 builder builder8 Jul 17 10:42 ._SUCCESS.crc {code} Prepending the paths with network folder that's available to spark succeeds. So, is this just a "file not found error", with a terrible error message? > Unable to infer schema when loading large Parquet file > -- > > Key: SPARK-21392 > URL: https://issues.apache.org/jira/browse/SPARK-21392 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.1, 2.2.0 > Environment: Spark 2.1.1. python 2.7.6 >Reporter: Stuart Reynolds > Labels: parquet, pyspark > > The following boring code works up until when I read in the parquet file. > {code:none} > import numpy as np > import pandas as pd > import pyspark > from pyspark import SQLContext, SparkContext, SparkConf > print pyspark.__version__ > sc = SparkContext(conf=SparkConf().setMaster('local')) > df = pd.DataFrame({"mi":np.arange(100), "eid":np.arange(100)}) > print df > sqlc = SQLContext(sc) > df = sqlc.createDataFrame(df) > df = df.createOrReplaceTempView("outcomes") > rdd = sqlc.sql("SELECT eid,mi FROM outcomes limit 5") > print rdd.schema > rdd.show() > rdd.write.parquet("mi", mode="overwrite") > rdd2 = sqlc.read.parquet("mi") # FAIL! > {code} > {code:none} > # print pyspark.__version__ > 2.2.0 > # print df > eid mi > 0 0 0 > 1 1 1 > 2 2 2 > 3 3 3 > ... > [100 rows x 2 columns] > # print rdd.schema > StructType(List(StructField(eid,LongType,true),StructField(mi,LongType,true))) > # rdd.show() > +---+---+ > |eid| mi| > +---+---+ > | 0| 0| > | 1| 1| > | 2| 2| > | 3| 3| > | 4| 4| > +---+---+ > {code} > > fails with: > {code:none} > rdd2 = sqlc.read.parquet("mixx") > File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/readwriter.py", > line 291, in parquet > return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths))) > File "/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.py", line > 1133, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/utils.py", line > 69, in deco > raise AnalysisException(s.split(': ', 1)[1], stackTrace) > pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It > must be specified manually.;' > {code} > The documentation for parquet says the format is self describing, and the > full schema was available when the parquet file was saved. What gives? > Works with master='local', but fails with my cluster is specified. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21392) Unable to infer schema when loading large Parquet file
[ https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16091854#comment-16091854 ] Stuart Reynolds commented on SPARK-21392: - Okie dokey: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-save-parquet-file-td28874.html I think there's still a bug here. (I suspect filename given to the cluster can't be saved on the cluster -- but then write should fail, not read, and the error should be different). > Unable to infer schema when loading large Parquet file > -- > > Key: SPARK-21392 > URL: https://issues.apache.org/jira/browse/SPARK-21392 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.1, 2.2.0 > Environment: Spark 2.1.1. python 2.7.6 >Reporter: Stuart Reynolds > Labels: parquet, pyspark > > The following boring code works up until when I read in the parquet file. > {code:none} > import numpy as np > import pandas as pd > import pyspark > from pyspark import SQLContext, SparkContext, SparkConf > print pyspark.__version__ > sc = SparkContext(conf=SparkConf().setMaster('local')) > df = pd.DataFrame({"mi":np.arange(100), "eid":np.arange(100)}) > print df > sqlc = SQLContext(sc) > df = sqlc.createDataFrame(df) > df = df.createOrReplaceTempView("outcomes") > rdd = sqlc.sql("SELECT eid,mi FROM outcomes limit 5") > print rdd.schema > rdd.show() > rdd.write.parquet("mi", mode="overwrite") > rdd2 = sqlc.read.parquet("mi") # FAIL! > {code} > {code:none} > # print pyspark.__version__ > 2.2.0 > # print df > eid mi > 0 0 0 > 1 1 1 > 2 2 2 > 3 3 3 > ... > [100 rows x 2 columns] > # print rdd.schema > StructType(List(StructField(eid,LongType,true),StructField(mi,LongType,true))) > # rdd.show() > +---+---+ > |eid| mi| > +---+---+ > | 0| 0| > | 1| 1| > | 2| 2| > | 3| 3| > | 4| 4| > +---+---+ > {code} > > fails with: > {code:none} > rdd2 = sqlc.read.parquet("mixx") > File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/readwriter.py", > line 291, in parquet > return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths))) > File "/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.py", line > 1133, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/utils.py", line > 69, in deco > raise AnalysisException(s.split(': ', 1)[1], stackTrace) > pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It > must be specified manually.;' > {code} > The documentation for parquet says the format is self describing, and the > full schema was available when the parquet file was saved. What gives? > Works with master='local', but fails with my cluster is specified. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21392) Unable to infer schema when loading large Parquet file
[ https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16090159#comment-16090159 ] Stuart Reynolds commented on SPARK-21392: - So trying to look at the csv was helpful. {code:none} #root = "/network/folder" # succeeds root = "" # fails rdd.write.parquet(root+"mi", mode="overwrite") rdd.write.csv(root+"minn.csv", mode="overwrite") rdd2 = sqlc.read.parquet(root+"mi") {code} The above creates a folder on my local machine, but no data. {code:none} % ls -la mi minn.csv mi: total 12 drwxrwxr-x 2 builder builder 4096 Jul 17 10:42 . drwxrwxr-x 5 builder builder 4096 Jul 17 10:42 .. -rw-r--r-- 1 builder builder0 Jul 17 10:42 _SUCCESS -rw-r--r-- 1 builder builder8 Jul 17 10:42 ._SUCCESS.crc minn.csv/: total 12 drwxrwxr-x 2 builder builder 4096 Jul 17 10:42 . drwxrwxr-x 5 builder builder 4096 Jul 17 10:42 .. -rw-r--r-- 1 builder builder0 Jul 17 10:42 _SUCCESS -rw-r--r-- 1 builder builder8 Jul 17 10:42 ._SUCCESS.crc {code} Prepending the paths with network folder that's available to spark succeeds. So, is this just a "file not found error", with a terrible error message? > Unable to infer schema when loading large Parquet file > -- > > Key: SPARK-21392 > URL: https://issues.apache.org/jira/browse/SPARK-21392 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.1, 2.2.0 > Environment: Spark 2.1.1. python 2.7.6 >Reporter: Stuart Reynolds > Labels: parquet, pyspark > > The following boring code works up until when I read in the parquet file. > {code:none} > import numpy as np > import pandas as pd > import pyspark > from pyspark import SQLContext, SparkContext, SparkConf > print pyspark.__version__ > sc = SparkContext(conf=SparkConf().setMaster('local')) > df = pd.DataFrame({"mi":np.arange(100), "eid":np.arange(100)}) > print df > sqlc = SQLContext(sc) > df = sqlc.createDataFrame(df) > df = df.createOrReplaceTempView("outcomes") > rdd = sqlc.sql("SELECT eid,mi FROM outcomes limit 5") > print rdd.schema > rdd.show() > rdd.write.parquet("mi", mode="overwrite") > rdd2 = sqlc.read.parquet("mi") # FAIL! > {code} > {code:none} > # print pyspark.__version__ > 2.2.0 > # print df > eid mi > 0 0 0 > 1 1 1 > 2 2 2 > 3 3 3 > ... > [100 rows x 2 columns] > # print rdd.schema > StructType(List(StructField(eid,LongType,true),StructField(mi,LongType,true))) > # rdd.show() > +---+---+ > |eid| mi| > +---+---+ > | 0| 0| > | 1| 1| > | 2| 2| > | 3| 3| > | 4| 4| > +---+---+ > {code} > > fails with: > {code:none} > rdd2 = sqlc.read.parquet("mixx") > File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/readwriter.py", > line 291, in parquet > return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths))) > File "/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.py", line > 1133, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/utils.py", line > 69, in deco > raise AnalysisException(s.split(': ', 1)[1], stackTrace) > pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It > must be specified manually.;' > {code} > The documentation for parquet says the format is self describing, and the > full schema was available when the parquet file was saved. What gives? > Works with master='local', but fails with my cluster is specified. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21392) Unable to infer schema when loading large Parquet file
[ https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16090134#comment-16090134 ] Stuart Reynolds commented on SPARK-21392: - I've made the example self contained and sourced from a pandas dataframe. It seems to succeed with master=local, and fails on the cluster (the cluster's dashboard says its also spark 2.2.0). > Unable to infer schema when loading large Parquet file > -- > > Key: SPARK-21392 > URL: https://issues.apache.org/jira/browse/SPARK-21392 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.1, 2.2.0 > Environment: Spark 2.1.1. python 2.7.6 >Reporter: Stuart Reynolds > Labels: parquet, pyspark > > The following boring code works up until when I read in the parquet file. > {code:none} > import numpy as np > import pandas as pd > import pyspark > from pyspark import SQLContext, SparkContext, SparkConf > print pyspark.__version__ > sc = SparkContext(conf=SparkConf().setMaster('local')) > df = pd.DataFrame({"mi":np.arange(100), "eid":np.arange(100)}) > print df > sqlc = SQLContext(sc) > df = sqlc.createDataFrame(df) > df = df.createOrReplaceTempView("outcomes") > rdd = sqlc.sql("SELECT eid,mi FROM outcomes limit 5") > print rdd.schema > rdd.show() > rdd.write.parquet("mi", mode="overwrite") > rdd2 = sqlc.read.parquet("mi") # FAIL! > {code} > {code:none} > # print pyspark.__version__ > 2.2.0 > # print df > eid mi > 0 0 0 > 1 1 1 > 2 2 2 > 3 3 3 > ... > [100 rows x 2 columns] > # print rdd.schema > StructType(List(StructField(eid,LongType,true),StructField(mi,LongType,true))) > # rdd.show() > +---+---+ > |eid| mi| > +---+---+ > | 0| 0| > | 1| 1| > | 2| 2| > | 3| 3| > | 4| 4| > +---+---+ > {code} > > fails with: > {code:none} > rdd2 = sqlc.read.parquet("mixx") > File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/readwriter.py", > line 291, in parquet > return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths))) > File "/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.py", line > 1133, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/utils.py", line > 69, in deco > raise AnalysisException(s.split(': ', 1)[1], stackTrace) > pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It > must be specified manually.;' > {code} > The documentation for parquet says the format is self describing, and the > full schema was available when the parquet file was saved. What gives? > Works with master='local', but fails with my cluster is specified. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21392) Unable to infer schema when loading large Parquet file
[ https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stuart Reynolds updated SPARK-21392: Description: The following boring code works up until when I read in the parquet file. {code:none} import numpy as np import pandas as pd import pyspark from pyspark import SQLContext, SparkContext, SparkConf print pyspark.__version__ sc = SparkContext(conf=SparkConf().setMaster('local')) df = pd.DataFrame({"mi":np.arange(100), "eid":np.arange(100)}) print df sqlc = SQLContext(sc) df = sqlc.createDataFrame(df) df = df.createOrReplaceTempView("outcomes") rdd = sqlc.sql("SELECT eid,mi FROM outcomes limit 5") print rdd.schema rdd.show() rdd.write.parquet("mi", mode="overwrite") rdd2 = sqlc.read.parquet("mi") # FAIL! {code} {code:none} # print pyspark.__version__ 2.2.0 # print df eid mi 0 0 0 1 1 1 2 2 2 3 3 3 ... [100 rows x 2 columns] # print rdd.schema StructType(List(StructField(eid,LongType,true),StructField(mi,LongType,true))) # rdd.show() +---+---+ |eid| mi| +---+---+ | 0| 0| | 1| 1| | 2| 2| | 3| 3| | 4| 4| +---+---+ {code} fails with: {code:none} rdd2 = sqlc.read.parquet("mixx") File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/readwriter.py", line 291, in parquet return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths))) File "/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.py", line 1133, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/utils.py", line 69, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;' {code} The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives? Works with master='local', but fails with my cluster is specified. was: The following boring code works up until when I read in the parquet file. {code:none} import numpy as np import pandas as pd import pyspark from pyspark import SQLContext, SparkContext, SparkConf print pyspark.__version__ sc = SparkContext(conf=SparkConf().setMaster('local')) df = pd.DataFrame({"mi":np.arange(100), "eid":np.arange(100)}) print df sqlc = SQLContext(sc) df = sqlc.createDataFrame(df) df = df.createOrReplaceTempView("outcomes") rdd = sqlc.sql("SELECT eid,mi FROM outcomes limit 5") print rdd.schema rdd.show() rdd.write.parquet("mi", mode="overwrite") rdd2 = sqlc.read.parquet("mi") {code} {code:none} # print pyspark.__version__ 2.2.0 # print df eid mi 0 0 0 1 1 1 2 2 2 3 3 3 --- [100 rows x 2 columns] # print rdd.schema StructType(List(StructField(eid,LongType,true),StructField(mi,LongType,true))) # rdd.show() +---+---+ |eid| mi| +---+---+ | 0| 0| | 1| 1| | 2| 2| | 3| 3| | 4| 4| +---+---+ {code} fails with: {code:none} rdd2 = sqlc.read.parquet("mixx") File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/readwriter.py", line 291, in parquet return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths))) File "/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.py", line 1133, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/utils.py", line 69, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;' {code} in {code:none} /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw) {code} The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives? Works with master='local', but fails with my cluster is specified. > Unable to infer schema when loading large Parquet file > -- > > Key: SPARK-21392 > URL: https://issues.apache.org/jira/browse/SPARK-21392 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.1, 2.2.0 > Environment: Spark 2.1.1. python 2.7.6 >Reporter: Stuart Reynolds > Labels: parquet, pyspark > > The following boring code works up until when I read in the parquet file. > {code:none} > import numpy as np > import pandas as pd > import pyspark > from pyspark import SQLContext, SparkContext, SparkConf > print pyspark.__version__ > sc = SparkContext(conf=SparkConf().setMaster('local')) > df = pd.DataFrame({"mi":np.arange(100), "eid":np.arange(100)}) > print df > sqlc = SQLContext(sc) > df = sqlc.createDataFrame(df) > df =
[jira] [Updated] (SPARK-21392) Unable to infer schema when loading large Parquet file
[ https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stuart Reynolds updated SPARK-21392: Description: The following boring code works up until when I read in the parquet file. {code:none} import numpy as np import pandas as pd import pyspark from pyspark import SQLContext, SparkContext, SparkConf print pyspark.__version__ sc = SparkContext(conf=SparkConf().setMaster('local')) df = pd.DataFrame({"mi":np.arange(100), "eid":np.arange(100)}) print df sqlc = SQLContext(sc) df = sqlc.createDataFrame(df) df = df.createOrReplaceTempView("outcomes") rdd = sqlc.sql("SELECT eid,mi FROM outcomes limit 5") print rdd.schema rdd.show() rdd.write.parquet("mi", mode="overwrite") rdd2 = sqlc.read.parquet("mi") {code} {code:none} # print pyspark.__version__ 2.2.0 # print df eid mi 0 0 0 1 1 1 2 2 2 3 3 3 --- [100 rows x 2 columns] # print rdd.schema StructType(List(StructField(eid,LongType,true),StructField(mi,LongType,true))) # rdd.show() +---+---+ |eid| mi| +---+---+ | 0| 0| | 1| 1| | 2| 2| | 3| 3| | 4| 4| +---+---+ {code} fails with: {code:none} rdd2 = sqlc.read.parquet("mixx") File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/readwriter.py", line 291, in parquet return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths))) File "/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.py", line 1133, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/utils.py", line 69, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;' {code} in {code:none} /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw) {code} The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives? Works with master='local', but fails with my cluster is specified. was: The following boring code works up until when I read in the parquet file. {code:none} response = "mi_or_chd_5" sc = get_spark_context() # custom sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes") print rdd.schema #>> StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true))) rdd.show() #+---+---+ #|eid|mi_or_chd_5| #+---+---+ #|226| null| #|442| null| #|978| 0| #|851| 0| #|428| 0| rdd.write.parquet(response, mode="overwrite") # success! rdd2 = sqlc.read.parquet(response) # fail {code} fails with: {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;' {code} in {code:none} /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw) {code} The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives? The error doesn't happen if I add "limit 10" to the sql query. The whole selected table is 500k rows with an int and short column. Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) > Unable to infer schema when loading large Parquet file > -- > > Key: SPARK-21392 > URL: https://issues.apache.org/jira/browse/SPARK-21392 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.1, 2.2.0 > Environment: Spark 2.1.1. python 2.7.6 >Reporter: Stuart Reynolds > Labels: parquet, pyspark > > The following boring code works up until when I read in the parquet file. > {code:none} > import numpy as np > import pandas as pd > import pyspark > from pyspark import SQLContext, SparkContext, SparkConf > print pyspark.__version__ > sc = SparkContext(conf=SparkConf().setMaster('local')) > df = pd.DataFrame({"mi":np.arange(100), "eid":np.arange(100)}) > print df > sqlc = SQLContext(sc) > df = sqlc.createDataFrame(df) > df = df.createOrReplaceTempView("outcomes") > rdd = sqlc.sql("SELECT eid,mi FROM outcomes limit 5") > print rdd.schema > rdd.show() > rdd.write.parquet("mi", mode="overwrite") > rdd2 = sqlc.read.parquet("mi") > {code} > {code:none} > # print pyspark.__version__ > 2.2.0 > # print df > eid mi > 0 0 0 > 1 1 1 > 2 2 2 > 3 3 3 > --- > [100 rows x 2 columns] > # print rdd.schema > StructType(List(StructField(eid,LongType,true),StructField(mi,LongType,true))) > # rdd.show() > +---+---+ > |eid| mi|
[jira] [Updated] (SPARK-21392) Unable to infer schema when loading large Parquet file
[ https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stuart Reynolds updated SPARK-21392: Affects Version/s: 2.2.0 > Unable to infer schema when loading large Parquet file > -- > > Key: SPARK-21392 > URL: https://issues.apache.org/jira/browse/SPARK-21392 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.1, 2.2.0 > Environment: Spark 2.1.1. python 2.7.6 >Reporter: Stuart Reynolds > Labels: parquet, pyspark > > The following boring code works up until when I read in the parquet file. > {code:none} > response = "mi_or_chd_5" > sc = get_spark_context() # custom > sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom > rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes") > print rdd.schema > #>> > StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true))) > rdd.show() > #+---+---+ > #|eid|mi_or_chd_5| > #+---+---+ > #|226| null| > #|442| null| > #|978| 0| > #|851| 0| > #|428| 0| > rdd.write.parquet(response, mode="overwrite") # success! > rdd2 = sqlc.read.parquet(response) # fail > {code} > > fails with: > {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must > be specified manually.;' > {code} > in > {code:none} > /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc > in deco(*a, **kw) > {code} > The documentation for parquet says the format is self describing, and the > full schema was available when the parquet file was saved. What gives? > The error doesn't happen if I add "limit 10" to the sql query. The whole > selected table is 500k rows with an int and short column. > Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but > which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21392) Unable to infer schema when loading large Parquet file
[ https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084837#comment-16084837 ] Stuart Reynolds edited comment on SPARK-21392 at 7/13/17 4:55 PM: -- I've simplified the example a little -more and also found the limiting the query size to 100 rows succeeds, whereas if I select all 500k rows * 2 columns, it fails-. In case it helps, my utility function get_sparkSQLContextWithTables loaded the full table 'outcomes' from postgres into spark, with 10 partitions with: {code:none} index="eid" index_min=min(eid) index_max=max(eid) {code} was (Author: stuartreynolds): I've simplified the example a little more and also found the limiting the query size to 100 rows succeeds, whereas if I select all 500k rows * 2 columns, it fails. In case it helps, my utility function get_sparkSQLContextWithTables loaded the full table 'outcomes' from postgres into spark, with 10 partitions with: {code:none} index="eid" index_min=min(eid) index_max=max(eid) {code} > Unable to infer schema when loading large Parquet file > -- > > Key: SPARK-21392 > URL: https://issues.apache.org/jira/browse/SPARK-21392 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.1 > Environment: Spark 2.1.1. python 2.7.6 >Reporter: Stuart Reynolds > Labels: parquet, pyspark > > The following boring code works > {code:none} > response = "mi_or_chd_5" > sc = get_spark_context() # custom > sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom > rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes") > print rdd.schema > #>> > StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true))) > rdd.show() > #+---+---+ > #|eid|mi_or_chd_5| > #+---+---+ > #|226| null| > #|442| null| > #|978| 0| > #|851| 0| > #|428| 0| > rdd.write.parquet(response, mode="overwrite") # success! > rdd2 = sqlc.read.parquet(response) # fail > {code} > > fails with: > {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must > be specified manually.;' > {code} > in > {code:none} > /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc > in deco(*a, **kw) > {code} > The documentation for parquet says the format is self describing, and the > full schema was available when the parquet file was saved. What gives? > The error doesn't happen if I add "limit 10" to the sql query. The whole > selected table is 500k rows with an int and short column. > Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but > which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21392) Unable to infer schema when loading large Parquet file
[ https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stuart Reynolds updated SPARK-21392: Description: The following boring code works up until when I read in the parquet file. {code:none} response = "mi_or_chd_5" sc = get_spark_context() # custom sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes") print rdd.schema #>> StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true))) rdd.show() #+---+---+ #|eid|mi_or_chd_5| #+---+---+ #|226| null| #|442| null| #|978| 0| #|851| 0| #|428| 0| rdd.write.parquet(response, mode="overwrite") # success! rdd2 = sqlc.read.parquet(response) # fail {code} fails with: {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;' {code} in {code:none} /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw) {code} The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives? The error doesn't happen if I add "limit 10" to the sql query. The whole selected table is 500k rows with an int and short column. Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) was: The following boring code works {code:none} response = "mi_or_chd_5" sc = get_spark_context() # custom sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes") print rdd.schema #>> StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true))) rdd.show() #+---+---+ #|eid|mi_or_chd_5| #+---+---+ #|226| null| #|442| null| #|978| 0| #|851| 0| #|428| 0| rdd.write.parquet(response, mode="overwrite") # success! rdd2 = sqlc.read.parquet(response) # fail {code} fails with: {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;' {code} in {code:none} /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw) {code} The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives? The error doesn't happen if I add "limit 10" to the sql query. The whole selected table is 500k rows with an int and short column. Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) > Unable to infer schema when loading large Parquet file > -- > > Key: SPARK-21392 > URL: https://issues.apache.org/jira/browse/SPARK-21392 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.1 > Environment: Spark 2.1.1. python 2.7.6 >Reporter: Stuart Reynolds > Labels: parquet, pyspark > > The following boring code works up until when I read in the parquet file. > {code:none} > response = "mi_or_chd_5" > sc = get_spark_context() # custom > sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom > rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes") > print rdd.schema > #>> > StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true))) > rdd.show() > #+---+---+ > #|eid|mi_or_chd_5| > #+---+---+ > #|226| null| > #|442| null| > #|978| 0| > #|851| 0| > #|428| 0| > rdd.write.parquet(response, mode="overwrite") # success! > rdd2 = sqlc.read.parquet(response) # fail > {code} > > fails with: > {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must > be specified manually.;' > {code} > in > {code:none} > /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc > in deco(*a, **kw) > {code} > The documentation for parquet says the format is self describing, and the > full schema was available when the parquet file was saved. What gives? > The error doesn't happen if I add "limit 10" to the sql query. The whole > selected table is 500k rows with an int and short column. > Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but > which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For
[jira] [Commented] (SPARK-21392) Unable to infer schema when loading large Parquet file
[ https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085987#comment-16085987 ] Stuart Reynolds commented on SPARK-21392: - My bad -- I may have missed the error message. Upgraded to spark 2.2.0 today and re-ran this. I get the error no matter what the number of rows. > Unable to infer schema when loading large Parquet file > -- > > Key: SPARK-21392 > URL: https://issues.apache.org/jira/browse/SPARK-21392 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.1 > Environment: Spark 2.1.1. python 2.7.6 >Reporter: Stuart Reynolds > Labels: parquet, pyspark > > The following boring code works > {code:none} > response = "mi_or_chd_5" > sc = get_spark_context() # custom > sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom > rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes") > print rdd.schema > #>> > StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true))) > rdd.show() > #+---+---+ > #|eid|mi_or_chd_5| > #+---+---+ > #|226| null| > #|442| null| > #|978| 0| > #|851| 0| > #|428| 0| > rdd.write.parquet(response, mode="overwrite") # success! > rdd2 = sqlc.read.parquet(response) # fail > {code} > > fails with: > {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must > be specified manually.;' > {code} > in > {code:none} > /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc > in deco(*a, **kw) > {code} > The documentation for parquet says the format is self describing, and the > full schema was available when the parquet file was saved. What gives? > The error doesn't happen if I add "limit 10" to the sql query. The whole > selected table is 500k rows with an int and short column. > Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but > which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21392) Unable to infer schema when loading large Parquet file
[ https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084837#comment-16084837 ] Stuart Reynolds edited comment on SPARK-21392 at 7/12/17 10:27 PM: --- I've simplified the example a little more and also found the limiting the query size to 100 rows succeeds, whereas if I select all 500k rows * 2 columns, it fails. In case it helps, my utility function get_sparkSQLContextWithTables loaded the full table 'outcomes' from postgres into spark, with 10 partitions with: {code:non} index="eid" index_min=min(eid) index_max=max(eid) {code} was (Author: stuartreynolds): I've simplified the example a little more and also found the limiting the query size to 100 rows succeeds, whereas if I select all 500k rows * 2 columns, it fails. > Unable to infer schema when loading large Parquet file > -- > > Key: SPARK-21392 > URL: https://issues.apache.org/jira/browse/SPARK-21392 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.1 > Environment: Spark 2.1.1. python 2.7.6 >Reporter: Stuart Reynolds > Labels: parquet, pyspark > > The following boring code works > {code:none} > response = "mi_or_chd_5" > sc = get_spark_context() # custom > sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom > rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes") > print rdd.schema > #>> > StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true))) > rdd.show() > #+---+---+ > #|eid|mi_or_chd_5| > #+---+---+ > #|216| null| > #|431| null| > #|978| 0| > #|852| 0| > #|418| 0| > rdd.write.parquet(response, mode="overwrite") # success! > rdd2 = sqlc.read.parquet(response) # fail > {code} > > fails with: > {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must > be specified manually.;' > {code} > in > {code:none} > /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc > in deco(*a, **kw) > {code} > The documentation for parquet says the format is self describing, and the > full schema was available when the parquet file was saved. What gives? > The error doesn't happen if I add "limit 10" to the sql query. The whole > selected table is 500k rows with an int and short column. > Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but > which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21392) Unable to infer schema when loading large Parquet file
[ https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084837#comment-16084837 ] Stuart Reynolds commented on SPARK-21392: - I've simplified the example a little more and also found the limiting the query size to 100 rows succeeds, whereas if I select all 500k rows * 2 columns, it fails. > Unable to infer schema when loading large Parquet file > -- > > Key: SPARK-21392 > URL: https://issues.apache.org/jira/browse/SPARK-21392 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.1 > Environment: Spark 2.1.1. python 2.7.6 >Reporter: Stuart Reynolds > Labels: parquet, pyspark > > The following boring code works > {code:none} > response = "mi_or_chd_5" > sc = get_spark_context() # custom > sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom > rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes") > print rdd.schema > #>> > StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true))) > rdd.show() > #+---+---+ > #|eid|mi_or_chd_5| > #+---+---+ > #|216| null| > #|431| null| > #|978| 0| > #|852| 0| > #|418| 0| > rdd.write.parquet(response, mode="overwrite") # success! > rdd2 = sqlc.read.parquet(response) # fail > {code} > > fails with: > {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must > be specified manually.;' > {code} > in > {code:none} > /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc > in deco(*a, **kw) > {code} > The documentation for parquet says the format is self describing, and the > full schema was available when the parquet file was saved. What gives? > The error doesn't happen if I add "limit 10" to the sql query. The whole > selected table is 500k rows with an int and short column. > Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but > which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21392) Unable to infer schema when loading large Parquet file
[ https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stuart Reynolds updated SPARK-21392: Description: The following boring code works {code:none} response = "mi_or_chd_5" sc = get_spark_context() # custom sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes") print rdd.schema #>> StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true))) rdd.show() #+---+---+ #|eid|mi_or_chd_5| #+---+---+ #|226| null| #|442| null| #|978| 0| #|851| 0| #|428| 0| rdd.write.parquet(response, mode="overwrite") # success! rdd2 = sqlc.read.parquet(response) # fail {code} fails with: {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;' {code} in {code:none} /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw) {code} The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives? The error doesn't happen if I add "limit 10" to the sql query. The whole selected table is 500k rows with an int and short column. Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) was: The following boring code works {code:none} response = "mi_or_chd_5" sc = get_spark_context() # custom sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes") print rdd.schema #>> StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true))) rdd.show() #+---+---+ #|eid|mi_or_chd_5| #+---+---+ #|216| null| #|431| null| #|978| 0| #|852| 0| #|418| 0| rdd.write.parquet(response, mode="overwrite") # success! rdd2 = sqlc.read.parquet(response) # fail {code} fails with: {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;' {code} in {code:none} /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw) {code} The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives? The error doesn't happen if I add "limit 10" to the sql query. The whole selected table is 500k rows with an int and short column. Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) > Unable to infer schema when loading large Parquet file > -- > > Key: SPARK-21392 > URL: https://issues.apache.org/jira/browse/SPARK-21392 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.1 > Environment: Spark 2.1.1. python 2.7.6 >Reporter: Stuart Reynolds > Labels: parquet, pyspark > > The following boring code works > {code:none} > response = "mi_or_chd_5" > sc = get_spark_context() # custom > sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom > rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes") > print rdd.schema > #>> > StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true))) > rdd.show() > #+---+---+ > #|eid|mi_or_chd_5| > #+---+---+ > #|226| null| > #|442| null| > #|978| 0| > #|851| 0| > #|428| 0| > rdd.write.parquet(response, mode="overwrite") # success! > rdd2 = sqlc.read.parquet(response) # fail > {code} > > fails with: > {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must > be specified manually.;' > {code} > in > {code:none} > /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc > in deco(*a, **kw) > {code} > The documentation for parquet says the format is self describing, and the > full schema was available when the parquet file was saved. What gives? > The error doesn't happen if I add "limit 10" to the sql query. The whole > selected table is 500k rows with an int and short column. > Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but > which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21392) Unable to infer schema when loading large Parquet file
[ https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084837#comment-16084837 ] Stuart Reynolds edited comment on SPARK-21392 at 7/12/17 10:27 PM: --- I've simplified the example a little more and also found the limiting the query size to 100 rows succeeds, whereas if I select all 500k rows * 2 columns, it fails. In case it helps, my utility function get_sparkSQLContextWithTables loaded the full table 'outcomes' from postgres into spark, with 10 partitions with: {code:none} index="eid" index_min=min(eid) index_max=max(eid) {code} was (Author: stuartreynolds): I've simplified the example a little more and also found the limiting the query size to 100 rows succeeds, whereas if I select all 500k rows * 2 columns, it fails. In case it helps, my utility function get_sparkSQLContextWithTables loaded the full table 'outcomes' from postgres into spark, with 10 partitions with: {code:non} index="eid" index_min=min(eid) index_max=max(eid) {code} > Unable to infer schema when loading large Parquet file > -- > > Key: SPARK-21392 > URL: https://issues.apache.org/jira/browse/SPARK-21392 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.1 > Environment: Spark 2.1.1. python 2.7.6 >Reporter: Stuart Reynolds > Labels: parquet, pyspark > > The following boring code works > {code:none} > response = "mi_or_chd_5" > sc = get_spark_context() # custom > sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom > rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes") > print rdd.schema > #>> > StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true))) > rdd.show() > #+---+---+ > #|eid|mi_or_chd_5| > #+---+---+ > #|216| null| > #|431| null| > #|978| 0| > #|852| 0| > #|418| 0| > rdd.write.parquet(response, mode="overwrite") # success! > rdd2 = sqlc.read.parquet(response) # fail > {code} > > fails with: > {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must > be specified manually.;' > {code} > in > {code:none} > /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc > in deco(*a, **kw) > {code} > The documentation for parquet says the format is self describing, and the > full schema was available when the parquet file was saved. What gives? > The error doesn't happen if I add "limit 10" to the sql query. The whole > selected table is 500k rows with an int and short column. > Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but > which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21392) Unable to infer schema when loading large Parquet file
[ https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stuart Reynolds updated SPARK-21392: Description: The following boring code works {code:none} response = "mi_or_chd_5" sc = get_spark_context() # custom sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes") print rdd.schema #>> StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true))) rdd.show() #+---+---+ #|eid|mi_or_chd_5| #+---+---+ #|216| null| #|431| null| #|978| 0| #|852| 0| #|418| 0| rdd.write.parquet(response, mode="overwrite") # success! rdd2 = sqlc.read.parquet(response) # fail {code} fails with: {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;' {code} in {code:none} /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw) {code} The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives? The error doesn't happen if I add "limit 10" to the sql query. The whole selected table is 500k rows with an int and short column. Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) was: The following boring code works {code:none} response = "mi_or_chd_5" outcome = sqlc.sql("""select eid,{response} as response from outcomes where {response} IS NOT NULL""".format(response=response)) outcome.write.parquet(response, mode="overwrite") >>> print outcome.schema StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true))) {code} But then, {code:none} outcome2 = sqlc.read.parquet(response) # fail {code} fails with: {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;' {code} in {code:none} /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw) {code} The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives? Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) Summary: Unable to infer schema when loading large Parquet file (was: Unable to infer schema when loading Parquet file) > Unable to infer schema when loading large Parquet file > -- > > Key: SPARK-21392 > URL: https://issues.apache.org/jira/browse/SPARK-21392 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.1 > Environment: Spark 2.1.1. python 2.7.6 >Reporter: Stuart Reynolds > Labels: parquet, pyspark > > The following boring code works > {code:none} > response = "mi_or_chd_5" > sc = get_spark_context() # custom > sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom > rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes") > print rdd.schema > #>> > StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true))) > rdd.show() > #+---+---+ > #|eid|mi_or_chd_5| > #+---+---+ > #|216| null| > #|431| null| > #|978| 0| > #|852| 0| > #|418| 0| > rdd.write.parquet(response, mode="overwrite") # success! > rdd2 = sqlc.read.parquet(response) # fail > {code} > > fails with: > {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must > be specified manually.;' > {code} > in > {code:none} > /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc > in deco(*a, **kw) > {code} > The documentation for parquet says the format is self describing, and the > full schema was available when the parquet file was saved. What gives? > The error doesn't happen if I add "limit 10" to the sql query. The whole > selected table is 500k rows with an int and short column. > Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but > which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21392) Unable to infer schema when loading Parquet file
[ https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084601#comment-16084601 ] Stuart Reynolds commented on SPARK-21392: - Done. I'm simply trying to build a table of two columns ("eid" and whatever the named response variable is), and save the table as a parquet file (with the same name are the second column), and load it back up. Saving works but loading fails, complaining it can't infer the schema, even though I could print the schema before saving it: {code:none} StructType( List( StructField(eid,IntegerType,true), StructField(response,ShortType,true))) {code} > Unable to infer schema when loading Parquet file > > > Key: SPARK-21392 > URL: https://issues.apache.org/jira/browse/SPARK-21392 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.1 > Environment: Spark 2.1.1. python 2.7.6 >Reporter: Stuart Reynolds > Labels: parquet, pyspark > > The following boring code works > {code:none} > response = "mi_or_chd_5" > outcome = sqlc.sql("""select eid,{response} as response > from outcomes > where {response} IS NOT NULL""".format(response=response)) > outcome.write.parquet(response, mode="overwrite") > > >>> print outcome.schema > > StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true))) > {code} > > But then, > {code:none} > outcome2 = sqlc.read.parquet(response) # fail > {code} > fails with: > {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must > be specified manually.;' > {code} > in > {code:none} > /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc > in deco(*a, **kw) > {code} > The documentation for parquet says the format is self describing, and the > full schema was available when the parquet file was saved. What gives? > Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but > which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21392) Unable to infer schema when loading Parquet file
[ https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stuart Reynolds updated SPARK-21392: Description: The following boring code works {code:none} response = "mi_or_chd_5" outcome = sqlc.sql("""select eid,{response} as response from outcomes where {response} IS NOT NULL""".format(response=response)) outcome.write.parquet(response, mode="overwrite") >>> print outcome.schema StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true))) {code} But then, {code:none} outcome2 = sqlc.read.parquet(response) # fail {code} fails with: {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;' {code} in {code:none} /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw) {code} The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives? Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) was: The following boring code works {code:python} response = "mi_or_chd_5" outcome = sqlc.sql("""select eid,{response} as response from outcomes where {response} IS NOT NULL""".format(response=response)) outcome.write.parquet(response, mode="overwrite") >>> print outcome.schema StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true))) {code} But then, {code:python} outcome2 = sqlc.read.parquet(response) # fail {code} fails with: {code:python}AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;' {code} in {code:python} /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw) {code} The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives? Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) > Unable to infer schema when loading Parquet file > > > Key: SPARK-21392 > URL: https://issues.apache.org/jira/browse/SPARK-21392 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.1 > Environment: Spark 2.1.1. python 2.7.6 >Reporter: Stuart Reynolds > Labels: parquet, pyspark > > The following boring code works > {code:none} > response = "mi_or_chd_5" > outcome = sqlc.sql("""select eid,{response} as response > from outcomes > where {response} IS NOT NULL""".format(response=response)) > outcome.write.parquet(response, mode="overwrite") > > >>> print outcome.schema > > StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true))) > {code} > > But then, > {code:none} > outcome2 = sqlc.read.parquet(response) # fail > {code} > fails with: > {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must > be specified manually.;' > {code} > in > {code:none} > /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc > in deco(*a, **kw) > {code} > The documentation for parquet says the format is self describing, and the > full schema was available when the parquet file was saved. What gives? > Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but > which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21392) Unable to infer schema when loading Parquet file
[ https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stuart Reynolds updated SPARK-21392: Description: The following boring code works {code:none} response = "mi_or_chd_5" outcome = sqlc.sql("""select eid,{response} as response from outcomes where {response} IS NOT NULL""".format(response=response)) outcome.write.parquet(response, mode="overwrite") >>> print outcome.schema StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true))) {code} But then, {code:none} outcome2 = sqlc.read.parquet(response) # fail {code} fails with: {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;' {code} in {code:none} /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw) {code} The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives? Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) was: The following boring code works {code:none} response = "mi_or_chd_5" outcome = sqlc.sql("""select eid,{response} as response from outcomes where {response} IS NOT NULL""".format(response=response)) outcome.write.parquet(response, mode="overwrite") >>> print outcome.schema StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true))) {code} But then, {code:none} outcome2 = sqlc.read.parquet(response) # fail {code} fails with: {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;' {code} in {code:none} /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw) {code} The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives? Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) > Unable to infer schema when loading Parquet file > > > Key: SPARK-21392 > URL: https://issues.apache.org/jira/browse/SPARK-21392 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.1 > Environment: Spark 2.1.1. python 2.7.6 >Reporter: Stuart Reynolds > Labels: parquet, pyspark > > The following boring code works > {code:none} > response = "mi_or_chd_5" > outcome = sqlc.sql("""select eid,{response} as response > from outcomes > where {response} IS NOT NULL""".format(response=response)) > outcome.write.parquet(response, mode="overwrite") > > >>> print outcome.schema > > StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true))) > {code} > > But then, > {code:none} > outcome2 = sqlc.read.parquet(response) # fail > {code} > fails with: > {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must > be specified manually.;' > {code} > in > {code:none} > /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc > in deco(*a, **kw) > {code} > The documentation for parquet says the format is self describing, and the > full schema was available when the parquet file was saved. What gives? > Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but > which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21392) Unable to infer schema when loading Parquet file
[ https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stuart Reynolds updated SPARK-21392: Description: The following boring code works {code:python} response = "mi_or_chd_5" colname = "f_1000" outcome = sqlc.sql("""select eid,{response} as response from outcomes where {response} IS NOT NULL""".format(response=response)) outcome.write.parquet(response, mode="overwrite") col = sqlc.sql("""select eid,{colname} as {colname} from baseline_denull where {colname} IS NOT NULL""".format(colname=colname)) col.write.parquet(colname, mode="overwrite") >>> print outcome.schema StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true))) >>> print col.schema StructType(List(StructField(eid,IntegerType,true),StructField(f_1000,DoubleType,true))) {code} But then, {code:python} outcome2 = sqlc.read.parquet(response) # fail col2 = sqlc.read.parquet(colname) # fail {code} fails with: {code:python}AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;' {code} in {code:python} /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw) {code} The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives? Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) was: The following boring code works {{response = "mi_or_chd_5" colname = "f_1000" outcome = sqlc.sql("""select eid,{response} as response from outcomes where {response} IS NOT NULL""".format(response=response)) outcome.write.parquet(response, mode="overwrite") col = sqlc.sql("""select eid,{colname} as {colname} from baseline_denull where {colname} IS NOT NULL""".format(colname=colname)) col.write.parquet(colname, mode="overwrite") >>> print outcome.schema StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true))) >>> print col.schema StructType(List(StructField(eid,IntegerType,true),StructField(f_1000,DoubleType,true))) }}. But then, {{ outcome2 = sqlc.read.parquet(response) # fail col2 = sqlc.read.parquet(colname) # fail }}. fails with: {{AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;' }}. in {{ /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw) }}. The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives? Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) > Unable to infer schema when loading Parquet file > > > Key: SPARK-21392 > URL: https://issues.apache.org/jira/browse/SPARK-21392 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.1 > Environment: Spark 2.1.1. python 2.7.6 >Reporter: Stuart Reynolds > Labels: parquet, pyspark > > The following boring code works > {code:python} >response = "mi_or_chd_5" > colname = "f_1000" > outcome = sqlc.sql("""select eid,{response} as response > from outcomes > where {response} IS NOT NULL""".format(response=response)) > outcome.write.parquet(response, mode="overwrite") > > col = sqlc.sql("""select eid,{colname} as {colname} > from baseline_denull > where {colname} IS NOT NULL""".format(colname=colname)) > col.write.parquet(colname, mode="overwrite") > >>> print outcome.schema > > StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true))) > >>> print col.schema > > StructType(List(StructField(eid,IntegerType,true),StructField(f_1000,DoubleType,true))) > {code} > > But then, > {code:python} > outcome2 = sqlc.read.parquet(response) # fail > col2 = sqlc.read.parquet(colname) # fail > {code} > fails with: > {code:python}AnalysisException: u'Unable to infer schema for Parquet. It must > be specified manually.;' > {code} > in > {code:python} > /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc > in deco(*a, **kw) > {code} > The documentation for parquet says the format is self describing, and the > full schema was available when the parquet file was saved. What gives? > Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but > which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) -- This
[jira] [Updated] (SPARK-21392) Unable to infer schema when loading Parquet file
[ https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stuart Reynolds updated SPARK-21392: Description: The following boring code works {code:python} response = "mi_or_chd_5" outcome = sqlc.sql("""select eid,{response} as response from outcomes where {response} IS NOT NULL""".format(response=response)) outcome.write.parquet(response, mode="overwrite") >>> print outcome.schema StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true))) {code} But then, {code:python} outcome2 = sqlc.read.parquet(response) # fail {code} fails with: {code:python}AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;' {code} in {code:python} /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw) {code} The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives? Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) was: The following boring code works {code:python} response = "mi_or_chd_5" colname = "f123" outcome = sqlc.sql("""select eid,{response} as response from outcomes where {response} IS NOT NULL""".format(response=response)) outcome.write.parquet(response, mode="overwrite") col = sqlc.sql("""select eid,{colname} as {colname} from baseline_denull where {colname} IS NOT NULL""".format(colname=colname)) col.write.parquet(colname, mode="overwrite") >>> print outcome.schema StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true))) >>> print col.schema StructType(List(StructField(eid,IntegerType,true),StructField(f123,DoubleType,true))) {code} But then, {code:python} outcome2 = sqlc.read.parquet(response) # fail col2 = sqlc.read.parquet(colname) # fail {code} fails with: {code:python}AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;' {code} in {code:python} /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw) {code} The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives? Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) > Unable to infer schema when loading Parquet file > > > Key: SPARK-21392 > URL: https://issues.apache.org/jira/browse/SPARK-21392 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.1 > Environment: Spark 2.1.1. python 2.7.6 >Reporter: Stuart Reynolds > Labels: parquet, pyspark > > The following boring code works > {code:python} > response = "mi_or_chd_5" > outcome = sqlc.sql("""select eid,{response} as response > from outcomes > where {response} IS NOT NULL""".format(response=response)) > outcome.write.parquet(response, mode="overwrite") > > >>> print outcome.schema > > StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true))) > {code} > > But then, > {code:python} > outcome2 = sqlc.read.parquet(response) # fail > {code} > fails with: > {code:python}AnalysisException: u'Unable to infer schema for Parquet. It must > be specified manually.;' > {code} > in > {code:python} > /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc > in deco(*a, **kw) > {code} > The documentation for parquet says the format is self describing, and the > full schema was available when the parquet file was saved. What gives? > Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but > which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21392) Unable to infer schema when loading Parquet file
[ https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stuart Reynolds updated SPARK-21392: Description: The following boring code works {code:python} response = "mi_or_chd_5" colname = "f123" outcome = sqlc.sql("""select eid,{response} as response from outcomes where {response} IS NOT NULL""".format(response=response)) outcome.write.parquet(response, mode="overwrite") col = sqlc.sql("""select eid,{colname} as {colname} from baseline_denull where {colname} IS NOT NULL""".format(colname=colname)) col.write.parquet(colname, mode="overwrite") >>> print outcome.schema StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true))) >>> print col.schema StructType(List(StructField(eid,IntegerType,true),StructField(f123,DoubleType,true))) {code} But then, {code:python} outcome2 = sqlc.read.parquet(response) # fail col2 = sqlc.read.parquet(colname) # fail {code} fails with: {code:python}AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;' {code} in {code:python} /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw) {code} The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives? Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) was: The following boring code works {code:python} response = "mi_or_chd_5" colname = "f_1000" outcome = sqlc.sql("""select eid,{response} as response from outcomes where {response} IS NOT NULL""".format(response=response)) outcome.write.parquet(response, mode="overwrite") col = sqlc.sql("""select eid,{colname} as {colname} from baseline_denull where {colname} IS NOT NULL""".format(colname=colname)) col.write.parquet(colname, mode="overwrite") >>> print outcome.schema StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true))) >>> print col.schema StructType(List(StructField(eid,IntegerType,true),StructField(f_1000,DoubleType,true))) {code} But then, {code:python} outcome2 = sqlc.read.parquet(response) # fail col2 = sqlc.read.parquet(colname) # fail {code} fails with: {code:python}AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;' {code} in {code:python} /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw) {code} The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives? Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) > Unable to infer schema when loading Parquet file > > > Key: SPARK-21392 > URL: https://issues.apache.org/jira/browse/SPARK-21392 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.1 > Environment: Spark 2.1.1. python 2.7.6 >Reporter: Stuart Reynolds > Labels: parquet, pyspark > > The following boring code works > {code:python} > response = "mi_or_chd_5" > colname = "f123" > outcome = sqlc.sql("""select eid,{response} as response > from outcomes > where {response} IS NOT NULL""".format(response=response)) > outcome.write.parquet(response, mode="overwrite") > > col = sqlc.sql("""select eid,{colname} as {colname} > from baseline_denull > where {colname} IS NOT NULL""".format(colname=colname)) > col.write.parquet(colname, mode="overwrite") > >>> print outcome.schema > > StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true))) > >>> print col.schema > > StructType(List(StructField(eid,IntegerType,true),StructField(f123,DoubleType,true))) > {code} > > But then, > {code:python} > outcome2 = sqlc.read.parquet(response) # fail > col2 = sqlc.read.parquet(colname) # fail > {code} > fails with: > {code:python}AnalysisException: u'Unable to infer schema for Parquet. It must > be specified manually.;' > {code} > in > {code:python} > /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc > in deco(*a, **kw) > {code} > The documentation for parquet says the format is self describing, and the > full schema was available when the parquet file was saved. What gives? > Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but > which claims it was fixed in
[jira] [Created] (SPARK-21392) Unable to infer schema when loading Parquet file
Stuart Reynolds created SPARK-21392: --- Summary: Unable to infer schema when loading Parquet file Key: SPARK-21392 URL: https://issues.apache.org/jira/browse/SPARK-21392 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.1.1 Environment: Spark 2.1.1. python 2.7.6 Reporter: Stuart Reynolds The following boring code works {{response = "mi_or_chd_5" colname = "f_1000" outcome = sqlc.sql("""select eid,{response} as response from outcomes where {response} IS NOT NULL""".format(response=response)) outcome.write.parquet(response, mode="overwrite") col = sqlc.sql("""select eid,{colname} as {colname} from baseline_denull where {colname} IS NOT NULL""".format(colname=colname)) col.write.parquet(colname, mode="overwrite") >>> print outcome.schema StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true))) >>> print col.schema StructType(List(StructField(eid,IntegerType,true),StructField(f_1000,DoubleType,true))) }}. But then, {{ outcome2 = sqlc.read.parquet(response) # fail col2 = sqlc.read.parquet(colname) # fail }}. fails with: {{AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;' }}. in {{ /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw) }}. The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives? Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20847) Error reading NULL int[] element from postgres -- null pointer exception.
[ https://issues.apache.org/jira/browse/SPARK-20847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stuart Reynolds updated SPARK-20847: Description: -- maybe fixed already? https://github.com/apache/spark/commit/f174cdc7478d0b81f9cfa896284a5ec4c6bb952d {code:python} def query_int_array(): import pandas as pd from pyspark.sql import SQLContext user,password = ... , hostname = dbName = ... url = "jdbc:postgresql://{hostname}:5432/{dbName}".format(**locals()) properties = {'user': user, 'password': password} sql_create = """DROP TABLE IF EXISTS public._df10; CREATE TABLE IF NOT EXISTS public._df10 ( id integer, f_21 integer[] ); INSERT INTO public._df10(id, f_21) VALUES (1, ARRAY[1,2]) --OK ,(2, ARRAY[3,NULL]) --OK ,(3, NULL) --FAIL *< PROBLEM ;""" engine = sqlalchemy.create_engine('postgresql+psycopg2://{user}:{password}@{hostname}:5432/{dbName}'.format(**locals())) with engine.connect().execution_options(autocommit=True) as con: con.execute(sql_create) # Export postgres _df10 to spark as table df10 sc = get_spark_context(master="local") sqlContext = SQLContext(sc) df10 = sqlContext.read.format("jdbc"). \ option("url", url). \ option("driver", "org.postgresql.Driver"). \ option("useUnicode", "true"). \ option("continueBatchOnError","true"). \ option("useSSL", "false"). \ option("user", user). \ option("password", password). \ option("dbtable", "_df10"). \ load() df10.registerTempTable("df10") print "DF inferred from postgres:" df10.printSchema() df10.show() print "DF queried from postgres:" df10 = sqlContext.sql("select * from df10") df10.printSchema() df10.show() print df10.collect() {code} Explodes with: {noformat} DF inferred from postgres: root |-- id: integer (nullable = true) |-- f_21: array (nullable = true) ||-- element: integer (containsNull = true) 17/05/22 15:46:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.NullPointerException at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:427) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:425) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:286) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:268) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 17/05/22 15:46:30 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.NullPointerException at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:427) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:425) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:286) at
[jira] [Created] (SPARK-20847) Error reading NULL int[] element from postgres -- null pointer exception.
Stuart Reynolds created SPARK-20847: --- Summary: Error reading NULL int[] element from postgres -- null pointer exception. Key: SPARK-20847 URL: https://issues.apache.org/jira/browse/SPARK-20847 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Stuart Reynolds {code:python} def query_int_array(): import pandas as pd from pyspark.sql import SQLContext user,password = ... , hostname = dbName = ... url = "jdbc:postgresql://{hostname}:5432/{dbName}".format(**locals()) properties = {'user': user, 'password': password} sql_create = """DROP TABLE IF EXISTS public._df10; CREATE TABLE IF NOT EXISTS public._df10 ( id integer, f_21 integer[] ); INSERT INTO public._df10(id, f_21) VALUES (1, ARRAY[1,2]) --OK ,(2, ARRAY[3,NULL]) --OK ,(3, NULL) --FAIL *< PROBLEM ;""" engine = sqlalchemy.create_engine('postgresql+psycopg2://{user}:{password}@{hostname}:5432/{dbName}'.format(**locals())) with engine.connect().execution_options(autocommit=True) as con: con.execute(sql_create) # Export postgres _df10 to spark as table df10 sc = get_spark_context(master="local") sqlContext = SQLContext(sc) df10 = sqlContext.read.format("jdbc"). \ option("url", url). \ option("driver", "org.postgresql.Driver"). \ option("useUnicode", "true"). \ option("continueBatchOnError","true"). \ option("useSSL", "false"). \ option("user", user). \ option("password", password). \ option("dbtable", "_df10"). \ load() df10.registerTempTable("df10") print "DF inferred from postgres:" df10.printSchema() df10.show() print "DF queried from postgres:" df10 = sqlContext.sql("select * from df10") df10.printSchema() df10.show() print df10.collect() {code} Explodes with: {noformat} DF inferred from postgres: root |-- id: integer (nullable = true) |-- f_21: array (nullable = true) ||-- element: integer (containsNull = true) 17/05/22 15:46:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.NullPointerException at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:427) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:425) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:286) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:268) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 17/05/22 15:46:30 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.NullPointerException at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:427) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:425) at
[jira] [Updated] (SPARK-20846) Incorrect posgres sql array column schema inferred from table.
[ https://issues.apache.org/jira/browse/SPARK-20846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stuart Reynolds updated SPARK-20846: Summary: Incorrect posgres sql array column schema inferred from table. (was: Incorrect posgres sql schema inferred from table.) > Incorrect posgres sql array column schema inferred from table. > -- > > Key: SPARK-20846 > URL: https://issues.apache.org/jira/browse/SPARK-20846 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Stuart Reynolds > > When reading a table containing int[][] columns from postgres, the column is > inferred as int[] (should be int[][]). > {code:python} > from pyspark.sql import SQLContext > import pandas as pd > from dataIngest.util.sqlUtil import asSQLAlchemyEngine > user,password = ..., ... > url = "postgresql://hostname:5432/dbname" > url = 'jdbc:'+url > properties = {'user': user, 'password': password} > engine = ... sql alchemy engine ... > Create pandas df with int[] and int[][] > df = pd.DataFrame({ > 'a1': [[1,2,None],[1,2,3], None], > 'b2': [[[1],[None],[3]], [[1],[2],[3]], None] > }) > # Store df into postgres as table _dfjunk > with engine.connect().execution_options(autocommit=True) as con: > con.execute(""" > DROP TABLE IF EXISTS _dfjunk; > > CREATE TABLE _dfjunk ( > a1 int[] NULL, > b2 int[][] NULL > ); > """) > df.to_sql("_dfjunk", con, index=None, if_exists="append") > # Let's access via spark > sc = get_spark_context(master="local") > sqlContext = SQLContext(sc) > print "pandas DF as spark DF:" > df = sqlContext.createDataFrame(df) > df.printSchema() > df.show() > df.registerTempTable("df") > print sqlContext.sql("select * from df").collect() > ### Export _dfjunk as table df3 > df3 = sqlContext.read.format("jdbc"). \ > option("url", url). \ > option("driver", "org.postgresql.Driver"). \ > option("useUnicode", "true"). \ > option("continueBatchOnError","true"). \ > option("useSSL", "false"). \ > option("user", user). \ > option("password", password). \ > option("dbtable", "_dfjunk").\ > load() > df3.registerTempTable("df3") > print "DF inferred from postgres:" > df3.printSchema() > df3.show() > print "DF queried from postgres:" > df3 = sqlContext.sql("select * from df3") > df3.printSchema() > df3.show() > print df3.collect() > {code} > Errors out with: > pandas DF as spark DF: > {noformat} > root > |-- a1: array (nullable = true) > ||-- element: long (containsNull = true) > |-- b2: array (nullable = true) > ||-- element: array (containsNull = true) > |||-- element: long (containsNull = true) <<< ** THIS IS > CORRECT > +++ > | a1| b2| > +++ > |[1, 2, null]|[WrappedArray(1),...| > | [1, 2, 3]|[WrappedArray(1),...| > |null|null| > +++ > [Row(a1=[1, 2, None], b2=[[1], [None], [3]]), Row(a1=[1, 2, 3], b2=[[1], [2], > [3]]), Row(a1=None, b2=None)] > DF inferred from postgres: > root > |-- a1: array (nullable = true) > ||-- element: integer (containsNull = true) > |-- b2: array (nullable = true) > ||-- element: integer (containsNull = true)<<< ** THIS IS > WRONG Is an array of arrays. > 17/05/22 15:00:39 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) > java.lang.ClassCastException: [Ljava.lang.Integer; cannot be cast to > java.lang.Integer > at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101) > at > org.apache.spark.sql.catalyst.util.GenericArrayData.getInt(GenericArrayData.scala:62) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at
[jira] [Updated] (SPARK-20846) Incorrect posgres sql schema inferred from table.
[ https://issues.apache.org/jira/browse/SPARK-20846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stuart Reynolds updated SPARK-20846: Description: When reading a table containing int[][] columns from postgres, the column is inferred as int[] (should be int[][]). {code:python} from pyspark.sql import SQLContext import pandas as pd from dataIngest.util.sqlUtil import asSQLAlchemyEngine user,password = ..., ... url = "postgresql://hostname:5432/dbname" url = 'jdbc:'+url properties = {'user': user, 'password': password} engine = ... sql alchemy engine ... Create pandas df with int[] and int[][] df = pd.DataFrame({ 'a1': [[1,2,None],[1,2,3], None], 'b2': [[[1],[None],[3]], [[1],[2],[3]], None] }) # Store df into postgres as table _dfjunk with engine.connect().execution_options(autocommit=True) as con: con.execute(""" DROP TABLE IF EXISTS _dfjunk; CREATE TABLE _dfjunk ( a1 int[] NULL, b2 int[][] NULL ); """) df.to_sql("_dfjunk", con, index=None, if_exists="append") # Let's access via spark sc = get_spark_context(master="local") sqlContext = SQLContext(sc) print "pandas DF as spark DF:" df = sqlContext.createDataFrame(df) df.printSchema() df.show() df.registerTempTable("df") print sqlContext.sql("select * from df").collect() ### Export _dfjunk as table df3 df3 = sqlContext.read.format("jdbc"). \ option("url", url). \ option("driver", "org.postgresql.Driver"). \ option("useUnicode", "true"). \ option("continueBatchOnError","true"). \ option("useSSL", "false"). \ option("user", user). \ option("password", password). \ option("dbtable", "_dfjunk").\ load() df3.registerTempTable("df3") print "DF inferred from postgres:" df3.printSchema() df3.show() print "DF queried from postgres:" df3 = sqlContext.sql("select * from df3") df3.printSchema() df3.show() print df3.collect() {code} Errors out with: pandas DF as spark DF: {noformat} root |-- a1: array (nullable = true) ||-- element: long (containsNull = true) |-- b2: array (nullable = true) ||-- element: array (containsNull = true) |||-- element: long (containsNull = true) <<< ** THIS IS CORRECT +++ | a1| b2| +++ |[1, 2, null]|[WrappedArray(1),...| | [1, 2, 3]|[WrappedArray(1),...| |null|null| +++ [Row(a1=[1, 2, None], b2=[[1], [None], [3]]), Row(a1=[1, 2, 3], b2=[[1], [2], [3]]), Row(a1=None, b2=None)] DF inferred from postgres: root |-- a1: array (nullable = true) ||-- element: integer (containsNull = true) |-- b2: array (nullable = true) ||-- element: integer (containsNull = true)<<< ** THIS IS WRONG Is an array of arrays. 17/05/22 15:00:39 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) java.lang.ClassCastException: [Ljava.lang.Integer; cannot be cast to java.lang.Integer at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101) at org.apache.spark.sql.catalyst.util.GenericArrayData.getInt(GenericArrayData.scala:62) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {noformat} was: When reading a table containing int[][] columns from postgres, the column is inferred as int[] (should be int[][]). from
[jira] [Updated] (SPARK-20846) Incorrect posgres sql schema inferred from table.
[ https://issues.apache.org/jira/browse/SPARK-20846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stuart Reynolds updated SPARK-20846: Description: When reading a table containing int[][] columns from postgres, the column is inferred as int[] (should be int[][]). from pyspark.sql import SQLContext import pandas as pd from dataIngest.util.sqlUtil import asSQLAlchemyEngine user,password = ..., ... url = "postgresql://hostname:5432/dbname" url = 'jdbc:'+url properties = {'user': user, 'password': password} engine = ... sql alchemy engine ... # Create pandas df with int[] and int[][] df = pd.DataFrame({ 'a1': [[1,2,None],[1,2,3], None], 'b2': [[[1],[None],[3]], [[1],[2],[3]], None] }) ### Store df into postgres as table _dfjunk with engine.connect().execution_options(autocommit=True) as con: con.execute(""" DROP TABLE IF EXISTS _dfjunk; CREATE TABLE _dfjunk ( a1 int[] NULL, b2 int[][] NULL ); """) df.to_sql("_dfjunk", con, index=None, if_exists="append") ### Let's access via spark sc = get_spark_context(master="local") sqlContext = SQLContext(sc) print "pandas DF as spark DF:" df = sqlContext.createDataFrame(df) df.printSchema() df.show() df.registerTempTable("df") print sqlContext.sql("select * from df").collect() ### Export _dfjunk as table df3 df3 = sqlContext.read.format("jdbc"). \ option("url", url). \ option("driver", "org.postgresql.Driver"). \ option("useUnicode", "true"). \ option("continueBatchOnError","true"). \ option("useSSL", "false"). \ option("user", user). \ option("password", password). \ option("dbtable", "_dfjunk").\ load() df3.registerTempTable("df3") print "DF inferred from postgres:" df3.printSchema() df3.show() print "DF queried from postgres:" df3 = sqlContext.sql("select * from df3") df3.printSchema() df3.show() print df3.collect() Errors out with: pandas DF as spark DF: root |-- a1: array (nullable = true) ||-- element: long (containsNull = true) |-- b2: array (nullable = true) ||-- element: array (containsNull = true) |||-- element: long (containsNull = true) <<< ** THIS IS CORRECT +++ | a1| b2| +++ |[1, 2, null]|[WrappedArray(1),...| | [1, 2, 3]|[WrappedArray(1),...| |null|null| +++ [Row(a1=[1, 2, None], b2=[[1], [None], [3]]), Row(a1=[1, 2, 3], b2=[[1], [2], [3]]), Row(a1=None, b2=None)] DF inferred from postgres: root |-- a1: array (nullable = true) ||-- element: integer (containsNull = true) |-- b2: array (nullable = true) ||-- element: integer (containsNull = true)<<< ** THIS IS WRONG Is an array of arrays. 17/05/22 15:00:39 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) java.lang.ClassCastException: [Ljava.lang.Integer; cannot be cast to java.lang.Integer at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101) at org.apache.spark.sql.catalyst.util.GenericArrayData.getInt(GenericArrayData.scala:62) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) was: When reading a table containing int[][] columns from postgres, the column is inferred as int[] (should be int[][]). from pyspark.sql import SQLContext
[jira] [Created] (SPARK-20846) Incorrect posgres sql schema inferred from table.
Stuart Reynolds created SPARK-20846: --- Summary: Incorrect posgres sql schema inferred from table. Key: SPARK-20846 URL: https://issues.apache.org/jira/browse/SPARK-20846 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Stuart Reynolds When reading a table containing int[][] columns from postgres, the column is inferred as int[] (should be int[][]). from pyspark.sql import SQLContext import pandas as pd from dataIngest.util.sqlUtil import asSQLAlchemyEngine user,password = ..., ... url = "postgresql://hostname:5432/dbname" url = 'jdbc:'+url properties = {'user': user, 'password': password} engine = ... sql alchemy engine ... # Create pandas df with int[] and int[][] df = pd.DataFrame({ 'a1': [[1,2,None],[1,2,3], None], 'b2': [[[1],[None],[3]], [[1],[2],[3]], None] }) # Store df into postgres as table _dfjunk with engine.connect().execution_options(autocommit=True) as con: con.execute(""" DROP TABLE IF EXISTS _dfjunk; CREATE TABLE _dfjunk ( a1 int[] NULL, b2 int[][] NULL ); """) df.to_sql("_dfjunk", con, index=None, if_exists="append") # Let's access via spark sc = get_spark_context(master="local") sqlContext = SQLContext(sc) print "pandas DF as spark DF:" df = sqlContext.createDataFrame(df) df.printSchema() df.show() df.registerTempTable("df") print sqlContext.sql("select * from df").collect() # Export _dfjunk as table df3 df3 = sqlContext.read.format("jdbc"). \ option("url", url). \ option("driver", "org.postgresql.Driver"). \ option("useUnicode", "true"). \ option("continueBatchOnError","true"). \ option("useSSL", "false"). \ option("user", user). \ option("password", password). \ option("dbtable", "_dfjunk").\ load() df3.registerTempTable("df3") print "DF inferred from postgres:" df3.printSchema() df3.show() print "DF queried from postgres:" df3 = sqlContext.sql("select * from df3") df3.printSchema() df3.show() print df3.collect() Errors out with: pandas DF as spark DF: root |-- a1: array (nullable = true) ||-- element: long (containsNull = true) |-- b2: array (nullable = true) ||-- element: array (containsNull = true) |||-- element: long (containsNull = true) <<< ** THIS IS CORRECT +++ | a1| b2| +++ |[1, 2, null]|[WrappedArray(1),...| | [1, 2, 3]|[WrappedArray(1),...| |null|null| +++ [Row(a1=[1, 2, None], b2=[[1], [None], [3]]), Row(a1=[1, 2, 3], b2=[[1], [2], [3]]), Row(a1=None, b2=None)] DF inferred from postgres: root |-- a1: array (nullable = true) ||-- element: integer (containsNull = true) |-- b2: array (nullable = true) ||-- element: integer (containsNull = true)<<< ** THIS IS WRONG Is an array of arrays. 17/05/22 15:00:39 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) java.lang.ClassCastException: [Ljava.lang.Integer; cannot be cast to java.lang.Integer at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101) at org.apache.spark.sql.catalyst.util.GenericArrayData.getInt(GenericArrayData.scala:62) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at