[jira] [Comment Edited] (SPARK-21392) Unable to infer schema when loading large Parquet file

2017-07-18 Thread Stuart Reynolds (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16090159#comment-16090159
 ] 

Stuart Reynolds edited comment on SPARK-21392 at 7/18/17 5:30 PM:
--

So trying to look at the csv was helpful.

{code:none}
#root = "/network/folder/mi"  # succeeds
root = "mi"  # fails
rdd.write.parquet(root+"mi", mode="overwrite")
rdd.write.csv(root+"minn.csv", mode="overwrite")
rdd2 = sqlc.read.parquet(root+"mi")
{code}

The above creates a folder on my local machine, but no data.
{code:none}
% ls -la mi minn.csv
mi:
total 12
drwxrwxr-x 2 builder builder 4096 Jul 17 10:42 .
drwxrwxr-x 5 builder builder 4096 Jul 17 10:42 ..
-rw-r--r-- 1 builder builder0 Jul 17 10:42 _SUCCESS
-rw-r--r-- 1 builder builder8 Jul 17 10:42 ._SUCCESS.crc

minn.csv/:
total 12
drwxrwxr-x 2 builder builder 4096 Jul 17 10:42 .
drwxrwxr-x 5 builder builder 4096 Jul 17 10:42 ..
-rw-r--r-- 1 builder builder0 Jul 17 10:42 _SUCCESS
-rw-r--r-- 1 builder builder8 Jul 17 10:42 ._SUCCESS.crc
{code}

Prepending the paths with network folder that's available to spark succeeds.

So, is this just a "file not found error", with a terrible error message?


was (Author: stuartreynolds):
So trying to look at the csv was helpful.

{code:none}
#root = "/network/folder"  # succeeds
root = ""  # fails
rdd.write.parquet(root+"mi", mode="overwrite")
rdd.write.csv(root+"minn.csv", mode="overwrite")
rdd2 = sqlc.read.parquet(root+"mi")
{code}

The above creates a folder on my local machine, but no data.
{code:none}
% ls -la mi minn.csv
mi:
total 12
drwxrwxr-x 2 builder builder 4096 Jul 17 10:42 .
drwxrwxr-x 5 builder builder 4096 Jul 17 10:42 ..
-rw-r--r-- 1 builder builder0 Jul 17 10:42 _SUCCESS
-rw-r--r-- 1 builder builder8 Jul 17 10:42 ._SUCCESS.crc

minn.csv/:
total 12
drwxrwxr-x 2 builder builder 4096 Jul 17 10:42 .
drwxrwxr-x 5 builder builder 4096 Jul 17 10:42 ..
-rw-r--r-- 1 builder builder0 Jul 17 10:42 _SUCCESS
-rw-r--r-- 1 builder builder8 Jul 17 10:42 ._SUCCESS.crc
{code}

Prepending the paths with network folder that's available to spark succeeds.

So, is this just a "file not found error", with a terrible error message?

> Unable to infer schema when loading large Parquet file
> --
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1, 2.2.0
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works up until when I read in the parquet file.
> {code:none}
> import numpy as np
> import pandas as pd
> import pyspark
> from pyspark import SQLContext, SparkContext, SparkConf
> print pyspark.__version__
> sc = SparkContext(conf=SparkConf().setMaster('local'))
> df = pd.DataFrame({"mi":np.arange(100), "eid":np.arange(100)})
> print df
> sqlc = SQLContext(sc)
> df = sqlc.createDataFrame(df)
> df = df.createOrReplaceTempView("outcomes")
> rdd = sqlc.sql("SELECT eid,mi FROM outcomes limit 5")
> print rdd.schema
> rdd.show()
> rdd.write.parquet("mi", mode="overwrite")
> rdd2 = sqlc.read.parquet("mi")  # FAIL!
> {code}
> {code:none}
> # print pyspark.__version__
> 2.2.0
> # print df
> eid  mi
> 0 0   0
> 1 1   1
> 2 2   2
> 3 3   3
> ...
> [100 rows x 2 columns]
> # print rdd.schema
> StructType(List(StructField(eid,LongType,true),StructField(mi,LongType,true)))
> # rdd.show()
> +---+---+
> |eid| mi|
> +---+---+
> |  0|  0|
> |  1|  1|
> |  2|  2|
> |  3|  3|
> |  4|  4|
> +---+---+
> {code}
> 
> fails with:
> {code:none}
> rdd2 = sqlc.read.parquet("mixx")
>   File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/readwriter.py", 
> line 291, in parquet
> return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
>   File "/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.py", line 
> 1133, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/utils.py", line 
> 69, in deco
> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It 
> must be specified manually.;'
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> Works with master='local', but fails with my cluster is specified.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21392) Unable to infer schema when loading large Parquet file

2017-07-18 Thread Stuart Reynolds (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16091854#comment-16091854
 ] 

Stuart Reynolds commented on SPARK-21392:
-

Okie dokey:

http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-save-parquet-file-td28874.html

I think there's still a bug here. (I suspect filename given to the cluster 
can't be saved on the cluster -- but then write should fail, not read, and the 
error should be different). 

> Unable to infer schema when loading large Parquet file
> --
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1, 2.2.0
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works up until when I read in the parquet file.
> {code:none}
> import numpy as np
> import pandas as pd
> import pyspark
> from pyspark import SQLContext, SparkContext, SparkConf
> print pyspark.__version__
> sc = SparkContext(conf=SparkConf().setMaster('local'))
> df = pd.DataFrame({"mi":np.arange(100), "eid":np.arange(100)})
> print df
> sqlc = SQLContext(sc)
> df = sqlc.createDataFrame(df)
> df = df.createOrReplaceTempView("outcomes")
> rdd = sqlc.sql("SELECT eid,mi FROM outcomes limit 5")
> print rdd.schema
> rdd.show()
> rdd.write.parquet("mi", mode="overwrite")
> rdd2 = sqlc.read.parquet("mi")  # FAIL!
> {code}
> {code:none}
> # print pyspark.__version__
> 2.2.0
> # print df
> eid  mi
> 0 0   0
> 1 1   1
> 2 2   2
> 3 3   3
> ...
> [100 rows x 2 columns]
> # print rdd.schema
> StructType(List(StructField(eid,LongType,true),StructField(mi,LongType,true)))
> # rdd.show()
> +---+---+
> |eid| mi|
> +---+---+
> |  0|  0|
> |  1|  1|
> |  2|  2|
> |  3|  3|
> |  4|  4|
> +---+---+
> {code}
> 
> fails with:
> {code:none}
> rdd2 = sqlc.read.parquet("mixx")
>   File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/readwriter.py", 
> line 291, in parquet
> return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
>   File "/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.py", line 
> 1133, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/utils.py", line 
> 69, in deco
> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It 
> must be specified manually.;'
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> Works with master='local', but fails with my cluster is specified.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21392) Unable to infer schema when loading large Parquet file

2017-07-17 Thread Stuart Reynolds (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16090159#comment-16090159
 ] 

Stuart Reynolds commented on SPARK-21392:
-

So trying to look at the csv was helpful.

{code:none}
#root = "/network/folder"  # succeeds
root = ""  # fails
rdd.write.parquet(root+"mi", mode="overwrite")
rdd.write.csv(root+"minn.csv", mode="overwrite")
rdd2 = sqlc.read.parquet(root+"mi")
{code}

The above creates a folder on my local machine, but no data.
{code:none}
% ls -la mi minn.csv
mi:
total 12
drwxrwxr-x 2 builder builder 4096 Jul 17 10:42 .
drwxrwxr-x 5 builder builder 4096 Jul 17 10:42 ..
-rw-r--r-- 1 builder builder0 Jul 17 10:42 _SUCCESS
-rw-r--r-- 1 builder builder8 Jul 17 10:42 ._SUCCESS.crc

minn.csv/:
total 12
drwxrwxr-x 2 builder builder 4096 Jul 17 10:42 .
drwxrwxr-x 5 builder builder 4096 Jul 17 10:42 ..
-rw-r--r-- 1 builder builder0 Jul 17 10:42 _SUCCESS
-rw-r--r-- 1 builder builder8 Jul 17 10:42 ._SUCCESS.crc
{code}

Prepending the paths with network folder that's available to spark succeeds.

So, is this just a "file not found error", with a terrible error message?

> Unable to infer schema when loading large Parquet file
> --
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1, 2.2.0
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works up until when I read in the parquet file.
> {code:none}
> import numpy as np
> import pandas as pd
> import pyspark
> from pyspark import SQLContext, SparkContext, SparkConf
> print pyspark.__version__
> sc = SparkContext(conf=SparkConf().setMaster('local'))
> df = pd.DataFrame({"mi":np.arange(100), "eid":np.arange(100)})
> print df
> sqlc = SQLContext(sc)
> df = sqlc.createDataFrame(df)
> df = df.createOrReplaceTempView("outcomes")
> rdd = sqlc.sql("SELECT eid,mi FROM outcomes limit 5")
> print rdd.schema
> rdd.show()
> rdd.write.parquet("mi", mode="overwrite")
> rdd2 = sqlc.read.parquet("mi")  # FAIL!
> {code}
> {code:none}
> # print pyspark.__version__
> 2.2.0
> # print df
> eid  mi
> 0 0   0
> 1 1   1
> 2 2   2
> 3 3   3
> ...
> [100 rows x 2 columns]
> # print rdd.schema
> StructType(List(StructField(eid,LongType,true),StructField(mi,LongType,true)))
> # rdd.show()
> +---+---+
> |eid| mi|
> +---+---+
> |  0|  0|
> |  1|  1|
> |  2|  2|
> |  3|  3|
> |  4|  4|
> +---+---+
> {code}
> 
> fails with:
> {code:none}
> rdd2 = sqlc.read.parquet("mixx")
>   File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/readwriter.py", 
> line 291, in parquet
> return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
>   File "/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.py", line 
> 1133, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/utils.py", line 
> 69, in deco
> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It 
> must be specified manually.;'
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> Works with master='local', but fails with my cluster is specified.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21392) Unable to infer schema when loading large Parquet file

2017-07-17 Thread Stuart Reynolds (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16090134#comment-16090134
 ] 

Stuart Reynolds commented on SPARK-21392:
-

I've made the example self contained and sourced from a pandas dataframe.

It seems to succeed with master=local, and fails on the cluster (the cluster's 
dashboard says its also spark 2.2.0).



> Unable to infer schema when loading large Parquet file
> --
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1, 2.2.0
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works up until when I read in the parquet file.
> {code:none}
> import numpy as np
> import pandas as pd
> import pyspark
> from pyspark import SQLContext, SparkContext, SparkConf
> print pyspark.__version__
> sc = SparkContext(conf=SparkConf().setMaster('local'))
> df = pd.DataFrame({"mi":np.arange(100), "eid":np.arange(100)})
> print df
> sqlc = SQLContext(sc)
> df = sqlc.createDataFrame(df)
> df = df.createOrReplaceTempView("outcomes")
> rdd = sqlc.sql("SELECT eid,mi FROM outcomes limit 5")
> print rdd.schema
> rdd.show()
> rdd.write.parquet("mi", mode="overwrite")
> rdd2 = sqlc.read.parquet("mi")  # FAIL!
> {code}
> {code:none}
> # print pyspark.__version__
> 2.2.0
> # print df
> eid  mi
> 0 0   0
> 1 1   1
> 2 2   2
> 3 3   3
> ...
> [100 rows x 2 columns]
> # print rdd.schema
> StructType(List(StructField(eid,LongType,true),StructField(mi,LongType,true)))
> # rdd.show()
> +---+---+
> |eid| mi|
> +---+---+
> |  0|  0|
> |  1|  1|
> |  2|  2|
> |  3|  3|
> |  4|  4|
> +---+---+
> {code}
> 
> fails with:
> {code:none}
> rdd2 = sqlc.read.parquet("mixx")
>   File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/readwriter.py", 
> line 291, in parquet
> return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
>   File "/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.py", line 
> 1133, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/utils.py", line 
> 69, in deco
> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It 
> must be specified manually.;'
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> Works with master='local', but fails with my cluster is specified.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21392) Unable to infer schema when loading large Parquet file

2017-07-17 Thread Stuart Reynolds (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stuart Reynolds updated SPARK-21392:

Description: 
The following boring code works up until when I read in the parquet file.

{code:none}
import numpy as np
import pandas as pd
import pyspark
from pyspark import SQLContext, SparkContext, SparkConf

print pyspark.__version__
sc = SparkContext(conf=SparkConf().setMaster('local'))
df = pd.DataFrame({"mi":np.arange(100), "eid":np.arange(100)})
print df
sqlc = SQLContext(sc)
df = sqlc.createDataFrame(df)
df = df.createOrReplaceTempView("outcomes")
rdd = sqlc.sql("SELECT eid,mi FROM outcomes limit 5")
print rdd.schema
rdd.show()
rdd.write.parquet("mi", mode="overwrite")
rdd2 = sqlc.read.parquet("mi")  # FAIL!
{code}

{code:none}
# print pyspark.__version__
2.2.0

# print df
eid  mi
0 0   0
1 1   1
2 2   2
3 3   3
...

[100 rows x 2 columns]

# print rdd.schema
StructType(List(StructField(eid,LongType,true),StructField(mi,LongType,true)))

# rdd.show()
+---+---+
|eid| mi|
+---+---+
|  0|  0|
|  1|  1|
|  2|  2|
|  3|  3|
|  4|  4|
+---+---+
{code}

fails with:

{code:none}
rdd2 = sqlc.read.parquet("mixx")
  File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/readwriter.py", line 
291, in parquet
return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
  File "/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.py", line 
1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
  File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/utils.py", line 69, 
in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It 
must be specified manually.;'
{code}


The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

Works with master='local', but fails with my cluster is specified.


  was:
The following boring code works up until when I read in the parquet file.

{code:none}
import numpy as np
import pandas as pd
import pyspark
from pyspark import SQLContext, SparkContext, SparkConf

print pyspark.__version__
sc = SparkContext(conf=SparkConf().setMaster('local'))
df = pd.DataFrame({"mi":np.arange(100), "eid":np.arange(100)})
print df
sqlc = SQLContext(sc)
df = sqlc.createDataFrame(df)
df = df.createOrReplaceTempView("outcomes")
rdd = sqlc.sql("SELECT eid,mi FROM outcomes limit 5")
print rdd.schema
rdd.show()
rdd.write.parquet("mi", mode="overwrite")
rdd2 = sqlc.read.parquet("mi")
{code}

{code:none}
# print pyspark.__version__
2.2.0

# print df
eid  mi
0 0   0
1 1   1
2 2   2
3 3   3
---

[100 rows x 2 columns]

# print rdd.schema
StructType(List(StructField(eid,LongType,true),StructField(mi,LongType,true)))

# rdd.show()
+---+---+
|eid| mi|
+---+---+
|  0|  0|
|  1|  1|
|  2|  2|
|  3|  3|
|  4|  4|
+---+---+
{code}

fails with:

{code:none}
rdd2 = sqlc.read.parquet("mixx")
  File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/readwriter.py", line 
291, in parquet
return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
  File "/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.py", line 
1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
  File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/utils.py", line 69, 
in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It 
must be specified manually.;'
{code}

in 

{code:none} 
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

Works with master='local', but fails with my cluster is specified.



> Unable to infer schema when loading large Parquet file
> --
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1, 2.2.0
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works up until when I read in the parquet file.
> {code:none}
> import numpy as np
> import pandas as pd
> import pyspark
> from pyspark import SQLContext, SparkContext, SparkConf
> print pyspark.__version__
> sc = SparkContext(conf=SparkConf().setMaster('local'))
> df = pd.DataFrame({"mi":np.arange(100), "eid":np.arange(100)})
> print df
> sqlc = SQLContext(sc)
> df = sqlc.createDataFrame(df)
> df = 

[jira] [Updated] (SPARK-21392) Unable to infer schema when loading large Parquet file

2017-07-17 Thread Stuart Reynolds (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stuart Reynolds updated SPARK-21392:

Description: 
The following boring code works up until when I read in the parquet file.

{code:none}
import numpy as np
import pandas as pd
import pyspark
from pyspark import SQLContext, SparkContext, SparkConf

print pyspark.__version__
sc = SparkContext(conf=SparkConf().setMaster('local'))
df = pd.DataFrame({"mi":np.arange(100), "eid":np.arange(100)})
print df
sqlc = SQLContext(sc)
df = sqlc.createDataFrame(df)
df = df.createOrReplaceTempView("outcomes")
rdd = sqlc.sql("SELECT eid,mi FROM outcomes limit 5")
print rdd.schema
rdd.show()
rdd.write.parquet("mi", mode="overwrite")
rdd2 = sqlc.read.parquet("mi")
{code}

{code:none}
# print pyspark.__version__
2.2.0

# print df
eid  mi
0 0   0
1 1   1
2 2   2
3 3   3
---

[100 rows x 2 columns]

# print rdd.schema
StructType(List(StructField(eid,LongType,true),StructField(mi,LongType,true)))

# rdd.show()
+---+---+
|eid| mi|
+---+---+
|  0|  0|
|  1|  1|
|  2|  2|
|  3|  3|
|  4|  4|
+---+---+
{code}

fails with:

{code:none}
rdd2 = sqlc.read.parquet("mixx")
  File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/readwriter.py", line 
291, in parquet
return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
  File "/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.py", line 
1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
  File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/utils.py", line 69, 
in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It 
must be specified manually.;'
{code}

in 

{code:none} 
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

Works with master='local', but fails with my cluster is specified.


  was:
The following boring code works up until when I read in the parquet file.

{code:none}
response = "mi_or_chd_5"
sc = get_spark_context() # custom
sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom


rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes")
print rdd.schema
#>>
StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true)))
rdd.show()
#+---+---+
#|eid|mi_or_chd_5|
#+---+---+
#|226|   null|
#|442|   null|
#|978|  0|
#|851|  0|
#|428|  0|

rdd.write.parquet(response, mode="overwrite") # success!
rdd2 = sqlc.read.parquet(response) # fail
{code}

fails with:

{code:none}AnalysisException: u'Unable to infer schema for Parquet. It must be 
specified manually.;'
{code}

in 

{code:none} 
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

The error doesn't happen if I add "limit 10" to the sql query. The whole 
selected table is 500k rows with an int and short column.

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



> Unable to infer schema when loading large Parquet file
> --
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1, 2.2.0
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works up until when I read in the parquet file.
> {code:none}
> import numpy as np
> import pandas as pd
> import pyspark
> from pyspark import SQLContext, SparkContext, SparkConf
> print pyspark.__version__
> sc = SparkContext(conf=SparkConf().setMaster('local'))
> df = pd.DataFrame({"mi":np.arange(100), "eid":np.arange(100)})
> print df
> sqlc = SQLContext(sc)
> df = sqlc.createDataFrame(df)
> df = df.createOrReplaceTempView("outcomes")
> rdd = sqlc.sql("SELECT eid,mi FROM outcomes limit 5")
> print rdd.schema
> rdd.show()
> rdd.write.parquet("mi", mode="overwrite")
> rdd2 = sqlc.read.parquet("mi")
> {code}
> {code:none}
> # print pyspark.__version__
> 2.2.0
> # print df
> eid  mi
> 0 0   0
> 1 1   1
> 2 2   2
> 3 3   3
> ---
> [100 rows x 2 columns]
> # print rdd.schema
> StructType(List(StructField(eid,LongType,true),StructField(mi,LongType,true)))
> # rdd.show()
> +---+---+
> |eid| mi|

[jira] [Updated] (SPARK-21392) Unable to infer schema when loading large Parquet file

2017-07-13 Thread Stuart Reynolds (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stuart Reynolds updated SPARK-21392:

Affects Version/s: 2.2.0

> Unable to infer schema when loading large Parquet file
> --
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1, 2.2.0
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works up until when I read in the parquet file.
> {code:none}
> response = "mi_or_chd_5"
> sc = get_spark_context() # custom
> sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom
> rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes")
> print rdd.schema
> #>>
> StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true)))
> rdd.show()
> #+---+---+
> #|eid|mi_or_chd_5|
> #+---+---+
> #|226|   null|
> #|442|   null|
> #|978|  0|
> #|851|  0|
> #|428|  0|
> rdd.write.parquet(response, mode="overwrite") # success!
> rdd2 = sqlc.read.parquet(response) # fail
> {code}
> 
> fails with:
> {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must 
> be specified manually.;'
> {code}
> in 
> {code:none} 
> /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
>  in deco(*a, **kw)
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> The error doesn't happen if I add "limit 10" to the sql query. The whole 
> selected table is 500k rows with an int and short column.
> Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but 
> which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21392) Unable to infer schema when loading large Parquet file

2017-07-13 Thread Stuart Reynolds (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084837#comment-16084837
 ] 

Stuart Reynolds edited comment on SPARK-21392 at 7/13/17 4:55 PM:
--

I've simplified the example a little -more and also found the limiting the 
query size to 100 rows succeeds, whereas if I select all 500k rows * 2 columns, 
it fails-.

In case it helps, my utility function get_sparkSQLContextWithTables loaded the 
full table 'outcomes' from postgres into spark, with 10 partitions with:
{code:none}
index="eid"
index_min=min(eid)
index_max=max(eid)
{code}


was (Author: stuartreynolds):
I've simplified the example a little more and also found the limiting the query 
size to 100 rows succeeds, whereas if I select all 500k rows * 2 columns, it 
fails.

In case it helps, my utility function get_sparkSQLContextWithTables loaded the 
full table 'outcomes' from postgres into spark, with 10 partitions with:
{code:none}
index="eid"
index_min=min(eid)
index_max=max(eid)
{code}

> Unable to infer schema when loading large Parquet file
> --
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works
> {code:none}
> response = "mi_or_chd_5"
> sc = get_spark_context() # custom
> sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom
> rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes")
> print rdd.schema
> #>>
> StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true)))
> rdd.show()
> #+---+---+
> #|eid|mi_or_chd_5|
> #+---+---+
> #|226|   null|
> #|442|   null|
> #|978|  0|
> #|851|  0|
> #|428|  0|
> rdd.write.parquet(response, mode="overwrite") # success!
> rdd2 = sqlc.read.parquet(response) # fail
> {code}
> 
> fails with:
> {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must 
> be specified manually.;'
> {code}
> in 
> {code:none} 
> /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
>  in deco(*a, **kw)
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> The error doesn't happen if I add "limit 10" to the sql query. The whole 
> selected table is 500k rows with an int and short column.
> Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but 
> which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21392) Unable to infer schema when loading large Parquet file

2017-07-13 Thread Stuart Reynolds (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stuart Reynolds updated SPARK-21392:

Description: 
The following boring code works up until when I read in the parquet file.

{code:none}
response = "mi_or_chd_5"
sc = get_spark_context() # custom
sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom


rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes")
print rdd.schema
#>>
StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true)))
rdd.show()
#+---+---+
#|eid|mi_or_chd_5|
#+---+---+
#|226|   null|
#|442|   null|
#|978|  0|
#|851|  0|
#|428|  0|

rdd.write.parquet(response, mode="overwrite") # success!
rdd2 = sqlc.read.parquet(response) # fail
{code}

fails with:

{code:none}AnalysisException: u'Unable to infer schema for Parquet. It must be 
specified manually.;'
{code}

in 

{code:none} 
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

The error doesn't happen if I add "limit 10" to the sql query. The whole 
selected table is 500k rows with an int and short column.

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)


  was:
The following boring code works

{code:none}
response = "mi_or_chd_5"
sc = get_spark_context() # custom
sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom


rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes")
print rdd.schema
#>>
StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true)))
rdd.show()
#+---+---+
#|eid|mi_or_chd_5|
#+---+---+
#|226|   null|
#|442|   null|
#|978|  0|
#|851|  0|
#|428|  0|

rdd.write.parquet(response, mode="overwrite") # success!
rdd2 = sqlc.read.parquet(response) # fail
{code}

fails with:

{code:none}AnalysisException: u'Unable to infer schema for Parquet. It must be 
specified manually.;'
{code}

in 

{code:none} 
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

The error doesn't happen if I add "limit 10" to the sql query. The whole 
selected table is 500k rows with an int and short column.

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



> Unable to infer schema when loading large Parquet file
> --
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works up until when I read in the parquet file.
> {code:none}
> response = "mi_or_chd_5"
> sc = get_spark_context() # custom
> sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom
> rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes")
> print rdd.schema
> #>>
> StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true)))
> rdd.show()
> #+---+---+
> #|eid|mi_or_chd_5|
> #+---+---+
> #|226|   null|
> #|442|   null|
> #|978|  0|
> #|851|  0|
> #|428|  0|
> rdd.write.parquet(response, mode="overwrite") # success!
> rdd2 = sqlc.read.parquet(response) # fail
> {code}
> 
> fails with:
> {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must 
> be specified manually.;'
> {code}
> in 
> {code:none} 
> /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
>  in deco(*a, **kw)
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> The error doesn't happen if I add "limit 10" to the sql query. The whole 
> selected table is 500k rows with an int and short column.
> Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but 
> which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For 

[jira] [Commented] (SPARK-21392) Unable to infer schema when loading large Parquet file

2017-07-13 Thread Stuart Reynolds (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085987#comment-16085987
 ] 

Stuart Reynolds commented on SPARK-21392:
-

My bad -- I may have missed the error message. 
Upgraded to spark 2.2.0 today and re-ran this. I get the error no matter what 
the number of rows.

> Unable to infer schema when loading large Parquet file
> --
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works
> {code:none}
> response = "mi_or_chd_5"
> sc = get_spark_context() # custom
> sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom
> rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes")
> print rdd.schema
> #>>
> StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true)))
> rdd.show()
> #+---+---+
> #|eid|mi_or_chd_5|
> #+---+---+
> #|226|   null|
> #|442|   null|
> #|978|  0|
> #|851|  0|
> #|428|  0|
> rdd.write.parquet(response, mode="overwrite") # success!
> rdd2 = sqlc.read.parquet(response) # fail
> {code}
> 
> fails with:
> {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must 
> be specified manually.;'
> {code}
> in 
> {code:none} 
> /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
>  in deco(*a, **kw)
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> The error doesn't happen if I add "limit 10" to the sql query. The whole 
> selected table is 500k rows with an int and short column.
> Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but 
> which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21392) Unable to infer schema when loading large Parquet file

2017-07-12 Thread Stuart Reynolds (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084837#comment-16084837
 ] 

Stuart Reynolds edited comment on SPARK-21392 at 7/12/17 10:27 PM:
---

I've simplified the example a little more and also found the limiting the query 
size to 100 rows succeeds, whereas if I select all 500k rows * 2 columns, it 
fails.

In case it helps, my utility function get_sparkSQLContextWithTables loaded the 
full table 'outcomes' from postgres into spark, with 10 partitions with:
{code:non}
index="eid"
index_min=min(eid)
index_max=max(eid)
{code}


was (Author: stuartreynolds):
I've simplified the example a little more and also found the limiting the query 
size to 100 rows succeeds, whereas if I select all 500k rows * 2 columns, it 
fails.

> Unable to infer schema when loading large Parquet file
> --
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works
> {code:none}
> response = "mi_or_chd_5"
> sc = get_spark_context() # custom
> sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom
> rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes")
> print rdd.schema
> #>>
> StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true)))
> rdd.show()
> #+---+---+
> #|eid|mi_or_chd_5|
> #+---+---+
> #|216|   null|
> #|431|   null|
> #|978|  0|
> #|852|  0|
> #|418|  0|
> rdd.write.parquet(response, mode="overwrite") # success!
> rdd2 = sqlc.read.parquet(response) # fail
> {code}
> 
> fails with:
> {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must 
> be specified manually.;'
> {code}
> in 
> {code:none} 
> /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
>  in deco(*a, **kw)
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> The error doesn't happen if I add "limit 10" to the sql query. The whole 
> selected table is 500k rows with an int and short column.
> Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but 
> which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21392) Unable to infer schema when loading large Parquet file

2017-07-12 Thread Stuart Reynolds (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084837#comment-16084837
 ] 

Stuart Reynolds commented on SPARK-21392:
-

I've simplified the example a little more and also found the limiting the query 
size to 100 rows succeeds, whereas if I select all 500k rows * 2 columns, it 
fails.

> Unable to infer schema when loading large Parquet file
> --
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works
> {code:none}
> response = "mi_or_chd_5"
> sc = get_spark_context() # custom
> sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom
> rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes")
> print rdd.schema
> #>>
> StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true)))
> rdd.show()
> #+---+---+
> #|eid|mi_or_chd_5|
> #+---+---+
> #|216|   null|
> #|431|   null|
> #|978|  0|
> #|852|  0|
> #|418|  0|
> rdd.write.parquet(response, mode="overwrite") # success!
> rdd2 = sqlc.read.parquet(response) # fail
> {code}
> 
> fails with:
> {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must 
> be specified manually.;'
> {code}
> in 
> {code:none} 
> /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
>  in deco(*a, **kw)
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> The error doesn't happen if I add "limit 10" to the sql query. The whole 
> selected table is 500k rows with an int and short column.
> Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but 
> which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21392) Unable to infer schema when loading large Parquet file

2017-07-12 Thread Stuart Reynolds (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stuart Reynolds updated SPARK-21392:

Description: 
The following boring code works

{code:none}
response = "mi_or_chd_5"
sc = get_spark_context() # custom
sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom


rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes")
print rdd.schema
#>>
StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true)))
rdd.show()
#+---+---+
#|eid|mi_or_chd_5|
#+---+---+
#|226|   null|
#|442|   null|
#|978|  0|
#|851|  0|
#|428|  0|

rdd.write.parquet(response, mode="overwrite") # success!
rdd2 = sqlc.read.parquet(response) # fail
{code}

fails with:

{code:none}AnalysisException: u'Unable to infer schema for Parquet. It must be 
specified manually.;'
{code}

in 

{code:none} 
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

The error doesn't happen if I add "limit 10" to the sql query. The whole 
selected table is 500k rows with an int and short column.

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)


  was:
The following boring code works

{code:none}
response = "mi_or_chd_5"
sc = get_spark_context() # custom
sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom


rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes")
print rdd.schema
#>>
StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true)))
rdd.show()
#+---+---+
#|eid|mi_or_chd_5|
#+---+---+
#|216|   null|
#|431|   null|
#|978|  0|
#|852|  0|
#|418|  0|

rdd.write.parquet(response, mode="overwrite") # success!
rdd2 = sqlc.read.parquet(response) # fail
{code}

fails with:

{code:none}AnalysisException: u'Unable to infer schema for Parquet. It must be 
specified manually.;'
{code}

in 

{code:none} 
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

The error doesn't happen if I add "limit 10" to the sql query. The whole 
selected table is 500k rows with an int and short column.

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



> Unable to infer schema when loading large Parquet file
> --
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works
> {code:none}
> response = "mi_or_chd_5"
> sc = get_spark_context() # custom
> sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom
> rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes")
> print rdd.schema
> #>>
> StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true)))
> rdd.show()
> #+---+---+
> #|eid|mi_or_chd_5|
> #+---+---+
> #|226|   null|
> #|442|   null|
> #|978|  0|
> #|851|  0|
> #|428|  0|
> rdd.write.parquet(response, mode="overwrite") # success!
> rdd2 = sqlc.read.parquet(response) # fail
> {code}
> 
> fails with:
> {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must 
> be specified manually.;'
> {code}
> in 
> {code:none} 
> /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
>  in deco(*a, **kw)
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> The error doesn't happen if I add "limit 10" to the sql query. The whole 
> selected table is 500k rows with an int and short column.
> Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but 
> which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21392) Unable to infer schema when loading large Parquet file

2017-07-12 Thread Stuart Reynolds (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084837#comment-16084837
 ] 

Stuart Reynolds edited comment on SPARK-21392 at 7/12/17 10:27 PM:
---

I've simplified the example a little more and also found the limiting the query 
size to 100 rows succeeds, whereas if I select all 500k rows * 2 columns, it 
fails.

In case it helps, my utility function get_sparkSQLContextWithTables loaded the 
full table 'outcomes' from postgres into spark, with 10 partitions with:
{code:none}
index="eid"
index_min=min(eid)
index_max=max(eid)
{code}


was (Author: stuartreynolds):
I've simplified the example a little more and also found the limiting the query 
size to 100 rows succeeds, whereas if I select all 500k rows * 2 columns, it 
fails.

In case it helps, my utility function get_sparkSQLContextWithTables loaded the 
full table 'outcomes' from postgres into spark, with 10 partitions with:
{code:non}
index="eid"
index_min=min(eid)
index_max=max(eid)
{code}

> Unable to infer schema when loading large Parquet file
> --
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works
> {code:none}
> response = "mi_or_chd_5"
> sc = get_spark_context() # custom
> sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom
> rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes")
> print rdd.schema
> #>>
> StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true)))
> rdd.show()
> #+---+---+
> #|eid|mi_or_chd_5|
> #+---+---+
> #|216|   null|
> #|431|   null|
> #|978|  0|
> #|852|  0|
> #|418|  0|
> rdd.write.parquet(response, mode="overwrite") # success!
> rdd2 = sqlc.read.parquet(response) # fail
> {code}
> 
> fails with:
> {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must 
> be specified manually.;'
> {code}
> in 
> {code:none} 
> /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
>  in deco(*a, **kw)
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> The error doesn't happen if I add "limit 10" to the sql query. The whole 
> selected table is 500k rows with an int and short column.
> Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but 
> which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21392) Unable to infer schema when loading large Parquet file

2017-07-12 Thread Stuart Reynolds (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stuart Reynolds updated SPARK-21392:

Description: 
The following boring code works

{code:none}
response = "mi_or_chd_5"
sc = get_spark_context() # custom
sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom


rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes")
print rdd.schema
#>>
StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true)))
rdd.show()
#+---+---+
#|eid|mi_or_chd_5|
#+---+---+
#|216|   null|
#|431|   null|
#|978|  0|
#|852|  0|
#|418|  0|

rdd.write.parquet(response, mode="overwrite") # success!
rdd2 = sqlc.read.parquet(response) # fail
{code}

fails with:

{code:none}AnalysisException: u'Unable to infer schema for Parquet. It must be 
specified manually.;'
{code}

in 

{code:none} 
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

The error doesn't happen if I add "limit 10" to the sql query. The whole 
selected table is 500k rows with an int and short column.

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)


  was:
The following boring code works

{code:none}
response = "mi_or_chd_5"

outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite")

>>> print outcome.schema

StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
{code}

But then,
{code:none}
outcome2 = sqlc.read.parquet(response)  # fail
{code}

fails with:

{code:none}AnalysisException: u'Unable to infer schema for Parquet. It must be 
specified manually.;'
{code}

in 

{code:none} 
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)


Summary: Unable to infer schema when loading large Parquet file  (was: 
Unable to infer schema when loading Parquet file)

> Unable to infer schema when loading large Parquet file
> --
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works
> {code:none}
> response = "mi_or_chd_5"
> sc = get_spark_context() # custom
> sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom
> rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes")
> print rdd.schema
> #>>
> StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true)))
> rdd.show()
> #+---+---+
> #|eid|mi_or_chd_5|
> #+---+---+
> #|216|   null|
> #|431|   null|
> #|978|  0|
> #|852|  0|
> #|418|  0|
> rdd.write.parquet(response, mode="overwrite") # success!
> rdd2 = sqlc.read.parquet(response) # fail
> {code}
> 
> fails with:
> {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must 
> be specified manually.;'
> {code}
> in 
> {code:none} 
> /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
>  in deco(*a, **kw)
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> The error doesn't happen if I add "limit 10" to the sql query. The whole 
> selected table is 500k rows with an int and short column.
> Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but 
> which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21392) Unable to infer schema when loading Parquet file

2017-07-12 Thread Stuart Reynolds (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084601#comment-16084601
 ] 

Stuart Reynolds commented on SPARK-21392:
-

Done.

I'm simply trying to build a table of two columns ("eid" and whatever the named 
response variable is), and save the table as a parquet file (with the same name 
are the second column), and load it back up.

Saving works but loading fails, complaining it can't infer the schema, even 
though I could print the schema before saving it:
{code:none}
   StructType(
   List(
StructField(eid,IntegerType,true),
StructField(response,ShortType,true)))
{code}

> Unable to infer schema when loading Parquet file
> 
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works
> {code:none}
> response = "mi_or_chd_5"
> outcome = sqlc.sql("""select eid,{response} as response
> from outcomes
> where {response} IS NOT NULL""".format(response=response))
> outcome.write.parquet(response, mode="overwrite")
> 
> >>> print outcome.schema
> 
> StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
> {code}
> 
> But then,
> {code:none}
> outcome2 = sqlc.read.parquet(response)  # fail
> {code}
> fails with:
> {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must 
> be specified manually.;'
> {code}
> in 
> {code:none} 
> /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
>  in deco(*a, **kw)
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but 
> which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21392) Unable to infer schema when loading Parquet file

2017-07-12 Thread Stuart Reynolds (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stuart Reynolds updated SPARK-21392:

Description: 
The following boring code works

{code:none}
response = "mi_or_chd_5"

outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite")

>>> print outcome.schema

StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
{code}

But then,
{code:none}
outcome2 = sqlc.read.parquet(response)  # fail
{code}

fails with:

{code:none}AnalysisException: u'Unable to infer schema for Parquet. It must be 
specified manually.;'
{code}

in 

{code:none} 
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)


  was:
The following boring code works

{code:python}
response = "mi_or_chd_5"

outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite")

>>> print outcome.schema

StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
{code}

But then,
{code:python}
outcome2 = sqlc.read.parquet(response)  # fail
{code}

fails with:

{code:python}AnalysisException: u'Unable to infer schema for Parquet. It must 
be specified manually.;'
{code}

in 

{code:python} 
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



> Unable to infer schema when loading Parquet file
> 
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works
> {code:none}
> response = "mi_or_chd_5"
> outcome = sqlc.sql("""select eid,{response} as response
> from outcomes
> where {response} IS NOT NULL""".format(response=response))
> outcome.write.parquet(response, mode="overwrite")
> 
> >>> print outcome.schema
> 
> StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
> {code}
> 
> But then,
> {code:none}
> outcome2 = sqlc.read.parquet(response)  # fail
> {code}
> fails with:
> {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must 
> be specified manually.;'
> {code}
> in 
> {code:none} 
> /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
>  in deco(*a, **kw)
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but 
> which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21392) Unable to infer schema when loading Parquet file

2017-07-12 Thread Stuart Reynolds (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stuart Reynolds updated SPARK-21392:

Description: 
The following boring code works

{code:none}
response = "mi_or_chd_5"

outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite")

>>> print outcome.schema

StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
{code}

But then,
{code:none}
outcome2 = sqlc.read.parquet(response)  # fail
{code}

fails with:

{code:none}AnalysisException: u'Unable to infer schema for Parquet. It must be 
specified manually.;'
{code}

in 

{code:none} 
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)


  was:
The following boring code works

{code:none}
response = "mi_or_chd_5"

outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite")

>>> print outcome.schema

StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
{code}

But then,
{code:none}
outcome2 = sqlc.read.parquet(response)  # fail
{code}

fails with:

{code:none}AnalysisException: u'Unable to infer schema for Parquet. It must be 
specified manually.;'
{code}

in 

{code:none} 
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



> Unable to infer schema when loading Parquet file
> 
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works
> {code:none}
> response = "mi_or_chd_5"
> outcome = sqlc.sql("""select eid,{response} as response
> from outcomes
> where {response} IS NOT NULL""".format(response=response))
> outcome.write.parquet(response, mode="overwrite")
> 
> >>> print outcome.schema
> 
> StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
> {code}
> 
> But then,
> {code:none}
> outcome2 = sqlc.read.parquet(response)  # fail
> {code}
> fails with:
> {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must 
> be specified manually.;'
> {code}
> in 
> {code:none} 
> /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
>  in deco(*a, **kw)
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but 
> which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21392) Unable to infer schema when loading Parquet file

2017-07-12 Thread Stuart Reynolds (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stuart Reynolds updated SPARK-21392:

Description: 
The following boring code works

{code:python}
   response = "mi_or_chd_5"
colname = "f_1000"

outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite")

col = sqlc.sql("""select eid,{colname} as {colname}
from baseline_denull
where {colname} IS NOT NULL""".format(colname=colname))
col.write.parquet(colname, mode="overwrite")

>>> print outcome.schema

StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))

>>> print col.schema

StructType(List(StructField(eid,IntegerType,true),StructField(f_1000,DoubleType,true)))
{code}

But then,
{code:python}
outcome2 = sqlc.read.parquet(response)  # fail
col2 = sqlc.read.parquet(colname) # fail
{code}

fails with:

{code:python}AnalysisException: u'Unable to infer schema for Parquet. It must 
be specified manually.;'
{code}

in 

{code:python} 
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)


  was:
The following boring code works

{{response = "mi_or_chd_5"
colname = "f_1000"

outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite")

col = sqlc.sql("""select eid,{colname} as {colname}
from baseline_denull
where {colname} IS NOT NULL""".format(colname=colname))
col.write.parquet(colname, mode="overwrite")

>>> print outcome.schema

StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))

>>> print col.schema

StructType(List(StructField(eid,IntegerType,true),StructField(f_1000,DoubleType,true)))
}}.

But then,
{{
outcome2 = sqlc.read.parquet(response)  # fail
col2 = sqlc.read.parquet(colname) # fail
}}.

fails with:

{{AnalysisException: u'Unable to infer schema for Parquet. It must be 
specified manually.;'
}}.

in 

{{
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
}}.

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



> Unable to infer schema when loading Parquet file
> 
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works
> {code:python}
>response = "mi_or_chd_5"
> colname = "f_1000"
> outcome = sqlc.sql("""select eid,{response} as response
> from outcomes
> where {response} IS NOT NULL""".format(response=response))
> outcome.write.parquet(response, mode="overwrite")
> 
> col = sqlc.sql("""select eid,{colname} as {colname}
> from baseline_denull
> where {colname} IS NOT NULL""".format(colname=colname))
> col.write.parquet(colname, mode="overwrite")
> >>> print outcome.schema
> 
> StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
> >>> print col.schema
> 
> StructType(List(StructField(eid,IntegerType,true),StructField(f_1000,DoubleType,true)))
> {code}
> 
> But then,
> {code:python}
> outcome2 = sqlc.read.parquet(response)  # fail
> col2 = sqlc.read.parquet(colname) # fail
> {code}
> fails with:
> {code:python}AnalysisException: u'Unable to infer schema for Parquet. It must 
> be specified manually.;'
> {code}
> in 
> {code:python} 
> /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
>  in deco(*a, **kw)
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but 
> which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



--
This 

[jira] [Updated] (SPARK-21392) Unable to infer schema when loading Parquet file

2017-07-12 Thread Stuart Reynolds (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stuart Reynolds updated SPARK-21392:

Description: 
The following boring code works

{code:python}
response = "mi_or_chd_5"

outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite")

>>> print outcome.schema

StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
{code}

But then,
{code:python}
outcome2 = sqlc.read.parquet(response)  # fail
{code}

fails with:

{code:python}AnalysisException: u'Unable to infer schema for Parquet. It must 
be specified manually.;'
{code}

in 

{code:python} 
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)


  was:
The following boring code works

{code:python}
response = "mi_or_chd_5"
colname = "f123"

outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite")

col = sqlc.sql("""select eid,{colname} as {colname}
from baseline_denull
where {colname} IS NOT NULL""".format(colname=colname))
col.write.parquet(colname, mode="overwrite")

>>> print outcome.schema

StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))

>>> print col.schema

StructType(List(StructField(eid,IntegerType,true),StructField(f123,DoubleType,true)))
{code}

But then,
{code:python}
outcome2 = sqlc.read.parquet(response)  # fail
col2 = sqlc.read.parquet(colname) # fail
{code}

fails with:

{code:python}AnalysisException: u'Unable to infer schema for Parquet. It must 
be specified manually.;'
{code}

in 

{code:python} 
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



> Unable to infer schema when loading Parquet file
> 
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works
> {code:python}
> response = "mi_or_chd_5"
> outcome = sqlc.sql("""select eid,{response} as response
> from outcomes
> where {response} IS NOT NULL""".format(response=response))
> outcome.write.parquet(response, mode="overwrite")
> 
> >>> print outcome.schema
> 
> StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
> {code}
> 
> But then,
> {code:python}
> outcome2 = sqlc.read.parquet(response)  # fail
> {code}
> fails with:
> {code:python}AnalysisException: u'Unable to infer schema for Parquet. It must 
> be specified manually.;'
> {code}
> in 
> {code:python} 
> /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
>  in deco(*a, **kw)
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but 
> which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21392) Unable to infer schema when loading Parquet file

2017-07-12 Thread Stuart Reynolds (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stuart Reynolds updated SPARK-21392:

Description: 
The following boring code works

{code:python}
response = "mi_or_chd_5"
colname = "f123"

outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite")

col = sqlc.sql("""select eid,{colname} as {colname}
from baseline_denull
where {colname} IS NOT NULL""".format(colname=colname))
col.write.parquet(colname, mode="overwrite")

>>> print outcome.schema

StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))

>>> print col.schema

StructType(List(StructField(eid,IntegerType,true),StructField(f123,DoubleType,true)))
{code}

But then,
{code:python}
outcome2 = sqlc.read.parquet(response)  # fail
col2 = sqlc.read.parquet(colname) # fail
{code}

fails with:

{code:python}AnalysisException: u'Unable to infer schema for Parquet. It must 
be specified manually.;'
{code}

in 

{code:python} 
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)


  was:
The following boring code works

{code:python}
   response = "mi_or_chd_5"
colname = "f_1000"

outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite")

col = sqlc.sql("""select eid,{colname} as {colname}
from baseline_denull
where {colname} IS NOT NULL""".format(colname=colname))
col.write.parquet(colname, mode="overwrite")

>>> print outcome.schema

StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))

>>> print col.schema

StructType(List(StructField(eid,IntegerType,true),StructField(f_1000,DoubleType,true)))
{code}

But then,
{code:python}
outcome2 = sqlc.read.parquet(response)  # fail
col2 = sqlc.read.parquet(colname) # fail
{code}

fails with:

{code:python}AnalysisException: u'Unable to infer schema for Parquet. It must 
be specified manually.;'
{code}

in 

{code:python} 
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



> Unable to infer schema when loading Parquet file
> 
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1. python 2.7.6
>Reporter: Stuart Reynolds
>  Labels: parquet, pyspark
>
> The following boring code works
> {code:python}
> response = "mi_or_chd_5"
> colname = "f123"
> outcome = sqlc.sql("""select eid,{response} as response
> from outcomes
> where {response} IS NOT NULL""".format(response=response))
> outcome.write.parquet(response, mode="overwrite")
> 
> col = sqlc.sql("""select eid,{colname} as {colname}
> from baseline_denull
> where {colname} IS NOT NULL""".format(colname=colname))
> col.write.parquet(colname, mode="overwrite")
> >>> print outcome.schema
> 
> StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
> >>> print col.schema
> 
> StructType(List(StructField(eid,IntegerType,true),StructField(f123,DoubleType,true)))
> {code}
> 
> But then,
> {code:python}
> outcome2 = sqlc.read.parquet(response)  # fail
> col2 = sqlc.read.parquet(colname) # fail
> {code}
> fails with:
> {code:python}AnalysisException: u'Unable to infer schema for Parquet. It must 
> be specified manually.;'
> {code}
> in 
> {code:python} 
> /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
>  in deco(*a, **kw)
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but 
> which claims it was fixed in 

[jira] [Created] (SPARK-21392) Unable to infer schema when loading Parquet file

2017-07-12 Thread Stuart Reynolds (JIRA)
Stuart Reynolds created SPARK-21392:
---

 Summary: Unable to infer schema when loading Parquet file
 Key: SPARK-21392
 URL: https://issues.apache.org/jira/browse/SPARK-21392
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.1.1
 Environment: Spark 2.1.1. python 2.7.6

Reporter: Stuart Reynolds


The following boring code works

{{response = "mi_or_chd_5"
colname = "f_1000"

outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite")

col = sqlc.sql("""select eid,{colname} as {colname}
from baseline_denull
where {colname} IS NOT NULL""".format(colname=colname))
col.write.parquet(colname, mode="overwrite")

>>> print outcome.schema

StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))

>>> print col.schema

StructType(List(StructField(eid,IntegerType,true),StructField(f_1000,DoubleType,true)))
}}.

But then,
{{
outcome2 = sqlc.read.parquet(response)  # fail
col2 = sqlc.read.parquet(colname) # fail
}}.

fails with:

{{AnalysisException: u'Unable to infer schema for Parquet. It must be 
specified manually.;'
}}.

in 

{{
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
}}.

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20847) Error reading NULL int[] element from postgres -- null pointer exception.

2017-05-22 Thread Stuart Reynolds (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stuart Reynolds updated SPARK-20847:

Description: 
-- maybe fixed already? 
https://github.com/apache/spark/commit/f174cdc7478d0b81f9cfa896284a5ec4c6bb952d

{code:python}
def query_int_array():
import pandas as pd
from pyspark.sql import SQLContext

user,password = ... , 
hostname = 
dbName = ...

url = "jdbc:postgresql://{hostname}:5432/{dbName}".format(**locals())
properties = {'user': user, 'password': password}

sql_create = """DROP TABLE IF EXISTS public._df10;
CREATE TABLE IF NOT EXISTS public._df10 (
id  integer,
f_21 integer[]
);

INSERT INTO public._df10(id, f_21) VALUES
(1, ARRAY[1,2])   --OK
   ,(2, ARRAY[3,NULL])  --OK
   ,(3, NULL)  --FAIL   *< PROBLEM
;"""

engine = 
sqlalchemy.create_engine('postgresql+psycopg2://{user}:{password}@{hostname}:5432/{dbName}'.format(**locals()))
with engine.connect().execution_options(autocommit=True) as con:
con.execute(sql_create)


# Export postgres _df10 to spark as table df10
sc = get_spark_context(master="local")
sqlContext = SQLContext(sc)
df10 = sqlContext.read.format("jdbc"). \
option("url", url). \
option("driver", "org.postgresql.Driver"). \
option("useUnicode", "true"). \
option("continueBatchOnError","true"). \
option("useSSL", "false"). \
option("user", user). \
option("password", password). \
option("dbtable", "_df10"). \
load()
df10.registerTempTable("df10")

print "DF inferred from postgres:"
df10.printSchema()
df10.show()

print "DF queried from postgres:"
df10 = sqlContext.sql("select * from df10")
df10.printSchema()
df10.show()
print df10.collect()
{code}

Explodes with:

{noformat}

DF inferred from postgres:
root
 |-- id: integer (nullable = true)
 |-- f_21: array (nullable = true)
 ||-- element: integer (containsNull = true)

17/05/22 15:46:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NullPointerException
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:427)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:425)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:286)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:268)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/05/22 15:46:30 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
localhost, executor driver): java.lang.NullPointerException
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:427)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:425)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:286)
at 

[jira] [Created] (SPARK-20847) Error reading NULL int[] element from postgres -- null pointer exception.

2017-05-22 Thread Stuart Reynolds (JIRA)
Stuart Reynolds created SPARK-20847:
---

 Summary: Error reading NULL int[] element from postgres -- null 
pointer exception.
 Key: SPARK-20847
 URL: https://issues.apache.org/jira/browse/SPARK-20847
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Stuart Reynolds



{code:python}
def query_int_array():
import pandas as pd
from pyspark.sql import SQLContext

user,password = ... , 
hostname = 
dbName = ...

url = "jdbc:postgresql://{hostname}:5432/{dbName}".format(**locals())
properties = {'user': user, 'password': password}

sql_create = """DROP TABLE IF EXISTS public._df10;
CREATE TABLE IF NOT EXISTS public._df10 (
id  integer,
f_21 integer[]
);

INSERT INTO public._df10(id, f_21) VALUES
(1, ARRAY[1,2])   --OK
   ,(2, ARRAY[3,NULL])  --OK
   ,(3, NULL)  --FAIL   *< PROBLEM
;"""

engine = 
sqlalchemy.create_engine('postgresql+psycopg2://{user}:{password}@{hostname}:5432/{dbName}'.format(**locals()))
with engine.connect().execution_options(autocommit=True) as con:
con.execute(sql_create)


# Export postgres _df10 to spark as table df10
sc = get_spark_context(master="local")
sqlContext = SQLContext(sc)
df10 = sqlContext.read.format("jdbc"). \
option("url", url). \
option("driver", "org.postgresql.Driver"). \
option("useUnicode", "true"). \
option("continueBatchOnError","true"). \
option("useSSL", "false"). \
option("user", user). \
option("password", password). \
option("dbtable", "_df10"). \
load()
df10.registerTempTable("df10")

print "DF inferred from postgres:"
df10.printSchema()
df10.show()

print "DF queried from postgres:"
df10 = sqlContext.sql("select * from df10")
df10.printSchema()
df10.show()
print df10.collect()
{code}

Explodes with:

{noformat}

DF inferred from postgres:
root
 |-- id: integer (nullable = true)
 |-- f_21: array (nullable = true)
 ||-- element: integer (containsNull = true)

17/05/22 15:46:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NullPointerException
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:427)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:425)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:286)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:268)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/05/22 15:46:30 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
localhost, executor driver): java.lang.NullPointerException
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:427)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:425)
at 

[jira] [Updated] (SPARK-20846) Incorrect posgres sql array column schema inferred from table.

2017-05-22 Thread Stuart Reynolds (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stuart Reynolds updated SPARK-20846:

Summary: Incorrect posgres sql array column schema inferred from table.  
(was: Incorrect posgres sql schema inferred from table.)

> Incorrect posgres sql array column schema inferred from table.
> --
>
> Key: SPARK-20846
> URL: https://issues.apache.org/jira/browse/SPARK-20846
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Stuart Reynolds
>
> When reading a table containing int[][] columns from postgres, the column is 
> inferred as int[] (should be int[][]).
> {code:python}
> from pyspark.sql import SQLContext
> import pandas as pd
> from dataIngest.util.sqlUtil import asSQLAlchemyEngine
> user,password = ..., ...
> url = "postgresql://hostname:5432/dbname"
> url = 'jdbc:'+url
> properties = {'user': user, 'password': password}
> engine = ... sql alchemy engine ...
>  Create pandas df with int[] and int[][]
> df = pd.DataFrame({
> 'a1': [[1,2,None],[1,2,3], None],
> 'b2':  [[[1],[None],[3]], [[1],[2],[3]], None]
> })
> # Store df into postgres as table _dfjunk
> with engine.connect().execution_options(autocommit=True) as con:
> con.execute("""
> DROP TABLE IF EXISTS _dfjunk;
> 
> CREATE TABLE _dfjunk (
>   a1 int[] NULL,
>   b2 int[][] NULL
> );
> """)
> df.to_sql("_dfjunk", con, index=None, if_exists="append")
> # Let's access via spark
> sc = get_spark_context(master="local")
> sqlContext = SQLContext(sc)
> print "pandas DF as spark DF:"
> df = sqlContext.createDataFrame(df)
> df.printSchema()
> df.show()
> df.registerTempTable("df")
> print sqlContext.sql("select * from df").collect()
> ### Export _dfjunk as table df3
> df3 = sqlContext.read.format("jdbc"). \
> option("url", url). \
> option("driver", "org.postgresql.Driver"). \
> option("useUnicode", "true"). \
> option("continueBatchOnError","true"). \
> option("useSSL", "false"). \
> option("user", user). \
> option("password", password). \
> option("dbtable", "_dfjunk").\
> load()
> df3.registerTempTable("df3")
> print "DF inferred from postgres:"
> df3.printSchema()
> df3.show()
> print "DF queried from postgres:"
> df3 = sqlContext.sql("select * from df3")
> df3.printSchema()
> df3.show()
> print df3.collect()
> {code}
> Errors out with:
> pandas DF as spark DF:
> {noformat}
> root
>  |-- a1: array (nullable = true)
>  ||-- element: long (containsNull = true)
>  |-- b2: array (nullable = true)
>  ||-- element: array (containsNull = true)
>  |||-- element: long (containsNull = true)  <<< ** THIS IS 
> CORRECT 
> +++
> |  a1|  b2|
> +++
> |[1, 2, null]|[WrappedArray(1),...|
> |   [1, 2, 3]|[WrappedArray(1),...|
> |null|null|
> +++
> [Row(a1=[1, 2, None], b2=[[1], [None], [3]]), Row(a1=[1, 2, 3], b2=[[1], [2], 
> [3]]), Row(a1=None, b2=None)]
> DF inferred from postgres:
> root
>  |-- a1: array (nullable = true)
>  ||-- element: integer (containsNull = true)
>  |-- b2: array (nullable = true)
>  ||-- element: integer (containsNull = true)<<< ** THIS IS 
> WRONG Is an array of arrays.
> 17/05/22 15:00:39 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
> java.lang.ClassCastException: [Ljava.lang.Integer; cannot be cast to 
> java.lang.Integer
>   at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
>   at 
> org.apache.spark.sql.catalyst.util.GenericArrayData.getInt(GenericArrayData.scala:62)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at 

[jira] [Updated] (SPARK-20846) Incorrect posgres sql schema inferred from table.

2017-05-22 Thread Stuart Reynolds (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stuart Reynolds updated SPARK-20846:

Description: 
When reading a table containing int[][] columns from postgres, the column is 
inferred as int[] (should be int[][]).

{code:python}
from pyspark.sql import SQLContext
import pandas as pd
from dataIngest.util.sqlUtil import asSQLAlchemyEngine


user,password = ..., ...
url = "postgresql://hostname:5432/dbname"
url = 'jdbc:'+url
properties = {'user': user, 'password': password}
engine = ... sql alchemy engine ...

 Create pandas df with int[] and int[][]
df = pd.DataFrame({
'a1': [[1,2,None],[1,2,3], None],
'b2':  [[[1],[None],[3]], [[1],[2],[3]], None]
})

# Store df into postgres as table _dfjunk
with engine.connect().execution_options(autocommit=True) as con:
con.execute("""
DROP TABLE IF EXISTS _dfjunk;

CREATE TABLE _dfjunk (
  a1 int[] NULL,
  b2 int[][] NULL
);
""")
df.to_sql("_dfjunk", con, index=None, if_exists="append")

# Let's access via spark
sc = get_spark_context(master="local")
sqlContext = SQLContext(sc)

print "pandas DF as spark DF:"
df = sqlContext.createDataFrame(df)
df.printSchema()
df.show()
df.registerTempTable("df")
print sqlContext.sql("select * from df").collect()

### Export _dfjunk as table df3
df3 = sqlContext.read.format("jdbc"). \
option("url", url). \
option("driver", "org.postgresql.Driver"). \
option("useUnicode", "true"). \
option("continueBatchOnError","true"). \
option("useSSL", "false"). \
option("user", user). \
option("password", password). \
option("dbtable", "_dfjunk").\
load()
df3.registerTempTable("df3")

print "DF inferred from postgres:"
df3.printSchema()
df3.show()

print "DF queried from postgres:"
df3 = sqlContext.sql("select * from df3")
df3.printSchema()
df3.show()
print df3.collect()
{code}

Errors out with:

pandas DF as spark DF:

{noformat}
root
 |-- a1: array (nullable = true)
 ||-- element: long (containsNull = true)
 |-- b2: array (nullable = true)
 ||-- element: array (containsNull = true)
 |||-- element: long (containsNull = true)  <<< ** THIS IS CORRECT 


+++
|  a1|  b2|
+++
|[1, 2, null]|[WrappedArray(1),...|
|   [1, 2, 3]|[WrappedArray(1),...|
|null|null|
+++

[Row(a1=[1, 2, None], b2=[[1], [None], [3]]), Row(a1=[1, 2, 3], b2=[[1], [2], 
[3]]), Row(a1=None, b2=None)]
DF inferred from postgres:
root
 |-- a1: array (nullable = true)
 ||-- element: integer (containsNull = true)
 |-- b2: array (nullable = true)
 ||-- element: integer (containsNull = true)<<< ** THIS IS 
WRONG Is an array of arrays.

17/05/22 15:00:39 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.ClassCastException: [Ljava.lang.Integer; cannot be cast to 
java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
at 
org.apache.spark.sql.catalyst.util.GenericArrayData.getInt(GenericArrayData.scala:62)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}

  was:
When reading a table containing int[][] columns from postgres, the column is 
inferred as int[] (should be int[][]).

from 

[jira] [Updated] (SPARK-20846) Incorrect posgres sql schema inferred from table.

2017-05-22 Thread Stuart Reynolds (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stuart Reynolds updated SPARK-20846:

Description: 
When reading a table containing int[][] columns from postgres, the column is 
inferred as int[] (should be int[][]).

from pyspark.sql import SQLContext
import pandas as pd
from dataIngest.util.sqlUtil import asSQLAlchemyEngine


user,password = ..., ...
url = "postgresql://hostname:5432/dbname"
url = 'jdbc:'+url
properties = {'user': user, 'password': password}
engine = ... sql alchemy engine ...

# Create pandas df with int[] and int[][]
df = pd.DataFrame({
'a1': [[1,2,None],[1,2,3], None],
'b2':  [[[1],[None],[3]], [[1],[2],[3]], None]
})

### Store df into postgres as table _dfjunk
with engine.connect().execution_options(autocommit=True) as con:
con.execute("""
DROP TABLE IF EXISTS _dfjunk;

CREATE TABLE _dfjunk (
  a1 int[] NULL,
  b2 int[][] NULL
);
""")
df.to_sql("_dfjunk", con, index=None, if_exists="append")

### Let's access via spark
sc = get_spark_context(master="local")
sqlContext = SQLContext(sc)

print "pandas DF as spark DF:"
df = sqlContext.createDataFrame(df)
df.printSchema()
df.show()
df.registerTempTable("df")
print sqlContext.sql("select * from df").collect()

### Export _dfjunk as table df3
df3 = sqlContext.read.format("jdbc"). \
option("url", url). \
option("driver", "org.postgresql.Driver"). \
option("useUnicode", "true"). \
option("continueBatchOnError","true"). \
option("useSSL", "false"). \
option("user", user). \
option("password", password). \
option("dbtable", "_dfjunk").\
load()
df3.registerTempTable("df3")

print "DF inferred from postgres:"
df3.printSchema()
df3.show()

print "DF queried from postgres:"
df3 = sqlContext.sql("select * from df3")
df3.printSchema()
df3.show()
print df3.collect()


Errors out with:

pandas DF as spark DF:
root
 |-- a1: array (nullable = true)
 ||-- element: long (containsNull = true)
 |-- b2: array (nullable = true)
 ||-- element: array (containsNull = true)
 |||-- element: long (containsNull = true)  <<< ** THIS IS CORRECT 


+++
|  a1|  b2|
+++
|[1, 2, null]|[WrappedArray(1),...|
|   [1, 2, 3]|[WrappedArray(1),...|
|null|null|
+++

[Row(a1=[1, 2, None], b2=[[1], [None], [3]]), Row(a1=[1, 2, 3], b2=[[1], [2], 
[3]]), Row(a1=None, b2=None)]
DF inferred from postgres:
root
 |-- a1: array (nullable = true)
 ||-- element: integer (containsNull = true)
 |-- b2: array (nullable = true)
 ||-- element: integer (containsNull = true)<<< ** THIS IS 
WRONG Is an array of arrays.

17/05/22 15:00:39 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.ClassCastException: [Ljava.lang.Integer; cannot be cast to 
java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
at 
org.apache.spark.sql.catalyst.util.GenericArrayData.getInt(GenericArrayData.scala:62)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

  was:
When reading a table containing int[][] columns from postgres, the column is 
inferred as int[] (should be int[][]).

from pyspark.sql import SQLContext

[jira] [Created] (SPARK-20846) Incorrect posgres sql schema inferred from table.

2017-05-22 Thread Stuart Reynolds (JIRA)
Stuart Reynolds created SPARK-20846:
---

 Summary: Incorrect posgres sql schema inferred from table.
 Key: SPARK-20846
 URL: https://issues.apache.org/jira/browse/SPARK-20846
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Stuart Reynolds


When reading a table containing int[][] columns from postgres, the column is 
inferred as int[] (should be int[][]).

from pyspark.sql import SQLContext
import pandas as pd
from dataIngest.util.sqlUtil import asSQLAlchemyEngine


user,password = ..., ...
url = "postgresql://hostname:5432/dbname"
url = 'jdbc:'+url
properties = {'user': user, 'password': password}
engine = ... sql alchemy engine ...

# Create pandas df with int[] and int[][]
df = pd.DataFrame({
'a1': [[1,2,None],[1,2,3], None],
'b2':  [[[1],[None],[3]], [[1],[2],[3]], None]
})

# Store df into postgres as table _dfjunk
with engine.connect().execution_options(autocommit=True) as con:
con.execute("""
DROP TABLE IF EXISTS _dfjunk;

CREATE TABLE _dfjunk (
  a1 int[] NULL,
  b2 int[][] NULL
);
""")
df.to_sql("_dfjunk", con, index=None, if_exists="append")

# Let's access via spark
sc = get_spark_context(master="local")
sqlContext = SQLContext(sc)

print "pandas DF as spark DF:"
df = sqlContext.createDataFrame(df)
df.printSchema()
df.show()
df.registerTempTable("df")
print sqlContext.sql("select * from df").collect()

# Export _dfjunk as table df3
df3 = sqlContext.read.format("jdbc"). \
option("url", url). \
option("driver", "org.postgresql.Driver"). \
option("useUnicode", "true"). \
option("continueBatchOnError","true"). \
option("useSSL", "false"). \
option("user", user). \
option("password", password). \
option("dbtable", "_dfjunk").\
load()
df3.registerTempTable("df3")

print "DF inferred from postgres:"
df3.printSchema()
df3.show()

print "DF queried from postgres:"
df3 = sqlContext.sql("select * from df3")
df3.printSchema()
df3.show()
print df3.collect()


Errors out with:

pandas DF as spark DF:
root
 |-- a1: array (nullable = true)
 ||-- element: long (containsNull = true)
 |-- b2: array (nullable = true)
 ||-- element: array (containsNull = true)
 |||-- element: long (containsNull = true)  <<< ** THIS IS CORRECT 


+++
|  a1|  b2|
+++
|[1, 2, null]|[WrappedArray(1),...|
|   [1, 2, 3]|[WrappedArray(1),...|
|null|null|
+++

[Row(a1=[1, 2, None], b2=[[1], [None], [3]]), Row(a1=[1, 2, 3], b2=[[1], [2], 
[3]]), Row(a1=None, b2=None)]
DF inferred from postgres:
root
 |-- a1: array (nullable = true)
 ||-- element: integer (containsNull = true)
 |-- b2: array (nullable = true)
 ||-- element: integer (containsNull = true)<<< ** THIS IS 
WRONG Is an array of arrays.

17/05/22 15:00:39 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.ClassCastException: [Ljava.lang.Integer; cannot be cast to 
java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
at 
org.apache.spark.sql.catalyst.util.GenericArrayData.getInt(GenericArrayData.scala:62)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at