[jira] [Commented] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2

2016-08-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419755#comment-15419755
 ] 

Apache Spark commented on SPARK-16975:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/14627

> Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
> --
>
> Key: SPARK-16975
> URL: https://issues.apache.org/jira/browse/SPARK-16975
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Ubuntu Linux 14.04
>Reporter: immerrr again
>Assignee: Dongjoon Hyun
>  Labels: parquet
> Fix For: 2.0.1, 2.1.0
>
>
> Spark-2.0.0 seems to have some problems reading a parquet dataset generated 
> by 1.6.2. 
> {code}
> In [80]: spark.read.parquet('/path/to/data')
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data. It must be specified manually;'
> {code}
> The dataset is ~150G and partitioned by _locality_code column. None of the 
> partitions are empty. I have narrowed the failing dataset to the first 32 
> partitions of the data:
> {code}
> In [82]: spark.read.parquet(*subdirs[:32])
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be 
> specified manually;'
> {code}
> Interestingly, it works OK if you remove any of the partitions from the list:
> {code}
> In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + 
> subdirs[i+1:32]))
> {code}
> Another strange thing is that the schemas for the first and the last 31 
> partitions of the subset are identical:
> {code}
> In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == 
> spark.read.parquet(*subdirs[1:32]).schema.fields
> Out[84]: True
> {code}
> Which got me interested and I tried this:
> {code}
> In [87]: spark.read.parquet(*([subdirs[0]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be 
> specified manually;'
> In [88]: spark.read.parquet(*([subdirs[15]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be 
> specified manually;'
> In [89]: spark.read.parquet(*([subdirs[31]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be 
> specified manually;'
> {code}
> If I read the first partition, save it in 2.0 and try to read in the same 
> manner, everything is fine:
> {code}
> In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test')
> 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32))
> {code}
> I have originally posted it to user mailing list, but with the last 
> discoveries this clearly seems like a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2

2016-08-11 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418364#comment-15418364
 ] 

Dongjoon Hyun commented on SPARK-16975:
---

Hi, [~rxin].
Could you review this PR?

> Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
> --
>
> Key: SPARK-16975
> URL: https://issues.apache.org/jira/browse/SPARK-16975
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Ubuntu Linux 14.04
>Reporter: immerrr again
>  Labels: parquet
>
> Spark-2.0.0 seems to have some problems reading a parquet dataset generated 
> by 1.6.2. 
> {code}
> In [80]: spark.read.parquet('/path/to/data')
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data. It must be specified manually;'
> {code}
> The dataset is ~150G and partitioned by _locality_code column. None of the 
> partitions are empty. I have narrowed the failing dataset to the first 32 
> partitions of the data:
> {code}
> In [82]: spark.read.parquet(*subdirs[:32])
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be 
> specified manually;'
> {code}
> Interestingly, it works OK if you remove any of the partitions from the list:
> {code}
> In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + 
> subdirs[i+1:32]))
> {code}
> Another strange thing is that the schemas for the first and the last 31 
> partitions of the subset are identical:
> {code}
> In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == 
> spark.read.parquet(*subdirs[1:32]).schema.fields
> Out[84]: True
> {code}
> Which got me interested and I tried this:
> {code}
> In [87]: spark.read.parquet(*([subdirs[0]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be 
> specified manually;'
> In [88]: spark.read.parquet(*([subdirs[15]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be 
> specified manually;'
> In [89]: spark.read.parquet(*([subdirs[31]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be 
> specified manually;'
> {code}
> If I read the first partition, save it in 2.0 and try to read in the same 
> manner, everything is fine:
> {code}
> In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test')
> 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32))
> {code}
> I have originally posted it to user mailing list, but with the last 
> discoveries this clearly seems like a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2

2016-08-11 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417240#comment-15417240
 ] 

Dongjoon Hyun commented on SPARK-16975:
---

Great! Thank you for confirming.

> Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
> --
>
> Key: SPARK-16975
> URL: https://issues.apache.org/jira/browse/SPARK-16975
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Ubuntu Linux 14.04
>Reporter: immerrr again
>  Labels: parquet
>
> Spark-2.0.0 seems to have some problems reading a parquet dataset generated 
> by 1.6.2. 
> {code}
> In [80]: spark.read.parquet('/path/to/data')
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data. It must be specified manually;'
> {code}
> The dataset is ~150G and partitioned by _locality_code column. None of the 
> partitions are empty. I have narrowed the failing dataset to the first 32 
> partitions of the data:
> {code}
> In [82]: spark.read.parquet(*subdirs[:32])
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be 
> specified manually;'
> {code}
> Interestingly, it works OK if you remove any of the partitions from the list:
> {code}
> In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + 
> subdirs[i+1:32]))
> {code}
> Another strange thing is that the schemas for the first and the last 31 
> partitions of the subset are identical:
> {code}
> In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == 
> spark.read.parquet(*subdirs[1:32]).schema.fields
> Out[84]: True
> {code}
> Which got me interested and I tried this:
> {code}
> In [87]: spark.read.parquet(*([subdirs[0]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be 
> specified manually;'
> In [88]: spark.read.parquet(*([subdirs[15]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be 
> specified manually;'
> In [89]: spark.read.parquet(*([subdirs[31]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be 
> specified manually;'
> {code}
> If I read the first partition, save it in 2.0 and try to read in the same 
> manner, everything is fine:
> {code}
> In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test')
> 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32))
> {code}
> I have originally posted it to user mailing list, but with the last 
> discoveries this clearly seems like a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2

2016-08-11 Thread immerrr again (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417186#comment-15417186
 ] 

immerrr again commented on SPARK-16975:
---

The figures were:
1.6.2: ~6s
2.0.0: ~12s

Interestingly enough, after restarting the driver, 2.0 run took far less than 
that:
1.6.2: ~5.7s
2.0.0: ~1.4s

Maybe, some sort of caching that survives restarts is used internally.

> Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
> --
>
> Key: SPARK-16975
> URL: https://issues.apache.org/jira/browse/SPARK-16975
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Ubuntu Linux 14.04
>Reporter: immerrr again
>  Labels: parquet
>
> Spark-2.0.0 seems to have some problems reading a parquet dataset generated 
> by 1.6.2. 
> {code}
> In [80]: spark.read.parquet('/path/to/data')
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data. It must be specified manually;'
> {code}
> The dataset is ~150G and partitioned by _locality_code column. None of the 
> partitions are empty. I have narrowed the failing dataset to the first 32 
> partitions of the data:
> {code}
> In [82]: spark.read.parquet(*subdirs[:32])
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be 
> specified manually;'
> {code}
> Interestingly, it works OK if you remove any of the partitions from the list:
> {code}
> In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + 
> subdirs[i+1:32]))
> {code}
> Another strange thing is that the schemas for the first and the last 31 
> partitions of the subset are identical:
> {code}
> In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == 
> spark.read.parquet(*subdirs[1:32]).schema.fields
> Out[84]: True
> {code}
> Which got me interested and I tried this:
> {code}
> In [87]: spark.read.parquet(*([subdirs[0]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be 
> specified manually;'
> In [88]: spark.read.parquet(*([subdirs[15]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be 
> specified manually;'
> In [89]: spark.read.parquet(*([subdirs[31]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be 
> specified manually;'
> {code}
> If I read the first partition, save it in 2.0 and try to read in the same 
> manner, everything is fine:
> {code}
> In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test')
> 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32))
> {code}
> I have originally posted it to user mailing list, but with the last 
> discoveries this clearly seems like a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2

2016-08-11 Thread immerrr again (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417172#comment-15417172
 ] 

immerrr again commented on SPARK-16975:
---

But it works. I have suppressed WARN logs and {{df.count()}} returned the 
correct value, despite taking 2x time to finish compared to 1.6.2.

> Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
> --
>
> Key: SPARK-16975
> URL: https://issues.apache.org/jira/browse/SPARK-16975
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Ubuntu Linux 14.04
>Reporter: immerrr again
>  Labels: parquet
>
> Spark-2.0.0 seems to have some problems reading a parquet dataset generated 
> by 1.6.2. 
> {code}
> In [80]: spark.read.parquet('/path/to/data')
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data. It must be specified manually;'
> {code}
> The dataset is ~150G and partitioned by _locality_code column. None of the 
> partitions are empty. I have narrowed the failing dataset to the first 32 
> partitions of the data:
> {code}
> In [82]: spark.read.parquet(*subdirs[:32])
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be 
> specified manually;'
> {code}
> Interestingly, it works OK if you remove any of the partitions from the list:
> {code}
> In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + 
> subdirs[i+1:32]))
> {code}
> Another strange thing is that the schemas for the first and the last 31 
> partitions of the subset are identical:
> {code}
> In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == 
> spark.read.parquet(*subdirs[1:32]).schema.fields
> Out[84]: True
> {code}
> Which got me interested and I tried this:
> {code}
> In [87]: spark.read.parquet(*([subdirs[0]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be 
> specified manually;'
> In [88]: spark.read.parquet(*([subdirs[15]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be 
> specified manually;'
> In [89]: spark.read.parquet(*([subdirs[31]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be 
> specified manually;'
> {code}
> If I read the first partition, save it in 2.0 and try to read in the same 
> manner, everything is fine:
> {code}
> In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test')
> 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32))
> {code}
> I have originally posted it to user mailing list, but with the last 
> discoveries this clearly seems like a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2

2016-08-11 Thread immerrr again (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417113#comment-15417113
 ] 

immerrr again commented on SPARK-16975:
---

I have built the code from the PR and it indeed succeeds reading the data.

I have tried doing {{df.count()}} and now I'm swarmed with warnings like this 
(they are just keep getting printed endlessly in the terminal): 
{code}
16/08/11 12:18:51 WARN CorruptStatistics: Ignoring statistics because 
created_by could not be parsed (see PARQUET-251): parquet-mr version 1.6.0
org.apache.parquet.VersionParser$VersionParseException: Could not parse 
created_by: parquet-mr version 1.6.0 using format: (.+) version ((.*) )?\(build 
?(.*)\)
at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
at 
org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)
at 
org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)
at 
org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:567)
at 
org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:544)
at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:431)
at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:386)
at 
org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:107)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:369)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:343)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:122)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:97)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}  

> Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
> --
>
> Key: SPARK-16975
> URL: https://issues.apache.org/jira/browse/SPARK-16975
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Ubuntu Linux 14.04
>Reporter: immerrr again
>  Labels: parquet
>
> Spark-2.0.0 seems to have some problems reading a parquet dataset generated 
> by 1.6.2. 
> {code}
> In [80]: spark.read.parquet('/path/to/data')
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data. It must be specified manually;'
> {code}
> The dataset is ~150G and partitioned by _locality_code column. None of the 
> partitions are empty. I have narrowed the failing dataset to the first 32 
> partitions of the data:
> {code}
> In [82]: spark.read.parquet(*subdirs[:32])
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be 
> specified manually;'
> {code}
> Interestingly, it works OK if you remove any of the partitions from the list:
> {code}
> In [83]: for i in range(32): 

[jira] [Commented] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2

2016-08-10 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15416038#comment-15416038
 ] 

Dongjoon Hyun commented on SPARK-16975:
---

Thank you. See you tomorrow! :)

> Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
> --
>
> Key: SPARK-16975
> URL: https://issues.apache.org/jira/browse/SPARK-16975
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Ubuntu Linux 14.04
>Reporter: immerrr again
>  Labels: parquet
>
> Spark-2.0.0 seems to have some problems reading a parquet dataset generated 
> by 1.6.2. 
> {code}
> In [80]: spark.read.parquet('/path/to/data')
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data. It must be specified manually;'
> {code}
> The dataset is ~150G and partitioned by _locality_code column. None of the 
> partitions are empty. I have narrowed the failing dataset to the first 32 
> partitions of the data:
> {code}
> In [82]: spark.read.parquet(*subdirs[:32])
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be 
> specified manually;'
> {code}
> Interestingly, it works OK if you remove any of the partitions from the list:
> {code}
> In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + 
> subdirs[i+1:32]))
> {code}
> Another strange thing is that the schemas for the first and the last 31 
> partitions of the subset are identical:
> {code}
> In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == 
> spark.read.parquet(*subdirs[1:32]).schema.fields
> Out[84]: True
> {code}
> Which got me interested and I tried this:
> {code}
> In [87]: spark.read.parquet(*([subdirs[0]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be 
> specified manually;'
> In [88]: spark.read.parquet(*([subdirs[15]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be 
> specified manually;'
> In [89]: spark.read.parquet(*([subdirs[31]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be 
> specified manually;'
> {code}
> If I read the first partition, save it in 2.0 and try to read in the same 
> manner, everything is fine:
> {code}
> In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test')
> 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32))
> {code}
> I have originally posted it to user mailing list, but with the last 
> discoveries this clearly seems like a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2

2016-08-10 Thread immerrr again (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15416034#comment-15416034
 ] 

immerrr again commented on SPARK-16975:
---

Great, thank you! That was so fast!

I'll try to look at it tomorrow, but cannot promise anything as my experience 
in Scala and building it is next to none.

> Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
> --
>
> Key: SPARK-16975
> URL: https://issues.apache.org/jira/browse/SPARK-16975
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Ubuntu Linux 14.04
>Reporter: immerrr again
>  Labels: parquet
>
> Spark-2.0.0 seems to have some problems reading a parquet dataset generated 
> by 1.6.2. 
> {code}
> In [80]: spark.read.parquet('/path/to/data')
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data. It must be specified manually;'
> {code}
> The dataset is ~150G and partitioned by _locality_code column. None of the 
> partitions are empty. I have narrowed the failing dataset to the first 32 
> partitions of the data:
> {code}
> In [82]: spark.read.parquet(*subdirs[:32])
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be 
> specified manually;'
> {code}
> Interestingly, it works OK if you remove any of the partitions from the list:
> {code}
> In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + 
> subdirs[i+1:32]))
> {code}
> Another strange thing is that the schemas for the first and the last 31 
> partitions of the subset are identical:
> {code}
> In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == 
> spark.read.parquet(*subdirs[1:32]).schema.fields
> Out[84]: True
> {code}
> Which got me interested and I tried this:
> {code}
> In [87]: spark.read.parquet(*([subdirs[0]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be 
> specified manually;'
> In [88]: spark.read.parquet(*([subdirs[15]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be 
> specified manually;'
> In [89]: spark.read.parquet(*([subdirs[31]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be 
> specified manually;'
> {code}
> If I read the first partition, save it in 2.0 and try to read in the same 
> manner, everything is fine:
> {code}
> In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test')
> 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32))
> {code}
> I have originally posted it to user mailing list, but with the last 
> discoveries this clearly seems like a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2

2016-08-10 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15416032#comment-15416032
 ] 

Dongjoon Hyun commented on SPARK-16975:
---

I made a PR, [~immerrr]. I tested only `sql` module tests. After passing 
Jenkins, could you test the PR in your environment if you have some time?
I think PySpark also will get benefit of this PR.

> Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
> --
>
> Key: SPARK-16975
> URL: https://issues.apache.org/jira/browse/SPARK-16975
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Ubuntu Linux 14.04
>Reporter: immerrr again
>  Labels: parquet
>
> Spark-2.0.0 seems to have some problems reading a parquet dataset generated 
> by 1.6.2. 
> {code}
> In [80]: spark.read.parquet('/path/to/data')
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data. It must be specified manually;'
> {code}
> The dataset is ~150G and partitioned by _locality_code column. None of the 
> partitions are empty. I have narrowed the failing dataset to the first 32 
> partitions of the data:
> {code}
> In [82]: spark.read.parquet(*subdirs[:32])
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be 
> specified manually;'
> {code}
> Interestingly, it works OK if you remove any of the partitions from the list:
> {code}
> In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + 
> subdirs[i+1:32]))
> {code}
> Another strange thing is that the schemas for the first and the last 31 
> partitions of the subset are identical:
> {code}
> In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == 
> spark.read.parquet(*subdirs[1:32]).schema.fields
> Out[84]: True
> {code}
> Which got me interested and I tried this:
> {code}
> In [87]: spark.read.parquet(*([subdirs[0]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be 
> specified manually;'
> In [88]: spark.read.parquet(*([subdirs[15]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be 
> specified manually;'
> In [89]: spark.read.parquet(*([subdirs[31]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be 
> specified manually;'
> {code}
> If I read the first partition, save it in 2.0 and try to read in the same 
> manner, everything is fine:
> {code}
> In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test')
> 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32))
> {code}
> I have originally posted it to user mailing list, but with the last 
> discoveries this clearly seems like a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2

2016-08-10 Thread immerrr again (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15416013#comment-15416013
 ] 

immerrr again commented on SPARK-16975:
---

Yes, a seemingly similar one.

> Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
> --
>
> Key: SPARK-16975
> URL: https://issues.apache.org/jira/browse/SPARK-16975
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Ubuntu Linux 14.04
>Reporter: immerrr again
>  Labels: parquet
>
> Spark-2.0.0 seems to have some problems reading a parquet dataset generated 
> by 1.6.2. 
> {code}
> In [80]: spark.read.parquet('/path/to/data')
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data. It must be specified manually;'
> {code}
> The dataset is ~150G and partitioned by _locality_code column. None of the 
> partitions are empty. I have narrowed the failing dataset to the first 32 
> partitions of the data:
> {code}
> In [82]: spark.read.parquet(*subdirs[:32])
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be 
> specified manually;'
> {code}
> Interestingly, it works OK if you remove any of the partitions from the list:
> {code}
> In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + 
> subdirs[i+1:32]))
> {code}
> Another strange thing is that the schemas for the first and the last 31 
> partitions of the subset are identical:
> {code}
> In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == 
> spark.read.parquet(*subdirs[1:32]).schema.fields
> Out[84]: True
> {code}
> Which got me interested and I tried this:
> {code}
> In [87]: spark.read.parquet(*([subdirs[0]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be 
> specified manually;'
> In [88]: spark.read.parquet(*([subdirs[15]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be 
> specified manually;'
> In [89]: spark.read.parquet(*([subdirs[31]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be 
> specified manually;'
> {code}
> If I read the first partition, save it in 2.0 and try to read in the same 
> manner, everything is fine:
> {code}
> In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test')
> 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32))
> {code}
> I have originally posted it to user mailing list, but with the last 
> discoveries this clearly seems like a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2

2016-08-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15416024#comment-15416024
 ] 

Apache Spark commented on SPARK-16975:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/14585

> Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
> --
>
> Key: SPARK-16975
> URL: https://issues.apache.org/jira/browse/SPARK-16975
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Ubuntu Linux 14.04
>Reporter: immerrr again
>  Labels: parquet
>
> Spark-2.0.0 seems to have some problems reading a parquet dataset generated 
> by 1.6.2. 
> {code}
> In [80]: spark.read.parquet('/path/to/data')
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data. It must be specified manually;'
> {code}
> The dataset is ~150G and partitioned by _locality_code column. None of the 
> partitions are empty. I have narrowed the failing dataset to the first 32 
> partitions of the data:
> {code}
> In [82]: spark.read.parquet(*subdirs[:32])
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be 
> specified manually;'
> {code}
> Interestingly, it works OK if you remove any of the partitions from the list:
> {code}
> In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + 
> subdirs[i+1:32]))
> {code}
> Another strange thing is that the schemas for the first and the last 31 
> partitions of the subset are identical:
> {code}
> In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == 
> spark.read.parquet(*subdirs[1:32]).schema.fields
> Out[84]: True
> {code}
> Which got me interested and I tried this:
> {code}
> In [87]: spark.read.parquet(*([subdirs[0]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be 
> specified manually;'
> In [88]: spark.read.parquet(*([subdirs[15]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be 
> specified manually;'
> In [89]: spark.read.parquet(*([subdirs[31]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be 
> specified manually;'
> {code}
> If I read the first partition, save it in 2.0 and try to read in the same 
> manner, everything is fine:
> {code}
> In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test')
> 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32))
> {code}
> I have originally posted it to user mailing list, but with the last 
> discoveries this clearly seems like a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2

2016-08-10 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15416009#comment-15416009
 ] 

Dongjoon Hyun commented on SPARK-16975:
---

Yep. And, it raised exceptions, right?

> Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
> --
>
> Key: SPARK-16975
> URL: https://issues.apache.org/jira/browse/SPARK-16975
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Ubuntu Linux 14.04
>Reporter: immerrr again
>  Labels: parquet
>
> Spark-2.0.0 seems to have some problems reading a parquet dataset generated 
> by 1.6.2. 
> {code}
> In [80]: spark.read.parquet('/path/to/data')
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data. It must be specified manually;'
> {code}
> The dataset is ~150G and partitioned by _locality_code column. None of the 
> partitions are empty. I have narrowed the failing dataset to the first 32 
> partitions of the data:
> {code}
> In [82]: spark.read.parquet(*subdirs[:32])
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be 
> specified manually;'
> {code}
> Interestingly, it works OK if you remove any of the partitions from the list:
> {code}
> In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + 
> subdirs[i+1:32]))
> {code}
> Another strange thing is that the schemas for the first and the last 31 
> partitions of the subset are identical:
> {code}
> In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == 
> spark.read.parquet(*subdirs[1:32]).schema.fields
> Out[84]: True
> {code}
> Which got me interested and I tried this:
> {code}
> In [87]: spark.read.parquet(*([subdirs[0]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be 
> specified manually;'
> In [88]: spark.read.parquet(*([subdirs[15]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be 
> specified manually;'
> In [89]: spark.read.parquet(*([subdirs[31]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be 
> specified manually;'
> {code}
> If I read the first partition, save it in 2.0 and try to read in the same 
> manner, everything is fine:
> {code}
> In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test')
> 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32))
> {code}
> I have originally posted it to user mailing list, but with the last 
> discoveries this clearly seems like a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2

2016-08-10 Thread immerrr again (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15416003#comment-15416003
 ] 

immerrr again commented on SPARK-16975:
---

I mean, this line from the original report
{code}
In [87]: spark.read.parquet(*([subdirs[0]] * 32))
{code}

means pass subdirs[0] 32 times as parameters to spark.read.parquet, i.e. 
spark.read.parquet(subdirs[0], subdirs[0], ...)

> Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
> --
>
> Key: SPARK-16975
> URL: https://issues.apache.org/jira/browse/SPARK-16975
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Ubuntu Linux 14.04
>Reporter: immerrr again
>  Labels: parquet
>
> Spark-2.0.0 seems to have some problems reading a parquet dataset generated 
> by 1.6.2. 
> {code}
> In [80]: spark.read.parquet('/path/to/data')
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data. It must be specified manually;'
> {code}
> The dataset is ~150G and partitioned by _locality_code column. None of the 
> partitions are empty. I have narrowed the failing dataset to the first 32 
> partitions of the data:
> {code}
> In [82]: spark.read.parquet(*subdirs[:32])
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be 
> specified manually;'
> {code}
> Interestingly, it works OK if you remove any of the partitions from the list:
> {code}
> In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + 
> subdirs[i+1:32]))
> {code}
> Another strange thing is that the schemas for the first and the last 31 
> partitions of the subset are identical:
> {code}
> In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == 
> spark.read.parquet(*subdirs[1:32]).schema.fields
> Out[84]: True
> {code}
> Which got me interested and I tried this:
> {code}
> In [87]: spark.read.parquet(*([subdirs[0]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be 
> specified manually;'
> In [88]: spark.read.parquet(*([subdirs[15]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be 
> specified manually;'
> In [89]: spark.read.parquet(*([subdirs[31]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be 
> specified manually;'
> {code}
> If I read the first partition, save it in 2.0 and try to read in the same 
> manner, everything is fine:
> {code}
> In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test')
> 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32))
> {code}
> I have originally posted it to user mailing list, but with the last 
> discoveries this clearly seems like a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2

2016-08-10 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15416001#comment-15416001
 ] 

Dongjoon Hyun commented on SPARK-16975:
---

Okay. Wait a second. I'll make PR for you.

> Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
> --
>
> Key: SPARK-16975
> URL: https://issues.apache.org/jira/browse/SPARK-16975
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Ubuntu Linux 14.04
>Reporter: immerrr again
>  Labels: parquet
>
> Spark-2.0.0 seems to have some problems reading a parquet dataset generated 
> by 1.6.2. 
> {code}
> In [80]: spark.read.parquet('/path/to/data')
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data. It must be specified manually;'
> {code}
> The dataset is ~150G and partitioned by _locality_code column. None of the 
> partitions are empty. I have narrowed the failing dataset to the first 32 
> partitions of the data:
> {code}
> In [82]: spark.read.parquet(*subdirs[:32])
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be 
> specified manually;'
> {code}
> Interestingly, it works OK if you remove any of the partitions from the list:
> {code}
> In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + 
> subdirs[i+1:32]))
> {code}
> Another strange thing is that the schemas for the first and the last 31 
> partitions of the subset are identical:
> {code}
> In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == 
> spark.read.parquet(*subdirs[1:32]).schema.fields
> Out[84]: True
> {code}
> Which got me interested and I tried this:
> {code}
> In [87]: spark.read.parquet(*([subdirs[0]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be 
> specified manually;'
> In [88]: spark.read.parquet(*([subdirs[15]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be 
> specified manually;'
> In [89]: spark.read.parquet(*([subdirs[31]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be 
> specified manually;'
> {code}
> If I read the first partition, save it in 2.0 and try to read in the same 
> manner, everything is fine:
> {code}
> In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test')
> 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32))
> {code}
> I have originally posted it to user mailing list, but with the last 
> discoveries this clearly seems like a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2

2016-08-10 Thread immerrr again (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15415998#comment-15415998
 ] 

immerrr again commented on SPARK-16975:
---

You mean the one I use to write the data, I don't have the exact string at 
hand, but it was a straightforward conversion from JSON inferring schema on the 
way, smth like
{code}
sqlContext.read.json('/path/to/json-data').write.partitionBy('_locality_code').parquet('/path/to/parquet-data',
 mode='overwrite')
{code}


Oh, another thing was that when I tried reading subdir-by-subdir.  When you 
read a single subdirectory, the _locality_code column is not present (just like 
in your example), but for some reason it worked ok when reading just one such 
subdirectory but failed reading that same directory multiple times.  There were 
other columns starting with underscore, though. I didn't use them to partition 
the data, but maybe they still somehow affected schema inference.

> Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
> --
>
> Key: SPARK-16975
> URL: https://issues.apache.org/jira/browse/SPARK-16975
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Ubuntu Linux 14.04
>Reporter: immerrr again
>  Labels: parquet
>
> Spark-2.0.0 seems to have some problems reading a parquet dataset generated 
> by 1.6.2. 
> {code}
> In [80]: spark.read.parquet('/path/to/data')
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data. It must be specified manually;'
> {code}
> The dataset is ~150G and partitioned by _locality_code column. None of the 
> partitions are empty. I have narrowed the failing dataset to the first 32 
> partitions of the data:
> {code}
> In [82]: spark.read.parquet(*subdirs[:32])
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be 
> specified manually;'
> {code}
> Interestingly, it works OK if you remove any of the partitions from the list:
> {code}
> In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + 
> subdirs[i+1:32]))
> {code}
> Another strange thing is that the schemas for the first and the last 31 
> partitions of the subset are identical:
> {code}
> In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == 
> spark.read.parquet(*subdirs[1:32]).schema.fields
> Out[84]: True
> {code}
> Which got me interested and I tried this:
> {code}
> In [87]: spark.read.parquet(*([subdirs[0]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be 
> specified manually;'
> In [88]: spark.read.parquet(*([subdirs[15]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be 
> specified manually;'
> In [89]: spark.read.parquet(*([subdirs[31]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be 
> specified manually;'
> {code}
> If I read the first partition, save it in 2.0 and try to read in the same 
> manner, everything is fine:
> {code}
> In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test')
> 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32))
> {code}
> I have originally posted it to user mailing list, but with the last 
> discoveries this clearly seems like a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2

2016-08-10 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15415966#comment-15415966
 ] 

Dongjoon Hyun commented on SPARK-16975:
---

In the python, could you give the command string to write parquet?

BTW, I also started to work in order to support `_col=xxx` format.

> Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
> --
>
> Key: SPARK-16975
> URL: https://issues.apache.org/jira/browse/SPARK-16975
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Ubuntu Linux 14.04
>Reporter: immerrr again
>  Labels: parquet
>
> Spark-2.0.0 seems to have some problems reading a parquet dataset generated 
> by 1.6.2. 
> {code}
> In [80]: spark.read.parquet('/path/to/data')
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data. It must be specified manually;'
> {code}
> The dataset is ~150G and partitioned by _locality_code column. None of the 
> partitions are empty. I have narrowed the failing dataset to the first 32 
> partitions of the data:
> {code}
> In [82]: spark.read.parquet(*subdirs[:32])
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be 
> specified manually;'
> {code}
> Interestingly, it works OK if you remove any of the partitions from the list:
> {code}
> In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + 
> subdirs[i+1:32]))
> {code}
> Another strange thing is that the schemas for the first and the last 31 
> partitions of the subset are identical:
> {code}
> In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == 
> spark.read.parquet(*subdirs[1:32]).schema.fields
> Out[84]: True
> {code}
> Which got me interested and I tried this:
> {code}
> In [87]: spark.read.parquet(*([subdirs[0]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be 
> specified manually;'
> In [88]: spark.read.parquet(*([subdirs[15]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be 
> specified manually;'
> In [89]: spark.read.parquet(*([subdirs[31]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be 
> specified manually;'
> {code}
> If I read the first partition, save it in 2.0 and try to read in the same 
> manner, everything is fine:
> {code}
> In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test')
> 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32))
> {code}
> I have originally posted it to user mailing list, but with the last 
> discoveries this clearly seems like a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2

2016-08-10 Thread immerrr again (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15415961#comment-15415961
 ] 

immerrr again commented on SPARK-16975:
---

oh, that's unfortunate. coming from python world, underscore seems a natural 
prefix for "internal things".

what bugs me, though, is that spark2.0 had no problems reading up to 31 
directories starting with underscores and bugged out only when there were 32 of 
them.

and i'll try the rename, give me a sec..

> Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
> --
>
> Key: SPARK-16975
> URL: https://issues.apache.org/jira/browse/SPARK-16975
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Ubuntu Linux 14.04
>Reporter: immerrr again
>  Labels: parquet
>
> Spark-2.0.0 seems to have some problems reading a parquet dataset generated 
> by 1.6.2. 
> {code}
> In [80]: spark.read.parquet('/path/to/data')
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data. It must be specified manually;'
> {code}
> The dataset is ~150G and partitioned by _locality_code column. None of the 
> partitions are empty. I have narrowed the failing dataset to the first 32 
> partitions of the data:
> {code}
> In [82]: spark.read.parquet(*subdirs[:32])
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be 
> specified manually;'
> {code}
> Interestingly, it works OK if you remove any of the partitions from the list:
> {code}
> In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + 
> subdirs[i+1:32]))
> {code}
> Another strange thing is that the schemas for the first and the last 31 
> partitions of the subset are identical:
> {code}
> In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == 
> spark.read.parquet(*subdirs[1:32]).schema.fields
> Out[84]: True
> {code}
> Which got me interested and I tried this:
> {code}
> In [87]: spark.read.parquet(*([subdirs[0]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be 
> specified manually;'
> In [88]: spark.read.parquet(*([subdirs[15]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be 
> specified manually;'
> In [89]: spark.read.parquet(*([subdirs[31]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be 
> specified manually;'
> {code}
> If I read the first partition, save it in 2.0 and try to read in the same 
> manner, everything is fine:
> {code}
> In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test')
> 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32))
> {code}
> I have originally posted it to user mailing list, but with the last 
> discoveries this clearly seems like a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2

2016-08-10 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15415957#comment-15415957
 ] 

Dongjoon Hyun commented on SPARK-16975:
---

Let me dig more. I can find more general solution for this for Spark 1.6 / 2.0.

> Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
> --
>
> Key: SPARK-16975
> URL: https://issues.apache.org/jira/browse/SPARK-16975
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Ubuntu Linux 14.04
>Reporter: immerrr again
>  Labels: parquet
>
> Spark-2.0.0 seems to have some problems reading a parquet dataset generated 
> by 1.6.2. 
> {code}
> In [80]: spark.read.parquet('/path/to/data')
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data. It must be specified manually;'
> {code}
> The dataset is ~150G and partitioned by _locality_code column. None of the 
> partitions are empty. I have narrowed the failing dataset to the first 32 
> partitions of the data:
> {code}
> In [82]: spark.read.parquet(*subdirs[:32])
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be 
> specified manually;'
> {code}
> Interestingly, it works OK if you remove any of the partitions from the list:
> {code}
> In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + 
> subdirs[i+1:32]))
> {code}
> Another strange thing is that the schemas for the first and the last 31 
> partitions of the subset are identical:
> {code}
> In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == 
> spark.read.parquet(*subdirs[1:32]).schema.fields
> Out[84]: True
> {code}
> Which got me interested and I tried this:
> {code}
> In [87]: spark.read.parquet(*([subdirs[0]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be 
> specified manually;'
> In [88]: spark.read.parquet(*([subdirs[15]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be 
> specified manually;'
> In [89]: spark.read.parquet(*([subdirs[31]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be 
> specified manually;'
> {code}
> If I read the first partition, save it in 2.0 and try to read in the same 
> manner, everything is fine:
> {code}
> In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test')
> 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32))
> {code}
> I have originally posted it to user mailing list, but with the last 
> discoveries this clearly seems like a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2

2016-08-10 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15415941#comment-15415941
 ] 

Dongjoon Hyun commented on SPARK-16975:
---

Ah, [~rxin]. 
For this issue, we should add a migration document for 1.6 .

Spark 2.0 itself has the save problem. We should block the illegal column 
names. May I make a PR for this?

{code}
scala> spark.range(10).withColumn("_locality_code", 
$"id").write.partitionBy("_locality_code").save("/tmp/parquet20")

scala> spark.read.parquet("/tmp/parquet20")
org.apache.spark.sql.AnalysisException: Unable to infer schema for 
ParquetFormat at /tmp/parquet20. It must be specified manually;
{code}

> Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
> --
>
> Key: SPARK-16975
> URL: https://issues.apache.org/jira/browse/SPARK-16975
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Ubuntu Linux 14.04
>Reporter: immerrr again
>  Labels: parquet
>
> Spark-2.0.0 seems to have some problems reading a parquet dataset generated 
> by 1.6.2. 
> {code}
> In [80]: spark.read.parquet('/path/to/data')
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data. It must be specified manually;'
> {code}
> The dataset is ~150G and partitioned by _locality_code column. None of the 
> partitions are empty. I have narrowed the failing dataset to the first 32 
> partitions of the data:
> {code}
> In [82]: spark.read.parquet(*subdirs[:32])
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be 
> specified manually;'
> {code}
> Interestingly, it works OK if you remove any of the partitions from the list:
> {code}
> In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + 
> subdirs[i+1:32]))
> {code}
> Another strange thing is that the schemas for the first and the last 31 
> partitions of the subset are identical:
> {code}
> In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == 
> spark.read.parquet(*subdirs[1:32]).schema.fields
> Out[84]: True
> {code}
> Which got me interested and I tried this:
> {code}
> In [87]: spark.read.parquet(*([subdirs[0]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be 
> specified manually;'
> In [88]: spark.read.parquet(*([subdirs[15]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be 
> specified manually;'
> In [89]: spark.read.parquet(*([subdirs[31]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be 
> specified manually;'
> {code}
> If I read the first partition, save it in 2.0 and try to read in the same 
> manner, everything is fine:
> {code}
> In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test')
> 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32))
> {code}
> I have originally posted it to user mailing list, but with the last 
> discoveries this clearly seems like a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2

2016-08-10 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15415932#comment-15415932
 ] 

Dongjoon Hyun commented on SPARK-16975:
---

I made a sample case having similar behaviors. I think this is related closed. 
[~rxin], how do you think about this?

{code}
spark-1.6.2-bin-hadoop2.6$ ls /tmp/parquet16/
_SUCCESS _locality_code=1 _locality_code=3 _locality_code=5 
_locality_code=7 _locality_code=9
_locality_code=0 _locality_code=2 _locality_code=4 _locality_code=6 
_locality_code=8
{code}

{code}
scala> spark.read.parquet("/tmp/parquet16").show
org.apache.spark.sql.AnalysisException: Unable to infer schema for 
ParquetFormat at /tmp/parquet16. It must be specified manually;
scala> spark.read.parquet("/tmp/parquet16/_locality_code=0").show
+---+
| id|
+---+
|  0|
+---+
{code}

> Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
> --
>
> Key: SPARK-16975
> URL: https://issues.apache.org/jira/browse/SPARK-16975
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Ubuntu Linux 14.04
>Reporter: immerrr again
>  Labels: parquet
>
> Spark-2.0.0 seems to have some problems reading a parquet dataset generated 
> by 1.6.2. 
> {code}
> In [80]: spark.read.parquet('/path/to/data')
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data. It must be specified manually;'
> {code}
> The dataset is ~150G and partitioned by _locality_code column. None of the 
> partitions are empty. I have narrowed the failing dataset to the first 32 
> partitions of the data:
> {code}
> In [82]: spark.read.parquet(*subdirs[:32])
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be 
> specified manually;'
> {code}
> Interestingly, it works OK if you remove any of the partitions from the list:
> {code}
> In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + 
> subdirs[i+1:32]))
> {code}
> Another strange thing is that the schemas for the first and the last 31 
> partitions of the subset are identical:
> {code}
> In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == 
> spark.read.parquet(*subdirs[1:32]).schema.fields
> Out[84]: True
> {code}
> Which got me interested and I tried this:
> {code}
> In [87]: spark.read.parquet(*([subdirs[0]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be 
> specified manually;'
> In [88]: spark.read.parquet(*([subdirs[15]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be 
> specified manually;'
> In [89]: spark.read.parquet(*([subdirs[31]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be 
> specified manually;'
> {code}
> If I read the first partition, save it in 2.0 and try to read in the same 
> manner, everything is fine:
> {code}
> In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test')
> 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32))
> {code}
> I have originally posted it to user mailing list, but with the last 
> discoveries this clearly seems like a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2

2016-08-10 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15415903#comment-15415903
 ] 

Dongjoon Hyun commented on SPARK-16975:
---

Hi, [~immerrr].
I can not reproduce your situation, but could you change `_locality_code` into 
`locality_code`?

> Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
> --
>
> Key: SPARK-16975
> URL: https://issues.apache.org/jira/browse/SPARK-16975
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Ubuntu Linux 14.04
>Reporter: immerrr again
>  Labels: parquet
>
> Spark-2.0.0 seems to have some problems reading a parquet dataset generated 
> by 1.6.2. 
> {code}
> In [80]: spark.read.parquet('/path/to/data')
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data. It must be specified manually;'
> {code}
> The dataset is ~150G and partitioned by _locality_code column. None of the 
> partitions are empty. I have narrowed the failing dataset to the first 32 
> partitions of the data:
> {code}
> In [82]: spark.read.parquet(*subdirs[:32])
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be 
> specified manually;'
> {code}
> Interestingly, it works OK if you remove any of the partitions from the list:
> {code}
> In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + 
> subdirs[i+1:32]))
> {code}
> Another strange thing is that the schemas for the first and the last 31 
> partitions of the subset are identical:
> {code}
> In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == 
> spark.read.parquet(*subdirs[1:32]).schema.fields
> Out[84]: True
> {code}
> Which got me interested and I tried this:
> {code}
> In [87]: spark.read.parquet(*([subdirs[0]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be 
> specified manually;'
> In [88]: spark.read.parquet(*([subdirs[15]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be 
> specified manually;'
> In [89]: spark.read.parquet(*([subdirs[31]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be 
> specified manually;'
> {code}
> If I read the first partition, save it in 2.0 and try to read in the same 
> manner, everything is fine:
> {code}
> In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test')
> 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32))
> {code}
> I have originally posted it to user mailing list, but with the last 
> discoveries this clearly seems like a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2

2016-08-10 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15415833#comment-15415833
 ] 

Dongjoon Hyun commented on SPARK-16975:
---

Thank you for pinging me.

> Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
> --
>
> Key: SPARK-16975
> URL: https://issues.apache.org/jira/browse/SPARK-16975
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Ubuntu Linux 14.04
>Reporter: immerrr again
>  Labels: parquet
>
> Spark-2.0.0 seems to have some problems reading a parquet dataset generated 
> by 1.6.2. 
> {code}
> In [80]: spark.read.parquet('/path/to/data')
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data. It must be specified manually;'
> {code}
> The dataset is ~150G and partitioned by _locality_code column. None of the 
> partitions are empty. I have narrowed the failing dataset to the first 32 
> partitions of the data:
> {code}
> In [82]: spark.read.parquet(*subdirs[:32])
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be 
> specified manually;'
> {code}
> Interestingly, it works OK if you remove any of the partitions from the list:
> {code}
> In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + 
> subdirs[i+1:32]))
> {code}
> Another strange thing is that the schemas for the first and the last 31 
> partitions of the subset are identical:
> {code}
> In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == 
> spark.read.parquet(*subdirs[1:32]).schema.fields
> Out[84]: True
> {code}
> Which got me interested and I tried this:
> {code}
> In [87]: spark.read.parquet(*([subdirs[0]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be 
> specified manually;'
> In [88]: spark.read.parquet(*([subdirs[15]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be 
> specified manually;'
> In [89]: spark.read.parquet(*([subdirs[31]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be 
> specified manually;'
> {code}
> If I read the first partition, save it in 2.0 and try to read in the same 
> manner, everything is fine:
> {code}
> In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test')
> 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32))
> {code}
> I have originally posted it to user mailing list, but with the last 
> discoveries this clearly seems like a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2

2016-08-10 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15415831#comment-15415831
 ] 

Dongjoon Hyun commented on SPARK-16975:
---

Oh, sure. It's my pleasure. I'll take a look.

> Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
> --
>
> Key: SPARK-16975
> URL: https://issues.apache.org/jira/browse/SPARK-16975
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Ubuntu Linux 14.04
>Reporter: immerrr again
>  Labels: parquet
>
> Spark-2.0.0 seems to have some problems reading a parquet dataset generated 
> by 1.6.2. 
> {code}
> In [80]: spark.read.parquet('/path/to/data')
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data. It must be specified manually;'
> {code}
> The dataset is ~150G and partitioned by _locality_code column. None of the 
> partitions are empty. I have narrowed the failing dataset to the first 32 
> partitions of the data:
> {code}
> In [82]: spark.read.parquet(*subdirs[:32])
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be 
> specified manually;'
> {code}
> Interestingly, it works OK if you remove any of the partitions from the list:
> {code}
> In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + 
> subdirs[i+1:32]))
> {code}
> Another strange thing is that the schemas for the first and the last 31 
> partitions of the subset are identical:
> {code}
> In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == 
> spark.read.parquet(*subdirs[1:32]).schema.fields
> Out[84]: True
> {code}
> Which got me interested and I tried this:
> {code}
> In [87]: spark.read.parquet(*([subdirs[0]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be 
> specified manually;'
> In [88]: spark.read.parquet(*([subdirs[15]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be 
> specified manually;'
> In [89]: spark.read.parquet(*([subdirs[31]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be 
> specified manually;'
> {code}
> If I read the first partition, save it in 2.0 and try to read in the same 
> manner, everything is fine:
> {code}
> In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test')
> 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32))
> {code}
> I have originally posted it to user mailing list, but with the last 
> discoveries this clearly seems like a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2

2016-08-10 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15415823#comment-15415823
 ] 

Reynold Xin commented on SPARK-16975:
-

cc [~dongjoon] do you have time to look into this?


> Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
> --
>
> Key: SPARK-16975
> URL: https://issues.apache.org/jira/browse/SPARK-16975
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Ubuntu Linux 14.04
>Reporter: immerrr again
>  Labels: parquet
>
> Spark-2.0.0 seems to have some problems reading a parquet dataset generated 
> by 1.6.2. 
> {code}
> In [80]: spark.read.parquet('/path/to/data')
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data. It must be specified manually;'
> {code}
> The dataset is ~150G and partitioned by _locality_code column. None of the 
> partitions are empty. I have narrowed the failing dataset to the first 32 
> partitions of the data:
> {code}
> In [82]: spark.read.parquet(*subdirs[:32])
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be 
> specified manually;'
> {code}
> Interestingly, it works OK if you remove any of the partitions from the list:
> {code}
> In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + 
> subdirs[i+1:32]))
> {code}
> Another strange thing is that the schemas for the first and the last 31 
> partitions of the subset are identical:
> {code}
> In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == 
> spark.read.parquet(*subdirs[1:32]).schema.fields
> Out[84]: True
> {code}
> Which got me interested and I tried this:
> {code}
> In [87]: spark.read.parquet(*([subdirs[0]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be 
> specified manually;'
> In [88]: spark.read.parquet(*([subdirs[15]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be 
> specified manually;'
> In [89]: spark.read.parquet(*([subdirs[31]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be 
> specified manually;'
> {code}
> If I read the first partition, save it in 2.0 and try to read in the same 
> manner, everything is fine:
> {code}
> In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test')
> 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32))
> {code}
> I have originally posted it to user mailing list, but with the last 
> discoveries this clearly seems like a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org