[jira] [Commented] (PARQUET-268) Build is failing with parquet-scrooge errors.
[ https://issues.apache.org/jira/browse/PARQUET-268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520532#comment-14520532 ] Ryan Blue commented on PARQUET-268: --- I'm going to do the downgrade and ignore the failing tests. We know that the library works right as long as Scrooge does, so I think it is reasonable. I'll ping you on the PR for review. Build is failing with parquet-scrooge errors. - Key: PARQUET-268 URL: https://issues.apache.org/jira/browse/PARQUET-268 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.6.0 Reporter: Ryan Blue Fix For: 1.6.1 The build is currently failing for all PRs in Travis CI. According to Alex: bq. . . . one of the scrooge dependencies transitively pulled in a snapshot that has since been purged. Seems like that dependency was improperly published. Upgrading the scrooge plugin should fix this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-268) Build is failing with parquet-scrooge errors.
[ https://issues.apache.org/jira/browse/PARQUET-268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-268. --- Resolution: Fixed Assignee: Ryan Blue Build is failing with parquet-scrooge errors. - Key: PARQUET-268 URL: https://issues.apache.org/jira/browse/PARQUET-268 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.6.0 Reporter: Ryan Blue Assignee: Ryan Blue Fix For: 1.6.1 The build is currently failing for all PRs in Travis CI. According to Alex: bq. . . . one of the scrooge dependencies transitively pulled in a snapshot that has since been purged. Seems like that dependency was improperly published. Upgrading the scrooge plugin should fix this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (PARQUET-270) Add legend to parquet-tools readme.md
[ https://issues.apache.org/jira/browse/PARQUET-270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue reassigned PARQUET-270: - Assignee: Ryan Blue Add legend to parquet-tools readme.md - Key: PARQUET-270 URL: https://issues.apache.org/jira/browse/PARQUET-270 Project: Parquet Issue Type: Improvement Components: parquet-mr Reporter: Brett Stime Assignee: Ryan Blue Priority: Trivial Improve the documentation for parquet-tools by describing the output in more detail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-280) Please create a DOAP file for your TLP
[ https://issues.apache.org/jira/browse/PARQUET-280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-280. --- Resolution: Fixed Assignee: Julien Le Dem Thanks, Julien! Please create a DOAP file for your TLP -- Key: PARQUET-280 URL: https://issues.apache.org/jira/browse/PARQUET-280 Project: Parquet Issue Type: Task Reporter: Sebb Assignee: Julien Le Dem Please can you set up a DOAP for your project and get it added to files.xml? See http://projects.apache.org/create.html Once you have created the DOAP, please submit it for inclusion in the Apache projects listing as per: http://projects.apache.org/create.html#submit Remember, if you ever move or rename the doap file in future, please ensure that files.xml is updated to point to the new location. It is recommended that the DOAP is published with the website, e.g. at http://parquet.apache.org/doap_Parquet.rdf as this URL is unlikely to change. Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-253) AvroSchemaConverter has confusing Javadoc
[ https://issues.apache.org/jira/browse/PARQUET-253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-253. --- Resolution: Fixed Merged #173. Thanks! AvroSchemaConverter has confusing Javadoc - Key: PARQUET-253 URL: https://issues.apache.org/jira/browse/PARQUET-253 Project: Parquet Issue Type: Improvement Components: parquet-mr Affects Versions: 1.5.0, 1.6.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Minor Got confused by the original Javadoc at first and didn't realize {{AvroSchemaConverter}} is also capable to convert a Parquet schema to an Avro schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-98) filter2 API performance regression
[ https://issues.apache.org/jira/browse/PARQUET-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14551393#comment-14551393 ] Ryan Blue commented on PARQUET-98: -- [~phraktle], to save you some time, the 1.7.0 release will also have this problem. I'll find some time to look into it further. filter2 API performance regression -- Key: PARQUET-98 URL: https://issues.apache.org/jira/browse/PARQUET-98 Project: Parquet Issue Type: Bug Reporter: Viktor Szathmáry The new filter API seems to be much slower (or perhaps I'm using it wrong \:) Code using an UnboundRecordFilter: {code:java} ColumnRecordFilter.column(column, ColumnPredicates.applyFunctionToBinary( input - Binary.fromString(value).equals(input))); {code} vs. code using FilterPredicate: {code:java} eq(binaryColumn(column), Binary.fromString(value)); {code} The latter performs twice as slow on the same Parquet file (built using 1.6.0rc2). Note: the reader is constructed using {code:java} ParquetReader.builder(new ProtoReadSupport().withFilter(filter).build() {code} The new filter API based approach seems to create a whole lot more garbage (perhaps due to reconstructing all the rows?). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-222) parquet writer runs into OOM during writing when calling DataFrame.saveAsParquetFile in Spark SQL
[ https://issues.apache.org/jira/browse/PARQUET-222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14574894#comment-14574894 ] Ryan Blue commented on PARQUET-222: --- [~phatak.dev]: the problem is probably the number of files you're trying to write to at once. Each file buffers to the Parquet row group size (set by parquet.block.size, defaults to 128MB). If you have 10 files open for a processor, that's ~1.3GB and Spark already uses quite a bit of memory itself. [~lian cheng], any ideas since you're the most familiar with how Spark writes from data frames? Is it possible to shuffle the data to have only one open file per executor at a time? parquet writer runs into OOM during writing when calling DataFrame.saveAsParquetFile in Spark SQL - Key: PARQUET-222 URL: https://issues.apache.org/jira/browse/PARQUET-222 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.6.0 Reporter: Chaozhong Yang Original Estimate: 336h Remaining Estimate: 336h In Spark SQL, there is a function `saveAsParquetFile` in DataFrame or SchemaRDD. That function calls method in parquet-mr, and sometimes it will fail due to the OOM error thrown by parquet-mr. We can see the exception stack trace as follows: WARN] [task-result-getter-3] 03-19 11:17:58,274 [TaskSetManager] - Lost task 0.2 in stag e 137.0 (TID 309, hb1.avoscloud.com): java.lang.OutOfMemoryError: Java heap space at parquet.column.values.dictionary.IntList.initSlab(IntList.java:87) at parquet.column.values.dictionary.IntList.init(IntList.java:83) at parquet.column.values.dictionary.DictionaryValuesWriter.init(DictionaryValue sWriter.java:85) at parquet.column.values.dictionary.DictionaryValuesWriter$PlainIntegerDictionary ValuesWriter.init(DictionaryValuesWriter.java:549) at parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:88) at parquet.column.impl.ColumnWriterImpl.init(ColumnWriterImpl.java:74) at parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.jav a:68) at parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl. java:56) at parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.init(MessageColumnI O.java:178) at parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369) at parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWrit er.java:108) at parquet.hadoop.InternalParquetRecordWriter.init(InternalParquetRecordWriter. java:94) at parquet.hadoop.ParquetRecordWriter.init(ParquetRecordWriter.java:64) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:28 2) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:25 2) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parqu et$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:304) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$ 1.apply(ParquetTableOperations.scala:325) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$ 1.apply(ParquetTableOperations.scala:325) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java :886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908 ) at java.lang.Thread.run(Thread.java:662) By the way, there is another similar issue https://issues.apache.org/jira/browse/PARQUET-99. But the reporter has closed it and mark it as resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-314) Fix broken equals implementation(s)
[ https://issues.apache.org/jira/browse/PARQUET-314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-314. --- Resolution: Fixed Fix Version/s: 1.8.0 Merged. Thanks for catching this and fixing it, [~nezihyigitbasi]! Fix broken equals implementation(s) --- Key: PARQUET-314 URL: https://issues.apache.org/jira/browse/PARQUET-314 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.8.0 Reporter: Nezih Yigitbasi Assignee: Nezih Yigitbasi Priority: Minor Fix For: 1.8.0 The equals implementation in ColumnDescriptor and Statistics classes are broken. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics
[ https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597906#comment-14597906 ] Ryan Blue commented on PARQUET-41: -- Interesting, I hadn't heard about the counting bloom filters. But as I think a bit more about how the Hive ACID stuff works, I don't think it would help. The base file is rewritten periodically to incorporate changes stored in the current set of deltas. That would rewrite the bloom filter from scratch, so there is no need for it to be reversible. Then if you're applying a delta on top of the base file, you only need to apply the filters to your delta because those rows entirely replace rows in the base. In that case, you have a static bloom filter per delta file and static bloom filters in the base file, too. Add bloom filters to parquet statistics --- Key: PARQUET-41 URL: https://issues.apache.org/jira/browse/PARQUET-41 Project: Parquet Issue Type: New Feature Components: parquet-format, parquet-mr Reporter: Alex Levenson Assignee: Ferdinand Xu Labels: filter2 For row groups with no dictionary, we could still produce a bloom filter. This could be very useful in filtering entire row groups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-306) Improve alignment between row groups and HDFS blocks
[ https://issues.apache.org/jira/browse/PARQUET-306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-306. --- Resolution: Fixed Fix Version/s: 1.8.0 Merged #211. Thanks for reviewing, Alex! Improve alignment between row groups and HDFS blocks Key: PARQUET-306 URL: https://issues.apache.org/jira/browse/PARQUET-306 Project: Parquet Issue Type: Improvement Components: parquet-mr Reporter: Ryan Blue Assignee: Ryan Blue Fix For: 1.8.0 Row groups should not span HDFS blocks to avoid remote reads. There are 3 things we can use to avoid this: 1. Set the next row group's size to the remaining bytes in the current HDFS block 2. Use HDFS-3689, variable-length HDFS blocks, when available 3. Pad after row groups close to the block boundary to start the next row group at the start of the next block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-317) writeMetaDataFile crashes when a relative root Path is used
[ https://issues.apache.org/jira/browse/PARQUET-317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-317. --- Resolution: Fixed Fix Version/s: 1.8.0 Merged #228. Thanks for fixing this, Steven! writeMetaDataFile crashes when a relative root Path is used --- Key: PARQUET-317 URL: https://issues.apache.org/jira/browse/PARQUET-317 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.8.0 Reporter: Steven She Assignee: Steven She Priority: Minor Fix For: 1.8.0 In Spark, I can save an RDD to the local file system using a relative path, e.g.: {noformat} rdd.saveAsNewAPIHadoopFile( relativeRoot, classOf[Void], tag.runtimeClass.asInstanceOf[Class[T]], classOf[ParquetOutputFormat[T]], job.getConfiguration) {noformat} This leads to a crash in the ParquetFileWriter.mergeFooters(..) method since the footer paths are read as fully qualified paths, but the root path is provided as a relative path: {noformat} org.apache.parquet.io.ParquetEncodingException: /Users/stevenshe/schema/relativeRoot/part-r-0.snappy.parquet invalid: all the files must be contained in the root relativeRoot {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-317) writeMetaDataFile crashes when a relative root Path is used
[ https://issues.apache.org/jira/browse/PARQUET-317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-317: -- Assignee: Steven She writeMetaDataFile crashes when a relative root Path is used --- Key: PARQUET-317 URL: https://issues.apache.org/jira/browse/PARQUET-317 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.8.0 Reporter: Steven She Assignee: Steven She Priority: Minor In Spark, I can save an RDD to the local file system using a relative path, e.g.: {noformat} rdd.saveAsNewAPIHadoopFile( relativeRoot, classOf[Void], tag.runtimeClass.asInstanceOf[Class[T]], classOf[ParquetOutputFormat[T]], job.getConfiguration) {noformat} This leads to a crash in the ParquetFileWriter.mergeFooters(..) method since the footer paths are read as fully qualified paths, but the root path is provided as a relative path: {noformat} org.apache.parquet.io.ParquetEncodingException: /Users/stevenshe/schema/relativeRoot/part-r-0.snappy.parquet invalid: all the files must be contained in the root relativeRoot {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-248) Simplify ParquetWriters's constructors
[ https://issues.apache.org/jira/browse/PARQUET-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-248. --- Resolution: Fixed Fix Version/s: 1.8.0 Added a builder class that can be extended by object models. Simplify ParquetWriters's constructors -- Key: PARQUET-248 URL: https://issues.apache.org/jira/browse/PARQUET-248 Project: Parquet Issue Type: Improvement Affects Versions: 1.6.0 Reporter: Konstantin Shaposhnikov Assignee: Ryan Blue Fix For: 1.8.0 ParquetWriter has a lot of constructors. A builder pattern can be used to simplify construction of ParquetWriter objects (similar to ParquetReader, see PARQUET-39). ParquetWriter subclasses (like AvroParquetWriter) should be updated to provide reasonable builder() static factory method. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics
[ https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602532#comment-14602532 ] Ryan Blue commented on PARQUET-41: -- Thanks for working on this, [~Ferd], it's great to be making some good progress on it. This is getting to be a pretty long comment. I don't have all that many conclusions, but I wanted to share some observations to start a discussion around how this feature should be done. I've mostly been thinking lately about the bloom filter configuration. I like that FPP is a user setting because the query patterns really affect what value you want for it. You can get much better space savings with a high FPP if you know that typical queries will only look for a few items. We can think of FPP as the probability that we will have to read a data page even though it doesn't actually have the item we are looking for. That is multiplied by the number of items in a query, which could be large but I think will generally be less than ~10 elements (for basing a default). That puts a general upper limit on the FPP because if it is something too high, like 10%, a fair number of queries will end up reading unnecessary data with a 50+% probability (anything checking for 5 or more unique items). I think we should have a way to read the page stats without the filter, since they can be pretty big. I took a look at a real-world dataset with 8-byte timestamps that are ~75% unique, which put the expected filter size for a 2.5% false-positive rate at 9% of the block size. If I'm looking for 32 timestamps at once, I have an 80% chance of reading pages I don't need to read, and end up reading an extra 9% for every page's bloom filter alone. I don't think we want a setting for the expected number of entries. For one thing, this varies widely across pages. I have a dataset with 20-30 values per page in one column and 131,000 values per page in another. A setting for all columns will definitely be a problem and I don't think we can trust users to set this correctly for their data on every column. We also don't know much about how many unique values are in a column or how that column will compress with the encodings. Bloom filters are surprisingly expensive in terms of space considering some of the encoding sizes we can get in Parquet. For example, if we have a column where delta integer encoding is doing a good job, values might be ~2 bytes each. If the column is 75% unique, then even a 10% FPP will create a bloom filter that is ~22.5% of the page size, and a 1% FPP is ~44.9% of the page size. To compare to not as good encoding, 8-bytes per value ends up being ~11.2% of the page size for a 1% FPP, which is still significant. As encoding gets better, pages have more values and the bloom filter needs to be larger. Without knowing the percentage of unique values or the encoding size, choosing the expected number of values for a page is impossible. Because of the potential size of the filters compared to the page size, over-estimating the filter size isn't enough: we don't want something 10% of the page size or larger. That means that if we chose an estimate for the number of values, we would still end up overloading filters fairly often. I took a look at the false-positive probability for overloaded filters: if a filter is 125% loaded, then the actual false-positive probability at least doubles, and for an original 1% FPP, it triples. It gets much worse as the overloading increases: 200% loaded results in a 9% actual FPP based on a 1% original FPP. Keep in mind that the expected overloading is probably not as low as 200% given that the number of values per page can vary from tens to tens of thousands. I think there are 2 approaches to fixing this. First, there's a paper, [Scalable Bloom Filters|http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf], that has a strategy to use a series of bloom filters so you don't have to know the size in advance. It's a good paper, but we would want to change the heuristics for growing the filter because we know when we are getting close to the total number of elements in the page. Another draw-back is that it uses a series of filters, so testing for an element has to be done in each filter. I think a second approach is to keep the data in memory until we have enough to determine the properties of the bloom filter. This would only need to be done for the first few pages, while memory consumption is still small. We could keep the hashed values instead of the actual data to get the size down to a set of integers that will be approximately the number of uniques items in the page (minus collisions). I like this option better because it is all on the write side and trades a reasonable amount of memory for a more complicated filter. The read side would be as it is now. Okay, this is long enough. I'll clean up the
[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics
[ https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600367#comment-14600367 ] Ryan Blue commented on PARQUET-41: -- I don't think the counting bloom filter idea is worth the increased size or the work to make it happen, when the trade-off is a false-positive. The ACID support will periodically rebuild the bloom filters anyway, so we're only talking about false positives for data in the delta files, which we expect to be small. Add bloom filters to parquet statistics --- Key: PARQUET-41 URL: https://issues.apache.org/jira/browse/PARQUET-41 Project: Parquet Issue Type: New Feature Components: parquet-format, parquet-mr Reporter: Alex Levenson Assignee: Ferdinand Xu Labels: filter2 For row groups with no dictionary, we could still produce a bloom filter. This could be very useful in filtering entire row groups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-152) Encoding issue with fixed length byte arrays
[ https://issues.apache.org/jira/browse/PARQUET-152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592474#comment-14592474 ] Ryan Blue commented on PARQUET-152: --- I think the RLE_DICTIONARY behavior is probably because the dictionary is using plain encoding rather than delta byte array. Encoding issue with fixed length byte arrays Key: PARQUET-152 URL: https://issues.apache.org/jira/browse/PARQUET-152 Project: Parquet Issue Type: Bug Reporter: Nezih Yigitbasi Priority: Minor While running some tests against the master branch I hit an encoding issue that seemed like a bug to me. I noticed that when writing a fixed length byte array and the array's size is dictionaryPageSize (in my test it was 512), the encoding falls back to DELTA_BYTE_ARRAY as seen below: {noformat} Dec 17, 2014 3:41:10 PM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 12,125B for [flba_field] FIXED_LEN_BYTE_ARRAY: 5,000 values, 1,710B raw, 1,710B comp, 5 pages, encodings: [DELTA_BYTE_ARRAY] {noformat} But then read fails with the following exception: {noformat} Caused by: parquet.io.ParquetDecodingException: Encoding DELTA_BYTE_ARRAY is only supported for type BINARY at parquet.column.Encoding$7.getValuesReader(Encoding.java:193) at parquet.column.impl.ColumnReaderImpl.initDataReader(ColumnReaderImpl.java:534) at parquet.column.impl.ColumnReaderImpl.readPageV2(ColumnReaderImpl.java:574) at parquet.column.impl.ColumnReaderImpl.access$400(ColumnReaderImpl.java:54) at parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:518) at parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:510) at parquet.column.page.DataPageV2.accept(DataPageV2.java:123) at parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:510) at parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:502) at parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:604) at parquet.column.impl.ColumnReaderImpl.init(ColumnReaderImpl.java:348) at parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63) at parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58) at parquet.io.RecordReaderImplementation.init(RecordReaderImplementation.java:267) at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131) at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96) at parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136) at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96) at parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:129) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:198) ... 16 more {noformat} When the array's size is dictionaryPageSize, RLE_DICTIONARY encoding is used and read works fine: {noformat} Dec 17, 2014 3:39:50 PM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 50B for [flba_field] FIXED_LEN_BYTE_ARRAY: 5,000 values, 3B raw, 3B comp, 1 pages, encodings: [RLE_DICTIONARY, PLAIN], dic { 1 entries, 8B raw, 1B comp} {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (PARQUET-41) Add bloom filters to parquet statistics
[ https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602532#comment-14602532 ] Ryan Blue edited comment on PARQUET-41 at 6/26/15 7:53 PM: --- Thanks for working on this, [~Ferd], it's great to be making some good progress on it. This is getting to be a pretty long comment. I don't have all that many conclusions, but I wanted to share some observations to start a discussion around how this feature should be done. I've mostly been thinking lately about the bloom filter configuration. I like that FPP is a user setting because the query patterns really affect what value you want for it. You can get much better space savings with a high FPP if you know that typical queries will only look for a few items. We can think of FPP as the probability that we will have to read a data page even though it doesn't actually have the item we are looking for. That is multiplied by the number of items in a query, which could be large but I think will generally be less than ~10 elements (for basing a default). That puts a general upper limit on the FPP because if it is something too high, like 10%, a fair number of queries will end up reading unnecessary data with a 50+% probability (anything checking for 7 or more unique items). I think we should have a way to read the page stats without the filter, since they can be pretty big. I took a look at a real-world dataset with 8-byte timestamps that are ~75% unique, which put the expected filter size for a 2.5% false-positive rate at 9% of the block size. If I'm looking for 32 timestamps at once, I have an 80% chance of reading pages I don't need to read, and end up reading an extra 9% for every page's bloom filter alone. I don't think we want a setting for the expected number of entries. For one thing, this varies widely across pages. I have a dataset with 20-30 values per page in one column and 131,000 values per page in another. A setting for all columns will definitely be a problem and I don't think we can trust users to set this correctly for their data on every column. We also don't know much about how many unique values are in a column or how that column will compress with the encodings. Bloom filters are surprisingly expensive in terms of space considering some of the encoding sizes we can get in Parquet. For example, if we have a column where delta integer encoding is doing a good job, values might be ~2 bytes each. If the column is 75% unique, then even a 10% FPP will create a bloom filter that is ~22.5% of the page size, and a 1% FPP is ~44.9% of the page size. To compare to not as good encoding, 8-bytes per value ends up being ~11.2% of the page size for a 1% FPP, which is still significant. As encoding gets better, pages have more values and the bloom filter needs to be larger. Without knowing the percentage of unique values or the encoding size, choosing the expected number of values for a page is impossible. Because of the potential size of the filters compared to the page size, over-estimating the filter size isn't enough: we don't want something 10% of the page size or larger. That means that if we chose an estimate for the number of values, we would still end up overloading filters fairly often. I took a look at the false-positive probability for overloaded filters: if a filter is 125% loaded, then the actual false-positive probability at least doubles, and for an original 1% FPP, it triples. It gets much worse as the overloading increases: 200% loaded results in a 9% actual FPP based on a 1% original FPP. Keep in mind that the expected overloading is probably not as low as 200% given that the number of values per page can vary from tens to tens of thousands. I think there are 2 approaches to fixing this. First, there's a paper, [Scalable Bloom Filters|http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf], that has a strategy to use a series of bloom filters so you don't have to know the size in advance. It's a good paper, but we would want to change the heuristics for growing the filter because we know when we are getting close to the total number of elements in the page. Another draw-back is that it uses a series of filters, so testing for an element has to be done in each filter. I think a second approach is to keep the data in memory until we have enough to determine the properties of the bloom filter. This would only need to be done for the first few pages, while memory consumption is still small. We could keep the hashed values instead of the actual data to get the size down to a set of integers that will be approximately the number of uniques items in the page (minus collisions). I like this option better because it is all on the write side and trades a reasonable amount of memory for a more complicated filter. The read side would be as it is now.
[jira] [Resolved] (PARQUET-293) ScalaReflectionException when trying to convert an RDD of Scrooge to a DataFrame
[ https://issues.apache.org/jira/browse/PARQUET-293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-293. --- Resolution: Duplicate Closing as a duplicate. Please follow SPARK-8288 instead. ScalaReflectionException when trying to convert an RDD of Scrooge to a DataFrame Key: PARQUET-293 URL: https://issues.apache.org/jira/browse/PARQUET-293 Project: Parquet Issue Type: Bug Components: parquet-format Affects Versions: 1.6.0 Reporter: Tim Chan I get scala.ScalaReflectionException: none is not a term when I try to convert an RDD of Scrooge to a DataFrame, e.g. myScroogeRDD.toDF Has anyone else encountered this problem? I'm using Spark 1.3.1, Scala 2.10.4 and scrooge-sbt-plugin 3.16.3 Here is my thrift IDL: {code} namespace scala com.junk namespace java com.junk struct Junk { 10: i64 junkID, 20: string junkString } {code} from a spark-shell: {code} val junks = List( Junk(123L, junk1), Junk(567L, junk2), Junk(789L, junk3) ) val junksRDD = sc.parallelize(junks) junksRDD.toDF {code} Exception thrown: {noformat} scala.ScalaReflectionException: none is not a term at scala.reflect.api.Symbols$SymbolApi$class.asTerm(Symbols.scala:259) at scala.reflect.internal.Symbols$SymbolContextApiImpl.asTerm(Symbols.scala:73) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:148) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:316) at org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:254) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:27) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:32) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:34) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:36) at $iwC$$iwC$$iwC$$iwC.init(console:38) at $iwC$$iwC$$iwC.init(console:40) at $iwC$$iwC.init(console:42) at $iwC.init(console:44) at init(console:46) at .init(console:50) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at
[jira] [Commented] (PARQUET-293) ScalaReflectionException when trying to convert an RDD of Scrooge to a DataFrame
[ https://issues.apache.org/jira/browse/PARQUET-293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580868#comment-14580868 ] Ryan Blue commented on PARQUET-293: --- Linking to the issue that replaces this. ScalaReflectionException when trying to convert an RDD of Scrooge to a DataFrame Key: PARQUET-293 URL: https://issues.apache.org/jira/browse/PARQUET-293 Project: Parquet Issue Type: Bug Components: parquet-format Affects Versions: 1.6.0 Reporter: Tim Chan I get scala.ScalaReflectionException: none is not a term when I try to convert an RDD of Scrooge to a DataFrame, e.g. myScroogeRDD.toDF Has anyone else encountered this problem? I'm using Spark 1.3.1, Scala 2.10.4 and scrooge-sbt-plugin 3.16.3 Here is my thrift IDL: {code} namespace scala com.junk namespace java com.junk struct Junk { 10: i64 junkID, 20: string junkString } {code} from a spark-shell: {code} val junks = List( Junk(123L, junk1), Junk(567L, junk2), Junk(789L, junk3) ) val junksRDD = sc.parallelize(junks) junksRDD.toDF {code} Exception thrown: {noformat} scala.ScalaReflectionException: none is not a term at scala.reflect.api.Symbols$SymbolApi$class.asTerm(Symbols.scala:259) at scala.reflect.internal.Symbols$SymbolContextApiImpl.asTerm(Symbols.scala:73) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:148) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:316) at org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:254) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:27) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:32) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:34) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:36) at $iwC$$iwC$$iwC$$iwC.init(console:38) at $iwC$$iwC$$iwC.init(console:40) at $iwC$$iwC.init(console:42) at $iwC.init(console:44) at init(console:46) at .init(console:50) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at
[jira] [Commented] (PARQUET-222) parquet writer runs into OOM during writing when calling DataFrame.saveAsParquetFile in Spark SQL
[ https://issues.apache.org/jira/browse/PARQUET-222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580861#comment-14580861 ] Ryan Blue commented on PARQUET-222: --- Okay, so it sounds like you're talking about writing out data to a single folder without FS partitioning. Then, I agree that the solution to reduce the number of tasks so to try and minimize the number of files. Sounds like you already do the optimization for FS partitioning, which is great. Thanks! parquet writer runs into OOM during writing when calling DataFrame.saveAsParquetFile in Spark SQL - Key: PARQUET-222 URL: https://issues.apache.org/jira/browse/PARQUET-222 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.6.0 Reporter: Chaozhong Yang Original Estimate: 336h Remaining Estimate: 336h In Spark SQL, there is a function {{saveAsParquetFile}} in {{DataFrame}} or {{SchemaRDD}}. That function calls method in parquet-mr, and sometimes it will fail due to the OOM error thrown by parquet-mr. We can see the exception stack trace as follows: {noformat} WARN] [task-result-getter-3] 03-19 11:17:58,274 [TaskSetManager] - Lost task 0.2 in stage 137.0 (TID 309, hb1.avoscloud.com): java.lang.OutOfMemoryError: Java heap space at parquet.column.values.dictionary.IntList.initSlab(IntList.java:87) at parquet.column.values.dictionary.IntList.init(IntList.java:83) at parquet.column.values.dictionary.DictionaryValuesWriter.init(DictionaryValuesWriter.java:85) at parquet.column.values.dictionary.DictionaryValuesWriter$PlainIntegerDictionaryValuesWriter.init(DictionaryValuesWriter.java:549) at parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:88) at parquet.column.impl.ColumnWriterImpl.init(ColumnWriterImpl.java:74) at parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:68) at parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56) at parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.init(MessageColumnIO.java:178) at parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369) at parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108) at parquet.hadoop.InternalParquetRecordWriter.init(InternalParquetRecordWriter.java:94) at parquet.hadoop.ParquetRecordWriter.init(ParquetRecordWriter.java:64) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:304) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) {noformat} By the way, there is another similar issue https://issues.apache.org/jira/browse/PARQUET-99. But the reporter has closed it and mark it as resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-178) META-INF for slf4j should not be in parquet-format jar
[ https://issues.apache.org/jira/browse/PARQUET-178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-178. --- Resolution: Fixed Assignee: Ryan Blue Merged. Thanks for letting us know about this [~koert]! META-INF for slf4j should not be in parquet-format jar -- Key: PARQUET-178 URL: https://issues.apache.org/jira/browse/PARQUET-178 Project: Parquet Issue Type: Bug Components: parquet-format Affects Versions: 1.6.0 Reporter: koert kuipers Assignee: Ryan Blue Priority: Minor {noformat} $ jar tf parquet-format-2.2.0-rc1.jar | grep org\\.slf META-INF/maven/org.slf4j/ META-INF/maven/org.slf4j/slf4j-api/ META-INF/maven/org.slf4j/slf4j-api/pom.xml META-INF/maven/org.slf4j/slf4j-api/pom.properties {noformat} It is not clear to me why these are here. I suspect they should not be. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PARQUET-308) Add accessor to ParquetWriter to get current data size
Ryan Blue created PARQUET-308: - Summary: Add accessor to ParquetWriter to get current data size Key: PARQUET-308 URL: https://issues.apache.org/jira/browse/PARQUET-308 Project: Parquet Issue Type: Improvement Components: parquet-mr Affects Versions: 1.7.0 Reporter: Ryan Blue Assignee: Ryan Blue Fix For: 1.8.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-246) ArrayIndexOutOfBoundsException with Parquet write version v2
[ https://issues.apache.org/jira/browse/PARQUET-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590353#comment-14590353 ] Ryan Blue commented on PARQUET-246: --- [~michael] can you answer my questions about this? When does this happen? Whenever you read a file like this? If so, then we need to add support (with a flag) to initialize the delta byte array from the last value in the last page/row group. That would mean we also need to keep it around and throw an exception if it isn't present (if you were reading from the middle of the file, we can't back up to get it right). I think data recovery needs to be part of the solution for this. ArrayIndexOutOfBoundsException with Parquet write version v2 Key: PARQUET-246 URL: https://issues.apache.org/jira/browse/PARQUET-246 Project: Parquet Issue Type: Bug Affects Versions: 1.6.0 Reporter: Konstantin Shaposhnikov Fix For: 2.0.0 I am getting the following exception when reading a parquet file that was created using Avro WriteSupport and Parquet write version v2.0: {noformat} Caused by: parquet.io.ParquetDecodingException: Can't read value in column [colName, rows, array, name] BINARY at value 313601 out of 428260, 1 out of 39200 in currentPage. repetition level: 0, definition level: 2 at parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:462) at parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:364) at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209) ... 27 more Caused by: java.lang.ArrayIndexOutOfBoundsException at parquet.column.values.deltastrings.DeltaByteArrayReader.readBytes(DeltaByteArrayReader.java:70) at parquet.column.impl.ColumnReaderImpl$2$6.read(ColumnReaderImpl.java:307) at parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:458) ... 30 more {noformat} The file is quite big (500Mb) so I cannot upload it here, but possibly there is enough information in the exception message to understand the cause of error. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-309) Remove unnecessary compile dependency on parquet-generator
[ https://issues.apache.org/jira/browse/PARQUET-309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-309: -- Assignee: Konstantin Shaposhnikov Remove unnecessary compile dependency on parquet-generator -- Key: PARQUET-309 URL: https://issues.apache.org/jira/browse/PARQUET-309 Project: Parquet Issue Type: Improvement Components: parquet-mr Affects Versions: 1.7.0 Reporter: Konstantin Shaposhnikov Assignee: Konstantin Shaposhnikov Fix For: 1.8.0 parquet-generator is used during build time only. Other parquet-jars (e.g. parquet-encoding) should not depend on it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-309) Remove unnecessary compile dependency on parquet-generator
[ https://issues.apache.org/jira/browse/PARQUET-309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-309. --- Resolution: Fixed Fix Version/s: 1.8.0 Remove unnecessary compile dependency on parquet-generator -- Key: PARQUET-309 URL: https://issues.apache.org/jira/browse/PARQUET-309 Project: Parquet Issue Type: Improvement Components: parquet-mr Affects Versions: 1.7.0 Reporter: Konstantin Shaposhnikov Fix For: 1.8.0 parquet-generator is used during build time only. Other parquet-jars (e.g. parquet-encoding) should not depend on it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics
[ https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590073#comment-14590073 ] Ryan Blue commented on PARQUET-41: -- Great, thanks [~Ferd]! Could you also tell us a bit more about how this works and the approach you're taking? At first glance, we need quite a bit more in the format to specify exactly what the structure means and how to use it. It would be good to discuss that here, too. Add bloom filters to parquet statistics --- Key: PARQUET-41 URL: https://issues.apache.org/jira/browse/PARQUET-41 Project: Parquet Issue Type: New Feature Components: parquet-format, parquet-mr Reporter: Alex Levenson Assignee: ferdinand xu Labels: filter2 For row groups with no dictionary, we could still produce a bloom filter. This could be very useful in filtering entire row groups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-246) ArrayIndexOutOfBoundsException with Parquet write version v2
[ https://issues.apache.org/jira/browse/PARQUET-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590045#comment-14590045 ] Ryan Blue commented on PARQUET-246: --- Should we also update the read side so we can recover data written with this bug? Does this happen when reading the entire file, or just when reading from a middle row group in MR? ArrayIndexOutOfBoundsException with Parquet write version v2 Key: PARQUET-246 URL: https://issues.apache.org/jira/browse/PARQUET-246 Project: Parquet Issue Type: Bug Affects Versions: 1.6.0 Reporter: Konstantin Shaposhnikov Fix For: 2.0.0 I am getting the following exception when reading a parquet file that was created using Avro WriteSupport and Parquet write version v2.0: {noformat} Caused by: parquet.io.ParquetDecodingException: Can't read value in column [colName, rows, array, name] BINARY at value 313601 out of 428260, 1 out of 39200 in currentPage. repetition level: 0, definition level: 2 at parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:462) at parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:364) at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209) ... 27 more Caused by: java.lang.ArrayIndexOutOfBoundsException at parquet.column.values.deltastrings.DeltaByteArrayReader.readBytes(DeltaByteArrayReader.java:70) at parquet.column.impl.ColumnReaderImpl$2$6.read(ColumnReaderImpl.java:307) at parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:458) ... 30 more {noformat} The file is quite big (500Mb) so I cannot upload it here, but possibly there is enough information in the exception message to understand the cause of error. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-39) Simplify ParquetReader's constructors
[ https://issues.apache.org/jira/browse/PARQUET-39?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-39. -- This was added in https://github.com/apache/parquet-mr/commit/ad32bf0fd111ab473ad1080cde11de39e3c5a67f Simplify ParquetReader's constructors - Key: PARQUET-39 URL: https://issues.apache.org/jira/browse/PARQUET-39 Project: Parquet Issue Type: Improvement Components: parquet-mr Reporter: Alex Levenson Assignee: Alex Levenson Priority: Minor Fix For: 1.6.0 ParquetReader has a lot of constructors. Maybe we should use the Builder pattern instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-293) ScalaReflectionException when trying to convert an RDD of Scrooge to a DataFrame
[ https://issues.apache.org/jira/browse/PARQUET-293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14563496#comment-14563496 ] Ryan Blue commented on PARQUET-293: --- [~lian cheng], could you take a look at this? Seems like your area of expertise. Do you think this should be a Spark issue instead of a Parquet issue? ScalaReflectionException when trying to convert an RDD of Scrooge to a DataFrame Key: PARQUET-293 URL: https://issues.apache.org/jira/browse/PARQUET-293 Project: Parquet Issue Type: Bug Components: parquet-format Affects Versions: 1.6.0 Reporter: Tim Chan I get scala.ScalaReflectionException: none is not a term when I try to convert an RDD of Scrooge to a DataFrame, e.g. myScroogeRDD.toDF Has anyone else encountered this problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-151) Null Pointer exception in parquet.hadoop.ParquetFileWriter.mergeFooters
[ https://issues.apache.org/jira/browse/PARQUET-151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-151: -- Assignee: Yash Datta Null Pointer exception in parquet.hadoop.ParquetFileWriter.mergeFooters --- Key: PARQUET-151 URL: https://issues.apache.org/jira/browse/PARQUET-151 Project: Parquet Issue Type: Bug Reporter: Vladislav Kuzemchik Assignee: Yash Datta Hi! I'm getting null pointer exception when I'm trying to write parquet files with spark. {noformat} Dec 13, 2014 3:05:10 AM WARNING: parquet.hadoop.ParquetOutputCommitter: could not write summary file for hdfs://phoenix-011.nym1.placeiq.net:8020/user/vkuzemchik/parquet_data/1789 java.lang.NullPointerException at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:426) at parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:402) at parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51) at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:936) at com.placeiq.spark.KafkaReader$.writeParquetHadoop(KafkaReader.scala:143) at com.placeiq.spark.KafkaReader$$anonfun$3.apply(KafkaReader.scala:165) at com.placeiq.spark.KafkaReader$$anonfun$3.apply(KafkaReader.scala:164) at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527) at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {noformat} Here is function I'm using: {code:title=Spark.scala|borderStyle=solid} def writeParquetHadoop(rdd:RDD[(Void,LogMessage)]):Unit = { val jobConf = new JobConf(ssc.sparkContext.hadoopConfiguration) val job = new Job(jobConf) val outputDir = hdfs://phoenix-011.nym1.placeiq.net:8020/user/vkuzemchik/parquet_data/ ParquetOutputFormat.setWriteSupportClass(job, classOf[AvroWriteSupport]) ParquetInputFormat.setReadSupportClass(job, classOf[AvroReadSupport[LogMessage]]) AvroParquetInputFormat.setAvroReadSchema(job, LogMessage.SCHEMA$) AvroParquetOutputFormat.setSchema(job, LogMessage.SCHEMA$) ParquetOutputFormat.setCompression(job,CompressionCodecName.SNAPPY) ParquetOutputFormat.setBlockSize(job, 536870912) job.setOutputKeyClass(classOf[Void]) job.setOutputValueClass(classOf[LogMessage]) job.setOutputFormatClass(classOf[ParquetOutputFormat[LogMessage]]) job.getConfiguration.set(mapred.output.dir, outputDir+rdd.id) rdd.saveAsNewAPIHadoopDataset(job.getConfiguration) } {code} I have this issue on 1.5. Trying to re-produce on newer versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PARQUET-296) Set master branch version back to 1.8.0-SNAPSHOT
Ryan Blue created PARQUET-296: - Summary: Set master branch version back to 1.8.0-SNAPSHOT Key: PARQUET-296 URL: https://issues.apache.org/jira/browse/PARQUET-296 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.8.0 Reporter: Ryan Blue Fix For: 1.8.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-266) Add support for lists of primitives to Pig schema converter
[ https://issues.apache.org/jira/browse/PARQUET-266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561923#comment-14561923 ] Ryan Blue commented on PARQUET-266: --- [~dweeks-netflix] or [~julienledem], you guys are the reviewers for Pig patches, right? Add support for lists of primitives to Pig schema converter --- Key: PARQUET-266 URL: https://issues.apache.org/jira/browse/PARQUET-266 Project: Parquet Issue Type: Improvement Affects Versions: 1.5.0, 1.6.0 Reporter: Christian Rolf Priority: Minor Attachments: PigPrimitiveList-1.8.patch, PigPrimitiveList.patch Right now lists of primitives are not supported in Pig (exception thrown from the PigSchemaConverter.java, line 292 in Parquet 1.6). Patch converts Parquet-arrays of primitives into Pig-bags, the closest representation of an array in Pig. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-292) Release Parquet 1.8.0
[ https://issues.apache.org/jira/browse/PARQUET-292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561927#comment-14561927 ] Ryan Blue commented on PARQUET-292: --- Adding PARQUET-265 instead of PARQUET-263. Release Parquet 1.8.0 - Key: PARQUET-292 URL: https://issues.apache.org/jira/browse/PARQUET-292 Project: Parquet Issue Type: Task Reporter: Alex Levenson Assignee: Alex Levenson -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-292) Release Parquet 1.8.0
[ https://issues.apache.org/jira/browse/PARQUET-292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561951#comment-14561951 ] Ryan Blue commented on PARQUET-292: --- Adding PARQUET-201, which was a bug fix pushed out for the 1.6.0 release. We don't have a very good reason to push it out this time, so I'm marking it as a blocker. Release Parquet 1.8.0 - Key: PARQUET-292 URL: https://issues.apache.org/jira/browse/PARQUET-292 Project: Parquet Issue Type: Task Reporter: Alex Levenson Assignee: Alex Levenson -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-199) Add a callback when the MemoryManager adjusts row group size
[ https://issues.apache.org/jira/browse/PARQUET-199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-199. --- Resolution: Fixed Fix Version/s: 1.8.0 This was merged a few days ago, just forgot to close. Add a callback when the MemoryManager adjusts row group size Key: PARQUET-199 URL: https://issues.apache.org/jira/browse/PARQUET-199 Project: Parquet Issue Type: Bug Components: parquet-mr Reporter: Ryan Blue Assignee: Dong Chen Fix For: 1.8.0 Parquet Hive would like to increment a counter when the row group size is altered by the memory manager so that Hive can detect when there are memory problems and inform the user. I think the right way to do this is to provide a callback that will be triggered when the memory manager hits its limit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-285) Implement nested types write rules in parquet-avro
[ https://issues.apache.org/jira/browse/PARQUET-285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-285. --- Resolution: Fixed Merged #198. Implement nested types write rules in parquet-avro -- Key: PARQUET-285 URL: https://issues.apache.org/jira/browse/PARQUET-285 Project: Parquet Issue Type: Bug Components: parquet-mr Reporter: Ryan Blue Assignee: Ryan Blue Fix For: 1.8.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-251) Binary column statistics error when reuse byte[] among rows
[ https://issues.apache.org/jira/browse/PARQUET-251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-251: -- Fix Version/s: (was: 2.0.0) 1.8.0 Binary column statistics error when reuse byte[] among rows --- Key: PARQUET-251 URL: https://issues.apache.org/jira/browse/PARQUET-251 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.6.0 Reporter: Yijie Shen Assignee: Ashish K Singh Priority: Blocker Fix For: 1.8.0 I think it is a common practice when inserting table data as parquet file, one would always reuse the same object among rows, and if a column is byte[] of fixed length, the byte[] would also be reused. If I use ByteArrayBackedBinary for my byte[], the bug occurs: All of the row groups created by a single task would have the same max min binary value, just as the last row's binary content. The reason is BinaryStatistic just keep max min as parquet.io.api.Binary references, since I use ByteArrayBackedBinary for byte[], the real content of max min would always point to the reused byte[], therefore the latest row's content. Does parquet declare somewhere that the user shouldn't reuse byte[] for Binary type? If it doesn't, I think it's a bug and can be reproduced by [Spark SQL's RowWriteSupport |https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala#L353-354] The related Spark JIRA ticket: [SPARK-6859|https://issues.apache.org/jira/browse/SPARK-6859] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-324) row count incorrect if data file has more than 2^31 rows
[ https://issues.apache.org/jira/browse/PARQUET-324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-324. --- Resolution: Fixed Fix Version/s: 1.8.0 Thanks for contributing the fix, [~tfriedr]! row count incorrect if data file has more than 2^31 rows Key: PARQUET-324 URL: https://issues.apache.org/jira/browse/PARQUET-324 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.7.0, 1.8.0 Reporter: Thomas Friedrich Assignee: Thomas Friedrich Priority: Minor Fix For: 1.8.0 If a parquet file has more than 2^31 rows, the row count written into the file metadata is incorrect. The cause of the problem is the use of an int instead of long data type for numRows in ParquetMetadataConverter, toParquetMetadata: int numRows = 0; for (BlockMetaData block : blocks) { numRows += block.getRowCount(); addRowGroup(parquetMetadata, rowGroups, block); } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-320) Restore semver checks
[ https://issues.apache.org/jira/browse/PARQUET-320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-320. --- Resolution: Fixed Merged #230 Restore semver checks - Key: PARQUET-320 URL: https://issues.apache.org/jira/browse/PARQUET-320 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.7.0 Reporter: Ryan Blue Assignee: Ryan Blue Fix For: 1.8.0 The exclusion for parquet-format classes was parquet/**, which evidently matches everything. Even classes in org.apache.parquet. We need remove that check and fix any problems that have cropped up since it was added. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-223) Add Map and List builiders
[ https://issues.apache.org/jira/browse/PARQUET-223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-223. --- Resolution: Fixed Fix Version/s: 1.8.0 I committed this. Thanks for the contribution [~singhashish]! Add Map and List builiders -- Key: PARQUET-223 URL: https://issues.apache.org/jira/browse/PARQUET-223 Project: Parquet Issue Type: Improvement Components: parquet-mr Reporter: Ashish K Singh Assignee: Ashish K Singh Fix For: 1.8.0 As of now, Parquet does not provide builders for Maps and Lists. This leaves margin for user errors. Having Map and List builders will make it easier for users to build these types. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PARQUET-361) Add prerelease logic to semantic versions
Ryan Blue created PARQUET-361: - Summary: Add prerelease logic to semantic versions Key: PARQUET-361 URL: https://issues.apache.org/jira/browse/PARQUET-361 Project: Parquet Issue Type: Improvement Components: parquet-mr Affects Versions: 1.8.1 Reporter: Ryan Blue Fix For: 1.9.0 CDH is including fixes for PARQUET-251. That means that we need to add the fixed versions to the logic that tests whether the fix is present and that requires the appropriate semver logic for prerelease versions because CDH versions are formatted like this: 1.5.0-cdh5.5.0 / upstream-base-cdhcdh-release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-361) Add prerelease logic to semantic versions
[ https://issues.apache.org/jira/browse/PARQUET-361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-361. --- Resolution: Fixed Assignee: Ryan Blue Add prerelease logic to semantic versions - Key: PARQUET-361 URL: https://issues.apache.org/jira/browse/PARQUET-361 Project: Parquet Issue Type: Improvement Components: parquet-mr Affects Versions: 1.8.1 Reporter: Ryan Blue Assignee: Ryan Blue Fix For: 1.9.0 CDH is including fixes for PARQUET-251. That means that we need to add the fixed versions to the logic that tests whether the fix is present and that requires the appropriate semver logic for prerelease versions because CDH versions are formatted like this: 1.5.0-cdh5.5.0 / upstream-base-cdhcdh-release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-316) Run.sh is broken in parquet-benchmarks
[ https://issues.apache.org/jira/browse/PARQUET-316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-316. --- Resolution: Fixed Fix Version/s: 1.8.0 Merged Nezih's PR. Thanks for fixing this! Run.sh is broken in parquet-benchmarks -- Key: PARQUET-316 URL: https://issues.apache.org/jira/browse/PARQUET-316 Project: Parquet Issue Type: Bug Reporter: Nezih Yigitbasi Assignee: Nezih Yigitbasi Fix For: 1.8.0 With the package renaming (to org.apache.parquet) the run.sh script is now broken. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-146) make Parquet compile with java 7 instead of java 6
[ https://issues.apache.org/jira/browse/PARQUET-146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608848#comment-14608848 ] Ryan Blue commented on PARQUET-146: --- We should discuss this on the mailing list. We've had recent contributions fixing support for java 6, so we definitely want to build consensus before deprecating support. make Parquet compile with java 7 instead of java 6 -- Key: PARQUET-146 URL: https://issues.apache.org/jira/browse/PARQUET-146 Project: Parquet Issue Type: Improvement Reporter: Julien Le Dem Labels: beginner, noob, pick-me-up currently Parquet is compatible with java 6. we should remove this constraint. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PARQUET-320) Restore semver checks
Ryan Blue created PARQUET-320: - Summary: Restore semver checks Key: PARQUET-320 URL: https://issues.apache.org/jira/browse/PARQUET-320 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.7.0 Reporter: Ryan Blue Fix For: 1.8.0 The exclusion for parquet-format classes was parquet/**, which evidently matches everything. Even classes in org.apache.parquet. We need remove that check and fix any problems that have cropped up since it was added. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics
[ https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606674#comment-14606674 ] Ryan Blue commented on PARQUET-41: -- I should also point out there's a table on the first page that calculates the probability of at least one false-positive when querying multiple items. That's pretty useful to apply here. If we are querying for 10 items and the bloom filter says it is 1%, then there is a 9.56% chance of reading a page when it has none of the items. But if the actual FPP of that filter is 10% because of overloading, then we get a 65% probability when we were expecting that 9.56%. Add bloom filters to parquet statistics --- Key: PARQUET-41 URL: https://issues.apache.org/jira/browse/PARQUET-41 Project: Parquet Issue Type: New Feature Components: parquet-format, parquet-mr Reporter: Alex Levenson Assignee: Ferdinand Xu Labels: filter2 For row groups with no dictionary, we could still produce a bloom filter. This could be very useful in filtering entire row groups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-321) Set the HDFS padding default to 8MB
[ https://issues.apache.org/jira/browse/PARQUET-321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-321: -- Summary: Set the HDFS padding default to 8MB (was: Set the HDFS padding default to 16MB) Set the HDFS padding default to 8MB --- Key: PARQUET-321 URL: https://issues.apache.org/jira/browse/PARQUET-321 Project: Parquet Issue Type: Improvement Components: parquet-mr Reporter: Ryan Blue Assignee: Ryan Blue Fix For: 1.8.0 PARQUET-306 added the ability to pad row groups so that they align with HDFS blocks to avoid remote reads. The ParquetFileWriter will now either pad the remaining space in the block or target a row group for the remaining size. The padding maximum controls the threshold of the amount of padding that will be used. If the space left is under this threshold, it is padded. If it is greater than this threshold, then the next row group is fit into the remaining space. The current padding maximum is 0. I think we should change the padding maximum to 8MB. My reasoning is this: we want this number to be small enough that it won't prevent the library from writing reasonable row groups, but larger than the minimum size row group we would want to write. 8MB is 1/16th of the row group default, so I think it is reasonable: we don't want a row group to be smaller than 8 MB. We also want this to be large enough that a few row groups in a block don't cause a tiny row group to be written in the excess space. 8MB accounts for 4 row groups that are 2MB under-size. In addition, it is reasonable to not allow row groups under 8MB. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PARQUET-321) Set the HDFS padding default to 16MB
Ryan Blue created PARQUET-321: - Summary: Set the HDFS padding default to 16MB Key: PARQUET-321 URL: https://issues.apache.org/jira/browse/PARQUET-321 Project: Parquet Issue Type: Improvement Components: parquet-mr Reporter: Ryan Blue Assignee: Ryan Blue Fix For: 1.8.0 PARQUET-306 added the ability to pad row groups so that they align with HDFS blocks to avoid remote reads. The ParquetFileWriter will now either pad the remaining space in the block or target a row group for the remaining size. The padding maximum controls the threshold of the amount of padding that will be used. If the space left is under this threshold, it is padded. If it is greater than this threshold, then the next row group is fit into the remaining space. The current padding maximum is 0. I think we should change the padding maximum to 8MB. My reasoning is this: we want this number to be small enough that it won't prevent the library from writing reasonable row groups, but larger than the minimum size row group we would want to write. 8MB is 1/16th of the row group default, so I think it is reasonable: we don't want a row group to be smaller than 8 MB. We also want this to be large enough that a few row groups in a block don't cause a tiny row group to be written in the excess space. 8MB accounts for 4 row groups that are 2MB under-size. In addition, it is reasonable to not allow row groups under 8MB. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-144) read a single file outside of mapreduce framework
[ https://issues.apache.org/jira/browse/PARQUET-144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14649334#comment-14649334 ] Ryan Blue commented on PARQUET-144: --- [~hy5446]: you can read files outside of MR using the ParquetReader with Scrooge read support. The constructor you want is here: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java#L64 The read support is what determines the object model the reader will use. Most object models have a convenience reader, but it looks like scrooge doesn't so you'll have to pass the right ReadSupport to the reader in your code. read a single file outside of mapreduce framework - Key: PARQUET-144 URL: https://issues.apache.org/jira/browse/PARQUET-144 Project: Parquet Issue Type: Test Components: parquet-mr Affects Versions: 1.6.0 Reporter: hy5446 Priority: Critical In my test I would like to read a file that has been written through Parquet + Scrooge. I would like to do it outside of map/reduce or hadoop. Something like this: val bytes = readFile(my file) val objects = deserializeWithParquetScrooge[MyObjectClass](bytes) Is something like this possible? How? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-144) read a single file outside of mapreduce framework
[ https://issues.apache.org/jira/browse/PARQUET-144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-144. --- Resolution: Not A Problem I'm resolving this as not a problem because it is a request for information and I think I've covered the question. In the future, you might have better luck getting information from the mailing list (dev@parquet.apache.org) because that's where we typically see this kind of question. read a single file outside of mapreduce framework - Key: PARQUET-144 URL: https://issues.apache.org/jira/browse/PARQUET-144 Project: Parquet Issue Type: Test Components: parquet-mr Affects Versions: 1.6.0 Reporter: hy5446 Priority: Critical In my test I would like to read a file that has been written through Parquet + Scrooge. I would like to do it outside of map/reduce or hadoop. Something like this: val bytes = readFile(my file) val objects = deserializeWithParquetScrooge[MyObjectClass](bytes) Is something like this possible? How? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-344) Limit the number of rows per block and per split
[ https://issues.apache.org/jira/browse/PARQUET-344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14644711#comment-14644711 ] Ryan Blue commented on PARQUET-344: --- [~QuentinFra], you can currently set the row group size and HDFS block size. That allows you to make smaller row groups and control the parallelism. * {{parquet.block.size}} - the target row group size, which we try to be slightly under * {{dfs.blocksize}} - sets the HDFS block size. Make this a whole-number multiple of the row group size Is that sufficient for your use case, or do you think that a limit in terms of number of rows would be better? We can certainly add that, but I'm not sure it's a good idea. When you set the end row group size in bytes, you don't have to know what compression ratio you're going to get. Limit the number of rows per block and per split Key: PARQUET-344 URL: https://issues.apache.org/jira/browse/PARQUET-344 Project: Parquet Issue Type: Improvement Components: parquet-mr Reporter: Quentin Francois Original Estimate: 504h Remaining Estimate: 504h We use Parquet to store raw metrics data and then query this data with Hadoop-Pig. The issue is that sometimes we end up with small Parquet files (~80mo) that contain more than 300 000 000 rows, usually because of a constant metric which results in a very good compression. Too good. As a result we have a very few number of maps that process up to 10x more rows than the other maps and we lose the benefits of the parallelization. The fix for that has two components I believe: 1. Be able to limit the number of rows per Parquet block (in addition to the size limit). 2. Be able to limit the number of rows per split. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-347) Thrift projection does not handle new (optional) fields in requestedSchema
[ https://issues.apache.org/jira/browse/PARQUET-347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14644955#comment-14644955 ] Ryan Blue commented on PARQUET-347: --- Seems like we should more generally take a look at what schema evolution changes are allowed and have tests for all of them. I'm planning on doing the same for Avro and it would be great to coordinate that so we know we can evolve an Avro schema and still read it in Thrift or vice versa. Thrift projection does not handle new (optional) fields in requestedSchema -- Key: PARQUET-347 URL: https://issues.apache.org/jira/browse/PARQUET-347 Project: Parquet Issue Type: Bug Components: parquet-mr Reporter: Alex Levenson It should be valid to request an optional field that is not present in a file (it should be assumed to be null) but instead this throws eagerly in: https://github.com/apache/parquet-mr/blob/d6f082b9be5d507ff60c6bc83a179cc44015ab97/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/api/ReadSupport.java#L58 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-355) Create Integration tests to validate statistics
[ https://issues.apache.org/jira/browse/PARQUET-355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662000#comment-14662000 ] Ryan Blue commented on PARQUET-355: --- [~sircodesalot], thanks for working on this! Can you describe the approach you're taking in the PR to ensure these are tested? Create Integration tests to validate statistics --- Key: PARQUET-355 URL: https://issues.apache.org/jira/browse/PARQUET-355 Project: Parquet Issue Type: Test Components: parquet-mr Reporter: Reuben Kuhnert Priority: Minor In response to [PARQUET-251|https://issues.apache.org/jira/browse/PARQUET-251] create unit tests that validate the statistics fields for each column type. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-358) Add support for temporal logical types to AVRO/Parquet conversion
[ https://issues.apache.org/jira/browse/PARQUET-358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697252#comment-14697252 ] Ryan Blue commented on PARQUET-358: --- Thanks for opening an issue on this one, [~k.shaposhni...@gmail.com]. Avro is currently holding a vote for release 1.8.0, which adds support for date/time types and decimals. I was waiting on that to go through so we can build the parquet-avro support to match its behavior. I would be glad to have your help building this if you're interested! Add support for temporal logical types to AVRO/Parquet conversion - Key: PARQUET-358 URL: https://issues.apache.org/jira/browse/PARQUET-358 Project: Parquet Issue Type: Improvement Components: parquet-avro Affects Versions: 1.8.0 Reporter: Konstantin Shaposhnikov Both [AVRO|https://github.com/apache/avro/blob/trunk/doc/src/content/xdocs/spec.xml] and [Parquet|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md] support logical types for dates, times and timestamps however this information is not transfered from AVRO schema to Parquet schema during conversion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PARQUET-356) Add ElephantBird section to LICENSE file
Ryan Blue created PARQUET-356: - Summary: Add ElephantBird section to LICENSE file Key: PARQUET-356 URL: https://issues.apache.org/jira/browse/PARQUET-356 Project: Parquet Issue Type: Task Components: parquet-mr Affects Versions: 1.8.0, 1.8.1 Reporter: Ryan Blue Assignee: Ryan Blue Fix For: 1.9.0 Commit [9993450|https://github.com/apache/parquet-mr/commit/9993450] brought in a section of [LzoRecordReader.java|https://github.com/twitter/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/input/LzoRecordReader.java#L124] from ElephantBird. The license for ElephantBird is ASL 2.0 so the inclusion is fine. We just need to add it to the root LICENSE file because it is included in the source distribution and in the parquet-thrift binary LICENSE file because it is in that binary package. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PARQUET-335) Avro object model should not require MAP_KEY_VALUE
Ryan Blue created PARQUET-335: - Summary: Avro object model should not require MAP_KEY_VALUE Key: PARQUET-335 URL: https://issues.apache.org/jira/browse/PARQUET-335 Project: Parquet Issue Type: Bug Components: parquet-avro Affects Versions: 1.8.0 Reporter: Ryan Blue Assignee: Ryan Blue Fix For: 1.9.0 The Avro object model currently includes a check that requires maps to use MAP_KEY_VALUE to annotate the repeated key_value group. This is not required by the map type spec and should be removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-327) Show statistics in the dump output
[ https://issues.apache.org/jira/browse/PARQUET-327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-327: -- Fix Version/s: (was: 1.8.0) 1.9.0 Show statistics in the dump output -- Key: PARQUET-327 URL: https://issues.apache.org/jira/browse/PARQUET-327 Project: Parquet Issue Type: Improvement Components: parquet-mr Affects Versions: 1.7.0 Reporter: Tom White Assignee: Tom White Fix For: 1.9.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-288) Add dictionary support to Avro converters
[ https://issues.apache.org/jira/browse/PARQUET-288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-288: -- Fix Version/s: (was: 1.8.0) Add dictionary support to Avro converters - Key: PARQUET-288 URL: https://issues.apache.org/jira/browse/PARQUET-288 Project: Parquet Issue Type: Improvement Components: parquet-avro Affects Versions: 1.7.0 Reporter: Ryan Blue -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-337) binary fields inside map/set/list are not handled in parquet-scrooge
[ https://issues.apache.org/jira/browse/PARQUET-337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-337: -- Assignee: Jake Donham binary fields inside map/set/list are not handled in parquet-scrooge Key: PARQUET-337 URL: https://issues.apache.org/jira/browse/PARQUET-337 Project: Parquet Issue Type: Bug Reporter: Jake Donham Assignee: Jake Donham Binary fields inside map/set/list are not handled; using them produces a ScroogeSchemaConversionException. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-339) Add Alex Levenson to KEYS file
[ https://issues.apache.org/jira/browse/PARQUET-339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631811#comment-14631811 ] Ryan Blue commented on PARQUET-339: --- I'm fine just pushing changes like this, though we should probably have consensus on it. Add Alex Levenson to KEYS file -- Key: PARQUET-339 URL: https://issues.apache.org/jira/browse/PARQUET-339 Project: Parquet Issue Type: Task Reporter: Alex Levenson Assignee: Alex Levenson Fix For: 1.8.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PARQUET-332) Incompatible changes in o.a.p.thrift.projection
Ryan Blue created PARQUET-332: - Summary: Incompatible changes in o.a.p.thrift.projection Key: PARQUET-332 URL: https://issues.apache.org/jira/browse/PARQUET-332 Project: Parquet Issue Type: Bug Components: parquet-mr Reporter: Ryan Blue Fix For: 1.8.0 There are incompatible changes in o.a.p.thrift.projection that weren't caught because of PARQUET-330: * The return type of [{{FieldsPath#push(ThriftField)}} changed|https://github.com/apache/parquet-mr/commit/ded56ffd598e41e32817f6c1b091595fe7122e8b#diff-e990fead0bb1a6faa5080efba86bc81fL34] ([return type compatibility ref|https://docs.oracle.com/javase/specs/jls/se7/html/jls-13.html#jls-13.4.15]) * [{{FieldProjectionFilter}} changed to an interface|https://github.com/apache/parquet-mr/commit/7fc7998398373a14b4cdc0ce18abdeb221b1ccf9#diff-49628343f8d6daf6cb774b6c6ccab82cL29] Both of these are incompatibilities if {{FieldProjectionFilter}} is part of the public API, which it appears to be because it is used by the ScroogeReadSupport and the ThriftSchemaConverter (public constructor). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-241) ParquetInputFormat.getFooters() should return in the same order as what listStatus() returns
[ https://issues.apache.org/jira/browse/PARQUET-241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-241: -- Assignee: Mingyu Kim > ParquetInputFormat.getFooters() should return in the same order as what > listStatus() returns > > > Key: PARQUET-241 > URL: https://issues.apache.org/jira/browse/PARQUET-241 > Project: Parquet > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: Mingyu Kim >Assignee: Mingyu Kim > Fix For: 1.9.0 > > > Because of how the footer cache is implemented, getFooters() returns the > footers in a different order than what listStatus() returns. > When I provided url > "hdfs://.../part-1.parquet,hdfs://.../part-2.parquet,hdfs://.../part-3.parquet", > ParquetInputFormat.getSplits(), which internally calls getFooters(), > returned the splits in a wrong order. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-241) ParquetInputFormat.getFooters() should return in the same order as what listStatus() returns
[ https://issues.apache.org/jira/browse/PARQUET-241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-241. --- Resolution: Fixed Merged #164. Thanks [~mkim] for the contribution! (And sorry this took so long. Next time, feel free to ping the mailing list to remind us!) > ParquetInputFormat.getFooters() should return in the same order as what > listStatus() returns > > > Key: PARQUET-241 > URL: https://issues.apache.org/jira/browse/PARQUET-241 > Project: Parquet > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: Mingyu Kim >Assignee: Mingyu Kim > Fix For: 1.9.0 > > > Because of how the footer cache is implemented, getFooters() returns the > footers in a different order than what listStatus() returns. > When I provided url > "hdfs://.../part-1.parquet,hdfs://.../part-2.parquet,hdfs://.../part-3.parquet", > ParquetInputFormat.getSplits(), which internally calls getFooters(), > returned the splits in a wrong order. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-241) ParquetInputFormat.getFooters() should return in the same order as what listStatus() returns
[ https://issues.apache.org/jira/browse/PARQUET-241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-241: -- Fix Version/s: 1.9.0 > ParquetInputFormat.getFooters() should return in the same order as what > listStatus() returns > > > Key: PARQUET-241 > URL: https://issues.apache.org/jira/browse/PARQUET-241 > Project: Parquet > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: Mingyu Kim > Fix For: 1.9.0 > > > Because of how the footer cache is implemented, getFooters() returns the > footers in a different order than what listStatus() returns. > When I provided url > "hdfs://.../part-1.parquet,hdfs://.../part-2.parquet,hdfs://.../part-3.parquet", > ParquetInputFormat.getSplits(), which internally calls getFooters(), > returned the splits in a wrong order. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-369) Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder
[ https://issues.apache.org/jira/browse/PARQUET-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-369. --- Resolution: Fixed Assignee: Ryan Blue Fix Version/s: format-2.3.1 > Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder > --- > > Key: PARQUET-369 > URL: https://issues.apache.org/jira/browse/PARQUET-369 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Cheng Lian >Assignee: Ryan Blue > Fix For: format-2.3.1 > > > Parquet-format shades SLF4J to {{parquet.org.slf4j}} (see > [here|https://github.com/apache/parquet-format/blob/apache-parquet-format-2.3.0/pom.xml#L162]). > This also accidentally shades [this > line|https://github.com/qos-ch/slf4j/blob/v_1.7.2/slf4j-api/src/main/java/org/slf4j/LoggerFactory.java#L207] > {code} > private static String STATIC_LOGGER_BINDER_PATH = > "org/slf4j/impl/StaticLoggerBinder.class"; > {code} > to > {code} > private static String STATIC_LOGGER_BINDER_PATH = > "parquet/org/slf4j/impl/StaticLoggerBinder.class"; > {code} > and thus {{LoggerFactory}} can never find the correct {{StaticLoggerBinder}} > implementation even if we provide dependencies like {{slf4j-log4j12}} on the > classpath. > This happens in Spark. Whenever we write a Parquet file, we see the following > famous message and can never get rid of it: > {noformat} > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-241) ParquetInputFormat.getFooters() should return in the same order as what listStatus() returns
[ https://issues.apache.org/jira/browse/PARQUET-241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14977329#comment-14977329 ] Ryan Blue commented on PARQUET-241: --- [~skonto], I think that most formats are consistent by accident, but that consistency isn't guaranteed. This would probably make the collect result in Spark more consistent. > ParquetInputFormat.getFooters() should return in the same order as what > listStatus() returns > > > Key: PARQUET-241 > URL: https://issues.apache.org/jira/browse/PARQUET-241 > Project: Parquet > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: Mingyu Kim > > Because of how the footer cache is implemented, getFooters() returns the > footers in a different order than what listStatus() returns. > When I provided url > "hdfs://.../part-1.parquet,hdfs://.../part-2.parquet,hdfs://.../part-3.parquet", > ParquetInputFormat.getSplits(), which internally calls getFooters(), > returned the splits in a wrong order. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-241) ParquetInputFormat.getFooters() should return in the same order as what listStatus() returns
[ https://issues.apache.org/jira/browse/PARQUET-241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978699#comment-14978699 ] Ryan Blue commented on PARQUET-241: --- Building 1.7.0 shouldn't make a difference because this issue is still unresolved. There are specs for Parquet, but nothing that covers this behavior. The order of listStatus probably depends on the order files were created, like most file systems. This would only make it so that the order of footers is the same as the order of the file status array. > ParquetInputFormat.getFooters() should return in the same order as what > listStatus() returns > > > Key: PARQUET-241 > URL: https://issues.apache.org/jira/browse/PARQUET-241 > Project: Parquet > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: Mingyu Kim > > Because of how the footer cache is implemented, getFooters() returns the > footers in a different order than what listStatus() returns. > When I provided url > "hdfs://.../part-1.parquet,hdfs://.../part-2.parquet,hdfs://.../part-3.parquet", > ParquetInputFormat.getSplits(), which internally calls getFooters(), > returned the splits in a wrong order. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-389) Filter predicates should work with missing columns
[ https://issues.apache.org/jira/browse/PARQUET-389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978702#comment-14978702 ] Ryan Blue commented on PARQUET-389: --- I agree, assuming that by "merged" you mean resolving the requested schema against different file schemas. > Filter predicates should work with missing columns > -- > > Key: PARQUET-389 > URL: https://issues.apache.org/jira/browse/PARQUET-389 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.6.0, 1.7.0, 1.8.0 >Reporter: Cheng Lian > > This issue originates from SPARK-11103, which contains detailed information > about how to reproduce it. > The major problem here is that, filter predicates pushed down assert that > columns they touch must exist in the target physical files. But this isn't > true in case of schema merging. > Actually this assertion is unnecessary, because if a column is missing in the > filter schema, the column is considered to be filled by nulls, and all the > filters should be able to act accordingly. For example, if we push down {{a = > 1}} but {{a}} is missing in the underlying physical file, all records in this > file should be dropped since {{a}} is always null. On the other hand, if we > push down {{a IS NULL}}, all records should be preserved. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-140) Allow clients to control the GenericData object that is used to read Avro records
[ https://issues.apache.org/jira/browse/PARQUET-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978707#comment-14978707 ] Ryan Blue commented on PARQUET-140: --- [~DeaconDesperado], you are correct. This allows you to use generic classes instead of specific by specifying GenericData instead of SpecificData or ReflectData. > Allow clients to control the GenericData object that is used to read Avro > records > - > > Key: PARQUET-140 > URL: https://issues.apache.org/jira/browse/PARQUET-140 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Josh Wills >Assignee: Josh Wills > Fix For: 1.6.0 > > > Right now, Parquet always uses the default SpecificData instance (retrieved > by SpecificData.get()) to lookup the schemas for SpecificRecord subclasses. > Unfortunately, if the definition of the SpecificRecord subclass is not > available to the classloader used in SpecificData.get(), we will fail to find > the definition of the SpecificRecord subclass and will fall back to returning > a GenericRecord, which will cause a ClassCastException in any client code > that is expecting an instance of the SpecificRecord subclass. > We can fix this limitation by allowing the client code to specify how to > construct a custom instance of SpecificData (or any other subclass of > GenericData) for Parquet to use, including instances of SpecificData that use > alternative classloaders. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-391) Parquet build fails with thrift9 profile
[ https://issues.apache.org/jira/browse/PARQUET-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999133#comment-14999133 ] Ryan Blue commented on PARQUET-391: --- I think this is a duplicate of PARQUET-380. There's a PR with a fix here: https://github.com/apache/parquet-mr/pull/276 Is it okay with you if I close this and track it on the other issue? > Parquet build fails with thrift9 profile > - > > Key: PARQUET-391 > URL: https://issues.apache.org/jira/browse/PARQUET-391 > Project: Parquet > Issue Type: Bug >Reporter: Yash Datta > > compile parquet build using: > mvn clean install -Pthrift9 -DskipTests > build fails in parquet-cascading project : > [INFO] - > [ERROR] COMPILATION ERROR : > [INFO] - > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[10,32] > package org.apache.thrift.scheme does not exist > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[11,32] > package org.apache.thrift.scheme does not exist > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[12,32] > package org.apache.thrift.scheme does not exist > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[14,32] > package org.apache.thrift.scheme does not exist > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[15,34] > cannot find symbol > symbol: class TTupleProtocol > location: package org.apache.thrift.protocol > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[40,44] > cannot find symbol > symbol: class IScheme > location: class parquet.thrift.test.Name > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[40,54] > cannot find symbol > symbol: class SchemeFactory > location: class parquet.thrift.test.Name > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[395,61] > cannot find symbol > symbol: class SchemeFactory > location: class parquet.thrift.test.Name > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[401,51] > cannot find symbol > symbol: class StandardScheme > location: class parquet.thrift.test.Name > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[462,58] > cannot find symbol > symbol: class SchemeFactory > location: class parquet.thrift.test.Name > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[468,48] > cannot find symbol -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-124) parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException
[ https://issues.apache.org/jira/browse/PARQUET-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996949#comment-14996949 ] Ryan Blue commented on PARQUET-124: --- [~swethakasireddy], it looks like this wasn't completely addressed by the fix above. [~terrasect] had a problem with it as well. Would one of you be willing to open a new issue for the current problem? Then we can work on getting it fixed. Thanks! > parquet.hadoop.ParquetOutputCommitter.commitJob() throws > parquet.io.ParquetEncodingException > > > Key: PARQUET-124 > URL: https://issues.apache.org/jira/browse/PARQUET-124 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.6.0 >Reporter: Chris Albright >Priority: Minor > Fix For: 1.6.0 > > Attachments: PARQUET-124-test > > > I'm running an example combining Avro, Spark and Parquet > (https://github.com/massie/spark-parquet-example), and in the process of > updating the library versions, am getting the warning below. > The version of Parquet-Hadoop in the original example is 1.0.0. I am using > 1.6.0rc3 > The ParquetFileWriter.mergeFooters(Path, List) method is performing a > check to ensure the footers are all for files in the output directory. The > output directory is supplied by ParquetFileWriter.writeMetadataFile; in > 1.0.0, the output path was converted to a fully qualified output path before > the call to mergeFooters, but in 1.6.0rc[2,3] that conversion happens after > the call to mergeFooters. Because of this, the check within merge footers is > failing (the URI for the footers starts with file:, but not the URI for the > root path does not) > Here is the warning message and stacktrace. > Oct 30, 2014 9:11:31 PM WARNING: parquet.hadoop.ParquetOutputCommitter: could > not write summary file for /tmp/1414728690018-0/output > parquet.io.ParquetEncodingException: > file:/tmp/1414728690018-0/output/part-r-0.parquet invalid: all the files > must be contained in the root /tmp/1414728690018-0/output > at > parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422) > at > parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398) > at > parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:50) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:936) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:832) > at > com.zenfractal.SparkParquetExample$.main(SparkParquetExample.scala:72) > at com.zenfractal.SparkParquetExample.main(SparkParquetExample.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-390) GroupType.union(Type toMerge, boolean strict) does not honor strict parameter
[ https://issues.apache.org/jira/browse/PARQUET-390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997083#comment-14997083 ] Ryan Blue commented on PARQUET-390: --- You're right that my suggestion is a much larger issue. For this problem, I'm fine with fixing the union function, though I'd like to see it fixed and tested rather than just tweaked, if that sounds reasonable. > GroupType.union(Type toMerge, boolean strict) does not honor strict parameter > - > > Key: PARQUET-390 > URL: https://issues.apache.org/jira/browse/PARQUET-390 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Reporter: Michael Allman > Labels: newbie, parquet > > This is the code as it currently stands in master: > {code} > @Override > protected Type union(Type toMerge, boolean strict) { > if (toMerge.isPrimitive()) { > throw new IncompatibleSchemaModificationException("can not merge > primitive type " + toMerge + " into group type " + this); > } > return new GroupType(toMerge.getRepetition(), getName(), > mergeFields(toMerge.asGroupType())); > } > {code} > Note the call to {{mergeFields}} omits the {{strict}} parameter. I believe > the code should be: > {code} > @Override > protected Type union(Type toMerge, boolean strict) { > if (toMerge.isPrimitive()) { > throw new IncompatibleSchemaModificationException("can not merge > primitive type " + toMerge + " into group type " + this); > } > return new GroupType(toMerge.getRepetition(), getName(), > mergeFields(toMerge.asGroupType(), strict)); > } > {code} > Note the call to {{mergeFields}} includes the {{strict}} parameter. > I would work on this myself, but I'm having considerable trouble working with > the codebase (see e.g. > http://stackoverflow.com/questions/31229445/build-failure-apache-parquet-mr-source-mvn-install-failure). > Given the (assumed) simplicity of the fix, can a seasoned Parquet > contributor take this up? Cheers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-380) Cascading and scrooge builds fail when using thrift 0.9.0
[ https://issues.apache.org/jira/browse/PARQUET-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15009063#comment-15009063 ] Ryan Blue commented on PARQUET-380: --- There are build failures from thrift's SLF4J dependency. I just need to have some time to work through it. > Cascading and scrooge builds fail when using thrift 0.9.0 > - > > Key: PARQUET-380 > URL: https://issues.apache.org/jira/browse/PARQUET-380 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.8.0 >Reporter: Ryan Blue >Assignee: Ryan Blue > Fix For: 1.9.0 > > > This is caused by a transitive dependency on libthrift 0.7.0 from > elephantbird. The solution is to add thrift as an explicit (but provided) > dependency to those projects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics
[ https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15009099#comment-15009099 ] Ryan Blue commented on PARQUET-41: -- [~Ferd], I think we need a design doc for this feature and some data about it before building an implementation. There are still some unknowns that I don't think we have designed enough. I don't think the current approach that mirrors ORC is appropriate because we don't know the number of unique values in pages and the filters are very sensitive to over-filling. > Add bloom filters to parquet statistics > --- > > Key: PARQUET-41 > URL: https://issues.apache.org/jira/browse/PARQUET-41 > Project: Parquet > Issue Type: New Feature > Components: parquet-format, parquet-mr >Reporter: Alex Levenson >Assignee: Ferdinand Xu > Labels: filter2 > > For row groups with no dictionary, we could still produce a bloom filter. > This could be very useful in filtering entire row groups. > Pull request: > https://github.com/apache/parquet-mr/pull/215 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-390) GroupType.union(Type toMerge, boolean strict) does not honor strict parameter
[ https://issues.apache.org/jira/browse/PARQUET-390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-390: -- Labels: newbie parquet (was: parquet) > GroupType.union(Type toMerge, boolean strict) does not honor strict parameter > - > > Key: PARQUET-390 > URL: https://issues.apache.org/jira/browse/PARQUET-390 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Reporter: Michael Allman > Labels: newbie, parquet > > This is the code as it currently stands in master: > {code} > @Override > protected Type union(Type toMerge, boolean strict) { > if (toMerge.isPrimitive()) { > throw new IncompatibleSchemaModificationException("can not merge > primitive type " + toMerge + " into group type " + this); > } > return new GroupType(toMerge.getRepetition(), getName(), > mergeFields(toMerge.asGroupType())); > } > {code} > Note the call to {{mergeFields}} omits the {{strict}} parameter. I believe > the code should be: > {code} > @Override > protected Type union(Type toMerge, boolean strict) { > if (toMerge.isPrimitive()) { > throw new IncompatibleSchemaModificationException("can not merge > primitive type " + toMerge + " into group type " + this); > } > return new GroupType(toMerge.getRepetition(), getName(), > mergeFields(toMerge.asGroupType(), strict)); > } > {code} > Note the call to {{mergeFields}} includes the {{strict}} parameter. > I would work on this myself, but I'm having considerable trouble working with > the codebase (see e.g. > http://stackoverflow.com/questions/31229445/build-failure-apache-parquet-mr-source-mvn-install-failure). > Given the (assumed) simplicity of the fix, can a seasoned Parquet > contributor take this up? Cheers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-390) GroupType.union(Type toMerge, boolean strict) does not honor strict parameter
[ https://issues.apache.org/jira/browse/PARQUET-390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989939#comment-14989939 ] Ryan Blue commented on PARQUET-390: --- Thanks for the bug report, Michael. I think you're right about this. Could you share with us what you're using this for? This was originally used for building an overall schema for the files in a job, but it isn't necessary to do that and we mostly removed the need to in PARQUET-139. I'd like to see what your use case for it is. Thanks! > GroupType.union(Type toMerge, boolean strict) does not honor strict parameter > - > > Key: PARQUET-390 > URL: https://issues.apache.org/jira/browse/PARQUET-390 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Reporter: Michael Allman > Labels: parquet > > This is the code as it currently stands in master: > {code} > @Override > protected Type union(Type toMerge, boolean strict) { > if (toMerge.isPrimitive()) { > throw new IncompatibleSchemaModificationException("can not merge > primitive type " + toMerge + " into group type " + this); > } > return new GroupType(toMerge.getRepetition(), getName(), > mergeFields(toMerge.asGroupType())); > } > {code} > Note the call to {{mergeFields}} omits the {{strict}} parameter. I believe > the code should be: > {code} > @Override > protected Type union(Type toMerge, boolean strict) { > if (toMerge.isPrimitive()) { > throw new IncompatibleSchemaModificationException("can not merge > primitive type " + toMerge + " into group type " + this); > } > return new GroupType(toMerge.getRepetition(), getName(), > mergeFields(toMerge.asGroupType(), strict)); > } > {code} > Note the call to {{mergeFields}} includes the {{strict}} parameter. > I would work on this myself, but I'm having considerable trouble working with > the codebase (see e.g. > http://stackoverflow.com/questions/31229445/build-failure-apache-parquet-mr-source-mvn-install-failure). > Given the (assumed) simplicity of the fix, can a seasoned Parquet > contributor take this up? Cheers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-373) MemoryManager tests are flaky
[ https://issues.apache.org/jira/browse/PARQUET-373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-373. --- Resolution: Fixed > MemoryManager tests are flaky > - > > Key: PARQUET-373 > URL: https://issues.apache.org/jira/browse/PARQUET-373 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.8.0 >Reporter: Ryan Blue >Assignee: Ryan Blue > Fix For: 1.9.0 > > > The memory manager tests are flaky, depending on the heap allocation for the > JVM they run in. This is caused by over-specific tests that assert the memory > allocation down to the byte and the fact that some assertions implicitly cast > long values to doubles to use the "within" form of assertEquals. > The tests should not validate a specific allocation strategy, but should > instead assert that: > 1. The allocation for a file is the row group size until room runs out > 2. When scaling row groups, the total allocation does not exceed the pool size -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-246) ArrayIndexOutOfBoundsException with Parquet write version v2
[ https://issues.apache.org/jira/browse/PARQUET-246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-246: -- Fix Version/s: (was: 2.0.0) 1.8.0 ArrayIndexOutOfBoundsException with Parquet write version v2 Key: PARQUET-246 URL: https://issues.apache.org/jira/browse/PARQUET-246 Project: Parquet Issue Type: Bug Affects Versions: 1.6.0 Reporter: Konstantin Shaposhnikov Fix For: 1.8.0 I am getting the following exception when reading a parquet file that was created using Avro WriteSupport and Parquet write version v2.0: {noformat} Caused by: parquet.io.ParquetDecodingException: Can't read value in column [colName, rows, array, name] BINARY at value 313601 out of 428260, 1 out of 39200 in currentPage. repetition level: 0, definition level: 2 at parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:462) at parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:364) at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209) ... 27 more Caused by: java.lang.ArrayIndexOutOfBoundsException at parquet.column.values.deltastrings.DeltaByteArrayReader.readBytes(DeltaByteArrayReader.java:70) at parquet.column.impl.ColumnReaderImpl$2$6.read(ColumnReaderImpl.java:307) at parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:458) ... 30 more {noformat} The file is quite big (500Mb) so I cannot upload it here, but possibly there is enough information in the exception message to understand the cause of error. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-246) ArrayIndexOutOfBoundsException with Parquet write version v2
[ https://issues.apache.org/jira/browse/PARQUET-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14620919#comment-14620919 ] Ryan Blue commented on PARQUET-246: --- The {{parquet.split.files}} option will read all files sequentially. You'll get one task per file instead of one task per input split (HDFS block). The reason is that we can't detect this situation while calculating splits without reading the file metadata to determine what version of Parquet wrote the file and whether it uses the delta byte array encoding. That would mean reading the footers on the task side, which is a bottleneck that we just fixed in PARQUET-139. Basically, reading the footers to plan splits doesn't scale well enough. So the compromise is to detect when a job would read corrupt data and fail those tasks with a message that tells you how to avoid the problem. It isn't ideal, but luckily this encoding wasn't very widely used. ArrayIndexOutOfBoundsException with Parquet write version v2 Key: PARQUET-246 URL: https://issues.apache.org/jira/browse/PARQUET-246 Project: Parquet Issue Type: Bug Affects Versions: 1.6.0 Reporter: Konstantin Shaposhnikov Assignee: Konstantin Shaposhnikov Fix For: 1.8.0 I am getting the following exception when reading a parquet file that was created using Avro WriteSupport and Parquet write version v2.0: {noformat} Caused by: parquet.io.ParquetDecodingException: Can't read value in column [colName, rows, array, name] BINARY at value 313601 out of 428260, 1 out of 39200 in currentPage. repetition level: 0, definition level: 2 at parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:462) at parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:364) at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209) ... 27 more Caused by: java.lang.ArrayIndexOutOfBoundsException at parquet.column.values.deltastrings.DeltaByteArrayReader.readBytes(DeltaByteArrayReader.java:70) at parquet.column.impl.ColumnReaderImpl$2$6.read(ColumnReaderImpl.java:307) at parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:458) ... 30 more {noformat} The file is quite big (500Mb) so I cannot upload it here, but possibly there is enough information in the exception message to understand the cause of error. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-246) ArrayIndexOutOfBoundsException with Parquet write version v2
[ https://issues.apache.org/jira/browse/PARQUET-246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-246. --- Resolution: Fixed Assignee: Konstantin Shaposhnikov Closing this now that the read side has a fix. Thanks Konstantin, Sergio, Alex, and Tianshuo for all your work getting this resolved! ArrayIndexOutOfBoundsException with Parquet write version v2 Key: PARQUET-246 URL: https://issues.apache.org/jira/browse/PARQUET-246 Project: Parquet Issue Type: Bug Affects Versions: 1.6.0 Reporter: Konstantin Shaposhnikov Assignee: Konstantin Shaposhnikov Fix For: 1.8.0 I am getting the following exception when reading a parquet file that was created using Avro WriteSupport and Parquet write version v2.0: {noformat} Caused by: parquet.io.ParquetDecodingException: Can't read value in column [colName, rows, array, name] BINARY at value 313601 out of 428260, 1 out of 39200 in currentPage. repetition level: 0, definition level: 2 at parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:462) at parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:364) at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209) ... 27 more Caused by: java.lang.ArrayIndexOutOfBoundsException at parquet.column.values.deltastrings.DeltaByteArrayReader.readBytes(DeltaByteArrayReader.java:70) at parquet.column.impl.ColumnReaderImpl$2$6.read(ColumnReaderImpl.java:307) at parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:458) ... 30 more {noformat} The file is quite big (500Mb) so I cannot upload it here, but possibly there is enough information in the exception message to understand the cause of error. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-380) Cascading and scrooge builds fail when using thrift 0.9.0
[ https://issues.apache.org/jira/browse/PARQUET-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15009838#comment-15009838 ] Ryan Blue commented on PARQUET-380: --- When I add the dependency for libthrift, I get an error somewhere in cascading that there is no StaticLoggerBinder for SLF4J. That's an easy fix: add a binder like slf4j-nop or slf4j-simple. But, when I add slf4j-simple:1.7.5, I get: {code} SLF4J: The requested version 1.6.99 by your slf4j binding is not compatible with [1.5.5, 1.5.6, 1.5.7, 1.5.8] . . . java.lang.NoSuchMethodError: org.slf4j.helpers.MessageFormatter.format(Ljava/lang/String;Ljava/lang/Object;Ljava/lang/Object;)Lorg/slf4j/helpers/FormattingTuple; at org.slf4j.impl.SimpleLogger.formatAndLog(SimpleLogger.java:414) at org.slf4j.impl.SimpleLogger.info(SimpleLogger.java:546) {code} The version of SLF4J that is pulled in by libthrift is too old to work with a new binding. But, using a slf4j-simple version that works with the older version of thrift causes failures in the hadoop-2 profile because Hadoop pulls in a version of SLF4J that isn't compatible with the older slf4j-simple. So the fix is to pull in the new version of both slf4j-api and slf4j-simple that matches the hadoop-2 verison. In the default profile, it overrides the transitive SLF4J dependency from libthrift and everything works. This is only needed for test dependencies, allowing downstream projects to use whatever version of the SLF4J API they need, which will override the old one in libthrift. I've pushed a new version that should work, I'll commit it after CI tests pass. > Cascading and scrooge builds fail when using thrift 0.9.0 > - > > Key: PARQUET-380 > URL: https://issues.apache.org/jira/browse/PARQUET-380 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.8.0 >Reporter: Ryan Blue >Assignee: Ryan Blue > Fix For: 1.9.0 > > > This is caused by a transitive dependency on libthrift 0.7.0 from > elephantbird. The solution is to add thrift as an explicit (but provided) > dependency to those projects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-380) Cascading and scrooge builds fail when using thrift 0.9.0
[ https://issues.apache.org/jira/browse/PARQUET-380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-380. --- Resolution: Fixed Fixed. Thanks for the push, [~saucam]! > Cascading and scrooge builds fail when using thrift 0.9.0 > - > > Key: PARQUET-380 > URL: https://issues.apache.org/jira/browse/PARQUET-380 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.8.0 >Reporter: Ryan Blue >Assignee: Ryan Blue > Fix For: 1.9.0 > > > This is caused by a transitive dependency on libthrift 0.7.0 from > elephantbird. The solution is to add thrift as an explicit (but provided) > dependency to those projects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-344) Limit the number of rows per block and per split
[ https://issues.apache.org/jira/browse/PARQUET-344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14712108#comment-14712108 ] Ryan Blue commented on PARQUET-344: --- Thanks Quentin! I like Dan's idea of limiting the raw data size as a way to control this that isn't exposed to users. If you are willing to build a patch for that, thank you! Limit the number of rows per block and per split Key: PARQUET-344 URL: https://issues.apache.org/jira/browse/PARQUET-344 Project: Parquet Issue Type: Improvement Components: parquet-mr Reporter: Quentin Francois We use Parquet to store raw metrics data and then query this data with Hadoop-Pig. The issue is that sometimes we end up with small Parquet files (~80mo) that contain more than 300 000 000 rows, usually because of a constant metric which results in a very good compression. Too good. As a result we have a very few number of maps that process up to 10x more rows than the other maps and we lose the benefits of the parallelization. The fix for that has two components I believe: 1. Be able to limit the number of rows per Parquet block (in addition to the size limit). 2. Be able to limit the number of rows per split. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PARQUET-373) MemoryManager tests are flaky
Ryan Blue created PARQUET-373: - Summary: MemoryManager tests are flaky Key: PARQUET-373 URL: https://issues.apache.org/jira/browse/PARQUET-373 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.8.0 Reporter: Ryan Blue Assignee: Ryan Blue Fix For: 1.9.0 The memory manager tests are flaky, depending on the heap allocation for the JVM they run in. This is caused by over-specific tests that assert the memory allocation down to the byte and the fact that some assertions implicitly cast long values to doubles to use the "within" form of assertEquals. The tests should not validate a specific allocation strategy, but should instead assert that: 1. The allocation for a file is the row group size until room runs out 2. When scaling row groups, the total allocation does not exceed the pool size -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-335) Avro object model should not require MAP_KEY_VALUE
[ https://issues.apache.org/jira/browse/PARQUET-335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-335. --- Resolution: Fixed > Avro object model should not require MAP_KEY_VALUE > -- > > Key: PARQUET-335 > URL: https://issues.apache.org/jira/browse/PARQUET-335 > Project: Parquet > Issue Type: Bug > Components: parquet-avro >Affects Versions: 1.8.0 >Reporter: Ryan Blue >Assignee: Ryan Blue > Fix For: 1.9.0 > > > The Avro object model currently includes a check that requires maps to use > MAP_KEY_VALUE to annotate the repeated key_value group. This is not required > by the map type spec and should be removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PARQUET-372) Parquet stats can have awkwardly large values
Ryan Blue created PARQUET-372: - Summary: Parquet stats can have awkwardly large values Key: PARQUET-372 URL: https://issues.apache.org/jira/browse/PARQUET-372 Project: Parquet Issue Type: Bug Components: parquet-format, parquet-mr Reporter: Ryan Blue If a column is storing very large values, say 2-4 MB, then the page header's min and max values can also be this large. It is wasteful to keep that much data in a page header, so we should look at options for decreasing the size required in these cases. One idea is to truncate the size of binary data and change the last byte to 0xFF (max) or 0x00 (min) to get a roughly equivalent min and max that isn't huge. This probably has some problems when the data stores multi-byte characters in UTF8 so we have to be careful and look into byte-wise sorting and UTF8. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-379) PrimitiveType.union erases original type
[ https://issues.apache.org/jira/browse/PARQUET-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14933846#comment-14933846 ] Ryan Blue commented on PARQUET-379: --- I think this is part of a larger issue of handling schema evolution. The main use case I know of for union is merging file schemas into a metadata summary file. Those are no longer really needed because each schema is resolved against the requested schema individually on the reader, which eliminates the bottle-neck that the metadata file was intended to avoid. And as you note, union doesn't really create a union as one might expect: a schema that can be used to read both of the input schemas. > PrimitiveType.union erases original type > > > Key: PARQUET-379 > URL: https://issues.apache.org/jira/browse/PARQUET-379 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0 >Reporter: Cheng Lian > > The following ScalaTest test case > {code} > test("merge primitive types") { > val expected = > Types.buildMessage() > .addField( > Types > .required(INT32) > .as(DECIMAL) > .precision(7) > .scale(2) > .named("f")) > .named("root") > assert(expected.union(expected) === expected) > } > {code} > produces the following assertion error > {noformat} > message root { > required int32 f; > } > did not equal message root { > required int32 f (DECIMAL(9,0)); > } > {noformat} > This is because {{PrimitiveType.union}} doesn't handle original type > properly. An open question is that, can two primitive types with the same > primitive type name but different original types be unioned? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-369) Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder
[ https://issues.apache.org/jira/browse/PARQUET-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14906920#comment-14906920 ] Ryan Blue commented on PARQUET-369: --- I should also note: I've verified that there are no org.slf4j.* classes in the shaded parquet-format jar (they are now shaded.parquet.org.slf4j) and I decompiled LoggerFactory and verified that the reference to StaticLoggerBinder.class is unmodified. > Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder > --- > > Key: PARQUET-369 > URL: https://issues.apache.org/jira/browse/PARQUET-369 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Cheng Lian > > Parquet-format shades SLF4J to {{parquet.org.slf4j}} (see > [here|https://github.com/apache/parquet-format/blob/apache-parquet-format-2.3.0/pom.xml#L162]). > This also accidentally shades [this > line|https://github.com/qos-ch/slf4j/blob/v_1.7.2/slf4j-api/src/main/java/org/slf4j/LoggerFactory.java#L207] > {code} > private static String STATIC_LOGGER_BINDER_PATH = > "org/slf4j/impl/StaticLoggerBinder.class"; > {code} > to > {code} > private static String STATIC_LOGGER_BINDER_PATH = > "parquet/org/slf4j/impl/StaticLoggerBinder.class"; > {code} > and thus {{LoggerFactory}} can never find the correct {{StaticLoggerBinder}} > implementation even if we provide dependencies like {{slf4j-log4j12}} on the > classpath. > This happens in Spark. Whenever we write a Parquet file, we see the following > famous message and can never get rid of it: > {noformat} > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-369) Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder
[ https://issues.apache.org/jira/browse/PARQUET-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14906993#comment-14906993 ] Ryan Blue commented on PARQUET-369: --- I've updated the PR to shade slf4j-nop and confirmed that everything still works, but the warning is gone. > Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder > --- > > Key: PARQUET-369 > URL: https://issues.apache.org/jira/browse/PARQUET-369 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Cheng Lian > > Parquet-format shades SLF4J to {{parquet.org.slf4j}} (see > [here|https://github.com/apache/parquet-format/blob/apache-parquet-format-2.3.0/pom.xml#L162]). > This also accidentally shades [this > line|https://github.com/qos-ch/slf4j/blob/v_1.7.2/slf4j-api/src/main/java/org/slf4j/LoggerFactory.java#L207] > {code} > private static String STATIC_LOGGER_BINDER_PATH = > "org/slf4j/impl/StaticLoggerBinder.class"; > {code} > to > {code} > private static String STATIC_LOGGER_BINDER_PATH = > "parquet/org/slf4j/impl/StaticLoggerBinder.class"; > {code} > and thus {{LoggerFactory}} can never find the correct {{StaticLoggerBinder}} > implementation even if we provide dependencies like {{slf4j-log4j12}} on the > classpath. > This happens in Spark. Whenever we write a Parquet file, we see the following > famous message and can never get rid of it: > {noformat} > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-383) ParquetOutputCommitter should propagate errors when writing metadata files
[ https://issues.apache.org/jira/browse/PARQUET-383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907123#comment-14907123 ] Ryan Blue commented on PARQUET-383: --- I think this is a good idea. I'd make the error fatal only if the user opted to use the metadata file. I'd suggest that we not write the metadata file by default, but I don't think that's an option without a major version bump because it could break users. > ParquetOutputCommitter should propagate errors when writing metadata files > -- > > Key: PARQUET-383 > URL: https://issues.apache.org/jira/browse/PARQUET-383 > Project: Parquet > Issue Type: Improvement >Reporter: Alex Levenson >Priority: Minor > > There's a lot of different ways the output committer can fail, or fail to > rollback after failing to write metadata files. We should decide whether > metadata files are required, and fatal (I think that's reasonable if the user > asked for them), and propagate without squashing exceptions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-369) Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder
[ https://issues.apache.org/jira/browse/PARQUET-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14906986#comment-14906986 ] Ryan Blue commented on PARQUET-369: --- Ignore my comment above, I just tested out the partial relocation and it doesn't work because of references back to some of the moved classes. Looks like we can either ship parquet-format with a slf4j-api dependency or bundle it with a logger implementation, like slf4j-nop. I don't think there is much interesting information being logged by thrift, plus we haven't been able to get those messages for at least the last release without complaints. I suggest we add slf4j-nop and shade that to avoid the warning. > Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder > --- > > Key: PARQUET-369 > URL: https://issues.apache.org/jira/browse/PARQUET-369 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Cheng Lian > > Parquet-format shades SLF4J to {{parquet.org.slf4j}} (see > [here|https://github.com/apache/parquet-format/blob/apache-parquet-format-2.3.0/pom.xml#L162]). > This also accidentally shades [this > line|https://github.com/qos-ch/slf4j/blob/v_1.7.2/slf4j-api/src/main/java/org/slf4j/LoggerFactory.java#L207] > {code} > private static String STATIC_LOGGER_BINDER_PATH = > "org/slf4j/impl/StaticLoggerBinder.class"; > {code} > to > {code} > private static String STATIC_LOGGER_BINDER_PATH = > "parquet/org/slf4j/impl/StaticLoggerBinder.class"; > {code} > and thus {{LoggerFactory}} can never find the correct {{StaticLoggerBinder}} > implementation even if we provide dependencies like {{slf4j-log4j12}} on the > classpath. > This happens in Spark. Whenever we write a Parquet file, we see the following > famous message and can never get rid of it: > {noformat} > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PARQUET-382) Add a way to append encoded blocks in ParquetFileWriter
Ryan Blue created PARQUET-382: - Summary: Add a way to append encoded blocks in ParquetFileWriter Key: PARQUET-382 URL: https://issues.apache.org/jira/browse/PARQUET-382 Project: Parquet Issue Type: New Feature Components: parquet-mr Affects Versions: 1.8.0 Reporter: Ryan Blue Assignee: Ryan Blue Concatenating two files together currently requires reading the source files and rewriting the content from scratch. This ends up taking a lot of memory, even if the data is already encoded correctly and blocks just need to be appended and have their metadata updated. Merging two files should be fast and not take much memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (PARQUET-372) Parquet stats can have awkwardly large values
[ https://issues.apache.org/jira/browse/PARQUET-372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue reassigned PARQUET-372: - Assignee: Ryan Blue > Parquet stats can have awkwardly large values > - > > Key: PARQUET-372 > URL: https://issues.apache.org/jira/browse/PARQUET-372 > Project: Parquet > Issue Type: Bug > Components: parquet-format, parquet-mr >Reporter: Ryan Blue >Assignee: Ryan Blue > > If a column is storing very large values, say 2-4 MB, then the page header's > min and max values can also be this large. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-372) Parquet stats can have awkwardly large values
[ https://issues.apache.org/jira/browse/PARQUET-372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-372: -- Description: If a column is storing very large values, say 2-4 MB, then the page header's min and max values can also be this large. (was: If a column is storing very large values, say 2-4 MB, then the page header's min and max values can also be this large. It is wasteful to keep that much data in a page header, so we should look at options for decreasing the size required in these cases. One idea is to truncate the size of binary data and change the last byte to 0xFF (max) or 0x00 (min) to get a roughly equivalent min and max that isn't huge. This probably has some problems when the data stores multi-byte characters in UTF8 so we have to be careful and look into byte-wise sorting and UTF8.) > Parquet stats can have awkwardly large values > - > > Key: PARQUET-372 > URL: https://issues.apache.org/jira/browse/PARQUET-372 > Project: Parquet > Issue Type: Bug > Components: parquet-format, parquet-mr >Reporter: Ryan Blue > > If a column is storing very large values, say 2-4 MB, then the page header's > min and max values can also be this large. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-34) Add support for repeated columns in the filter2 API
[ https://issues.apache.org/jira/browse/PARQUET-34?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036944#comment-15036944 ] Ryan Blue commented on PARQUET-34: -- [~f.pompermaier], I don't think anyone has extra cycles to spend implementing this right now, but if you are interested in building it, we'll work with you to get it reviewed and included. I think the next step is to write up what you think needs to be done so we can look at it and help you in the right direction. It may be that the disconnect between Alex's comment about this being easy and your assessment is a different level of support. > Add support for repeated columns in the filter2 API > --- > > Key: PARQUET-34 > URL: https://issues.apache.org/jira/browse/PARQUET-34 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Alex Levenson >Priority: Minor > Labels: filter2 > > They currently are not supported. They would need their own set of operators, > like contains() and size() etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-382) Add a way to append encoded blocks in ParquetFileWriter
[ https://issues.apache.org/jira/browse/PARQUET-382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-382. --- Resolution: Fixed Fix Version/s: 1.9.0 Merged #278. Thanks for reviewing, Sergio! > Add a way to append encoded blocks in ParquetFileWriter > --- > > Key: PARQUET-382 > URL: https://issues.apache.org/jira/browse/PARQUET-382 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Affects Versions: 1.8.0 >Reporter: Ryan Blue >Assignee: Ryan Blue > Fix For: 1.9.0 > > > Concatenating two files together currently requires reading the source files > and rewriting the content from scratch. This ends up taking a lot of memory, > even if the data is already encoded correctly and blocks just need to be > appended and have their metadata updated. Merging two files should be fast > and not take much memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-402) Apache Pig cannot store Map data type into Parquet format
[ https://issues.apache.org/jira/browse/PARQUET-402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051521#comment-15051521 ] Ryan Blue commented on PARQUET-402: --- Is there anything we can do about it? Maybe we should at least throw an exception when the map type passed to Parquet can't be converted to a valid Parquet schema because the KV types are missing. > Apache Pig cannot store Map data type into Parquet format > - > > Key: PARQUET-402 > URL: https://issues.apache.org/jira/browse/PARQUET-402 > Project: Parquet > Issue Type: Bug > Components: parquet-pig >Affects Versions: 1.6.0, 1.8.1 >Reporter: Jerry Ylilammi > > Trying to store simple map with two entries gives me following exception: > {code}table_with_map_data: {my_map: map[]} > 2015-12-10 11:58:54,478 [main] INFO > org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is > deprecated. Instead, use fs.defaultFS > 2015-12-10 11:58:54,498 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 2999: Unexpected internal error. Invalid map Schema, schema should contain > exactly one field: my_map: map{code} > For example taking any input and doing this gives me the exception: > {code}table_with_map_data = FOREACH random_data GENERATE TOMAP('123', > 'hello', '456', 'world') as (my_map); > DESCRIBE table_with_map_data; > STORE table_with_map_data INTO '...' USING ParquetStorer();{code} > I'm using latest version of Pig: Apache Pig version 0.15.0 (r1682971) > compiled Jun 01 2015, 11:44:35 > and Parquet: parquet-pig-bundle-1.6.0.jar > EDIT: I noticed Parquet 1.8.1 is out. I switched to it and were forced to > update the pig script to use full path with ParquetStorer. However this gives > me same error as 1.6.0. > {code}STORE table_with_map_data INTO > '/Users/jerry/tmp/parquet/output/parquet' USING > org.apache.parquet.pig.ParquetStorer();{code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-393) release parquet-format 2.3.1
[ https://issues.apache.org/jira/browse/PARQUET-393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-393: -- Summary: release parquet-format 2.3.1 (was: release parquet-format 2.4.0) > release parquet-format 2.3.1 > > > Key: PARQUET-393 > URL: https://issues.apache.org/jira/browse/PARQUET-393 > Project: Parquet > Issue Type: Task >Reporter: Julien Le Dem > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-346) ThriftSchemaConverter throws for unknown struct or union type
[ https://issues.apache.org/jira/browse/PARQUET-346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-346: -- Fix Version/s: (was: 2.0.0) 1.9.0 > ThriftSchemaConverter throws for unknown struct or union type > - > > Key: PARQUET-346 > URL: https://issues.apache.org/jira/browse/PARQUET-346 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Reporter: Alex Levenson >Assignee: Alex Levenson > Fix For: 1.9.0 > > > ThriftSchemaConverter should either only be called on ThriftStruct's that > have populated structOrUnionType metadata, or should support a mode where > this data is unknown w/o throwing an exception. > Currently it is called using the file's metadata here: > https://github.com/apache/parquet-mr/blob/d6f082b9be5d507ff60c6bc83a179cc44015ab97/parquet-thrift/src/main/java/org/apache/parquet/thrift/ThriftRecordConverter.java#L797 > One workaround is not not use the file matadata here but rather the schema > from the thrift class. The other is to support unknown struct or union types -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-405) Backwards-incompatible change to thrift metadata
[ https://issues.apache.org/jira/browse/PARQUET-405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15057013#comment-15057013 ] Ryan Blue commented on PARQUET-405: --- Thanks, Ben! Both for reporting the issue and for helping us keep the issues organized. > Backwards-incompatible change to thrift metadata > > > Key: PARQUET-405 > URL: https://issues.apache.org/jira/browse/PARQUET-405 > Project: Parquet > Issue Type: Bug >Affects Versions: 1.8.0 >Reporter: Ben Kirwin > > Sometime in the last few versions, a {{isStructOrUnion}} field has been added > to the `thrift.descriptor` written to the parquet header: > {code} > { > "children": [ ... ], > "id": "STRUCT", > "structOrUnionType": "STRUCT" > } > {code} > The current release now throws an exception when that field is missing / > {{UNKNOWN}}). This makes it impossible to read back thrift data written using > a previous release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)