[jira] [Commented] (PARQUET-268) Build is failing with parquet-scrooge errors.

2015-04-29 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520532#comment-14520532
 ] 

Ryan Blue commented on PARQUET-268:
---

I'm going to do the downgrade and ignore the failing tests. We know that the 
library works right as long as Scrooge does, so I think it is reasonable. I'll 
ping you on the PR for review.

 Build is failing with parquet-scrooge errors.
 -

 Key: PARQUET-268
 URL: https://issues.apache.org/jira/browse/PARQUET-268
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.6.0
Reporter: Ryan Blue
 Fix For: 1.6.1


 The build is currently failing for all PRs in Travis CI. According to Alex:
 bq. . . . one of the scrooge dependencies transitively pulled in a snapshot 
 that has since been purged. Seems like that dependency was improperly 
 published. Upgrading the scrooge plugin should fix this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-268) Build is failing with parquet-scrooge errors.

2015-04-29 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-268.
---
Resolution: Fixed
  Assignee: Ryan Blue

 Build is failing with parquet-scrooge errors.
 -

 Key: PARQUET-268
 URL: https://issues.apache.org/jira/browse/PARQUET-268
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.6.0
Reporter: Ryan Blue
Assignee: Ryan Blue
 Fix For: 1.6.1


 The build is currently failing for all PRs in Travis CI. According to Alex:
 bq. . . . one of the scrooge dependencies transitively pulled in a snapshot 
 that has since been purged. Seems like that dependency was improperly 
 published. Upgrading the scrooge plugin should fix this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (PARQUET-270) Add legend to parquet-tools readme.md

2015-04-29 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue reassigned PARQUET-270:
-

Assignee: Ryan Blue

 Add legend to parquet-tools readme.md
 -

 Key: PARQUET-270
 URL: https://issues.apache.org/jira/browse/PARQUET-270
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: Brett Stime
Assignee: Ryan Blue
Priority: Trivial

 Improve the documentation for parquet-tools by describing the output in more 
 detail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-280) Please create a DOAP file for your TLP

2015-05-14 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-280.
---
Resolution: Fixed
  Assignee: Julien Le Dem

Thanks, Julien!

 Please create a DOAP file for your TLP
 --

 Key: PARQUET-280
 URL: https://issues.apache.org/jira/browse/PARQUET-280
 Project: Parquet
  Issue Type: Task
Reporter: Sebb
Assignee: Julien Le Dem

 Please can you set up a DOAP for your project and get it added to files.xml?
 See http://projects.apache.org/create.html
 Once you have created the DOAP, please submit it for inclusion in the Apache 
 projects listing as per:
 http://projects.apache.org/create.html#submit
 Remember, if you ever move or rename the doap file in future, please
 ensure that files.xml is updated to point to the new location.
 It is recommended that the DOAP is published with the website, e.g. at
 http://parquet.apache.org/doap_Parquet.rdf
 as this URL is unlikely to change.
 Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-253) AvroSchemaConverter has confusing Javadoc

2015-05-15 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-253.
---
Resolution: Fixed

Merged #173. Thanks!

 AvroSchemaConverter has confusing Javadoc
 -

 Key: PARQUET-253
 URL: https://issues.apache.org/jira/browse/PARQUET-253
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Affects Versions: 1.5.0, 1.6.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Minor

 Got confused by the original Javadoc at first and didn't realize 
 {{AvroSchemaConverter}} is also capable to convert a Parquet schema to an 
 Avro schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-98) filter2 API performance regression

2015-05-19 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14551393#comment-14551393
 ] 

Ryan Blue commented on PARQUET-98:
--

[~phraktle], to save you some time, the 1.7.0 release will also have this 
problem. I'll find some time to look into it further.

 filter2 API performance regression
 --

 Key: PARQUET-98
 URL: https://issues.apache.org/jira/browse/PARQUET-98
 Project: Parquet
  Issue Type: Bug
Reporter: Viktor Szathmáry

 The new filter API seems to be much slower (or perhaps I'm using it wrong \:)
 Code using an UnboundRecordFilter:
 {code:java}
 ColumnRecordFilter.column(column,
 ColumnPredicates.applyFunctionToBinary(
 input - Binary.fromString(value).equals(input)));
 {code}
 vs. code using FilterPredicate:
 {code:java}
 eq(binaryColumn(column), Binary.fromString(value));
 {code}
 The latter performs twice as slow on the same Parquet file (built using 
 1.6.0rc2).
 Note: the reader is constructed using
 {code:java}
 ParquetReader.builder(new ProtoReadSupport().withFilter(filter).build()
 {code}
 The new filter API based approach seems to create a whole lot more garbage 
 (perhaps due to reconstructing all the rows?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-222) parquet writer runs into OOM during writing when calling DataFrame.saveAsParquetFile in Spark SQL

2015-06-05 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14574894#comment-14574894
 ] 

Ryan Blue commented on PARQUET-222:
---

[~phatak.dev]: the problem is probably the number of files you're trying to 
write to at once. Each file buffers to the Parquet row group size (set by 
parquet.block.size, defaults to 128MB). If you have 10 files open for a 
processor, that's ~1.3GB and Spark already uses quite a bit of memory itself.

[~lian cheng], any ideas since you're the most familiar with how Spark writes 
from data frames? Is it possible to shuffle the data to have only one open file 
per executor at a time?

 parquet writer runs into OOM during writing when calling 
 DataFrame.saveAsParquetFile in Spark SQL
 -

 Key: PARQUET-222
 URL: https://issues.apache.org/jira/browse/PARQUET-222
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.6.0
Reporter: Chaozhong Yang
   Original Estimate: 336h
  Remaining Estimate: 336h

 In Spark SQL, there is a function `saveAsParquetFile` in DataFrame or 
 SchemaRDD. That function calls method in parquet-mr, and sometimes it will 
 fail due to the OOM error thrown by parquet-mr. We can see the exception 
 stack trace  as follows:
 WARN] [task-result-getter-3] 03-19 11:17:58,274 [TaskSetManager] - Lost task 
 0.2 in stag
 e 137.0 (TID 309, hb1.avoscloud.com): java.lang.OutOfMemoryError: Java heap 
 space
 at parquet.column.values.dictionary.IntList.initSlab(IntList.java:87)
 at parquet.column.values.dictionary.IntList.init(IntList.java:83)
 at 
 parquet.column.values.dictionary.DictionaryValuesWriter.init(DictionaryValue
 sWriter.java:85)
 at 
 parquet.column.values.dictionary.DictionaryValuesWriter$PlainIntegerDictionary
 ValuesWriter.init(DictionaryValuesWriter.java:549)
 at 
 parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:88)
 at 
 parquet.column.impl.ColumnWriterImpl.init(ColumnWriterImpl.java:74)
 at 
 parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.jav
 a:68)
 at 
 parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.
 java:56)
 at 
 parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.init(MessageColumnI
 O.java:178)
 at 
 parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369)
 at 
 parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWrit
 er.java:108)
 at 
 parquet.hadoop.InternalParquetRecordWriter.init(InternalParquetRecordWriter.
 java:94)
 at 
 parquet.hadoop.ParquetRecordWriter.init(ParquetRecordWriter.java:64)
 at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:28
 2)
 at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:25
 2)
 at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parqu
 et$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:304)
 at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$
 1.apply(ParquetTableOperations.scala:325)
 at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$
 1.apply(ParquetTableOperations.scala:325)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java
 :886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908
 )
 at java.lang.Thread.run(Thread.java:662)
 By the way, there is another similar issue 
 https://issues.apache.org/jira/browse/PARQUET-99. But the reporter has closed 
 it and mark it as resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-314) Fix broken equals implementation(s)

2015-06-22 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-314.
---
   Resolution: Fixed
Fix Version/s: 1.8.0

Merged. Thanks for catching this and fixing it, [~nezihyigitbasi]!

 Fix broken equals implementation(s)
 ---

 Key: PARQUET-314
 URL: https://issues.apache.org/jira/browse/PARQUET-314
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.8.0
Reporter: Nezih Yigitbasi
Assignee: Nezih Yigitbasi
Priority: Minor
 Fix For: 1.8.0


 The equals implementation in ColumnDescriptor and Statistics classes are 
 broken.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics

2015-06-23 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597906#comment-14597906
 ] 

Ryan Blue commented on PARQUET-41:
--

Interesting, I hadn't heard about the counting bloom filters. But as I think a 
bit more about how the Hive ACID stuff works, I don't think it would help.

The base file is rewritten periodically to incorporate changes stored in the 
current set of deltas. That would rewrite the bloom filter from scratch, so 
there is no need for it to be reversible. Then if you're applying a delta on 
top of the base file, you only need to apply the filters to your delta because 
those rows entirely replace rows in the base. In that case, you have a static 
bloom filter per delta file and static bloom filters in the base file, too.

 Add bloom filters to parquet statistics
 ---

 Key: PARQUET-41
 URL: https://issues.apache.org/jira/browse/PARQUET-41
 Project: Parquet
  Issue Type: New Feature
  Components: parquet-format, parquet-mr
Reporter: Alex Levenson
Assignee: Ferdinand Xu
  Labels: filter2

 For row groups with no dictionary, we could still produce a bloom filter. 
 This could be very useful in filtering entire row groups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-306) Improve alignment between row groups and HDFS blocks

2015-06-22 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-306.
---
   Resolution: Fixed
Fix Version/s: 1.8.0

Merged #211. Thanks for reviewing, Alex!

 Improve alignment between row groups and HDFS blocks
 

 Key: PARQUET-306
 URL: https://issues.apache.org/jira/browse/PARQUET-306
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: Ryan Blue
Assignee: Ryan Blue
 Fix For: 1.8.0


 Row groups should not span HDFS blocks to avoid remote reads. There are 3 
 things we can use to avoid this:
 1. Set the next row group's size to the remaining bytes in the current HDFS 
 block
 2. Use HDFS-3689, variable-length HDFS blocks, when available
 3. Pad after row groups close to the block boundary to start the next row 
 group at the start of the next block



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-317) writeMetaDataFile crashes when a relative root Path is used

2015-06-25 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-317.
---
   Resolution: Fixed
Fix Version/s: 1.8.0

Merged #228. Thanks for fixing this, Steven!

 writeMetaDataFile crashes when a relative root Path is used
 ---

 Key: PARQUET-317
 URL: https://issues.apache.org/jira/browse/PARQUET-317
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.8.0
Reporter: Steven She
Assignee: Steven She
Priority: Minor
 Fix For: 1.8.0


 In Spark, I can save an RDD to the local file system using a relative path, 
 e.g.:
 {noformat}
 rdd.saveAsNewAPIHadoopFile(
 relativeRoot,
 classOf[Void],
 tag.runtimeClass.asInstanceOf[Class[T]],
 classOf[ParquetOutputFormat[T]],
 job.getConfiguration)
 {noformat}
 This leads to a crash in the ParquetFileWriter.mergeFooters(..) method since 
 the footer paths are read as fully qualified paths, but the root path is 
 provided as a relative path:
 {noformat}
 org.apache.parquet.io.ParquetEncodingException: 
 /Users/stevenshe/schema/relativeRoot/part-r-0.snappy.parquet invalid: all 
 the files must be contained in the root relativeRoot
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-317) writeMetaDataFile crashes when a relative root Path is used

2015-06-25 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-317:
--
Assignee: Steven She

 writeMetaDataFile crashes when a relative root Path is used
 ---

 Key: PARQUET-317
 URL: https://issues.apache.org/jira/browse/PARQUET-317
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.8.0
Reporter: Steven She
Assignee: Steven She
Priority: Minor

 In Spark, I can save an RDD to the local file system using a relative path, 
 e.g.:
 {noformat}
 rdd.saveAsNewAPIHadoopFile(
 relativeRoot,
 classOf[Void],
 tag.runtimeClass.asInstanceOf[Class[T]],
 classOf[ParquetOutputFormat[T]],
 job.getConfiguration)
 {noformat}
 This leads to a crash in the ParquetFileWriter.mergeFooters(..) method since 
 the footer paths are read as fully qualified paths, but the root path is 
 provided as a relative path:
 {noformat}
 org.apache.parquet.io.ParquetEncodingException: 
 /Users/stevenshe/schema/relativeRoot/part-r-0.snappy.parquet invalid: all 
 the files must be contained in the root relativeRoot
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-248) Simplify ParquetWriters's constructors

2015-06-25 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-248.
---
   Resolution: Fixed
Fix Version/s: 1.8.0

Added a builder class that can be extended by object models.

 Simplify ParquetWriters's constructors
 --

 Key: PARQUET-248
 URL: https://issues.apache.org/jira/browse/PARQUET-248
 Project: Parquet
  Issue Type: Improvement
Affects Versions: 1.6.0
Reporter: Konstantin Shaposhnikov
Assignee: Ryan Blue
 Fix For: 1.8.0


 ParquetWriter has a lot of constructors. A builder pattern can be used to 
 simplify construction of ParquetWriter objects (similar to ParquetReader, see 
 PARQUET-39).
 ParquetWriter subclasses (like AvroParquetWriter) should be updated to 
 provide reasonable builder() static factory method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics

2015-06-26 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602532#comment-14602532
 ] 

Ryan Blue commented on PARQUET-41:
--

Thanks for working on this, [~Ferd], it's great to be making some good progress 
on it. This is getting to be a pretty long comment. I don't have all that many 
conclusions, but I wanted to share some observations to start a discussion 
around how this feature should be done.

I've mostly been thinking lately about the bloom filter configuration. I like 
that FPP is a user setting because the query patterns really affect what value 
you want for it. You can get much better space savings with a high FPP if you 
know that typical queries will only look for a few items.

We can think of FPP as the probability that we will have to read a data page 
even though it doesn't actually have the item we are looking for. That is 
multiplied by the number of items in a query, which could be large but I think 
will generally be less than ~10 elements (for basing a default). That puts a 
general upper limit on the FPP because if it is something too high, like 10%, a 
fair number of queries will end up reading unnecessary data with a 50+% 
probability (anything checking for 5 or more unique items).

I think we should have a way to read the page stats without the filter, since 
they can be pretty big. I took a look at a real-world dataset with 8-byte 
timestamps that are ~75% unique, which put the expected filter size for a 2.5% 
false-positive rate at 9% of the block size. If I'm looking for 32 timestamps 
at once, I have an 80% chance of reading pages I don't need to read, and end up 
reading an extra 9% for every page's bloom filter alone.

I don't think we want a setting for the expected number of entries. For one 
thing, this varies widely across pages. I have a dataset with 20-30 values per 
page in one column and 131,000 values per page in another. A setting for all 
columns will definitely be a problem and I don't think we can trust users to 
set this correctly for their data on every column.

We also don't know much about how many unique values are in a column or how 
that column will compress with the encodings. Bloom filters are surprisingly 
expensive in terms of space considering some of the encoding sizes we can get 
in Parquet. For example, if we have a column where delta integer encoding is 
doing a good job, values might be ~2 bytes each. If the column is 75% unique, 
then even a 10% FPP will create a bloom filter that is ~22.5% of the page size, 
and a 1% FPP is ~44.9% of the page size. To compare to not as good encoding, 
8-bytes per value ends up being ~11.2% of the page size for a 1% FPP, which is 
still significant. As encoding gets better, pages have more values and the 
bloom filter needs to be larger.

Without knowing the percentage of unique values or the encoding size, choosing 
the expected number of values for a page is impossible. Because of the 
potential size of the filters compared to the page size, over-estimating the 
filter size isn't enough: we don't want something 10% of the page size or 
larger. That means that if we chose an estimate for the number of values, we 
would still end up overloading filters fairly often. I took a look at the 
false-positive probability for overloaded filters: if a filter is 125% loaded, 
then the actual false-positive probability at least doubles, and for an 
original 1% FPP, it triples. It gets much worse as the overloading increases: 
200% loaded results in a 9% actual FPP based on a 1% original FPP. Keep in mind 
that the expected overloading is probably not as low as 200% given that the 
number of values per page can vary from tens to tens of thousands.

I think there are 2 approaches to fixing this. First, there's a paper, 
[Scalable Bloom Filters|http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf], 
that has a strategy to use a series of bloom filters so you don't have to know 
the size in advance. It's a good paper, but we would want to change the 
heuristics for growing the filter because we know when we are getting close to 
the total number of elements in the page. Another draw-back is that it uses a 
series of filters, so testing for an element has to be done in each filter.

I think a second approach is to keep the data in memory until we have enough to 
determine the properties of the bloom filter. This would only need to be done 
for the first few pages, while memory consumption is still small. We could keep 
the hashed values instead of the actual data to get the size down to a set of 
integers that will be approximately the number of uniques items in the page 
(minus collisions). I like this option better because it is all on the write 
side and trades a reasonable amount of memory for a more complicated filter. 
The read side would be as it is now.

Okay, this is long enough. I'll clean up the 

[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics

2015-06-24 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600367#comment-14600367
 ] 

Ryan Blue commented on PARQUET-41:
--

I don't think the counting bloom filter idea is worth the increased size or the 
work to make it happen, when the trade-off is a false-positive. The ACID 
support will periodically rebuild the bloom filters anyway, so we're only 
talking about false positives for data in the delta files, which we expect to 
be small.

 Add bloom filters to parquet statistics
 ---

 Key: PARQUET-41
 URL: https://issues.apache.org/jira/browse/PARQUET-41
 Project: Parquet
  Issue Type: New Feature
  Components: parquet-format, parquet-mr
Reporter: Alex Levenson
Assignee: Ferdinand Xu
  Labels: filter2

 For row groups with no dictionary, we could still produce a bloom filter. 
 This could be very useful in filtering entire row groups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-152) Encoding issue with fixed length byte arrays

2015-06-18 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592474#comment-14592474
 ] 

Ryan Blue commented on PARQUET-152:
---

I think the RLE_DICTIONARY behavior is probably because the dictionary is using 
plain encoding rather than delta byte array.

 Encoding issue with fixed length byte arrays
 

 Key: PARQUET-152
 URL: https://issues.apache.org/jira/browse/PARQUET-152
 Project: Parquet
  Issue Type: Bug
Reporter: Nezih Yigitbasi
Priority: Minor

 While running some tests against the master branch I hit an encoding issue 
 that seemed like a bug to me.
 I noticed that when writing a fixed length byte array and the array's size is 
  dictionaryPageSize (in my test it was 512), the encoding falls back to 
 DELTA_BYTE_ARRAY as seen below:
 {noformat}
 Dec 17, 2014 3:41:10 PM INFO: parquet.hadoop.ColumnChunkPageWriteStore: 
 written 12,125B for [flba_field] FIXED_LEN_BYTE_ARRAY: 5,000 values, 1,710B 
 raw, 1,710B comp, 5 pages, encodings: [DELTA_BYTE_ARRAY]
 {noformat}
 But then read fails with the following exception:
 {noformat}
 Caused by: parquet.io.ParquetDecodingException: Encoding DELTA_BYTE_ARRAY is 
 only supported for type BINARY
   at parquet.column.Encoding$7.getValuesReader(Encoding.java:193)
   at 
 parquet.column.impl.ColumnReaderImpl.initDataReader(ColumnReaderImpl.java:534)
   at 
 parquet.column.impl.ColumnReaderImpl.readPageV2(ColumnReaderImpl.java:574)
   at 
 parquet.column.impl.ColumnReaderImpl.access$400(ColumnReaderImpl.java:54)
   at 
 parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:518)
   at 
 parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:510)
   at parquet.column.page.DataPageV2.accept(DataPageV2.java:123)
   at 
 parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:510)
   at 
 parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:502)
   at 
 parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:604)
   at 
 parquet.column.impl.ColumnReaderImpl.init(ColumnReaderImpl.java:348)
   at 
 parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63)
   at 
 parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58)
   at 
 parquet.io.RecordReaderImplementation.init(RecordReaderImplementation.java:267)
   at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131)
   at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96)
   at 
 parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136)
   at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96)
   at 
 parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:129)
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:198)
   ... 16 more
 {noformat}
 When the array's size is  dictionaryPageSize, RLE_DICTIONARY encoding is 
 used and read works fine:
 {noformat}
 Dec 17, 2014 3:39:50 PM INFO: parquet.hadoop.ColumnChunkPageWriteStore: 
 written 50B for [flba_field] FIXED_LEN_BYTE_ARRAY: 5,000 values, 3B raw, 3B 
 comp, 1 pages, encodings: [RLE_DICTIONARY, PLAIN], dic { 1 entries, 8B raw, 
 1B comp}
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (PARQUET-41) Add bloom filters to parquet statistics

2015-06-26 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602532#comment-14602532
 ] 

Ryan Blue edited comment on PARQUET-41 at 6/26/15 7:53 PM:
---

Thanks for working on this, [~Ferd], it's great to be making some good progress 
on it. This is getting to be a pretty long comment. I don't have all that many 
conclusions, but I wanted to share some observations to start a discussion 
around how this feature should be done.

I've mostly been thinking lately about the bloom filter configuration. I like 
that FPP is a user setting because the query patterns really affect what value 
you want for it. You can get much better space savings with a high FPP if you 
know that typical queries will only look for a few items.

We can think of FPP as the probability that we will have to read a data page 
even though it doesn't actually have the item we are looking for. That is 
multiplied by the number of items in a query, which could be large but I think 
will generally be less than ~10 elements (for basing a default). That puts a 
general upper limit on the FPP because if it is something too high, like 10%, a 
fair number of queries will end up reading unnecessary data with a 50+% 
probability (anything checking for 7 or more unique items).

I think we should have a way to read the page stats without the filter, since 
they can be pretty big. I took a look at a real-world dataset with 8-byte 
timestamps that are ~75% unique, which put the expected filter size for a 2.5% 
false-positive rate at 9% of the block size. If I'm looking for 32 timestamps 
at once, I have an 80% chance of reading pages I don't need to read, and end up 
reading an extra 9% for every page's bloom filter alone.

I don't think we want a setting for the expected number of entries. For one 
thing, this varies widely across pages. I have a dataset with 20-30 values per 
page in one column and 131,000 values per page in another. A setting for all 
columns will definitely be a problem and I don't think we can trust users to 
set this correctly for their data on every column.

We also don't know much about how many unique values are in a column or how 
that column will compress with the encodings. Bloom filters are surprisingly 
expensive in terms of space considering some of the encoding sizes we can get 
in Parquet. For example, if we have a column where delta integer encoding is 
doing a good job, values might be ~2 bytes each. If the column is 75% unique, 
then even a 10% FPP will create a bloom filter that is ~22.5% of the page size, 
and a 1% FPP is ~44.9% of the page size. To compare to not as good encoding, 
8-bytes per value ends up being ~11.2% of the page size for a 1% FPP, which is 
still significant. As encoding gets better, pages have more values and the 
bloom filter needs to be larger.

Without knowing the percentage of unique values or the encoding size, choosing 
the expected number of values for a page is impossible. Because of the 
potential size of the filters compared to the page size, over-estimating the 
filter size isn't enough: we don't want something 10% of the page size or 
larger. That means that if we chose an estimate for the number of values, we 
would still end up overloading filters fairly often. I took a look at the 
false-positive probability for overloaded filters: if a filter is 125% loaded, 
then the actual false-positive probability at least doubles, and for an 
original 1% FPP, it triples. It gets much worse as the overloading increases: 
200% loaded results in a 9% actual FPP based on a 1% original FPP. Keep in mind 
that the expected overloading is probably not as low as 200% given that the 
number of values per page can vary from tens to tens of thousands.

I think there are 2 approaches to fixing this. First, there's a paper, 
[Scalable Bloom Filters|http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf], 
that has a strategy to use a series of bloom filters so you don't have to know 
the size in advance. It's a good paper, but we would want to change the 
heuristics for growing the filter because we know when we are getting close to 
the total number of elements in the page. Another draw-back is that it uses a 
series of filters, so testing for an element has to be done in each filter.

I think a second approach is to keep the data in memory until we have enough to 
determine the properties of the bloom filter. This would only need to be done 
for the first few pages, while memory consumption is still small. We could keep 
the hashed values instead of the actual data to get the size down to a set of 
integers that will be approximately the number of uniques items in the page 
(minus collisions). I like this option better because it is all on the write 
side and trades a reasonable amount of memory for a more complicated filter. 
The read side would be as it is now.


[jira] [Resolved] (PARQUET-293) ScalaReflectionException when trying to convert an RDD of Scrooge to a DataFrame

2015-06-10 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-293.
---
Resolution: Duplicate

Closing as a duplicate. Please follow SPARK-8288 instead.

 ScalaReflectionException when trying to convert an RDD of Scrooge to a 
 DataFrame
 

 Key: PARQUET-293
 URL: https://issues.apache.org/jira/browse/PARQUET-293
 Project: Parquet
  Issue Type: Bug
  Components: parquet-format
Affects Versions: 1.6.0
Reporter: Tim Chan

 I get scala.ScalaReflectionException: none is not a term when I try to 
 convert an RDD of Scrooge to a DataFrame, e.g. myScroogeRDD.toDF
 Has anyone else encountered this problem? 
 I'm using Spark 1.3.1, Scala 2.10.4 and scrooge-sbt-plugin 3.16.3
 Here is my thrift IDL:
 {code}
 namespace scala com.junk
 namespace java com.junk
 struct Junk {
 10: i64 junkID,
 20: string junkString
 }
 {code}
 from a spark-shell: 
 {code}
 val junks = List( Junk(123L, junk1), Junk(567L, junk2), Junk(789L, 
 junk3) )
 val junksRDD = sc.parallelize(junks)
 junksRDD.toDF
 {code}
 Exception thrown:
 {noformat}
 scala.ScalaReflectionException: none is not a term
   at scala.reflect.api.Symbols$SymbolApi$class.asTerm(Symbols.scala:259)
   at 
 scala.reflect.internal.Symbols$SymbolContextApiImpl.asTerm(Symbols.scala:73)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:148)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
   at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:316)
   at 
 org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:254)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:27)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:32)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:34)
   at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:36)
   at $iwC$$iwC$$iwC$$iwC.init(console:38)
   at $iwC$$iwC$$iwC.init(console:40)
   at $iwC$$iwC.init(console:42)
   at $iwC.init(console:44)
   at init(console:46)
   at .init(console:50)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656)
   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664)
   at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
   at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)
   at org.apache.spark.repl.Main$.main(Main.scala:31)
   at org.apache.spark.repl.Main.main(Main.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 

[jira] [Commented] (PARQUET-293) ScalaReflectionException when trying to convert an RDD of Scrooge to a DataFrame

2015-06-10 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580868#comment-14580868
 ] 

Ryan Blue commented on PARQUET-293:
---

Linking to the issue that replaces this.

 ScalaReflectionException when trying to convert an RDD of Scrooge to a 
 DataFrame
 

 Key: PARQUET-293
 URL: https://issues.apache.org/jira/browse/PARQUET-293
 Project: Parquet
  Issue Type: Bug
  Components: parquet-format
Affects Versions: 1.6.0
Reporter: Tim Chan

 I get scala.ScalaReflectionException: none is not a term when I try to 
 convert an RDD of Scrooge to a DataFrame, e.g. myScroogeRDD.toDF
 Has anyone else encountered this problem? 
 I'm using Spark 1.3.1, Scala 2.10.4 and scrooge-sbt-plugin 3.16.3
 Here is my thrift IDL:
 {code}
 namespace scala com.junk
 namespace java com.junk
 struct Junk {
 10: i64 junkID,
 20: string junkString
 }
 {code}
 from a spark-shell: 
 {code}
 val junks = List( Junk(123L, junk1), Junk(567L, junk2), Junk(789L, 
 junk3) )
 val junksRDD = sc.parallelize(junks)
 junksRDD.toDF
 {code}
 Exception thrown:
 {noformat}
 scala.ScalaReflectionException: none is not a term
   at scala.reflect.api.Symbols$SymbolApi$class.asTerm(Symbols.scala:259)
   at 
 scala.reflect.internal.Symbols$SymbolContextApiImpl.asTerm(Symbols.scala:73)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:148)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
   at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:316)
   at 
 org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:254)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:27)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:32)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:34)
   at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:36)
   at $iwC$$iwC$$iwC$$iwC.init(console:38)
   at $iwC$$iwC$$iwC.init(console:40)
   at $iwC$$iwC.init(console:42)
   at $iwC.init(console:44)
   at init(console:46)
   at .init(console:50)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656)
   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664)
   at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
   at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)
   at org.apache.spark.repl.Main$.main(Main.scala:31)
   at org.apache.spark.repl.Main.main(Main.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at 

[jira] [Commented] (PARQUET-222) parquet writer runs into OOM during writing when calling DataFrame.saveAsParquetFile in Spark SQL

2015-06-10 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580861#comment-14580861
 ] 

Ryan Blue commented on PARQUET-222:
---

Okay, so it sounds like you're talking about writing out data to a single 
folder without FS partitioning. Then, I agree that the solution to reduce the 
number of tasks so to try and minimize the number of files. Sounds like you 
already do the optimization for FS partitioning, which is great. Thanks!

 parquet writer runs into OOM during writing when calling 
 DataFrame.saveAsParquetFile in Spark SQL
 -

 Key: PARQUET-222
 URL: https://issues.apache.org/jira/browse/PARQUET-222
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.6.0
Reporter: Chaozhong Yang
   Original Estimate: 336h
  Remaining Estimate: 336h

 In Spark SQL, there is a function {{saveAsParquetFile}} in {{DataFrame}} or 
 {{SchemaRDD}}. That function calls method in parquet-mr, and sometimes it 
 will fail due to the OOM error thrown by parquet-mr. We can see the exception 
 stack trace  as follows:
 {noformat}
 WARN] [task-result-getter-3] 03-19 11:17:58,274 [TaskSetManager] - Lost task 
 0.2 in stage 137.0 (TID 309, hb1.avoscloud.com): java.lang.OutOfMemoryError: 
 Java heap space
 at parquet.column.values.dictionary.IntList.initSlab(IntList.java:87)
 at parquet.column.values.dictionary.IntList.init(IntList.java:83)
 at 
 parquet.column.values.dictionary.DictionaryValuesWriter.init(DictionaryValuesWriter.java:85)
 at 
 parquet.column.values.dictionary.DictionaryValuesWriter$PlainIntegerDictionaryValuesWriter.init(DictionaryValuesWriter.java:549)
 at 
 parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:88)
 at 
 parquet.column.impl.ColumnWriterImpl.init(ColumnWriterImpl.java:74)
 at 
 parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:68)
 at 
 parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56)
 at 
 parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.init(MessageColumnIO.java:178)
 at 
 parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369)
 at 
 parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108)
 at 
 parquet.hadoop.InternalParquetRecordWriter.init(InternalParquetRecordWriter.java:94)
 at 
 parquet.hadoop.ParquetRecordWriter.init(ParquetRecordWriter.java:64)
 at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
 at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
 at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:304)
 at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325)
 at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 {noformat}
 By the way, there is another similar issue 
 https://issues.apache.org/jira/browse/PARQUET-99. But the reporter has closed 
 it and mark it as resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-178) META-INF for slf4j should not be in parquet-format jar

2015-06-16 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-178.
---
Resolution: Fixed
  Assignee: Ryan Blue

Merged. Thanks for letting us know about this [~koert]!

 META-INF for slf4j should not be in parquet-format jar
 --

 Key: PARQUET-178
 URL: https://issues.apache.org/jira/browse/PARQUET-178
 Project: Parquet
  Issue Type: Bug
  Components: parquet-format
Affects Versions: 1.6.0
Reporter: koert kuipers
Assignee: Ryan Blue
Priority: Minor

 {noformat}
 $ jar tf parquet-format-2.2.0-rc1.jar  | grep org\\.slf
 META-INF/maven/org.slf4j/
 META-INF/maven/org.slf4j/slf4j-api/
 META-INF/maven/org.slf4j/slf4j-api/pom.xml
 META-INF/maven/org.slf4j/slf4j-api/pom.properties
 {noformat}
 It is not clear to me why these are here. I suspect they should not be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-308) Add accessor to ParquetWriter to get current data size

2015-06-16 Thread Ryan Blue (JIRA)
Ryan Blue created PARQUET-308:
-

 Summary: Add accessor to ParquetWriter to get current data size
 Key: PARQUET-308
 URL: https://issues.apache.org/jira/browse/PARQUET-308
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Affects Versions: 1.7.0
Reporter: Ryan Blue
Assignee: Ryan Blue
 Fix For: 1.8.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-246) ArrayIndexOutOfBoundsException with Parquet write version v2

2015-06-17 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590353#comment-14590353
 ] 

Ryan Blue commented on PARQUET-246:
---

[~michael] can you answer my questions about this? When does this happen? 
Whenever you read a file like this? If so, then we need to add support (with a 
flag) to initialize the delta byte array from the last value in the last 
page/row group. That would mean we also need to keep it around and throw an 
exception if it isn't present (if you were reading from the middle of the file, 
we can't back up to get it right). I think data recovery needs to be part of 
the solution for this.

 ArrayIndexOutOfBoundsException with Parquet write version v2
 

 Key: PARQUET-246
 URL: https://issues.apache.org/jira/browse/PARQUET-246
 Project: Parquet
  Issue Type: Bug
Affects Versions: 1.6.0
Reporter: Konstantin Shaposhnikov
 Fix For: 2.0.0


 I am getting the following exception when reading a parquet file that was 
 created using Avro WriteSupport and Parquet write version v2.0:
 {noformat}
 Caused by: parquet.io.ParquetDecodingException: Can't read value in column 
 [colName, rows, array, name] BINARY at value 313601 out of 428260, 1 out of 
 39200 in currentPage. repetition level: 0, definition level: 2
   at 
 parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:462)
   at 
 parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:364)
   at 
 parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
   ... 27 more
 Caused by: java.lang.ArrayIndexOutOfBoundsException
   at 
 parquet.column.values.deltastrings.DeltaByteArrayReader.readBytes(DeltaByteArrayReader.java:70)
   at 
 parquet.column.impl.ColumnReaderImpl$2$6.read(ColumnReaderImpl.java:307)
   at 
 parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:458)
   ... 30 more
 {noformat}
 The file is quite big (500Mb) so I cannot upload it here, but possibly there 
 is enough information in the exception message to understand the cause of 
 error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-309) Remove unnecessary compile dependency on parquet-generator

2015-06-17 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-309:
--
Assignee: Konstantin Shaposhnikov

 Remove unnecessary compile dependency on parquet-generator
 --

 Key: PARQUET-309
 URL: https://issues.apache.org/jira/browse/PARQUET-309
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Affects Versions: 1.7.0
Reporter: Konstantin Shaposhnikov
Assignee: Konstantin Shaposhnikov
 Fix For: 1.8.0


 parquet-generator is used during build time only. Other parquet-jars (e.g. 
 parquet-encoding) should not depend on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-309) Remove unnecessary compile dependency on parquet-generator

2015-06-17 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-309.
---
   Resolution: Fixed
Fix Version/s: 1.8.0

 Remove unnecessary compile dependency on parquet-generator
 --

 Key: PARQUET-309
 URL: https://issues.apache.org/jira/browse/PARQUET-309
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Affects Versions: 1.7.0
Reporter: Konstantin Shaposhnikov
 Fix For: 1.8.0


 parquet-generator is used during build time only. Other parquet-jars (e.g. 
 parquet-encoding) should not depend on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics

2015-06-17 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590073#comment-14590073
 ] 

Ryan Blue commented on PARQUET-41:
--

Great, thanks [~Ferd]! Could you also tell us a bit more about how this works 
and the approach you're taking? At first glance, we need quite a bit more in 
the format to specify exactly what the structure means and how to use it. It 
would be good to discuss that here, too.

 Add bloom filters to parquet statistics
 ---

 Key: PARQUET-41
 URL: https://issues.apache.org/jira/browse/PARQUET-41
 Project: Parquet
  Issue Type: New Feature
  Components: parquet-format, parquet-mr
Reporter: Alex Levenson
Assignee: ferdinand xu
  Labels: filter2

 For row groups with no dictionary, we could still produce a bloom filter. 
 This could be very useful in filtering entire row groups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-246) ArrayIndexOutOfBoundsException with Parquet write version v2

2015-06-17 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590045#comment-14590045
 ] 

Ryan Blue commented on PARQUET-246:
---

Should we also update the read side so we can recover data written with this 
bug? Does this happen when reading the entire file, or just when reading from a 
middle row group in MR?

 ArrayIndexOutOfBoundsException with Parquet write version v2
 

 Key: PARQUET-246
 URL: https://issues.apache.org/jira/browse/PARQUET-246
 Project: Parquet
  Issue Type: Bug
Affects Versions: 1.6.0
Reporter: Konstantin Shaposhnikov
 Fix For: 2.0.0


 I am getting the following exception when reading a parquet file that was 
 created using Avro WriteSupport and Parquet write version v2.0:
 {noformat}
 Caused by: parquet.io.ParquetDecodingException: Can't read value in column 
 [colName, rows, array, name] BINARY at value 313601 out of 428260, 1 out of 
 39200 in currentPage. repetition level: 0, definition level: 2
   at 
 parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:462)
   at 
 parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:364)
   at 
 parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
   ... 27 more
 Caused by: java.lang.ArrayIndexOutOfBoundsException
   at 
 parquet.column.values.deltastrings.DeltaByteArrayReader.readBytes(DeltaByteArrayReader.java:70)
   at 
 parquet.column.impl.ColumnReaderImpl$2$6.read(ColumnReaderImpl.java:307)
   at 
 parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:458)
   ... 30 more
 {noformat}
 The file is quite big (500Mb) so I cannot upload it here, but possibly there 
 is enough information in the exception message to understand the cause of 
 error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-39) Simplify ParquetReader's constructors

2015-05-28 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-39?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-39.
--

This was added in 
https://github.com/apache/parquet-mr/commit/ad32bf0fd111ab473ad1080cde11de39e3c5a67f

 Simplify ParquetReader's constructors
 -

 Key: PARQUET-39
 URL: https://issues.apache.org/jira/browse/PARQUET-39
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: Alex Levenson
Assignee: Alex Levenson
Priority: Minor
 Fix For: 1.6.0


 ParquetReader has a lot of constructors. Maybe we should use the Builder 
 pattern instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-293) ScalaReflectionException when trying to convert an RDD of Scrooge to a DataFrame

2015-05-28 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14563496#comment-14563496
 ] 

Ryan Blue commented on PARQUET-293:
---

[~lian cheng], could you take a look at this? Seems like your area of 
expertise. Do you think this should be a Spark issue instead of a Parquet issue?

 ScalaReflectionException when trying to convert an RDD of Scrooge to a 
 DataFrame
 

 Key: PARQUET-293
 URL: https://issues.apache.org/jira/browse/PARQUET-293
 Project: Parquet
  Issue Type: Bug
  Components: parquet-format
Affects Versions: 1.6.0
Reporter: Tim Chan

 I get scala.ScalaReflectionException: none is not a term when I try to 
 convert an RDD of Scrooge to a DataFrame, e.g. myScroogeRDD.toDF
 Has anyone else encountered this problem? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-151) Null Pointer exception in parquet.hadoop.ParquetFileWriter.mergeFooters

2015-06-01 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-151:
--
Assignee: Yash Datta

 Null Pointer exception in parquet.hadoop.ParquetFileWriter.mergeFooters
 ---

 Key: PARQUET-151
 URL: https://issues.apache.org/jira/browse/PARQUET-151
 Project: Parquet
  Issue Type: Bug
Reporter: Vladislav Kuzemchik
Assignee: Yash Datta

 Hi!
 I'm getting null pointer exception when I'm trying to write parquet files 
 with spark.
 {noformat}
 Dec 13, 2014 3:05:10 AM WARNING: parquet.hadoop.ParquetOutputCommitter: could 
 not write summary file for 
 hdfs://phoenix-011.nym1.placeiq.net:8020/user/vkuzemchik/parquet_data/1789
 java.lang.NullPointerException
   at 
 parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:426)
   at 
 parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:402)
   at 
 parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51)
   at 
 org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:936)
   at 
 com.placeiq.spark.KafkaReader$.writeParquetHadoop(KafkaReader.scala:143)
   at com.placeiq.spark.KafkaReader$$anonfun$3.apply(KafkaReader.scala:165)
   at com.placeiq.spark.KafkaReader$$anonfun$3.apply(KafkaReader.scala:164)
   at 
 org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
   at 
 org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
   at 
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
   at 
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
   at 
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
   at scala.util.Try$.apply(Try.scala:161)
   at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
   at 
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {noformat}
 Here is function I'm using:
 {code:title=Spark.scala|borderStyle=solid}
   def writeParquetHadoop(rdd:RDD[(Void,LogMessage)]):Unit =  {
   val jobConf = new JobConf(ssc.sparkContext.hadoopConfiguration)
   val job = new Job(jobConf)
   val outputDir = 
 hdfs://phoenix-011.nym1.placeiq.net:8020/user/vkuzemchik/parquet_data/
   ParquetOutputFormat.setWriteSupportClass(job, classOf[AvroWriteSupport])
   ParquetInputFormat.setReadSupportClass(job, 
 classOf[AvroReadSupport[LogMessage]])
   AvroParquetInputFormat.setAvroReadSchema(job, LogMessage.SCHEMA$)
   AvroParquetOutputFormat.setSchema(job, LogMessage.SCHEMA$)
   ParquetOutputFormat.setCompression(job,CompressionCodecName.SNAPPY)
   ParquetOutputFormat.setBlockSize(job, 536870912)
   job.setOutputKeyClass(classOf[Void])
   job.setOutputValueClass(classOf[LogMessage])
   job.setOutputFormatClass(classOf[ParquetOutputFormat[LogMessage]])
   job.getConfiguration.set(mapred.output.dir, outputDir+rdd.id)
   rdd.saveAsNewAPIHadoopDataset(job.getConfiguration)
   }
 {code}
 I have this issue on 1.5. Trying to re-produce on newer versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-296) Set master branch version back to 1.8.0-SNAPSHOT

2015-06-01 Thread Ryan Blue (JIRA)
Ryan Blue created PARQUET-296:
-

 Summary: Set master branch version back to 1.8.0-SNAPSHOT
 Key: PARQUET-296
 URL: https://issues.apache.org/jira/browse/PARQUET-296
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.8.0
Reporter: Ryan Blue
 Fix For: 1.8.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-266) Add support for lists of primitives to Pig schema converter

2015-05-27 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561923#comment-14561923
 ] 

Ryan Blue commented on PARQUET-266:
---

[~dweeks-netflix] or [~julienledem], you guys are the reviewers for Pig 
patches, right?

 Add support for lists of primitives to Pig schema converter
 ---

 Key: PARQUET-266
 URL: https://issues.apache.org/jira/browse/PARQUET-266
 Project: Parquet
  Issue Type: Improvement
Affects Versions: 1.5.0, 1.6.0
Reporter: Christian Rolf
Priority: Minor
 Attachments: PigPrimitiveList-1.8.patch, PigPrimitiveList.patch


 Right now lists of primitives are not supported in Pig (exception thrown from 
 the PigSchemaConverter.java, line 292 in Parquet 1.6). 
 Patch converts Parquet-arrays of primitives into Pig-bags, the closest 
 representation of an array in Pig.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-292) Release Parquet 1.8.0

2015-05-27 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561927#comment-14561927
 ] 

Ryan Blue commented on PARQUET-292:
---

Adding PARQUET-265 instead of PARQUET-263.

 Release Parquet 1.8.0
 -

 Key: PARQUET-292
 URL: https://issues.apache.org/jira/browse/PARQUET-292
 Project: Parquet
  Issue Type: Task
Reporter: Alex Levenson
Assignee: Alex Levenson





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-292) Release Parquet 1.8.0

2015-05-27 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561951#comment-14561951
 ] 

Ryan Blue commented on PARQUET-292:
---

Adding PARQUET-201, which was a bug fix pushed out for the 1.6.0 release. We 
don't have a very good reason to push it out this time, so I'm marking it as a 
blocker.

 Release Parquet 1.8.0
 -

 Key: PARQUET-292
 URL: https://issues.apache.org/jira/browse/PARQUET-292
 Project: Parquet
  Issue Type: Task
Reporter: Alex Levenson
Assignee: Alex Levenson





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-199) Add a callback when the MemoryManager adjusts row group size

2015-05-27 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-199.
---
   Resolution: Fixed
Fix Version/s: 1.8.0

This was merged a few days ago, just forgot to close.

 Add a callback when the MemoryManager adjusts row group size
 

 Key: PARQUET-199
 URL: https://issues.apache.org/jira/browse/PARQUET-199
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Reporter: Ryan Blue
Assignee: Dong Chen
 Fix For: 1.8.0


 Parquet Hive would like to increment a counter when the row group size is 
 altered by the memory manager so that Hive can detect when there are memory 
 problems and inform the user. I think the right way to do this is to provide 
 a callback that will be triggered when the memory manager hits its limit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-285) Implement nested types write rules in parquet-avro

2015-06-01 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-285.
---
Resolution: Fixed

Merged #198.

 Implement nested types write rules in parquet-avro
 --

 Key: PARQUET-285
 URL: https://issues.apache.org/jira/browse/PARQUET-285
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Reporter: Ryan Blue
Assignee: Ryan Blue
 Fix For: 1.8.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-251) Binary column statistics error when reuse byte[] among rows

2015-07-01 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-251:
--
Fix Version/s: (was: 2.0.0)
   1.8.0

 Binary column statistics error when reuse byte[] among rows
 ---

 Key: PARQUET-251
 URL: https://issues.apache.org/jira/browse/PARQUET-251
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.6.0
Reporter: Yijie Shen
Assignee: Ashish K Singh
Priority: Blocker
 Fix For: 1.8.0


 I think it is a common practice when inserting table data as parquet file, 
 one would always reuse the same object among rows, and if a column is byte[] 
 of fixed length, the byte[] would also be reused. 
 If I use ByteArrayBackedBinary for my byte[], the bug occurs: All of the row 
 groups created by a single task would have the same max  min binary value, 
 just as the last row's binary content.
 The reason is BinaryStatistic just keep max  min as parquet.io.api.Binary 
 references, since I use ByteArrayBackedBinary for byte[], the real content of 
 max  min would always point to the reused byte[], therefore the latest row's 
 content.
 Does parquet declare somewhere that the user shouldn't reuse byte[] for 
 Binary type?  If it doesn't, I think it's a bug and can be reproduced by 
 [Spark SQL's RowWriteSupport 
 |https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala#L353-354]
 The related Spark JIRA ticket: 
 [SPARK-6859|https://issues.apache.org/jira/browse/SPARK-6859]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-324) row count incorrect if data file has more than 2^31 rows

2015-07-03 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-324.
---
   Resolution: Fixed
Fix Version/s: 1.8.0

Thanks for contributing the fix, [~tfriedr]!

 row count incorrect if data file has more than 2^31 rows
 

 Key: PARQUET-324
 URL: https://issues.apache.org/jira/browse/PARQUET-324
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.7.0, 1.8.0
Reporter: Thomas Friedrich
Assignee: Thomas Friedrich
Priority: Minor
 Fix For: 1.8.0


 If a parquet file has more than 2^31 rows, the row count written into the 
 file metadata is incorrect. 
 The cause of the problem is the use of an int instead of long data type for 
 numRows in ParquetMetadataConverter, toParquetMetadata:
 int numRows = 0;
 for (BlockMetaData block : blocks) {
   numRows += block.getRowCount();
   addRowGroup(parquetMetadata, rowGroups, block);
 }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-320) Restore semver checks

2015-07-01 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-320.
---
Resolution: Fixed

Merged #230

 Restore semver checks
 -

 Key: PARQUET-320
 URL: https://issues.apache.org/jira/browse/PARQUET-320
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.7.0
Reporter: Ryan Blue
Assignee: Ryan Blue
 Fix For: 1.8.0


 The exclusion for parquet-format classes was parquet/**, which evidently 
 matches everything. Even classes in org.apache.parquet. We need remove that 
 check and fix any problems that have cropped up since it was added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-223) Add Map and List builiders

2015-05-26 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-223.
---
   Resolution: Fixed
Fix Version/s: 1.8.0

I committed this. Thanks for the contribution [~singhashish]!

 Add Map and List builiders
 --

 Key: PARQUET-223
 URL: https://issues.apache.org/jira/browse/PARQUET-223
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: Ashish K Singh
Assignee: Ashish K Singh
 Fix For: 1.8.0


 As of now, Parquet does not provide builders for Maps and Lists. This leaves 
 margin for user errors. Having Map and List builders will make it easier for 
 users to build these types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-361) Add prerelease logic to semantic versions

2015-08-19 Thread Ryan Blue (JIRA)
Ryan Blue created PARQUET-361:
-

 Summary: Add prerelease logic to semantic versions
 Key: PARQUET-361
 URL: https://issues.apache.org/jira/browse/PARQUET-361
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Affects Versions: 1.8.1
Reporter: Ryan Blue
 Fix For: 1.9.0


CDH is including fixes for PARQUET-251. That means that we need to add the 
fixed versions to the logic that tests whether the fix is present and that 
requires the appropriate semver logic for prerelease versions because CDH 
versions are formatted like this: 1.5.0-cdh5.5.0 / 
upstream-base-cdhcdh-release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-361) Add prerelease logic to semantic versions

2015-08-20 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-361.
---
Resolution: Fixed
  Assignee: Ryan Blue

 Add prerelease logic to semantic versions
 -

 Key: PARQUET-361
 URL: https://issues.apache.org/jira/browse/PARQUET-361
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Affects Versions: 1.8.1
Reporter: Ryan Blue
Assignee: Ryan Blue
 Fix For: 1.9.0


 CDH is including fixes for PARQUET-251. That means that we need to add the 
 fixed versions to the logic that tests whether the fix is present and that 
 requires the appropriate semver logic for prerelease versions because CDH 
 versions are formatted like this: 1.5.0-cdh5.5.0 / 
 upstream-base-cdhcdh-release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-316) Run.sh is broken in parquet-benchmarks

2015-06-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-316.
---
   Resolution: Fixed
Fix Version/s: 1.8.0

Merged Nezih's PR. Thanks for fixing this!

 Run.sh is broken in parquet-benchmarks
 --

 Key: PARQUET-316
 URL: https://issues.apache.org/jira/browse/PARQUET-316
 Project: Parquet
  Issue Type: Bug
Reporter: Nezih Yigitbasi
Assignee: Nezih Yigitbasi
 Fix For: 1.8.0


 With the package renaming (to org.apache.parquet) the run.sh script is now 
 broken.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-146) make Parquet compile with java 7 instead of java 6

2015-06-30 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608848#comment-14608848
 ] 

Ryan Blue commented on PARQUET-146:
---

We should discuss this on the mailing list. We've had recent contributions 
fixing support for java 6, so we definitely want to build consensus before 
deprecating support.

 make Parquet compile with java 7 instead of java 6
 --

 Key: PARQUET-146
 URL: https://issues.apache.org/jira/browse/PARQUET-146
 Project: Parquet
  Issue Type: Improvement
Reporter: Julien Le Dem
  Labels: beginner, noob, pick-me-up

 currently Parquet is compatible with java 6. we should remove this constraint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-320) Restore semver checks

2015-06-29 Thread Ryan Blue (JIRA)
Ryan Blue created PARQUET-320:
-

 Summary: Restore semver checks
 Key: PARQUET-320
 URL: https://issues.apache.org/jira/browse/PARQUET-320
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.7.0
Reporter: Ryan Blue
 Fix For: 1.8.0


The exclusion for parquet-format classes was parquet/**, which evidently 
matches everything. Even classes in org.apache.parquet. We need remove that 
check and fix any problems that have cropped up since it was added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics

2015-06-29 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606674#comment-14606674
 ] 

Ryan Blue commented on PARQUET-41:
--

I should also point out there's a table on the first page that calculates the 
probability of at least one false-positive when querying multiple items. That's 
pretty useful to apply here. If we are querying for 10 items and the bloom 
filter says it is 1%, then there is a 9.56% chance of reading a page when it 
has none of the items. But if the actual FPP of that filter is 10% because of 
overloading, then we get a 65% probability when we were expecting that 9.56%.

 Add bloom filters to parquet statistics
 ---

 Key: PARQUET-41
 URL: https://issues.apache.org/jira/browse/PARQUET-41
 Project: Parquet
  Issue Type: New Feature
  Components: parquet-format, parquet-mr
Reporter: Alex Levenson
Assignee: Ferdinand Xu
  Labels: filter2

 For row groups with no dictionary, we could still produce a bloom filter. 
 This could be very useful in filtering entire row groups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-321) Set the HDFS padding default to 8MB

2015-06-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-321:
--
Summary: Set the HDFS padding default to 8MB  (was: Set the HDFS padding 
default to 16MB)

 Set the HDFS padding default to 8MB
 ---

 Key: PARQUET-321
 URL: https://issues.apache.org/jira/browse/PARQUET-321
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: Ryan Blue
Assignee: Ryan Blue
 Fix For: 1.8.0


 PARQUET-306 added the ability to pad row groups so that they align with HDFS 
 blocks to avoid remote reads. The ParquetFileWriter will now either pad the 
 remaining space in the block or target a row group for the remaining size.
 The padding maximum controls the threshold of the amount of padding that will 
 be used. If the space left is under this threshold, it is padded. If it is 
 greater than this threshold, then the next row group is fit into the 
 remaining space. The current padding maximum is 0.
 I think we should change the padding maximum to 8MB. My reasoning is this: we 
 want this number to be small enough that it won't prevent the library from 
 writing reasonable row groups, but larger than the minimum size row group we 
 would want to write. 8MB is 1/16th of the row group default, so I think it is 
 reasonable: we don't want a row group to be smaller than 8 MB.
 We also want this to be large enough that a few row groups in a  block don't 
 cause a tiny row group to be written in the excess space. 8MB accounts for 4 
 row groups that are 2MB under-size. In addition, it is reasonable to not 
 allow row groups under 8MB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-321) Set the HDFS padding default to 16MB

2015-06-30 Thread Ryan Blue (JIRA)
Ryan Blue created PARQUET-321:
-

 Summary: Set the HDFS padding default to 16MB
 Key: PARQUET-321
 URL: https://issues.apache.org/jira/browse/PARQUET-321
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: Ryan Blue
Assignee: Ryan Blue
 Fix For: 1.8.0


PARQUET-306 added the ability to pad row groups so that they align with HDFS 
blocks to avoid remote reads. The ParquetFileWriter will now either pad the 
remaining space in the block or target a row group for the remaining size.

The padding maximum controls the threshold of the amount of padding that will 
be used. If the space left is under this threshold, it is padded. If it is 
greater than this threshold, then the next row group is fit into the remaining 
space. The current padding maximum is 0.

I think we should change the padding maximum to 8MB. My reasoning is this: we 
want this number to be small enough that it won't prevent the library from 
writing reasonable row groups, but larger than the minimum size row group we 
would want to write. 8MB is 1/16th of the row group default, so I think it is 
reasonable: we don't want a row group to be smaller than 8 MB.

We also want this to be large enough that a few row groups in a  block don't 
cause a tiny row group to be written in the excess space. 8MB accounts for 4 
row groups that are 2MB under-size. In addition, it is reasonable to not allow 
row groups under 8MB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-144) read a single file outside of mapreduce framework

2015-07-31 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14649334#comment-14649334
 ] 

Ryan Blue commented on PARQUET-144:
---

[~hy5446]: you can read files outside of MR using the ParquetReader with 
Scrooge read support. The constructor you want is here: 
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java#L64

The read support is what determines the object model the reader will use. Most 
object models have a convenience reader, but it looks like scrooge doesn't so 
you'll have to pass the right ReadSupport to the reader in your code.

 read a single file outside of mapreduce framework
 -

 Key: PARQUET-144
 URL: https://issues.apache.org/jira/browse/PARQUET-144
 Project: Parquet
  Issue Type: Test
  Components: parquet-mr
Affects Versions: 1.6.0
Reporter: hy5446
Priority: Critical

 In my test I would like to read a file that has been written through Parquet 
 + Scrooge. I would like to do it outside of map/reduce or hadoop. Something 
 like this:
 val bytes = readFile(my file)
 val objects = deserializeWithParquetScrooge[MyObjectClass](bytes)
 Is something like this possible? How?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-144) read a single file outside of mapreduce framework

2015-07-31 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-144.
---
Resolution: Not A Problem

I'm resolving this as not a problem because it is a request for information 
and I think I've covered the question. In the future, you might have better 
luck getting information from the mailing list (dev@parquet.apache.org) because 
that's where we typically see this kind of question.

 read a single file outside of mapreduce framework
 -

 Key: PARQUET-144
 URL: https://issues.apache.org/jira/browse/PARQUET-144
 Project: Parquet
  Issue Type: Test
  Components: parquet-mr
Affects Versions: 1.6.0
Reporter: hy5446
Priority: Critical

 In my test I would like to read a file that has been written through Parquet 
 + Scrooge. I would like to do it outside of map/reduce or hadoop. Something 
 like this:
 val bytes = readFile(my file)
 val objects = deserializeWithParquetScrooge[MyObjectClass](bytes)
 Is something like this possible? How?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-344) Limit the number of rows per block and per split

2015-07-28 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14644711#comment-14644711
 ] 

Ryan Blue commented on PARQUET-344:
---

[~QuentinFra], you can currently set the row group size and HDFS block size. 
That allows you to make smaller row groups and control the parallelism.

* {{parquet.block.size}} - the target row group size, which we try to be 
slightly under
* {{dfs.blocksize}} - sets the HDFS block size. Make this a whole-number 
multiple of the row group size

Is that sufficient for your use case, or do you think that a limit in terms of 
number of rows would be better? We can certainly add that, but I'm not sure 
it's a good idea. When you set the end row group size in bytes, you don't have 
to know what compression ratio you're going to get.

 Limit the number of rows per block and per split
 

 Key: PARQUET-344
 URL: https://issues.apache.org/jira/browse/PARQUET-344
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: Quentin Francois
   Original Estimate: 504h
  Remaining Estimate: 504h

 We use Parquet to store raw metrics data and then query this data with 
 Hadoop-Pig. 
 The issue is that sometimes we end up with small Parquet files (~80mo) that 
 contain more than 300 000 000 rows, usually because of a constant metric 
 which results in a very good compression. Too good. As a result we have a 
 very few number of maps that process up to 10x more rows than the other maps 
 and we lose the benefits of the parallelization. 
 The fix for that has two components I believe:
 1. Be able to limit the number of rows per Parquet block (in addition to the 
 size limit).
 2. Be able to limit the number of rows per split.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-347) Thrift projection does not handle new (optional) fields in requestedSchema

2015-07-28 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14644955#comment-14644955
 ] 

Ryan Blue commented on PARQUET-347:
---

Seems like we should more generally take a look at what schema evolution 
changes are allowed and have tests for all of them. I'm planning on doing the 
same for Avro and it would be great to coordinate that so we know we can evolve 
an Avro schema and still read it in Thrift or vice versa.

 Thrift projection does not handle new (optional) fields in requestedSchema
 --

 Key: PARQUET-347
 URL: https://issues.apache.org/jira/browse/PARQUET-347
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Reporter: Alex Levenson

 It should be valid to request an optional field that is not present in a file 
 (it should be assumed to be null) but instead this throws eagerly in:
 https://github.com/apache/parquet-mr/blob/d6f082b9be5d507ff60c6bc83a179cc44015ab97/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/api/ReadSupport.java#L58



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-355) Create Integration tests to validate statistics

2015-08-07 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662000#comment-14662000
 ] 

Ryan Blue commented on PARQUET-355:
---

[~sircodesalot], thanks for working on this! Can you describe the approach 
you're taking in the PR to ensure these are tested?

 Create Integration tests to validate statistics
 ---

 Key: PARQUET-355
 URL: https://issues.apache.org/jira/browse/PARQUET-355
 Project: Parquet
  Issue Type: Test
  Components: parquet-mr
Reporter: Reuben Kuhnert
Priority: Minor

 In response to 
 [PARQUET-251|https://issues.apache.org/jira/browse/PARQUET-251] create unit 
 tests that validate the statistics fields for each column type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-358) Add support for temporal logical types to AVRO/Parquet conversion

2015-08-14 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697252#comment-14697252
 ] 

Ryan Blue commented on PARQUET-358:
---

Thanks for opening an issue on this one, [~k.shaposhni...@gmail.com]. Avro is 
currently holding a vote for release 1.8.0, which adds support for date/time 
types and decimals. I was waiting on that to go through so we can build the 
parquet-avro support to match its behavior. I would be glad to have your help 
building this if you're interested!

 Add support for temporal logical types to AVRO/Parquet conversion
 -

 Key: PARQUET-358
 URL: https://issues.apache.org/jira/browse/PARQUET-358
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-avro
Affects Versions: 1.8.0
Reporter: Konstantin Shaposhnikov

 Both 
 [AVRO|https://github.com/apache/avro/blob/trunk/doc/src/content/xdocs/spec.xml]
  and 
 [Parquet|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md]
  support logical types for dates, times and timestamps however this 
 information is not transfered from AVRO schema to Parquet schema during 
 conversion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-356) Add ElephantBird section to LICENSE file

2015-08-12 Thread Ryan Blue (JIRA)
Ryan Blue created PARQUET-356:
-

 Summary: Add ElephantBird section to LICENSE file
 Key: PARQUET-356
 URL: https://issues.apache.org/jira/browse/PARQUET-356
 Project: Parquet
  Issue Type: Task
  Components: parquet-mr
Affects Versions: 1.8.0, 1.8.1
Reporter: Ryan Blue
Assignee: Ryan Blue
 Fix For: 1.9.0


Commit [9993450|https://github.com/apache/parquet-mr/commit/9993450] brought in 
a section of 
[LzoRecordReader.java|https://github.com/twitter/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/input/LzoRecordReader.java#L124]
 from ElephantBird. The license for ElephantBird is ASL 2.0 so the inclusion is 
fine. We just need to add it to the root LICENSE file because it is included in 
the source distribution and in the parquet-thrift binary LICENSE file because 
it is in that binary package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-335) Avro object model should not require MAP_KEY_VALUE

2015-07-15 Thread Ryan Blue (JIRA)
Ryan Blue created PARQUET-335:
-

 Summary: Avro object model should not require MAP_KEY_VALUE
 Key: PARQUET-335
 URL: https://issues.apache.org/jira/browse/PARQUET-335
 Project: Parquet
  Issue Type: Bug
  Components: parquet-avro
Affects Versions: 1.8.0
Reporter: Ryan Blue
Assignee: Ryan Blue
 Fix For: 1.9.0


The Avro object model currently includes a check that requires maps to use 
MAP_KEY_VALUE to annotate the repeated key_value group. This is not required by 
the map type spec and should be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-327) Show statistics in the dump output

2015-07-15 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-327:
--
Fix Version/s: (was: 1.8.0)
   1.9.0

 Show statistics in the dump output
 --

 Key: PARQUET-327
 URL: https://issues.apache.org/jira/browse/PARQUET-327
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Affects Versions: 1.7.0
Reporter: Tom White
Assignee: Tom White
 Fix For: 1.9.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-288) Add dictionary support to Avro converters

2015-07-15 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-288:
--
Fix Version/s: (was: 1.8.0)

 Add dictionary support to Avro converters
 -

 Key: PARQUET-288
 URL: https://issues.apache.org/jira/browse/PARQUET-288
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-avro
Affects Versions: 1.7.0
Reporter: Ryan Blue





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-337) binary fields inside map/set/list are not handled in parquet-scrooge

2015-07-16 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-337:
--
Assignee: Jake Donham

 binary fields inside map/set/list are not handled in parquet-scrooge
 

 Key: PARQUET-337
 URL: https://issues.apache.org/jira/browse/PARQUET-337
 Project: Parquet
  Issue Type: Bug
Reporter: Jake Donham
Assignee: Jake Donham

 Binary fields inside map/set/list are not handled; using them produces a 
 ScroogeSchemaConversionException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-339) Add Alex Levenson to KEYS file

2015-07-17 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631811#comment-14631811
 ] 

Ryan Blue commented on PARQUET-339:
---

I'm fine just pushing changes like this, though we should probably have 
consensus on it.

 Add Alex Levenson to KEYS file
 --

 Key: PARQUET-339
 URL: https://issues.apache.org/jira/browse/PARQUET-339
 Project: Parquet
  Issue Type: Task
Reporter: Alex Levenson
Assignee: Alex Levenson
 Fix For: 1.8.1






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-332) Incompatible changes in o.a.p.thrift.projection

2015-07-13 Thread Ryan Blue (JIRA)
Ryan Blue created PARQUET-332:
-

 Summary: Incompatible changes in o.a.p.thrift.projection
 Key: PARQUET-332
 URL: https://issues.apache.org/jira/browse/PARQUET-332
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Reporter: Ryan Blue
 Fix For: 1.8.0


There are incompatible changes in o.a.p.thrift.projection that weren't caught 
because of PARQUET-330:
* The return type of [{{FieldsPath#push(ThriftField)}} 
changed|https://github.com/apache/parquet-mr/commit/ded56ffd598e41e32817f6c1b091595fe7122e8b#diff-e990fead0bb1a6faa5080efba86bc81fL34]
 ([return type compatibility 
ref|https://docs.oracle.com/javase/specs/jls/se7/html/jls-13.html#jls-13.4.15])
* [{{FieldProjectionFilter}} changed to an 
interface|https://github.com/apache/parquet-mr/commit/7fc7998398373a14b4cdc0ce18abdeb221b1ccf9#diff-49628343f8d6daf6cb774b6c6ccab82cL29]

Both of these are incompatibilities if {{FieldProjectionFilter}} is part of the 
public API, which it appears to be because it is used by the ScroogeReadSupport 
and the ThriftSchemaConverter (public constructor).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-241) ParquetInputFormat.getFooters() should return in the same order as what listStatus() returns

2015-10-29 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-241:
--
Assignee: Mingyu Kim

> ParquetInputFormat.getFooters() should return in the same order as what 
> listStatus() returns
> 
>
> Key: PARQUET-241
> URL: https://issues.apache.org/jira/browse/PARQUET-241
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: Mingyu Kim
>Assignee: Mingyu Kim
> Fix For: 1.9.0
>
>
> Because of how the footer cache is implemented, getFooters() returns the 
> footers in a different order than what listStatus() returns.
> When I provided url 
> "hdfs://.../part-1.parquet,hdfs://.../part-2.parquet,hdfs://.../part-3.parquet",
>  ParquetInputFormat.getSplits(), which internally calls getFooters(), 
> returned the splits in a wrong order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-241) ParquetInputFormat.getFooters() should return in the same order as what listStatus() returns

2015-10-29 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-241.
---
Resolution: Fixed

Merged #164. Thanks [~mkim] for the contribution!

(And sorry this took so long. Next time, feel free to ping the mailing list to 
remind us!)

> ParquetInputFormat.getFooters() should return in the same order as what 
> listStatus() returns
> 
>
> Key: PARQUET-241
> URL: https://issues.apache.org/jira/browse/PARQUET-241
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: Mingyu Kim
>Assignee: Mingyu Kim
> Fix For: 1.9.0
>
>
> Because of how the footer cache is implemented, getFooters() returns the 
> footers in a different order than what listStatus() returns.
> When I provided url 
> "hdfs://.../part-1.parquet,hdfs://.../part-2.parquet,hdfs://.../part-3.parquet",
>  ParquetInputFormat.getSplits(), which internally calls getFooters(), 
> returned the splits in a wrong order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-241) ParquetInputFormat.getFooters() should return in the same order as what listStatus() returns

2015-10-29 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-241:
--
Fix Version/s: 1.9.0

> ParquetInputFormat.getFooters() should return in the same order as what 
> listStatus() returns
> 
>
> Key: PARQUET-241
> URL: https://issues.apache.org/jira/browse/PARQUET-241
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: Mingyu Kim
> Fix For: 1.9.0
>
>
> Because of how the footer cache is implemented, getFooters() returns the 
> footers in a different order than what listStatus() returns.
> When I provided url 
> "hdfs://.../part-1.parquet,hdfs://.../part-2.parquet,hdfs://.../part-3.parquet",
>  ParquetInputFormat.getSplits(), which internally calls getFooters(), 
> returned the splits in a wrong order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-369) Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder

2015-10-27 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-369.
---
   Resolution: Fixed
 Assignee: Ryan Blue
Fix Version/s: format-2.3.1

> Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder
> ---
>
> Key: PARQUET-369
> URL: https://issues.apache.org/jira/browse/PARQUET-369
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Cheng Lian
>Assignee: Ryan Blue
> Fix For: format-2.3.1
>
>
> Parquet-format shades SLF4J to {{parquet.org.slf4j}} (see 
> [here|https://github.com/apache/parquet-format/blob/apache-parquet-format-2.3.0/pom.xml#L162]).
>  This also accidentally shades [this 
> line|https://github.com/qos-ch/slf4j/blob/v_1.7.2/slf4j-api/src/main/java/org/slf4j/LoggerFactory.java#L207]
> {code}
> private static String STATIC_LOGGER_BINDER_PATH = 
> "org/slf4j/impl/StaticLoggerBinder.class";
> {code}
> to
> {code}
> private static String STATIC_LOGGER_BINDER_PATH = 
> "parquet/org/slf4j/impl/StaticLoggerBinder.class";
> {code}
> and thus {{LoggerFactory}} can never find the correct {{StaticLoggerBinder}} 
> implementation even if we provide dependencies like {{slf4j-log4j12}} on the 
> classpath.
> This happens in Spark. Whenever we write a Parquet file, we see the following 
> famous message and can never get rid of it:
> {noformat}
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-241) ParquetInputFormat.getFooters() should return in the same order as what listStatus() returns

2015-10-27 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14977329#comment-14977329
 ] 

Ryan Blue commented on PARQUET-241:
---

[~skonto], I think that most formats are consistent by accident, but that 
consistency isn't guaranteed. This would probably make the collect result in 
Spark more consistent.

> ParquetInputFormat.getFooters() should return in the same order as what 
> listStatus() returns
> 
>
> Key: PARQUET-241
> URL: https://issues.apache.org/jira/browse/PARQUET-241
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: Mingyu Kim
>
> Because of how the footer cache is implemented, getFooters() returns the 
> footers in a different order than what listStatus() returns.
> When I provided url 
> "hdfs://.../part-1.parquet,hdfs://.../part-2.parquet,hdfs://.../part-3.parquet",
>  ParquetInputFormat.getSplits(), which internally calls getFooters(), 
> returned the splits in a wrong order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-241) ParquetInputFormat.getFooters() should return in the same order as what listStatus() returns

2015-10-28 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978699#comment-14978699
 ] 

Ryan Blue commented on PARQUET-241:
---

Building 1.7.0 shouldn't make a difference because this issue is still 
unresolved. There are specs for Parquet, but nothing that covers this behavior. 
The order of listStatus probably depends on the order files were created, like 
most file systems. This would only make it so that the order of footers is the 
same as the order of the file status array.

> ParquetInputFormat.getFooters() should return in the same order as what 
> listStatus() returns
> 
>
> Key: PARQUET-241
> URL: https://issues.apache.org/jira/browse/PARQUET-241
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: Mingyu Kim
>
> Because of how the footer cache is implemented, getFooters() returns the 
> footers in a different order than what listStatus() returns.
> When I provided url 
> "hdfs://.../part-1.parquet,hdfs://.../part-2.parquet,hdfs://.../part-3.parquet",
>  ParquetInputFormat.getSplits(), which internally calls getFooters(), 
> returned the splits in a wrong order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-389) Filter predicates should work with missing columns

2015-10-28 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978702#comment-14978702
 ] 

Ryan Blue commented on PARQUET-389:
---

I agree, assuming that by "merged" you mean resolving the requested schema 
against different file schemas.

> Filter predicates should work with missing columns
> --
>
> Key: PARQUET-389
> URL: https://issues.apache.org/jira/browse/PARQUET-389
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.6.0, 1.7.0, 1.8.0
>Reporter: Cheng Lian
>
> This issue originates from SPARK-11103, which contains detailed information 
> about how to reproduce it.
> The major problem here is that, filter predicates pushed down assert that 
> columns they touch must exist in the target physical files. But this isn't 
> true in case of schema merging.
> Actually this assertion is unnecessary, because if a column is missing in the 
> filter schema, the column is considered to be filled by nulls, and all the 
> filters should be able to act accordingly. For example, if we push down {{a = 
> 1}} but {{a}} is missing in the underlying physical file, all records in this 
> file should be dropped since {{a}} is always null. On the other hand, if we 
> push down {{a IS NULL}}, all records should be preserved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-140) Allow clients to control the GenericData object that is used to read Avro records

2015-10-28 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978707#comment-14978707
 ] 

Ryan Blue commented on PARQUET-140:
---

[~DeaconDesperado], you are correct. This allows you to use generic classes 
instead of specific by specifying GenericData instead of SpecificData or 
ReflectData.

> Allow clients to control the GenericData object that is used to read Avro 
> records
> -
>
> Key: PARQUET-140
> URL: https://issues.apache.org/jira/browse/PARQUET-140
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Josh Wills
>Assignee: Josh Wills
> Fix For: 1.6.0
>
>
> Right now, Parquet always uses the default SpecificData instance (retrieved 
> by SpecificData.get()) to lookup the schemas for SpecificRecord subclasses. 
> Unfortunately, if the definition of the SpecificRecord subclass is not 
> available to the classloader used in SpecificData.get(), we will fail to find 
> the definition of the SpecificRecord subclass and will fall back to returning 
> a GenericRecord, which will cause a ClassCastException in any client code 
> that is expecting an instance of the SpecificRecord subclass.
> We can fix this limitation by allowing the client code to specify how to 
> construct a custom instance of SpecificData (or any other subclass of 
> GenericData) for Parquet to use, including instances of SpecificData that use 
> alternative classloaders.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-391) Parquet build fails with thrift9 profile

2015-11-10 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999133#comment-14999133
 ] 

Ryan Blue commented on PARQUET-391:
---

I think this is a duplicate of PARQUET-380. There's a PR with a fix here: 
https://github.com/apache/parquet-mr/pull/276

Is it okay with you if I close this and track it on the other issue?

> Parquet build fails with thrift9 profile 
> -
>
> Key: PARQUET-391
> URL: https://issues.apache.org/jira/browse/PARQUET-391
> Project: Parquet
>  Issue Type: Bug
>Reporter: Yash Datta
>
> compile parquet build using:
> mvn clean install -Pthrift9 -DskipTests
> build fails in parquet-cascading project :
> [INFO] -
> [ERROR] COMPILATION ERROR :
> [INFO] -
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[10,32]
>  package org.apache.thrift.scheme does not exist
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[11,32]
>  package org.apache.thrift.scheme does not exist
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[12,32]
>  package org.apache.thrift.scheme does not exist
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[14,32]
>  package org.apache.thrift.scheme does not exist
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[15,34]
>  cannot find symbol
>   symbol:   class TTupleProtocol
>   location: package org.apache.thrift.protocol
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[40,44]
>  cannot find symbol
>   symbol:   class IScheme
>   location: class parquet.thrift.test.Name
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[40,54]
>  cannot find symbol
>   symbol:   class SchemeFactory
>   location: class parquet.thrift.test.Name
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[395,61]
>  cannot find symbol
>   symbol:   class SchemeFactory
>   location: class parquet.thrift.test.Name
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[401,51]
>  cannot find symbol
>   symbol:   class StandardScheme
>   location: class parquet.thrift.test.Name
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[462,58]
>  cannot find symbol
>   symbol:   class SchemeFactory
>   location: class parquet.thrift.test.Name
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[468,48]
>  cannot find symbol



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-124) parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException

2015-11-09 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996949#comment-14996949
 ] 

Ryan Blue commented on PARQUET-124:
---

[~swethakasireddy], it looks like this wasn't completely addressed by the fix 
above. [~terrasect] had a problem with it as well. Would one of you be willing 
to open a new issue for the current problem? Then we can work on getting it 
fixed. Thanks!

> parquet.hadoop.ParquetOutputCommitter.commitJob() throws 
> parquet.io.ParquetEncodingException
> 
>
> Key: PARQUET-124
> URL: https://issues.apache.org/jira/browse/PARQUET-124
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.6.0
>Reporter: Chris Albright
>Priority: Minor
> Fix For: 1.6.0
>
> Attachments: PARQUET-124-test
>
>
> I'm running an example combining Avro, Spark and Parquet 
> (https://github.com/massie/spark-parquet-example), and in the process of 
> updating the library versions, am getting the warning below.
> The version of Parquet-Hadoop in the original example is 1.0.0. I am using 
> 1.6.0rc3
> The ParquetFileWriter.mergeFooters(Path, List) method is performing a 
> check to ensure the footers are all for files in the output directory. The 
> output directory is supplied by ParquetFileWriter.writeMetadataFile; in 
> 1.0.0, the output path was converted to a fully qualified output path before 
> the call to mergeFooters, but in 1.6.0rc[2,3] that conversion happens after 
> the call to mergeFooters. Because of this, the check within merge footers is 
> failing (the URI for the footers starts with file:, but not the URI for the 
> root path does not)
> Here is the warning message and stacktrace.
> Oct 30, 2014 9:11:31 PM WARNING: parquet.hadoop.ParquetOutputCommitter: could 
> not write summary file for /tmp/1414728690018-0/output
> parquet.io.ParquetEncodingException: 
> file:/tmp/1414728690018-0/output/part-r-0.parquet invalid: all the files 
> must be contained in the root /tmp/1414728690018-0/output
>   at 
> parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422)
>   at 
> parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398)
>   at 
> parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:50)
>   at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:936)
>   at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:832)
>   at 
> com.zenfractal.SparkParquetExample$.main(SparkParquetExample.scala:72)
>   at com.zenfractal.SparkParquetExample.main(SparkParquetExample.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-390) GroupType.union(Type toMerge, boolean strict) does not honor strict parameter

2015-11-09 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997083#comment-14997083
 ] 

Ryan Blue commented on PARQUET-390:
---

You're right that my suggestion is a much larger issue. For this problem, I'm 
fine with fixing the union function, though I'd like to see it fixed and tested 
rather than just tweaked, if that sounds reasonable.

> GroupType.union(Type toMerge, boolean strict) does not honor strict parameter
> -
>
> Key: PARQUET-390
> URL: https://issues.apache.org/jira/browse/PARQUET-390
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Michael Allman
>  Labels: newbie, parquet
>
> This is the code as it currently stands in master:
> {code}
> @Override
> protected Type union(Type toMerge, boolean strict) {
>   if (toMerge.isPrimitive()) {
> throw new IncompatibleSchemaModificationException("can not merge 
> primitive type " + toMerge + " into group type " + this);
>   }
>   return new GroupType(toMerge.getRepetition(), getName(), 
> mergeFields(toMerge.asGroupType()));
> }
> {code}
> Note the call to {{mergeFields}} omits the {{strict}} parameter. I believe 
> the code should be:
> {code}
> @Override
> protected Type union(Type toMerge, boolean strict) {
>   if (toMerge.isPrimitive()) {
> throw new IncompatibleSchemaModificationException("can not merge 
> primitive type " + toMerge + " into group type " + this);
>   }
>   return new GroupType(toMerge.getRepetition(), getName(), 
> mergeFields(toMerge.asGroupType(), strict));
> }
> {code}
> Note the call to {{mergeFields}} includes the {{strict}} parameter.
> I would work on this myself, but I'm having considerable trouble working with 
> the codebase (see e.g. 
> http://stackoverflow.com/questions/31229445/build-failure-apache-parquet-mr-source-mvn-install-failure).
>  Given the (assumed) simplicity of the fix, can a seasoned Parquet 
> contributor take this up? Cheers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-380) Cascading and scrooge builds fail when using thrift 0.9.0

2015-11-17 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15009063#comment-15009063
 ] 

Ryan Blue commented on PARQUET-380:
---

There are build failures from thrift's SLF4J dependency. I just need to have 
some time to work through it.

> Cascading and scrooge builds fail when using thrift 0.9.0
> -
>
> Key: PARQUET-380
> URL: https://issues.apache.org/jira/browse/PARQUET-380
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.8.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Fix For: 1.9.0
>
>
> This is caused by a transitive dependency on libthrift 0.7.0 from 
> elephantbird. The solution is to add thrift as an explicit (but provided) 
> dependency to those projects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics

2015-11-17 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15009099#comment-15009099
 ] 

Ryan Blue commented on PARQUET-41:
--

[~Ferd], I think we need a design doc for this feature and some data about it 
before building an implementation. There are still some unknowns that I don't 
think we have designed enough. I don't think the current approach that mirrors 
ORC is appropriate because we don't know the number of unique values in pages 
and the filters are very sensitive to over-filling.

> Add bloom filters to parquet statistics
> ---
>
> Key: PARQUET-41
> URL: https://issues.apache.org/jira/browse/PARQUET-41
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-format, parquet-mr
>Reporter: Alex Levenson
>Assignee: Ferdinand Xu
>  Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter. 
> This could be very useful in filtering entire row groups.
> Pull request:
> https://github.com/apache/parquet-mr/pull/215



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-390) GroupType.union(Type toMerge, boolean strict) does not honor strict parameter

2015-11-04 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-390:
--
Labels: newbie parquet  (was: parquet)

> GroupType.union(Type toMerge, boolean strict) does not honor strict parameter
> -
>
> Key: PARQUET-390
> URL: https://issues.apache.org/jira/browse/PARQUET-390
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Michael Allman
>  Labels: newbie, parquet
>
> This is the code as it currently stands in master:
> {code}
> @Override
> protected Type union(Type toMerge, boolean strict) {
>   if (toMerge.isPrimitive()) {
> throw new IncompatibleSchemaModificationException("can not merge 
> primitive type " + toMerge + " into group type " + this);
>   }
>   return new GroupType(toMerge.getRepetition(), getName(), 
> mergeFields(toMerge.asGroupType()));
> }
> {code}
> Note the call to {{mergeFields}} omits the {{strict}} parameter. I believe 
> the code should be:
> {code}
> @Override
> protected Type union(Type toMerge, boolean strict) {
>   if (toMerge.isPrimitive()) {
> throw new IncompatibleSchemaModificationException("can not merge 
> primitive type " + toMerge + " into group type " + this);
>   }
>   return new GroupType(toMerge.getRepetition(), getName(), 
> mergeFields(toMerge.asGroupType(), strict));
> }
> {code}
> Note the call to {{mergeFields}} includes the {{strict}} parameter.
> I would work on this myself, but I'm having considerable trouble working with 
> the codebase (see e.g. 
> http://stackoverflow.com/questions/31229445/build-failure-apache-parquet-mr-source-mvn-install-failure).
>  Given the (assumed) simplicity of the fix, can a seasoned Parquet 
> contributor take this up? Cheers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-390) GroupType.union(Type toMerge, boolean strict) does not honor strict parameter

2015-11-04 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989939#comment-14989939
 ] 

Ryan Blue commented on PARQUET-390:
---

Thanks for the bug report, Michael. I think you're right about this.

Could you share with us what you're using this for? This was originally used 
for building an overall schema for the files in a job, but it isn't necessary 
to do that and we mostly removed the need to in PARQUET-139. I'd like to see 
what your use case for it is. Thanks!

> GroupType.union(Type toMerge, boolean strict) does not honor strict parameter
> -
>
> Key: PARQUET-390
> URL: https://issues.apache.org/jira/browse/PARQUET-390
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Michael Allman
>  Labels: parquet
>
> This is the code as it currently stands in master:
> {code}
> @Override
> protected Type union(Type toMerge, boolean strict) {
>   if (toMerge.isPrimitive()) {
> throw new IncompatibleSchemaModificationException("can not merge 
> primitive type " + toMerge + " into group type " + this);
>   }
>   return new GroupType(toMerge.getRepetition(), getName(), 
> mergeFields(toMerge.asGroupType()));
> }
> {code}
> Note the call to {{mergeFields}} omits the {{strict}} parameter. I believe 
> the code should be:
> {code}
> @Override
> protected Type union(Type toMerge, boolean strict) {
>   if (toMerge.isPrimitive()) {
> throw new IncompatibleSchemaModificationException("can not merge 
> primitive type " + toMerge + " into group type " + this);
>   }
>   return new GroupType(toMerge.getRepetition(), getName(), 
> mergeFields(toMerge.asGroupType(), strict));
> }
> {code}
> Note the call to {{mergeFields}} includes the {{strict}} parameter.
> I would work on this myself, but I'm having considerable trouble working with 
> the codebase (see e.g. 
> http://stackoverflow.com/questions/31229445/build-failure-apache-parquet-mr-source-mvn-install-failure).
>  Given the (assumed) simplicity of the fix, can a seasoned Parquet 
> contributor take this up? Cheers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-373) MemoryManager tests are flaky

2015-10-19 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-373.
---
Resolution: Fixed

> MemoryManager tests are flaky
> -
>
> Key: PARQUET-373
> URL: https://issues.apache.org/jira/browse/PARQUET-373
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.8.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Fix For: 1.9.0
>
>
> The memory manager tests are flaky, depending on the heap allocation for the 
> JVM they run in. This is caused by over-specific tests that assert the memory 
> allocation down to the byte and the fact that some assertions implicitly cast 
> long values to doubles to use the "within" form of assertEquals.
> The tests should not validate a specific allocation strategy, but should 
> instead assert that:
> 1. The allocation for a file is the row group size until room runs out
> 2. When scaling row groups, the total allocation does not exceed the pool size



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-246) ArrayIndexOutOfBoundsException with Parquet write version v2

2015-07-09 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-246:
--
Fix Version/s: (was: 2.0.0)
   1.8.0

 ArrayIndexOutOfBoundsException with Parquet write version v2
 

 Key: PARQUET-246
 URL: https://issues.apache.org/jira/browse/PARQUET-246
 Project: Parquet
  Issue Type: Bug
Affects Versions: 1.6.0
Reporter: Konstantin Shaposhnikov
 Fix For: 1.8.0


 I am getting the following exception when reading a parquet file that was 
 created using Avro WriteSupport and Parquet write version v2.0:
 {noformat}
 Caused by: parquet.io.ParquetDecodingException: Can't read value in column 
 [colName, rows, array, name] BINARY at value 313601 out of 428260, 1 out of 
 39200 in currentPage. repetition level: 0, definition level: 2
   at 
 parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:462)
   at 
 parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:364)
   at 
 parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
   ... 27 more
 Caused by: java.lang.ArrayIndexOutOfBoundsException
   at 
 parquet.column.values.deltastrings.DeltaByteArrayReader.readBytes(DeltaByteArrayReader.java:70)
   at 
 parquet.column.impl.ColumnReaderImpl$2$6.read(ColumnReaderImpl.java:307)
   at 
 parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:458)
   ... 30 more
 {noformat}
 The file is quite big (500Mb) so I cannot upload it here, but possibly there 
 is enough information in the exception message to understand the cause of 
 error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-246) ArrayIndexOutOfBoundsException with Parquet write version v2

2015-07-09 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14620919#comment-14620919
 ] 

Ryan Blue commented on PARQUET-246:
---

The {{parquet.split.files}} option will read all files sequentially. You'll get 
one task per file instead of one task per input split (HDFS block). The reason 
is that we can't detect this situation while calculating splits without reading 
the file metadata to determine what version of Parquet wrote the file and 
whether it uses the delta byte array encoding. That would mean reading the 
footers on the task side, which is a bottleneck that we just fixed in 
PARQUET-139. Basically, reading the footers to plan splits doesn't scale well 
enough. So the compromise is to detect when a job would read corrupt data and 
fail those tasks with a message that tells you how to avoid the problem. It 
isn't ideal, but luckily this encoding wasn't very widely used.

 ArrayIndexOutOfBoundsException with Parquet write version v2
 

 Key: PARQUET-246
 URL: https://issues.apache.org/jira/browse/PARQUET-246
 Project: Parquet
  Issue Type: Bug
Affects Versions: 1.6.0
Reporter: Konstantin Shaposhnikov
Assignee: Konstantin Shaposhnikov
 Fix For: 1.8.0


 I am getting the following exception when reading a parquet file that was 
 created using Avro WriteSupport and Parquet write version v2.0:
 {noformat}
 Caused by: parquet.io.ParquetDecodingException: Can't read value in column 
 [colName, rows, array, name] BINARY at value 313601 out of 428260, 1 out of 
 39200 in currentPage. repetition level: 0, definition level: 2
   at 
 parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:462)
   at 
 parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:364)
   at 
 parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
   ... 27 more
 Caused by: java.lang.ArrayIndexOutOfBoundsException
   at 
 parquet.column.values.deltastrings.DeltaByteArrayReader.readBytes(DeltaByteArrayReader.java:70)
   at 
 parquet.column.impl.ColumnReaderImpl$2$6.read(ColumnReaderImpl.java:307)
   at 
 parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:458)
   ... 30 more
 {noformat}
 The file is quite big (500Mb) so I cannot upload it here, but possibly there 
 is enough information in the exception message to understand the cause of 
 error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-246) ArrayIndexOutOfBoundsException with Parquet write version v2

2015-07-09 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-246.
---
Resolution: Fixed
  Assignee: Konstantin Shaposhnikov

Closing this now that the read side has a fix. Thanks Konstantin, Sergio, Alex, 
and Tianshuo for all your work getting this resolved!

 ArrayIndexOutOfBoundsException with Parquet write version v2
 

 Key: PARQUET-246
 URL: https://issues.apache.org/jira/browse/PARQUET-246
 Project: Parquet
  Issue Type: Bug
Affects Versions: 1.6.0
Reporter: Konstantin Shaposhnikov
Assignee: Konstantin Shaposhnikov
 Fix For: 1.8.0


 I am getting the following exception when reading a parquet file that was 
 created using Avro WriteSupport and Parquet write version v2.0:
 {noformat}
 Caused by: parquet.io.ParquetDecodingException: Can't read value in column 
 [colName, rows, array, name] BINARY at value 313601 out of 428260, 1 out of 
 39200 in currentPage. repetition level: 0, definition level: 2
   at 
 parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:462)
   at 
 parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:364)
   at 
 parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
   ... 27 more
 Caused by: java.lang.ArrayIndexOutOfBoundsException
   at 
 parquet.column.values.deltastrings.DeltaByteArrayReader.readBytes(DeltaByteArrayReader.java:70)
   at 
 parquet.column.impl.ColumnReaderImpl$2$6.read(ColumnReaderImpl.java:307)
   at 
 parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:458)
   ... 30 more
 {noformat}
 The file is quite big (500Mb) so I cannot upload it here, but possibly there 
 is enough information in the exception message to understand the cause of 
 error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-380) Cascading and scrooge builds fail when using thrift 0.9.0

2015-11-17 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15009838#comment-15009838
 ] 

Ryan Blue commented on PARQUET-380:
---

When I add the dependency for libthrift, I get an error somewhere in cascading 
that there is no StaticLoggerBinder for SLF4J. That's an easy fix: add a binder 
like slf4j-nop or slf4j-simple. But, when I add slf4j-simple:1.7.5, I get:

{code}
SLF4J: The requested version 1.6.99 by your slf4j binding is not compatible 
with [1.5.5, 1.5.6, 1.5.7, 1.5.8]
. . .
java.lang.NoSuchMethodError: 
org.slf4j.helpers.MessageFormatter.format(Ljava/lang/String;Ljava/lang/Object;Ljava/lang/Object;)Lorg/slf4j/helpers/FormattingTuple;
  at org.slf4j.impl.SimpleLogger.formatAndLog(SimpleLogger.java:414)
  at org.slf4j.impl.SimpleLogger.info(SimpleLogger.java:546)
{code}

The version of SLF4J that is pulled in by libthrift is too old to work with a 
new binding. But, using a slf4j-simple version that works with the older 
version of thrift causes failures in the hadoop-2 profile because Hadoop pulls 
in a version of SLF4J that isn't compatible with the older slf4j-simple. So the 
fix is to pull in the new version of both slf4j-api and slf4j-simple that 
matches the hadoop-2 verison. In the default profile, it overrides the 
transitive SLF4J dependency from libthrift and everything works. This is only 
needed for test dependencies, allowing downstream projects to use whatever 
version of the SLF4J API they need, which will override the old one in 
libthrift.

I've pushed a new version that should work, I'll commit it after CI tests pass.

> Cascading and scrooge builds fail when using thrift 0.9.0
> -
>
> Key: PARQUET-380
> URL: https://issues.apache.org/jira/browse/PARQUET-380
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.8.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Fix For: 1.9.0
>
>
> This is caused by a transitive dependency on libthrift 0.7.0 from 
> elephantbird. The solution is to add thrift as an explicit (but provided) 
> dependency to those projects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-380) Cascading and scrooge builds fail when using thrift 0.9.0

2015-11-17 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-380.
---
Resolution: Fixed

Fixed. Thanks for the push, [~saucam]!

> Cascading and scrooge builds fail when using thrift 0.9.0
> -
>
> Key: PARQUET-380
> URL: https://issues.apache.org/jira/browse/PARQUET-380
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.8.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Fix For: 1.9.0
>
>
> This is caused by a transitive dependency on libthrift 0.7.0 from 
> elephantbird. The solution is to add thrift as an explicit (but provided) 
> dependency to those projects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-344) Limit the number of rows per block and per split

2015-08-25 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14712108#comment-14712108
 ] 

Ryan Blue commented on PARQUET-344:
---

Thanks Quentin! I like Dan's idea of limiting the raw data size as a way to 
control this that isn't exposed to users. If you are willing to build a patch 
for that, thank you!

 Limit the number of rows per block and per split
 

 Key: PARQUET-344
 URL: https://issues.apache.org/jira/browse/PARQUET-344
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: Quentin Francois

 We use Parquet to store raw metrics data and then query this data with 
 Hadoop-Pig. 
 The issue is that sometimes we end up with small Parquet files (~80mo) that 
 contain more than 300 000 000 rows, usually because of a constant metric 
 which results in a very good compression. Too good. As a result we have a 
 very few number of maps that process up to 10x more rows than the other maps 
 and we lose the benefits of the parallelization. 
 The fix for that has two components I believe:
 1. Be able to limit the number of rows per Parquet block (in addition to the 
 size limit).
 2. Be able to limit the number of rows per split.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-373) MemoryManager tests are flaky

2015-09-11 Thread Ryan Blue (JIRA)
Ryan Blue created PARQUET-373:
-

 Summary: MemoryManager tests are flaky
 Key: PARQUET-373
 URL: https://issues.apache.org/jira/browse/PARQUET-373
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.8.0
Reporter: Ryan Blue
Assignee: Ryan Blue
 Fix For: 1.9.0


The memory manager tests are flaky, depending on the heap allocation for the 
JVM they run in. This is caused by over-specific tests that assert the memory 
allocation down to the byte and the fact that some assertions implicitly cast 
long values to doubles to use the "within" form of assertEquals.

The tests should not validate a specific allocation strategy, but should 
instead assert that:
1. The allocation for a file is the row group size until room runs out
2. When scaling row groups, the total allocation does not exceed the pool size



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-335) Avro object model should not require MAP_KEY_VALUE

2015-09-11 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-335.
---
Resolution: Fixed

> Avro object model should not require MAP_KEY_VALUE
> --
>
> Key: PARQUET-335
> URL: https://issues.apache.org/jira/browse/PARQUET-335
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.8.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Fix For: 1.9.0
>
>
> The Avro object model currently includes a check that requires maps to use 
> MAP_KEY_VALUE to annotate the repeated key_value group. This is not required 
> by the map type spec and should be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-372) Parquet stats can have awkwardly large values

2015-09-10 Thread Ryan Blue (JIRA)
Ryan Blue created PARQUET-372:
-

 Summary: Parquet stats can have awkwardly large values
 Key: PARQUET-372
 URL: https://issues.apache.org/jira/browse/PARQUET-372
 Project: Parquet
  Issue Type: Bug
  Components: parquet-format, parquet-mr
Reporter: Ryan Blue


If a column is storing very large values, say 2-4 MB, then the page header's 
min and max values can also be this large. It is wasteful to keep that much 
data in a page header, so we should look at options for decreasing the size 
required in these cases.

One idea is to truncate the size of binary data and change the last byte to 
0xFF (max) or 0x00 (min) to get a roughly equivalent min and max that isn't 
huge. This probably has some problems when the data stores multi-byte 
characters in UTF8 so we have to be careful and look into byte-wise sorting and 
UTF8.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-379) PrimitiveType.union erases original type

2015-09-28 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14933846#comment-14933846
 ] 

Ryan Blue commented on PARQUET-379:
---

I think this is part of a larger issue of handling schema evolution. The main 
use case I know of for union is merging file schemas into a metadata summary 
file. Those are no longer really needed because each schema is resolved against 
the requested schema individually on the reader, which eliminates the 
bottle-neck that the metadata file was intended to avoid. And as you note, 
union doesn't really create a union as one might expect: a schema that can be 
used to read both of the input schemas.

> PrimitiveType.union erases original type
> 
>
> Key: PARQUET-379
> URL: https://issues.apache.org/jira/browse/PARQUET-379
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0
>Reporter: Cheng Lian
>
> The following ScalaTest test case
> {code}
>   test("merge primitive types") {
> val expected =
>   Types.buildMessage()
> .addField(
>   Types
> .required(INT32)
> .as(DECIMAL)
> .precision(7)
> .scale(2)
> .named("f"))
> .named("root")
> assert(expected.union(expected) === expected)
>   }
> {code}
> produces the following assertion error
> {noformat}
> message root {
>   required int32 f;
> }
>  did not equal message root {
>   required int32 f (DECIMAL(9,0));
> }
> {noformat}
> This is because {{PrimitiveType.union}} doesn't handle original type 
> properly. An open question is that, can two primitive types with the same 
> primitive type name but different original types be unioned?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-369) Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder

2015-09-24 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14906920#comment-14906920
 ] 

Ryan Blue commented on PARQUET-369:
---

I should also note: I've verified that there are no org.slf4j.* classes in the 
shaded parquet-format jar (they are now shaded.parquet.org.slf4j) and I 
decompiled LoggerFactory and verified that the reference to 
StaticLoggerBinder.class is unmodified.

> Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder
> ---
>
> Key: PARQUET-369
> URL: https://issues.apache.org/jira/browse/PARQUET-369
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Cheng Lian
>
> Parquet-format shades SLF4J to {{parquet.org.slf4j}} (see 
> [here|https://github.com/apache/parquet-format/blob/apache-parquet-format-2.3.0/pom.xml#L162]).
>  This also accidentally shades [this 
> line|https://github.com/qos-ch/slf4j/blob/v_1.7.2/slf4j-api/src/main/java/org/slf4j/LoggerFactory.java#L207]
> {code}
> private static String STATIC_LOGGER_BINDER_PATH = 
> "org/slf4j/impl/StaticLoggerBinder.class";
> {code}
> to
> {code}
> private static String STATIC_LOGGER_BINDER_PATH = 
> "parquet/org/slf4j/impl/StaticLoggerBinder.class";
> {code}
> and thus {{LoggerFactory}} can never find the correct {{StaticLoggerBinder}} 
> implementation even if we provide dependencies like {{slf4j-log4j12}} on the 
> classpath.
> This happens in Spark. Whenever we write a Parquet file, we see the following 
> famous message and can never get rid of it:
> {noformat}
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-369) Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder

2015-09-24 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14906993#comment-14906993
 ] 

Ryan Blue commented on PARQUET-369:
---

I've updated the PR to shade slf4j-nop and confirmed that everything still 
works, but the warning is gone.

> Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder
> ---
>
> Key: PARQUET-369
> URL: https://issues.apache.org/jira/browse/PARQUET-369
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Cheng Lian
>
> Parquet-format shades SLF4J to {{parquet.org.slf4j}} (see 
> [here|https://github.com/apache/parquet-format/blob/apache-parquet-format-2.3.0/pom.xml#L162]).
>  This also accidentally shades [this 
> line|https://github.com/qos-ch/slf4j/blob/v_1.7.2/slf4j-api/src/main/java/org/slf4j/LoggerFactory.java#L207]
> {code}
> private static String STATIC_LOGGER_BINDER_PATH = 
> "org/slf4j/impl/StaticLoggerBinder.class";
> {code}
> to
> {code}
> private static String STATIC_LOGGER_BINDER_PATH = 
> "parquet/org/slf4j/impl/StaticLoggerBinder.class";
> {code}
> and thus {{LoggerFactory}} can never find the correct {{StaticLoggerBinder}} 
> implementation even if we provide dependencies like {{slf4j-log4j12}} on the 
> classpath.
> This happens in Spark. Whenever we write a Parquet file, we see the following 
> famous message and can never get rid of it:
> {noformat}
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-383) ParquetOutputCommitter should propagate errors when writing metadata files

2015-09-24 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907123#comment-14907123
 ] 

Ryan Blue commented on PARQUET-383:
---

I think this is a good idea. I'd make the error fatal only if the user opted to 
use the metadata file. I'd suggest that we not write the metadata file by 
default, but I don't think that's an option without a major version bump 
because it could break users.

> ParquetOutputCommitter should propagate errors when writing metadata files
> --
>
> Key: PARQUET-383
> URL: https://issues.apache.org/jira/browse/PARQUET-383
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Alex Levenson
>Priority: Minor
>
> There's a lot of different ways the output committer can fail, or fail to 
> rollback after failing to write metadata files. We should decide whether 
> metadata files are required, and fatal (I think that's reasonable if the user 
> asked for them), and propagate without squashing exceptions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-369) Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder

2015-09-24 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14906986#comment-14906986
 ] 

Ryan Blue commented on PARQUET-369:
---

Ignore my comment above, I just tested out the partial relocation and it 
doesn't work because of references back to some of the moved classes. Looks 
like we can either ship parquet-format with a slf4j-api dependency or bundle it 
with a logger implementation, like slf4j-nop.

I don't think there is much interesting information being logged by thrift, 
plus we haven't been able to get those messages for at least the last release 
without complaints. I suggest we add slf4j-nop and shade that to avoid the 
warning.

> Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder
> ---
>
> Key: PARQUET-369
> URL: https://issues.apache.org/jira/browse/PARQUET-369
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Cheng Lian
>
> Parquet-format shades SLF4J to {{parquet.org.slf4j}} (see 
> [here|https://github.com/apache/parquet-format/blob/apache-parquet-format-2.3.0/pom.xml#L162]).
>  This also accidentally shades [this 
> line|https://github.com/qos-ch/slf4j/blob/v_1.7.2/slf4j-api/src/main/java/org/slf4j/LoggerFactory.java#L207]
> {code}
> private static String STATIC_LOGGER_BINDER_PATH = 
> "org/slf4j/impl/StaticLoggerBinder.class";
> {code}
> to
> {code}
> private static String STATIC_LOGGER_BINDER_PATH = 
> "parquet/org/slf4j/impl/StaticLoggerBinder.class";
> {code}
> and thus {{LoggerFactory}} can never find the correct {{StaticLoggerBinder}} 
> implementation even if we provide dependencies like {{slf4j-log4j12}} on the 
> classpath.
> This happens in Spark. Whenever we write a Parquet file, we see the following 
> famous message and can never get rid of it:
> {noformat}
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-382) Add a way to append encoded blocks in ParquetFileWriter

2015-09-24 Thread Ryan Blue (JIRA)
Ryan Blue created PARQUET-382:
-

 Summary: Add a way to append encoded blocks in ParquetFileWriter
 Key: PARQUET-382
 URL: https://issues.apache.org/jira/browse/PARQUET-382
 Project: Parquet
  Issue Type: New Feature
  Components: parquet-mr
Affects Versions: 1.8.0
Reporter: Ryan Blue
Assignee: Ryan Blue


Concatenating two files together currently requires reading the source files 
and rewriting the content from scratch. This ends up taking a lot of memory, 
even if the data is already encoded correctly and blocks just need to be 
appended and have their metadata updated. Merging two files should be fast and 
not take much memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (PARQUET-372) Parquet stats can have awkwardly large values

2015-09-25 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue reassigned PARQUET-372:
-

Assignee: Ryan Blue

> Parquet stats can have awkwardly large values
> -
>
> Key: PARQUET-372
> URL: https://issues.apache.org/jira/browse/PARQUET-372
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format, parquet-mr
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>
> If a column is storing very large values, say 2-4 MB, then the page header's 
> min and max values can also be this large.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-372) Parquet stats can have awkwardly large values

2015-09-25 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-372:
--
Description: If a column is storing very large values, say 2-4 MB, then the 
page header's min and max values can also be this large.  (was: If a column is 
storing very large values, say 2-4 MB, then the page header's min and max 
values can also be this large. It is wasteful to keep that much data in a page 
header, so we should look at options for decreasing the size required in these 
cases.

One idea is to truncate the size of binary data and change the last byte to 
0xFF (max) or 0x00 (min) to get a roughly equivalent min and max that isn't 
huge. This probably has some problems when the data stores multi-byte 
characters in UTF8 so we have to be careful and look into byte-wise sorting and 
UTF8.)

> Parquet stats can have awkwardly large values
> -
>
> Key: PARQUET-372
> URL: https://issues.apache.org/jira/browse/PARQUET-372
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format, parquet-mr
>Reporter: Ryan Blue
>
> If a column is storing very large values, say 2-4 MB, then the page header's 
> min and max values can also be this large.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-34) Add support for repeated columns in the filter2 API

2015-12-02 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-34?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036944#comment-15036944
 ] 

Ryan Blue commented on PARQUET-34:
--

[~f.pompermaier], I don't think anyone has extra cycles to spend implementing 
this right now, but if you are interested in building it, we'll work with you 
to get it reviewed and included.

I think the next step is to write up what you think needs to be done so we can 
look at it and help you in the right direction. It may be that the disconnect 
between Alex's comment about this being easy and your assessment is a different 
level of support.

> Add support for repeated columns in the filter2 API
> ---
>
> Key: PARQUET-34
> URL: https://issues.apache.org/jira/browse/PARQUET-34
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Alex Levenson
>Priority: Minor
>  Labels: filter2
>
> They currently are not supported. They would need their own set of operators, 
> like contains() and size() etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-382) Add a way to append encoded blocks in ParquetFileWriter

2015-12-08 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-382.
---
   Resolution: Fixed
Fix Version/s: 1.9.0

Merged #278. Thanks for reviewing, Sergio!

> Add a way to append encoded blocks in ParquetFileWriter
> ---
>
> Key: PARQUET-382
> URL: https://issues.apache.org/jira/browse/PARQUET-382
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.8.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Fix For: 1.9.0
>
>
> Concatenating two files together currently requires reading the source files 
> and rewriting the content from scratch. This ends up taking a lot of memory, 
> even if the data is already encoded correctly and blocks just need to be 
> appended and have their metadata updated. Merging two files should be fast 
> and not take much memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-402) Apache Pig cannot store Map data type into Parquet format

2015-12-10 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051521#comment-15051521
 ] 

Ryan Blue commented on PARQUET-402:
---

Is there anything we can do about it? Maybe we should at least throw an 
exception when the map type passed to Parquet can't be converted to a valid 
Parquet schema because the KV types are missing.

> Apache Pig cannot store Map data type into Parquet format
> -
>
> Key: PARQUET-402
> URL: https://issues.apache.org/jira/browse/PARQUET-402
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-pig
>Affects Versions: 1.6.0, 1.8.1
>Reporter: Jerry Ylilammi
>
> Trying to store simple map with two entries gives me following exception:
> {code}table_with_map_data: {my_map: map[]}
> 2015-12-10 11:58:54,478 [main] INFO  
> org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is 
> deprecated. Instead, use fs.defaultFS
> 2015-12-10 11:58:54,498 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 2999: Unexpected internal error. Invalid map Schema, schema should contain 
> exactly one field: my_map: map{code}
> For example taking any input and doing this gives me the exception:
> {code}table_with_map_data = FOREACH random_data GENERATE TOMAP('123', 
> 'hello', '456', 'world') as (my_map);
> DESCRIBE table_with_map_data;
> STORE table_with_map_data INTO '...' USING ParquetStorer();{code}
> I'm using latest version of Pig: Apache Pig version 0.15.0 (r1682971) 
> compiled Jun 01 2015, 11:44:35
> and Parquet: parquet-pig-bundle-1.6.0.jar
> EDIT: I noticed Parquet 1.8.1 is out. I switched to it and were forced to 
> update the pig script to use full path with ParquetStorer. However this gives 
> me same error as 1.6.0.
> {code}STORE table_with_map_data INTO 
> '/Users/jerry/tmp/parquet/output/parquet' USING 
> org.apache.parquet.pig.ParquetStorer();{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-393) release parquet-format 2.3.1

2015-12-10 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-393:
--
Summary: release parquet-format 2.3.1  (was: release parquet-format 2.4.0)

> release parquet-format 2.3.1
> 
>
> Key: PARQUET-393
> URL: https://issues.apache.org/jira/browse/PARQUET-393
> Project: Parquet
>  Issue Type: Task
>Reporter: Julien Le Dem
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-346) ThriftSchemaConverter throws for unknown struct or union type

2015-12-14 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-346:
--
Fix Version/s: (was: 2.0.0)
   1.9.0

> ThriftSchemaConverter throws for unknown struct or union type
> -
>
> Key: PARQUET-346
> URL: https://issues.apache.org/jira/browse/PARQUET-346
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Alex Levenson
>Assignee: Alex Levenson
> Fix For: 1.9.0
>
>
> ThriftSchemaConverter should either only be called on ThriftStruct's that 
> have populated structOrUnionType metadata, or should support a mode where 
> this data is unknown w/o throwing an exception.
> Currently it is called using the file's metadata here:
> https://github.com/apache/parquet-mr/blob/d6f082b9be5d507ff60c6bc83a179cc44015ab97/parquet-thrift/src/main/java/org/apache/parquet/thrift/ThriftRecordConverter.java#L797
> One workaround is not not use the file matadata here but rather the schema 
> from the thrift class. The other is to support unknown struct or union types



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-405) Backwards-incompatible change to thrift metadata

2015-12-14 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15057013#comment-15057013
 ] 

Ryan Blue commented on PARQUET-405:
---

Thanks, Ben! Both for reporting the issue and for helping us keep the issues 
organized.

> Backwards-incompatible change to thrift metadata
> 
>
> Key: PARQUET-405
> URL: https://issues.apache.org/jira/browse/PARQUET-405
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.8.0
>Reporter: Ben Kirwin
>
> Sometime in the last few versions, a {{isStructOrUnion}} field has been added 
> to the `thrift.descriptor` written to the parquet header:
> {code}
> {
> "children": [ ... ],
> "id": "STRUCT", 
> "structOrUnionType": "STRUCT"
> }
> {code}
> The current release now throws an exception when that field is missing  / 
> {{UNKNOWN}}). This makes it impossible to read back thrift data written using 
> a previous release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   >