Re: Parquet sync starting now

2018-03-13 Thread Julien Le Dem
Notes: Attendees: - Julien (WeWork): proto, release - Marcel: Iceberg - Zoltan, Gabor, Anna (Cloudera): bug null values. - - https://issues.apache.org/jira/projects/PARQUET/issues/PARQUET-1222

Re: parquet sync starting now

2017-10-11 Thread Julien Le Dem
Attendees/agenda: Santlal Deepak (Vertical): deprecation of older compression. Lars (Cloudera, Impala): Column indexes Marcel: Column indexes Ryan (Netflix): release parquet-format 2.4.0. need help on java side. parquet related table format (id based column projection) Jim (Cloudera) Zoltan

Re: Parquet sync starting now

2017-08-16 Thread Lars Volker
Here are the notes I took: Pooja (CMU, Cloudera): Present her work on Parquet indices Yaliang (Presto), Zoltan (Cloudera), Anna (Cloudera), Marcel, Deepak (Vertica): Interested in Parquet index work Ryan (Netflix): Parquet indices, compression Junjie (Intel): Bloom filter proposal Parquet Indices

Re: Parquet sync starting now

2017-08-14 Thread Wes McKinney
I have not taken a look at the performance of different compression algorithms yet. Are there any example datasets that anyone would like to see statistics for? Otherwise I will generate some high and low entropy datasets with dictionary encoding disabled (so that the compression is handled more by

Re: Parquet sync starting now

2017-08-11 Thread Julien Le Dem
Sorry for the delay. See notes bellow. I'm on vacation next week and Lars will send an invitation for the next sync August 16th. Pooja will talk about her work on page indices. Here are the notes from last sync: Parquet Sync Aug 2 2017 Anna (Cloudera): Deepak (Vertica): timestamp format Jim (

Re: Parquet sync starting now

2017-08-04 Thread Jeff Knupp
Thanks! Good to know :) -Jeff On Fri, Aug 4, 2017 at 9:50 AM, Uwe L. Korn wrote: > Hello Jeff, > > they are open for anyone and everyone is appreciated! We use these syncs > to exchange and discuss things about the Parquet project as well as the > Parquet format. It is also a good point to star

Re: Parquet sync starting now

2017-08-04 Thread Uwe L. Korn
Hello Jeff, they are open for anyone and everyone is appreciated! We use these syncs to exchange and discuss things about the Parquet project as well as the Parquet format. It is also a good point to start if you want to know what the current "hot topics" in Parquet are and how you could get invol

Re: Parquet sync starting now

2017-08-04 Thread Jeff Knupp
Just out of curiosity, are these sync meetings restricted to committers and higher or can anyone listen in? Cheers, Jeff Knupp On Wed, Aug 2, 2017 at 7:28 PM, 俊杰陈 wrote: > Hi Julien > Do we have meeting minutes for sync up? I can't hear clearly from handout > due to vpn issue from home. > > 20

Re: Parquet sync starting now

2017-08-02 Thread 俊杰陈
Hi Julien Do we have meeting minutes for sync up? I can't hear clearly from handout due to vpn issue from home. 2017-08-03 0:01 GMT+08:00 Julien Le Dem : > on hangout: > https://hangouts.google.com/hangouts/_/calendar/ > anVsaWVuLmxlZGVtQGdtYWlsLmNvbQ.k5oikh8rp3ho37qdca3o9jvh04 > -- Thanks &

Re: Parquet sync starting now

2017-07-19 Thread Julien Le Dem
Notes: Parquet Sync Jul 19 2017 Intros, Agenda: Anna, Zoltan (Cloudera Budapest): Column Chunk deprecation (PARQUET-291), type dependent sort orderings Cheng (Intel Shangai): Parquet Bloom Filter Jim (Cloudera): Bloom Filter Lars (Cloudera Impala): Marcel: Column index design Ryan (Netflix): Blo

Re: Parquet sync starting now

2017-07-19 Thread Wes McKinney
The video call is full On Wed, Jul 19, 2017 at 12:29 PM, Julien Le Dem wrote: > https://plus.google.com/hangouts/_/calendar/anVsaWVuLmxlZGVtQGdtYWlsLmNvbQ.vtfomsfgpbvjqd8d3kb8hte3j8

Re: parquet sync starting now

2017-05-24 Thread Julien Le Dem
Notes Ryan (Netflix): - Parquet bloom filters Julien (Dremio): - timestamp logical type - timestamp unknown ordering - pig decimal Deepak (Vertica): - timestamp - bloom filter Bloom filters: - Intel came back with good numbers on their bloom filters Pull Request - TODO: define the spec

Re: Parquet sync starting now

2017-05-10 Thread Julien Le Dem
Notes: Attendees and agenda building: Ryan (Netflix): - new logical types representation - index proposal Deepak (Vertica): - logical types for timestamps Lars (Impala): - dummy ordering to test unknown ordering - implement new ordering in parquet-mr Marcel (Impala): - index proposal

Re: Parquet sync starting now on hangout

2017-03-10 Thread Julien Le Dem
It requires extra conversion when using code expecting millis timestamps. That's probably not a strong argument against it except we now have data stored in that format. Those types were added a while ago: https://issues.apache.org/jira/browse/PARQUET-12 On Thu, Mar 9, 2017 at 6:15 PM, Marcel Korn

Re: Parquet sync starting now on hangout

2017-03-09 Thread Marcel Kornacker
Timestamp_millis seems like a subset of Timestamp_micros, unless I'm missing something: both need 8 bytes of storage, and you can obviously pad the former by multiplying with 1000 to arrive at the latter. Postgres supports timestamp_micros with a range of 4713BC/294276AD, and while dropping to a mi

Re: Parquet sync starting now on hangout

2017-03-08 Thread Ryan Blue
TIMESTAMP_MILLIS is a common format for applications that aren't SQL engines and is intended as a way for those apps to mark timestamps. SQL engines would ideally recognize those values and be able to read them. rb On Wed, Mar 8, 2017 at 2:08 PM, Marcel Kornacker wrote: > One thing I forgot to

Re: Parquet sync starting now on hangout

2017-03-08 Thread Marcel Kornacker
One thing I forgot to bring up: do we care about TIMESTAMP_MILLIS in addition to TIMESTAMP_MICROS? From SQL perspective, only the latter is needed. On Wed, Mar 8, 2017 at 1:54 PM, Julien Le Dem wrote: > 2. The other thing to look into is HyperLogLog for approximate distinct > value count. Simila

Re: Parquet sync starting now on hangout

2017-03-08 Thread Julien Le Dem
2. The other thing to look into is HyperLogLog for approximate distinct value count. Similar concepts than Bloom filters On Wed, Mar 8, 2017 at 1:39 PM, Ryan Blue wrote: > To follow up on the bloom filter discussion: The discussion on PARQUET-41 >

Re: Parquet sync starting now on hangout

2017-03-08 Thread Ryan Blue
To follow up on the bloom filter discussion: The discussion on PARQUET-41 has a lot of information and context for the bloom filter spreadsheet I

Re: Parquet sync starting now on hangout

2017-03-08 Thread Julien Le Dem
Notes: Attendees/Agenda: Zoltan (Cloudera, file formats): - timestamp types Ryan (Netflix): - timestamp types - fix for sorting metadata (min-max) Deepak (Vertica, parquet-cpp): - timestamp Emily (IBM Spark Technology center) Greg (Cloudera): - timestamp Lars (Cloudera impala): - min-max

Re: parquet sync starting now

2017-02-28 Thread Deepak Majeti
I am in favor of the two timestamp type solution as well. We also have a choice between nanosecond and microsecond/millisecond precision. Not all tools require nanosecond precision. I propose the following. - Add two logical types for nanosecond precision (TIMESTAMP, TIMESTAMP_TZ). The underlyin

Re: parquet sync starting now

2017-02-27 Thread Greg Rahn
I think the decision comes down to how many TIMESTAMP types does Parquet (and systems that use it a format) want to support or the use cases that are being targeted. If the answer is two, then it makes sense to follow the ANSI standard and what Postgres et al. have done: - timestamp [ without t

Re: parquet sync starting now

2017-02-27 Thread Marcel Kornacker
Greg, thanks for this writeup. Going back to "timestamp with timezone" in Parquet: does anything speak *against* following the SQL standard and storing UTC without an attached timezone (and leaving it to the client to do the conversion correctly for timestamp literals)? On Mon, Feb 27, 2017 at 4:

Re: parquet sync starting now

2017-02-27 Thread Marcel Kornacker
On Mon, Feb 27, 2017 at 10:43 AM, Zoltan Ivanfi wrote: > What you describe (storing in UTC and adjusting to local time) is the > implicit timezone that is associated with the plain TIMEZONE type of ANSI > SQL. Excerpts: Postgres allows explicit timezone offsets in timestamp literals. When these a

Re: parquet sync starting now

2017-02-27 Thread Zoltan Ivanfi
What you describe (storing in UTC and adjusting to local time) is the implicit timezone that is associated with the plain TIMEZONE type of ANSI SQL. Excerpts: Datetime data types that contain time fields (TIME and TIMESTAMP) are maintained in Universal Coordinated Time (UTC), with an explicit

Re: parquet sync starting now

2017-02-27 Thread Marcel Kornacker
On Mon, Feb 27, 2017 at 8:47 AM, Zoltan Ivanfi wrote: > Hi, > > Although the draft of SQL-92[1] does not explicitly state that the time zone > offset has to be stored, the following excerpts strongly suggest that the > time zone has to be stored with each individual value of TIMESTAMP WITH TIME >

Re: parquet sync starting now

2017-02-27 Thread Zoltan Ivanfi
Hi, Although the draft of SQL-92[1] does not explicitly state that the time zone offset has to be stored, the following excerpts strongly suggest that the time zone has to be stored with each individual value of TIMESTAMP WITH TIME ZONE: The length of a TIMESTAMP is 19 positions [...] The len

Re: parquet sync starting now

2017-02-23 Thread Marcel Kornacker
Yes, that sounds like a good idea. On Thu, Feb 23, 2017 at 2:16 PM, Wes McKinney wrote: > I made some comments about sharing C++ code more generally amongst > Impala, Kudu, Parquet, and Arrow. > > There's a significant amount of byte and bit processing code that > should have little coupling to t

Re: parquet sync starting now

2017-02-23 Thread Wes McKinney
I made some comments about sharing C++ code more generally amongst Impala, Kudu, Parquet, and Arrow. There's a significant amount of byte and bit processing code that should have little coupling to the Impala or Kudu runtime: - SIMD algorithms for hashing - RLE encoding - Dictionary encoding - Bi

Re: parquet sync starting now

2017-02-23 Thread Marcel Kornacker
Regarding timestamp with timezone: I'm not sure whether the SQL standard requires the timezone to be stored along with the timestamp for 'timestamp with timezone' (at least Oracle and Postgres diverge on that topic). Cc'ing Greg Rahn to shed some more light on that. Regarding 'make Impala depend

Re: parquet sync starting now

2017-02-23 Thread Lars Volker
Thank you Julien for writing up the notes! Here is the Impala JIRA I mentioned that tracks swapping the fields of TimestampValue: IMPALA-4825 A change is out for review here: https://gerrit.cloudera.org/#/c/6048/ Cheers, Lars On Thu, Feb 23, 2017

Re: parquet sync starting now

2017-02-23 Thread Julien Le Dem
Attendees/agenda: - Nandor, Zoltan (Cloudera/file formats) - Lars (Cloudera/Impala)" Statistics progress - Uwe (Blue Yonder): Parquet cpp RC. Int96 timestamps - Wes (twosigma): parquet cpp rc. 1.0 Release - Julien (Dremio): parquet metadata. Statistics. - Deepak (HP/Vertica): Parquet-cpp - Kazuaki

Re: Parquet sync starting now

2016-10-06 Thread Julien Le Dem
Attendees/Agenda Julien (Dremio): - 1.9.0 release Dan, Ryan (Netflix): - new statistics discussion (ordering) - new encodings. - IOManager discussion - time wasted in GC in Hive Parquet serde Piyush (Twitter): - better thrift integration in Scala Sergio (Cloudera): - presented new people wo