Re: [DISCUSS] Write-audit-publish support

2019-11-11 Thread Miao Wang
From a timeline perspective, we can’t work on implementing this feature in next 
a couple of months. For short term workaround, we choose a lock mechanism at 
application level.

@Anton Okolnychyi If you can pick up this 
feature, it will be great!

Thanks!

Miao

From: Ryan Blue 
Reply-To: "dev@iceberg.apache.org" , 
"rb...@netflix.com" 
Date: Monday, November 11, 2019 at 11:54 AM
To: Anton Okolnychyi 
Cc: Iceberg Dev List , Ashish Mehta 

Subject: Re: [DISCUSS] Write-audit-publish support

I just had a direct request for this over the weekend, too. I opened #629 Add 
cherry-pick 
operation
 to track this.

On Mon, Nov 11, 2019 at 1:43 AM Anton Okolnychyi 
mailto:aokolnyc...@apple.com>> wrote:
We would be interested in this functionality as well. We have a use case with 
multiple concurrent writers where we wanted to use WAP but couldn’t.


On 9 Nov 2019, at 01:32, Ryan Blue 
mailto:rb...@netflix.com.INVALID>> wrote:

Right now, there isn't a good way to manage multiple pending writes. Snapshots 
from each write are created based on the current table state, so simply moving 
to one of two pending commits would mean you ignore the changes in the other 
pending commit. We've considered adding a "cherry-pick" operation that can take 
the changes from one snapshot and apply them on top of another to solve that 
problem. If you'd like to implement that, I'd be happy to review it!

On Fri, Nov 8, 2019 at 3:29 PM Ashish Mehta 
mailto:mehta.ashis...@gmail.com>> wrote:
Thanks Ryan, that worked out. Since its a rollback, I wonder how can user stage 
multiple WAP snapshots, and commit then in any order, based on how Audit 
process work out?
I wonder this expectation, goes against the underlying principles of Iceberg.

Thanks,
Ashish

On Fri, Nov 8, 2019 at 2:44 PM Ryan Blue 
mailto:rb...@netflix.com.invalid>> wrote:

Ashish, you can use the rollback table operation to set a particular snapshot 
as the current table state. Like this:

Table table = hiveCatalog.load(name);

table.rollback().toSnapshotId(id).commmit();

On Fri, Nov 8, 2019 at 12:52 PM Ashish Mehta 
mailto:mehta.ashis...@gmail.com>> wrote:
Hi Ryan,

Can you please help me point to doc, where I can find how to publish a WAP 
snapshot? I am able to filter the snapshot, based on 
wap.id
 in summary of Snapshot, but clueless the official recommendation on committing 
that snapshot. I can think of cherry-picking Appended/Deleted files, but don't 
know the nuances of missing something important with this.

Thanks,
-Ashish

-- Forwarded message -
From: Ryan Blue mailto:rb...@netflix.com.invalid>>
Date: Wed, Jul 31, 2019 at 4:41 PM
Subject: Re: [DISCUSS] Write-audit-publish support
To: Edgar Rodriguez 
mailto:edgar.rodrig...@airbnb.com>>
Cc: Iceberg Dev List mailto:dev@iceberg.apache.org>>, 
Anton Okolnychyi mailto:aokolnyc...@apple.com>>

Hi everyone, I've added PR 
#342
 to the Iceberg repository with our WAP changes. Please have a look if you were 
interested in this.

On Mon, Jul 22, 2019 at 11:05 AM Edgar Rodriguez 
mailto:edgar.rodrig...@airbnb.com>> wrote:
I think this use case is pretty helpful in most data environments, we do the 
same sort of stage-check-publish pattern to run quality checks.
One question is, if say the audit part fails, is there a way to expire the 
snapshot or what would be the workflow that follows?

Best,
Edgar

On Mon, Jul 22, 2019 at 9:59 AM Mouli Mukherjee 
mailto:moulimukher...@gmail.com>> wrote:
This would be super helpful. We have a similar workflow where we do some 
validation before letting the downstream consume the changes.

Best,
Mouli

On Mon, Jul 22, 2019 at 9:18 AM Filip 
mailto:filip@gmail.com>> wrote:
This definitely sounds interesting. Quick question on whether this presents 
impact on the current Upserts spec? Or is it maybe that we are looking to 
associate this support for append-only commits?

On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue 
mailto:rb...@netflix.com.invalid>> wrote:

Audits run on the snapshot by setting the snapshot-id read option to read the 
WAP snapshot, even though it has not (yet) been the 

Re: [ANNOUNCE] Apache Iceberg release 0.7.0-incubating

2019-10-28 Thread Miao Wang
Congratulations!

Miao

From: Xabriel Collazo Mojica 
Reply-To: "dev@iceberg.apache.org" 
Date: Monday, October 28, 2019 at 9:08 AM
To: "dev@iceberg.apache.org" 
Subject: Re: [ANNOUNCE] Apache Iceberg release 0.7.0-incubating

This is great news! Thanks to all involved in development as well as testing 
the release!

Xabriel J Collazo Mojica  |  Sr Software Engineer  |  Adobe

From: John Zhuge 
Reply-To: "dev@iceberg.apache.org" 
Date: Monday, October 28, 2019 at 8:40 AM
To: "dev@iceberg.apache.org" 
Subject: Re: [ANNOUNCE] Apache Iceberg release 0.7.0-incubating

Congratulations !

On Mon, Oct 28, 2019 at 6:37 AM Anton Okolnychyi 
 wrote:
This is great! Thanks everyone who contributed and to Ryan for preparing the 
release!

- Anton



On 28 Oct 2019, at 13:30, Arina Yelchiyeva 
mailto:arina.yelchiy...@gmail.com>> wrote:

Congratulations with the first release!


Kind regards,
Arina



On Oct 28, 2019, at 3:16 PM, Anjali Norwood 
mailto:anorw...@netflix.com.INVALID>> wrote:

Congratulations!!

Regards
Anjali

On Sun, Oct 27, 2019 at 2:56 PM Ryan Blue 
mailto:b...@apache.org>> wrote:
Here's the release announcement that I just sent out. Thanks to everyone that 
contributed to this release!
-- Forwarded message -
From: Ryan Blue mailto:b...@apache.org>>
Date: Sun, Oct 27, 2019 at 2:55 PM
Subject: [ANNOUNCE] Apache Iceberg release 0.7.0-incubating
To: mailto:annou...@apache.org>>

I'm pleased to announce the release of Apache Iceberg 0.7.0-incubating!

Apache Iceberg is an open table format for huge analytic datasets. Iceberg 
delivers high query performance for tables with tens of petabytes of data, as 
well as with atomic commits with concurrent writers and side-effect free schema 
evolution.

The source release can be downloaded from: 
https://www.apache.org/dyn/closer.cgi/incubator/iceberg/apache-iceberg-0.7.0-incubating/apache-iceberg-0.7.0-incubating.tar.gz
 – 
signature
 – 
sha512

Java artifacts are available from Maven Central, including an all-in-one Spark 
2.4 runtime 
Jar.
 To use Iceberg in Spark 2.4, add the runtime Jar to the jars folder of your 
Spark install.

Additional information is available at 
http://iceberg.apache.org/releases/

Thanks to everyone that contributed to this release! This is the first Apache 
release of Iceberg!


--
Ryan Blue


--
Ryan Blue




--
John Zhuge


Re: Iceberg tombstone?

2020-01-27 Thread Miao Wang
Hi Ryan,

Just found your comment in my junk mail box.

` It sounds like a tombstone in independent of the table snapshot, like 
blacklisting a row. ` -> It goes into snapshot. It was first developed for 
Adobe use case which is not a row-level delete marker. However, it is 
extensible/generic to support row-level deletes.

@Filip Bocse<mailto:fbo...@adobe.com> can provide more details on Tombstone 
design and use case.

Miao

From: Ryan Blue 
Reply-To: "dev@iceberg.apache.org" , 
"rb...@netflix.com" 
Date: Tuesday, January 21, 2020 at 5:40 PM
To: Iceberg Dev List 
Subject: Re: Iceberg tombstone?

Hi everyone,

Thanks for bringing this up, Filip. I've been thinking about it over the last 
few days. One thing that I'm not clear about is how exactly tombstones differ 
from the row-level deletes that we're already working on. It sounds like a 
tombstone in independent of the table snapshot, like blacklisting a row. So if 
I tombstone user = 'rdblue', it doesn't just apply to all the data currently in 
the table; it also deletes any future additions of that user. Is that correct?

If that's right, then I think that could be added fairly easily using the 
existing filter code and our Table interface to wrap existing readers. On the 
other hand, if that's not what you want then I'm not sure how it differs from 
the existing plan to add row-level deletes. Can you clarify?

rb

On Wed, Jan 15, 2020 at 11:04 AM Miao Wang  wrote:
Hi Filip,

I vote for this feature because it is extensible for [1].

Besides Openlnx’s comments, can you provide a high level description on how 
tombstone aligns with milestone [1]?

[1]. 
https://github.com/apache/incubator-iceberg/milestone/4<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Fmilestone%2F4=02%7C01%7Cmiwang%40adobe.com%7Ca7266ac170474f301c4108d79edc1b92%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637152540544316361=GMl3IGr8XUA3unFkuvu4ayoKalwq%2FmYSMZTERsDLt2s%3D=0>

Thanks!

Miao

From: OpenInx mailto:open...@gmail.com>>
Reply-To: "dev@iceberg.apache.org<mailto:dev@iceberg.apache.org>" 
mailto:dev@iceberg.apache.org>>
Date: Tuesday, January 14, 2020 at 6:34 PM
To: "dev@iceberg.apache.org<mailto:dev@iceberg.apache.org>" 
mailto:dev@iceberg.apache.org>>
Subject: Re: Iceberg tombstone?

Hi Filip

So you team have implemented the tombstone feature in your internal branch.  
For my understanding, the tombstone
you mean is similar to the delete marker in HBase, so you're trying to 
implement the update/delete feature I think. For this part,
Anton and Miguel have a design doc for this [1], IMO your work should intersect 
with it.

Another question is: one rule which we shouldn't break (my personal view) is 
the open file format rule, means encoding data in
open format and can view them by using non-iceberg tool, your implementation 
will follow the rule or not? encoding the tombstone
in your own format, or just make them into a separate file and mark it as 
tombstone file ?

For the column predicate filter or row-level filter,  I'm not familiar with 
this part. Mind to provide more details?  :-)

Thanks for your information :-)

[1]. 
https://docs.google.com/document/d/1Pk34C3diOfVCRc-sfxfhXZfzvxwum1Odo-6Jj9mwK38/edit#<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1Pk34C3diOfVCRc-sfxfhXZfzvxwum1Odo-6Jj9mwK38%2Fedit%23=02%7C01%7Cmiwang%40adobe.com%7Ca7266ac170474f301c4108d79edc1b92%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637152540544326346=kZiRYMphDfwzHRykL8L%2B%2Blmy%2F0yaLMnaYFxHvHO32zg%3D=0>


On Tue, Jan 14, 2020 at 10:20 PM Filip 
mailto:filip@gmail.com>> wrote:
Hi everyone,

I was wondering if it would be of interest to start a thread on a proposal for 
a tombstone feature in Iceberg.
Well, maybe it could be less strict than a general case implementation of 
tombstone [1].
I would be looking for at least a couple of things to be covered on this thread:

  1.  [Opportunity] I was wondering if others on this list would find such a 
feature useful and/ or if folks would support such a feature be provided by 
Iceberg
  2.  [Feasibility] Also wondering if there's any intersection between a 
tombstone feature (i.e. filter by column predicate) and the upcoming Upsert 
spec/ implementation or if these two may very well serve different use-cases so 
it's wise they shouldn't be mixed up, only indirectly I guess, by the sheer 
implications of accidental complexity :)
The current Iceberg codebase is quite generous wrt to the Open/Closed principle 
and we've been doing some spikes to implement such a feature in a new 
datasource and I've thought I'd share some touchpoints of our work so far 
(would gladly share this if community is interested):
> [extension] implementing tombstoning as a column/values predicate should be 
> associated w/ some specific metadata

Re: Welcome new committer and PPMC member Ratandeep Ratti

2020-02-17 Thread Miao Wang
Congratulations!

Sent from my iPhone

On Feb 17, 2020, at 3:38 PM, Gara Walid  wrote:


Congratulations!

Cheers,
Walid

On Mon, Feb 17, 2020, 10:04 PM John Zhuge 
mailto:jzh...@apache.org>> wrote:
Congratulations !

On Mon, Feb 17, 2020 at 12:47 PM Rohan Agarwal 
mailto:rohan.agarwal...@gmail.com>> wrote:
Congrats Ratandeep!!

On Mon, Feb 17, 2020 at 3:13 PM Anton Okolnychyi 
 wrote:
Thanks for the contributions and welcome!

- Anton

On 17 Feb 2020, at 12:09, Driesprong, Fokko 
mailto:fo...@driesprong.frl>> wrote:

Well deserved Ratandeep!

Cheers, Fokko

Op ma 17 feb. 2020 om 20:43 schreef Sud 
mailto:sudssf2...@gmail.com>>:
Congratulations Ratandeep!! Keep up the good work!

On Mon, Feb 17, 2020 at 6:26 AM Anjali Norwood 
mailto:anorw...@netflix.com.invalid>> wrote:
Congratulations Ratandeep!!

regards.
Anjali.

On Mon, Feb 17, 2020 at 12:19 AM Manish Malhotra 
mailto:manish.malhotra.w...@gmail.com>> wrote:
Congratulations !!

On Sun, Feb 16, 2020 at 8:37 PM RD 
mailto:rdsr...@gmail.com>> wrote:
Thanks everyone!

-Best,
R.

On Sun, Feb 16, 2020 at 7:39 PM David Christle 
mailto:dchris...@linkedin.com.invalid>> wrote:
Congrats!!!

From: Jacques Nadeau mailto:jacq...@dremio.com>>
Reply-To: "dev@iceberg.apache.org" 
mailto:dev@iceberg.apache.org>>
Date: Sunday, February 16, 2020 at 7:20 PM
To: Iceberg Dev List mailto:dev@iceberg.apache.org>>
Subject: Re: Welcome new committer and PPMC member Ratandeep Ratti

Congrats!

On Sun, Feb 16, 2020, 7:06 PM xiaokun ding 
mailto:ding.xiao...@gmail.com>> wrote:
CONGRATULATIONS

李响 mailto:wate...@gmail.com>> 于2020年2月17日周一 上午11:05写道:
CONGRATULATIONS!!!

On Mon, Feb 17, 2020 at 9:50 AM Junjie Chen 
mailto:chenjunjied...@gmail.com>> wrote:
Congratulations!

On Mon, Feb 17, 2020 at 5:48 AM Ryan Blue 
mailto:b...@apache.org>> wrote:
Hi everyone,

I'd like to congratulate Ratandeep Ratti, who was just invited to join the 
Iceberg committers adn PPMC!

Thanks for your contributions and reviews, Ratandeep!

rb

--
Ryan Blue


--
Best Regards


--

   李响 Xiang Li

手机 cellphone :+86-136-8113-8972
邮件 e-mail  :wate...@gmail.com



--
John Zhuge


Re: Iceberg tombstone?

2020-01-15 Thread Miao Wang
Hi Filip,

I vote for this feature because it is extensible for [1].

Besides Openlnx’s comments, can you provide a high level description on how 
tombstone aligns with milestone [1]?

[1]. https://github.com/apache/incubator-iceberg/milestone/4

Thanks!

Miao

From: OpenInx 
Reply-To: "dev@iceberg.apache.org" 
Date: Tuesday, January 14, 2020 at 6:34 PM
To: "dev@iceberg.apache.org" 
Subject: Re: Iceberg tombstone?

Hi Filip

So you team have implemented the tombstone feature in your internal branch.  
For my understanding, the tombstone
you mean is similar to the delete marker in HBase, so you're trying to 
implement the update/delete feature I think. For this part,
Anton and Miguel have a design doc for this [1], IMO your work should intersect 
with it.

Another question is: one rule which we shouldn't break (my personal view) is 
the open file format rule, means encoding data in
open format and can view them by using non-iceberg tool, your implementation 
will follow the rule or not? encoding the tombstone
in your own format, or just make them into a separate file and mark it as 
tombstone file ?

For the column predicate filter or row-level filter,  I'm not familiar with 
this part. Mind to provide more details?  :-)

Thanks for your information :-)

[1]. 
https://docs.google.com/document/d/1Pk34C3diOfVCRc-sfxfhXZfzvxwum1Odo-6Jj9mwK38/edit#


On Tue, Jan 14, 2020 at 10:20 PM Filip 
mailto:filip@gmail.com>> wrote:
Hi everyone,

I was wondering if it would be of interest to start a thread on a proposal for 
a tombstone feature in Iceberg.
Well, maybe it could be less strict than a general case implementation of 
tombstone [1].
I would be looking for at least a couple of things to be covered on this thread:

  1.  [Opportunity] I was wondering if others on this list would find such a 
feature useful and/ or if folks would support such a feature be provided by 
Iceberg
  2.  [Feasibility] Also wondering if there's any intersection between a 
tombstone feature (i.e. filter by column predicate) and the upcoming Upsert 
spec/ implementation or if these two may very well serve different use-cases so 
it's wise they shouldn't be mixed up, only indirectly I guess, by the sheer 
implications of accidental complexity :)
The current Iceberg codebase is quite generous wrt to the Open/Closed principle 
and we've been doing some spikes to implement such a feature in a new 
datasource and I've thought I'd share some touchpoints of our work so far 
(would gladly share this if community is interested):
> [extension] implementing tombstoning as a column/values predicate should be 
> associated w/ some specific metadata (snapshot id, version?) and basic 
> metrics (i.e. count, basic histograms) - mostly thinking that any tombstone 
> operation feature is accompanied by a compaction task so metadata and metrics 
> would help with building generic solutions for optimal scheduling of these 
> maintenance tasks - tombstones could be modeled/ programmed against the 
> org.apache.iceberg.Table interface
> [atomic guarantee] a simple solution is to make the tombstone operation 
> atomic by assigning a new snapshot summary property point to a file reference 
> of an immutable file holding the tombstone predicates/expressions
> [new API] append tombstones
> [new API] remove tombstones
> [new API] append files and add tombstones
> [new API] append files and remove tombstones
> [new API] vacuum tombstones - a task as in clean up tombstoned rows and evict 
> the associated tombstone metadata as well, oh and maybe not `vacuum` (I 
> remember reading about this on this list in a different context and it's 
> probably reserved for a different Iceberg feature, right?)
> [extend] extend the spark reader/writer to account for tombstone based 
> filtering and tombstone committing respectively - writing may prove easier to 
> implement than reading, reading comes in many flavours so applying a filter 
> expression may not be as accessible at the moment as one would prefer to 
> extend on top of Iceberg [2]
> [extend] removing snapshots would also account for their associated tombstone 
> files to be dropped as well (where available)

[1] I believe that in a general tombstone implementation the data filtering is 
applied only for the data that was added prior to the tombstone setting/ 
assignment operation but that might prove quite difficult to implement w/ 
Iceberg and it could be considered a specialization of a more generic and basic 
use-case of adding tombstone support as filtering by column values regardless 
of order of operations (data vs tombstone).
[2] We could benefit from adding like an 

Re: Open a new branch for row-delete feature ?

2020-03-31 Thread Miao Wang
+1 on not creating a branch now. Rebasing and maintenance are too expensive, 
comparing of fast development. Some additional thoughts below.

Row-delete feature should be behind a feature flag, which implies that it 
should have minimum impact on Master branch if it is turned off. Working on 
Master avoids the pain of breaking Master at branch merge, which actually works 
at a fail-fast and fail-early mode.

Working on Master Branch will not prevent splitting the feature into small 
items. Instead, it will encourage more people to work it and help the community 
stay focus on Master roadmap.

Finally, if we think about rebasing, it either ends too expensive to rebase or 
easy to rebase. If it is the former case, we should not create a branch because 
it is hard to keep sync with Master. If it is the latter case, it has little 
impact on Master and there is no need to have a branch.

Thanks!

Miao

From: Ryan Blue 
Reply-To: "dev@iceberg.apache.org" , 
"rb...@netflix.com" 
Date: Tuesday, March 31, 2020 at 10:08 AM
To: OpenInx 
Cc: Iceberg Dev List 
Subject: Re: Open a new branch for row-delete feature ?

I'm fine starting a branch later if we do run into those issues, but I don't 
think it is a good idea to do it now in anticipation. All of the work that we 
can do on master we should try to do on master. We can start a branch when we 
need one.

On Mon, Mar 30, 2020 at 7:44 PM OpenInx 
mailto:open...@gmail.com>> wrote:
Hi Ryan

The reason I suggest to open a new dev branch for row-delete development is:  
we will split the whole feature into
many small issues and each issue will have a pull request with appropriate 
length of code so the contributors/reviewers
can discuss one point each time and make this feature a faster iteration. In 
the process of implementation, we will ensure
that the v1 works for every separate PR but it may not ready for cutting 
release, for example, when release the 0.8.0 I'm
sure we won't like the release version contains part of the v2 spec(such as 
provide the sequence_number, but no file_type).
The spark reader/writer and data/delete manifest may also need some code 
refactor, it's possible to put them into several PR.
Splitting into multiple Pull Requests may block the release of the new version 
for a certain period of time, that's not we want
to see.

About the new branch maintenance, in my experience we could rebase the new 
branch with master periodly(such as rebase
for every three days), so that the new pull request for row-delete will be 
designed based on the newest changes. It should work
for the master which would not have too many new change. This is in line with 
our current situation.

In this case, I weighed the maintenance costs of the new branch against the 
delay of the row-delete. I think we should let the
row-delete go a little faster (almost all community users are looking forward 
to this feature), and I think the current maintenance
cost is acceptable.

Thanks

On Tue, Mar 31, 2020 at 5:52 AM Ryan Blue  wrote:
Sorry, I didn't address the suggestion to add a Flink branch as well. The work 
needed for the Flink sink is to remove parts that are specific to Netflix, so 
I'm not sure what the rationale for a branch would be. Is there a reason why 
this can't be done in master, but requires a shared branch? If multiple people 
want to contribute, why not contribute to the same PR?

A shared PR branch makes the most sense to me for this because it is regularly 
tested against master.

On Mon, Mar 30, 2020 at 2:48 PM Ryan Blue 
mailto:rb...@netflix.com>> wrote:
I think we will eventually may want a branch, but I think it is too early to 
create one now.

Branches are expensive. They require maintenance to stay in sync with master, 
usually copying changes from master into the branch with updates. Updating the 
changes to master for the branch is more difficult because it is usually not 
the original contributor or reviewer porting them. And it is better to catch 
problems between changes in master and the branch early.

I'm not against branches, but I don't want to create them unless they are 
valuable. In this case, I don't see the value. We plan to add v2 in parallel so 
you can still write v1 tables for compatibility, and most of the work that 
needs to be done -- like creating readers and writers for diff formats -- can 
be done in master.

rb

On Mon, Mar 30, 2020 at 9:00 AM Gautam 
mailto:gautamkows...@gmail.com>> wrote:
Thanks for bringing this up OpenInx.  That's a great idea: to open a separate 
branch for row-level deletes.

I would like to help support/contribute/review this as well. If there are 
sub-tasks you guys have identified that can be added to 

Re: Timestamp Based Incremental Reading in Iceberg ...

2020-09-08 Thread Miao Wang
+1 Openlnx’s comment on implementation.

Only if we have an external timing synchronization service and enforce all 
clients using the service, timestamps of different clients are not comparable.

So, there are two asks: 1). Whether to have a timestamp based API for delta 
reading; 2). How to enforce and implement a service/protocol for timestamp sync 
among all clients.

1). +1 to have it as Jingsong and Gautam suggested. Snapshot ID could be source 
of truth in any cases.

2). IMO, it should be an external package to Iceberg.

Miao

From: OpenInx 
Reply-To: "dev@iceberg.apache.org" 
Date: Tuesday, September 8, 2020 at 7:55 PM
To: Iceberg Dev List 
Subject: Re: Timestamp Based Incremental Reading in Iceberg ...

I agree that  it's helpful to allow users to read the incremental delta based 
timestamp,  as Jingsong said timestamp is more friendly.

My question is how to implement this ?

 If just attach the client's timestamp to the iceberg table when committing,  
then different clients may have different timestamp values because of the 
skewing. In theory, these time values are not strictly comparable, and can only 
be compared within the margin of error.


On Wed, Sep 9, 2020 at 10:06 AM Jingsong Li 
mailto:jingsongl...@gmail.com>> wrote:
+1 for timestamps are linear, in implementation, maybe the writer only needs to 
look at the previous snapshot timestamp.

We're trying to think of iceberg as a message queue, Let's take the popular 
queue Kafka as an example,
Iceberg has snapshotId and timestamp, corresponding, Kafka has offset and 
timestamp:
- offset: It is used for incremental read, such as the state of a checkpoint in 
a computing system.
- timestamp: It is explicitly specified by the user to specify the scope of 
consumption. As start_timestamp of reading. Timestamp is a better user aware 
interface. But offset/snapshotId is not human readable and friendly.

So there are scenarios where timestamp is used for incremental read.

Best,
Jingsong


On Wed, Sep 9, 2020 at 12:45 AM Sud 
mailto:sudssf2...@gmail.com>> wrote:

We are using incremental read for iceberg tables which gets quite few appends ( 
~500- 1000 per hour) . but instead of using timestamp we use snapshot ids and 
track state of last read snapshot Id.
We are using timestamp as fallback when the state is incorrect, but as you 
mentioned if timestamps are linear then it works as expected.
We also found that incremental reader might be slow when dealing with > 2k 
snapshots in range. we are currently testing a manifest based incremental 
reader which looks at manifest entries instead of scanning snapshot history and 
accessing each snapshot.

Is there any reason you can't use snapshot based incremental read?

On Tue, Sep 8, 2020 at 9:06 AM Gautam 
mailto:gautamkows...@gmail.com>> wrote:
Hello Devs,
   We are looking into adding workflows that read data 
incrementally based on commit time. The ability to read deltas between start / 
end commit timestamps on a table and ability to resume reading from last read 
end timestamp. In that regard, we need the timestamps to be linear in the 
current active snapshot history (newer versions always have higher timestamps). 
Although Iceberg commit flow ensures the versions are newer, there isn't a 
check to ensure timestamps are linear.

Example flow, if two clients (clientA and clientB), whose time-clocks are 
slightly off (say by a couple seconds), are committing frequently, clientB 
might get to commit after clientA even if it's new snapshot timestamps is out 
of order. I might be wrong but I haven't found a check in 
HadoopTableOperations.commit() to ensure this above case does not happen.

On the other hand, restricting commits due to out-of-order timestamps can hurt 
commit throughput so I can see why this isn't something Iceberg might want to 
enforce based on System.currentTimeMillis(). Although if clients had a way to 
define their own globally synchronized timestamps (using external service or 
some monotonically increasing UUID) then iceberg could allow an API to set that 
on the snapshot or use that instead of System.currentTimeMillis(). Iceberg 
exposes something similar using Sequence numbers in v2 format to track Deletes 
and Appends.
Is this a concern others have? If so how are folks handling this today or are 
they not exposing such a feature at all due to the inherent distributed timing 
problem? Would like to hear how others are thinking/going about this. Thoughts?

Cheers,

-Gautam.


--
Best, Jingsong Lee


Re: New committer: Shardul Mahadik

2020-07-22 Thread Miao Wang
Congratulations!

Miao

Sent from my iPhone

> On Jul 22, 2020, at 8:08 PM, 俊杰陈  wrote:
> 
> Congrats! Good job!
> 
>> On Thu, Jul 23, 2020 at 11:01 AM Saisai Shao  wrote:
>> 
>> Congrats!
>> 
>> Thanks
>> Saisai
>> 
>> OpenInx  于2020年7月23日周四 上午10:06写道:
>>> 
>>> Congratulations !
>>> 
>>> On Thu, Jul 23, 2020 at 9:31 AM Jingsong Li  wrote:
 
 Congratulations Shardul! Well deserved!
 
 Best,
 Jingsong
 
 On Thu, Jul 23, 2020 at 7:27 AM Anton Okolnychyi 
  wrote:
> 
> Congrats and welcome! Keep up the good work!
> 
> - Anton
> 
> On 22 Jul 2020, at 16:02, RD  wrote:
> 
> Congratulations Shardul! Well deserved!
> 
> -Best,
> R.
> 
> On Wed, Jul 22, 2020 at 2:24 PM Ryan Blue  wrote:
>> 
>> Hi everyone,
>> 
>> I'd like to congratulate Shardul Mahadik, who was just invited to join 
>> the Iceberg committers!
>> 
>> Thanks for all your contributions, Shardul!
>> 
>> rb
>> 
>> 
>> --
>> Ryan Blue
> 
> 
 
 
 --
 Best, Jingsong Lee
> 
> 
> 
> -- 
> Thanks & Best Regards


Iceberg At Adobe

2020-12-03 Thread Miao Wang
Hi,

Our team post 1 blog about Iceberg use case at Adobe.

https://medium.com/adobetech/iceberg-at-adobe-88cf1950e866

There will be a series of blogs to show more details.

Miao


Secondary Index Support Draft

2020-12-02 Thread Miao Wang
Hi,

As we discussed in the sync up meeting, I come up with a draft for secondary 
index support. 
Draft.
 It is WIP. Detailed API design is not included in this version.

Thanks!

Miao



Re: Sync to discuss secondary index proposal

2021-01-28 Thread Miao Wang
Hi @OpenInx<mailto:open...@gmail.com>,

The code change is based on our internal fork. We need to some refactoring 
before sending out an open source PR. In addition, since there is no spec 
defined in Iceberg, the implementation is coupled closely to our code base. 
That is one of the major reasons that I put my thoughts on building data format 
and compute engine agnostic into the draft.

Miao

From: OpenInx 
Date: Thursday, January 28, 2021 at 6:31 PM
To: Iceberg Dev List , Miao Wang 
Cc: Ryan Blue 
Subject: Re: Sync to discuss secondary index proposal
Sorry  I sent the wrong link,  the secondary index document link is: 
https://docs.google.com/document/d/1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ/edit<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ%2Fedit=04%7C01%7Cmiwang%40adobe.com%7C7d97fc234caf481f4a5508d8c3fe0623%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637474843077023815%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=QkXoJydi3dCc6xWaQLjfcHsuA%2FZZlKsEXMRDoMEnUWE%3D=0>

On Fri, Jan 29, 2021 at 10:31 AM OpenInx 
mailto:open...@gmail.com>> wrote:
Hi

@Miao Wang<mailto:miw...@adobe.com>   Would you mind to share your current PoC 
code or PR  for this document [1]  if possible ?   I'd like to understand more 
details before I get involved in this discussion.

Thanks.

[1].  
https://docs.google.com/document/d/1q6xaBxUPFwYsW9aXWxYUh7die6O7rDeAPFQcTAMQ0GM/edit?ts=601316b0#<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1q6xaBxUPFwYsW9aXWxYUh7die6O7rDeAPFQcTAMQ0GM%2Fedit%3Fts%3D601316b0%23=04%7C01%7Cmiwang%40adobe.com%7C7d97fc234caf481f4a5508d8c3fe0623%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637474843077023815%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=hgiPjL%2B0IqQxCJ9W700snLDeBthvumjQcG0DUNk5tKM%3D=0>

On Fri, Jan 29, 2021 at 10:16 AM 李响 
mailto:wate...@gmail.com>> wrote:
+1, my colleagues and I is at UTC+8

On Fri, Jan 29, 2021 at 9:50 AM OpenInx 
mailto:open...@gmail.com>> wrote:
+1,  my time zone is CST.

On Fri, Jan 29, 2021 at 6:57 AM Xinli shang  wrote:
I had some earlier discussion with Miao on this. I am still interested in it. 
My time zone is PST.

On Thu, Jan 28, 2021 at 2:50 PM Jack Ye 
mailto:yezhao...@gmail.com>> wrote:
+1, looking forward to the discussion, please include me and Yan 
(yyany...@gmail.com<mailto:yyany...@gmail.com>), also in PST.
-Jack

On Thu, Jan 28, 2021 at 2:16 PM Russell Spitzer 
mailto:russell.spit...@gmail.com>> wrote:
CST Please :) But I don’t mind waking up early or staying up late as required


On Jan 28, 2021, at 4:14 PM, Ryan Blue 
mailto:rb...@netflix.com.INVALID>> wrote:

Hi everyone,

The proposal that Miao wrote about secondary indexes has come up a lot lately. 
I think it would be a good time to have a discussion about the proposal and set 
some initial goals for what we want to do next. Since there hasn't been much 
discussion on the dev list, I'll schedule a sync so that everyone has a 
deadline to read the proposal and be ready with questions. Then we can have a 
quick summary to start with and a productive discussion.

First, who is interested in attending? I think that Miao and I are in the US in 
PST (UTC-8). I think Paula from IBM in Israel is interested. Anyone else in a 
time zone that we should try to include? We can always have two discussions if 
we need to include more zones.

Please reply if you're interested so we can get something set up. Thanks!

rb

--
Ryan Blue
Software Engineer
Netflix



--
Xinli Shang


--

   李响 Xiang Li

手机 cellphone :+86-136-8113-8972
邮件 e-mail  :wate...@gmail.com<mailto:wate...@gmail.com>


Re: Welcoming OpenInx as a new PMC member!

2021-06-29 Thread Miao Wang
Congratulations!

Sent from my iPhone

On Jun 29, 2021, at 4:57 PM, Steven Wu  wrote:


Congrats!

On Tue, Jun 29, 2021 at 2:12 PM Huadong Liu 
mailto:huadong...@gmail.com>> wrote:

Congrats Zheng!


On Tue, Jun 29, 2021 at 1:52 PM Ryan Blue 
mailto:b...@apache.org>> wrote:
Hi everyone,

I'd like to welcome OpenInx (Zheng Hu) as a new Iceberg PMC member.

Thanks for all your contributions and commitment to the project, OpenInx!


Ryan

--
Ryan Blue


Re: Welcoming Jack Ye as a new committer!

2021-07-06 Thread Miao Wang
Congratulations!

Miao

Sent from my iPhone

On Jul 5, 2021, at 4:14 PM, Daniel Weeks  wrote:


Great work Jack, Congratulations!

On Mon, Jul 5, 2021 at 1:21 PM karuppayya 
mailto:karuppayya1...@gmail.com>> wrote:
Congratulations Jack!

On Mon, Jul 5, 2021 at 1:14 PM Yufei Gu 
mailto:flyrain...@gmail.com>> wrote:
Congratulations, Jack! Thanks for the contribution!

Best,

Yufei


On Mon, Jul 5, 2021 at 1:09 PM John Zhuge 
mailto:jzh...@apache.org>> wrote:
Congratulations Jack!

On Mon, Jul 5, 2021 at 12:57 PM Marton Bod  wrote:
Congrats Jack!

On Mon, Jul 5, 2021 at 9:54 PM Wing Yew Poon  
wrote:
Congratulations Jack!


On Mon, Jul 5, 2021 at 11:35 AM Ryan Blue 
mailto:b...@apache.org>> wrote:
Hi everyone,

I'd like to welcome Jack Ye as a new Iceberg committer.

Thanks for all your contributions, Jack!

Ryan

--
Ryan Blue
--
John Zhuge


Re: Welcoming Russell Spitzer as a new committer

2021-03-29 Thread Miao Wang
Congratulations Russell!

Miao

From: Szehon Ho 
Reply-To: "dev@iceberg.apache.org" 
Date: Monday, March 29, 2021 at 9:12 AM
To: "dev@iceberg.apache.org" 
Subject: Re: Welcoming Russell Spitzer as a new committer

Awesome, well-deserved, Russell!

Szehon


On 29 Mar 2021, at 18:10, Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:

Congratulations Russel!

On Mon, Mar 29, 2021 at 9:10 AM Anton Okolnychyi 
mailto:aokolnyc...@apple.com.invalid>> wrote:
Hey folks,

I’d like to welcome Russell Spitzer as a new committer to the project!

Thanks for all your contributions, Russell!

- Anton
--
Twitter: 
https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 

YouTube Live Streams: 
https://www.youtube.com/user/holdenkarau



Re: Welcoming Yan Yan as a new committer!

2021-03-23 Thread Miao Wang
Congrats @Yan Yan!

Miao

From: Ryan Blue 
Reply-To: "dev@iceberg.apache.org" 
Date: Tuesday, March 23, 2021 at 3:43 PM
To: Iceberg Dev List 
Subject: Welcoming Yan Yan as a new committer!

Hi everyone,

I'd like to welcome Yan Yan as a new Iceberg committer.

Thanks for all your contributions, Yan!

rb

--
Ryan Blue


Re: Secondary Indexes - Pluggable File Filter interface for Apache Iceberg

2021-03-03 Thread Miao Wang
It works for me.

With a quick thought, there may be a few concerns about consolidated fashion 
storage.

1). Maintaining the consolidated storage may be a bit more complex;
2). It may make collecting index while writing data file (i.e., online index 
building) more complex (e.g., we need to consider that multiple writers write 
to the same consolidated index file in parallel);
3). We need to have some auxiliary structure in the index file to quickly 
locate relevant index given some key (e.g., a data file name);

However, I do think consolidated fashion storage is some meaningful 
optimization on the disk. If we properly design splitable and mergeable index 
file format, the consolidation fashion and 1-data-file-1-index (1:1 index file) 
are not mutual exclusive. Therefore, 1:1 index file can be the building block 
for larger consolidated index files and index at different levels, like 
partition level index.

Our team member went through one pass of the design and shared some thoughts 
with me. I will complete my pass.

Thanks!

Miao


From: Ryan Blue 
Date: Wednesday, March 3, 2021 at 6:08 PM
To: OpenInx 
Cc: Iceberg Dev List 
Subject: Re: Secondary Indexes - Pluggable File Filter interface for Apache 
Iceberg
Great, thank you for planning to join! I definitely want to get your input on 
this as well.

On Wed, Mar 3, 2021 at 6:06 PM OpenInx 
mailto:open...@gmail.com>> wrote:
It will be  1:00 AM (China Standard Time) on 18 March,  and it works for our 
Asia people.   I'd love to attend this discussion, Thanks.

On Thu, Mar 4, 2021 at 9:50 AM Ryan Blue  wrote:
Thanks for putting this together, Guy! I just did a pass over the doc and it 
looks like a really reasonable proposal for being able to inject custom file 
filter implementations.

One of the main things we need to think about is how to store and track the 
index data. There's a comment in the doc about storing them in a "consolidated 
fashion" and I'd like to hear more about what you're thinking there. The 
index-per-file approach that Adobe is working on is a good way to track index 
data because we get a clear lifecycle for index data because it is tied to a 
data file that is immutable. On the other hand, the drawback is that we have a 
lot of index files -- one per data file.

Let's set up a time to go talk through the options. Would 9AM PST (17:00 UTC) 
on 17 March work for everyone? I'm thinking in the morning so everyone from IBM 
can attend. We can do a second discussion at a time that works more for people 
in Asia later on as well.

If that day works, then I'll send out an invite.

On Fri, Feb 19, 2021 at 8:49 AM Guy Khazma 
mailto:guyk...@gmail.com>> wrote:
Hi All,

Following up on our discussion from Wednesday sync here attached is a proposal 
to enhance iceberg with a pluggable interface for data skipping indexes to 
enable use of existing indexes in job planning.

https://docs.google.com/document/d/11o3T7XQVITY_5F9Vbri9lF9oJjDZKjHIso7K8tEaFfY/edit?usp=sharing

We will be glad to get you feedback.

Thanks,
Guy


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix


Re: Secondary Indexes - Pluggable File Filter interface for Apache Iceberg

2021-03-04 Thread Miao Wang
@rb...@netflix.com<mailto:rb...@netflix.com> can you add @Andrei 
Taleanu<mailto:tale...@adobe.com> to the sync up?

He mainly works on indexing in our team.

His PR to Hyperspace has been merged, 
https://github.com/microsoft/hyperspace/pull/358

Iceberg is now supported in Hyperspace for covering indexes.

Thanks!

Miao

From: Ryan Blue 
Reply-To: "dev@iceberg.apache.org" , 
"rb...@netflix.com" 
Date: Thursday, March 4, 2021 at 9:20 AM
To: Paula Ta-Shma 
Cc: Iceberg Dev List , OpenInx 
Subject: Re: Secondary Indexes - Pluggable File Filter interface for Apache 
Iceberg

Great, I'm glad everyone can make it. I've sent out an invite to the list of 
people on the regular syncs. If you need to be added, please let me know.

On Thu, Mar 4, 2021 at 3:01 AM Paula Ta-Shma 
mailto:pa...@il.ibm.com>> wrote:
Hi all,

This time works for Guy, Gal and myself, looking forward

thanks!
Paula

Paula Ta-Shma, Ph.D.
Cloud Storage and Analytics
IBM Research - Haifa
Phone:+972.74.7929402
Email: pa...@il.ibm.com<mailto:pa...@il.ibm.com>




From:Miao Wang 
To:"dev@iceberg.apache.org<mailto:dev@iceberg.apache.org>" 
mailto:dev@iceberg.apache.org>>, 
"rb...@netflix.com<mailto:rb...@netflix.com>" 
mailto:rb...@netflix.com>>, OpenInx 
mailto:open...@gmail.com>>
Cc:Iceberg Dev List 
mailto:dev@iceberg.apache.org>>
Date:04/03/2021 06:31
Subject:[EXTERNAL] Re: Secondary Indexes - Pluggable File Filter 
interface for Apache Iceberg




It works for me. With a quick thought, there may be a few concerns about 
consolidated fashion storage. 1). Maintaining the consolidated storage may be a 
bit more complex; 2). It may make collecting index while writing data file 
(i.e., online

It works for me.



With a quick thought, there may be a few concerns about consolidated fashion 
storage.



1). Maintaining the consolidated storage may be a bit more complex;

2). It may make collecting index while writing data file (i.e., online index 
building) more complex (e.g., we need to consider that multiple writers write 
to the same consolidated index file in parallel);

3). We need to have some auxiliary structure in the index file to quickly 
locate relevant index given some key (e.g., a data file name);



However, I do think consolidated fashion storage is some meaningful 
optimization on the disk. If we properly design splitable and mergeable index 
file format, the consolidation fashion and 1-data-file-1-index (1:1 index file) 
are not mutual exclusive. Therefore, 1:1 index file can be the building block 
for larger consolidated index files and index at different levels, like 
partition level index.



Our team member went through one pass of the design and shared some thoughts 
with me. I will complete my pass.



Thanks!



Miao





From: Ryan Blue 
Date: Wednesday, March 3, 2021 at 6:08 PM
To: OpenInx mailto:open...@gmail.com>>
Cc: Iceberg Dev List mailto:dev@iceberg.apache.org>>
Subject: Re: Secondary Indexes - Pluggable File Filter interface for Apache 
Iceberg

Great, thank you for planning to join! I definitely want to get your input on 
this as well.



On Wed, Mar 3, 2021 at 6:06 PM OpenInx 
mailto:open...@gmail.com>> wrote:

It will be  1:00 AM (China Standard Time) on 18 March,  and it works for our 
Asia people.   I'd love to attend this discussion, Thanks.



On Thu, Mar 4, 2021 at 9:50 AM Ryan Blue  wrote:

Thanks for putting this together, Guy! I just did a pass over the doc and it 
looks like a really reasonable proposal for being able to inject custom file 
filter implementations.



One of the main things we need to think about is how to store and track the 
index data. There's a comment in the doc about storing them in a "consolidated 
fashion" and I'd like to hear more about what you're thinking there. The 
index-per-file approach that Adobe is working on is a good way to track index 
data because we get a clear lifecycle for index data because it is tied to a 
data file that is immutable. On the other hand, the drawback is that we have a 
lot of index files -- one per data file.



Let's set up a time to go talk through the options. Would 9AM PST (17:00 UTC) 
on 17 March work for everyone? I'm thinking in the morning so everyone from IBM 
can attend. We can do a second discussion at a time that works more for people 
in Asia later on as well.



If that day works, then I'll send out an invite.



On Fri, Feb 19, 2021 at 8:49 AM Guy Khazma 
mailto:guyk...@gmail.com>> wrote:

Hi All,

Following up on our discussion from Wednesday sync here attached is a proposal 
to enhance iceberg with a pluggable interface for data skipping indexes to 
enable use of existing indexes in job planning.

https://docs.google.com/document/d/11o3T7XQVITY_5F9Vbri9lF9oJjDZKjHIso7K8tEaFfY/edit?usp=sharing<https://nam04.safelinks.protection.

Re: Secondary Indexes - Pluggable File Filter interface for Apache Iceberg

2021-02-19 Thread Miao Wang
Hi Guy,

I will read the doc in the weekend.

Thanks!

Miao

From: Guy Khazma 
Reply-To: "dev@iceberg.apache.org" 
Date: Friday, February 19, 2021 at 8:49 AM
To: "dev@iceberg.apache.org" 
Subject: Secondary Indexes - Pluggable File Filter interface for Apache Iceberg


Hi All,

Following up on our discussion from Wednesday sync attached is a proposal to 
enhance Iceberg with a pluggable interface for data skipping indexes to enable 
use of existing indexes in job planning.

https://docs.google.com/document/d/11o3T7XQVITY_5F9Vbri9lF9oJjDZKjHIso7K8tEaFfY/edit#

We will be glad to get you feedback.

Thanks,
Guy


Re: Continuing the Secondary Index Discussion

2022-01-25 Thread Miao Wang
Thanks Jack for resuming the discussion. Zaicheng from Byte Dance created a 
slack channel for index work. I suggested him adding Anton and you to the 
channel.

I still remember some conclusions from previous discussions.

1). Index types support: We planned to support Skipping Index first. Iceberg 
metadata exposes hints whether the tracked data files have index which reduces 
index reading overhead. Index file can be applied when generating the scan task.

2). As Ryan mentioned, Sequence number will be used to indicate whether an 
index is valid. Sequence number can link the data evolution with index 
evolution.

3). Storage: We planned to have simple file format which includes Column 
Name/ID, Index Type (String), Index content length, and binary content. It is 
not necessary to use Parquet to store index. Initial thought was 1 data file 
mapping to 1 index file. It can be merged to 1 partition mapping to 1 index 
file. As Ryan said, file level implementation could be a step stone for 
Partition level implementation.

4). How to build index: We want to keep the index reading and writing interface 
with Iceberg and leave the actual building logic as Engine specific (i.e., we 
can use different compute to build Index without changing anything inside 
Iceberg).

Misc:
Huaxin implemented Index support API for DSv2 in Spark 3.x code base.
Design doc: 
https://docs.google.com/document/d/1qnq1X08Zb4NjCm4Nl_XYjAofwUgXUB03WDLM61B3a_8/edit
PR should have been merged.
Guy from IBM did a partial PoC and provided a private doc. I will ask if he can 
make it public.

We can continue the discussion and breaking down the big tasks into tickets.

Thanks!

Miao
From: Ryan Blue 
Date: Tuesday, January 25, 2022 at 5:08 PM
To: Iceberg Dev List 
Subject: Re: Continuing the Secondary Index Discussion
Thanks for raising this for discussion, Jack! It would be great to start adding 
more indexes.

> Scope of native index support

The way I think about it, the biggest challenge here is how to know when you 
can use an index. For example, if you have a partition index that is up to date 
as of snapshot 13764091836784, but the current snapshot is 97613097151667, then 
you basically have no idea what files are covered or not and can't use it. On 
the other hand, if you know that the index was up to date as of sequence number 
11 and you're reading sequence number 12, then you just have to read any data 
file that was written at sequence number 12.

The problem of where you can use an index makes me think that it is best to 
maintain index metadata within Iceberg. An alternative is to try to always keep 
the index up-to-date, but I don't think that's necessarily possible -- you'd 
have to support index updates in every writer that touches table data. You 
would have to spend the time updating indexes at write time, but there are 
competing priorities like making data available. So I think you want 
asynchronous index updates and that leads to integration with the table format.

> Index levels

I think that partition-level indexes are better for job planning (eliminate 
whole partitions!) but file-level are still useful for skipping files at the 
task level. I would probably focus on partition-level, but I'm not strongly 
opinionated here. File-level is probably a stepping stone to partition-level, 
given that we would be able to track index data in the same format.

> Index storage

Do you mean putting indexes in Parquet, or using Parquet for indexes? I think 
that bloom filters would probably exceed the amount of data we'd want to put 
into a Parquet binary column, probably at the file level and almost certainly 
at the partition level, since the size depends on the number of distinct values 
and the primary use is for identifiers.

> Indexing process

Synchronous is nice, but as I said above, I think we have to support async 
because it is too complicated to update every writer that touches a table and 
you may not want to pay the price at write time.

> Index validation

I think this is pretty much what I talked about for question 1. I think that we 
have a good plan around using sequence numbers, if we want to do this.

Ryan

On Tue, Jan 25, 2022 at 3:23 PM Jack Ye 
mailto:yezhao...@gmail.com>> wrote:
Hi everyone,

Based on the conversation in the last community sync and the Iceberg Slack 
channel, it seems like multiple parties have interest in continuing the effort 
related to the secondary index in Iceberg, so I would like to restart the 
thread to continue the discussion.

So far most people refer to the document authored by Miao 
Wang<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ%2Fedit=04%7C01%7Cmiwang%40adobe.com%7Cf818943b13944011e28f08d9e0684690%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637787561291307113%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJX

Re: [VOTE] Adopt Puffin format as a file format for statistics and indexes

2022-06-12 Thread Miao Wang
+1 on the format! It looks great!

Thanks for materializing the initial design idea.

Miao
From: Kyle Bendickson 
Date: Sunday, June 12, 2022 at 1:55 PM
To: dev@iceberg.apache.org 
Subject: Re: [VOTE] Adopt Puffin format as a file format for statistics and 
indexes

EXTERNAL: Use caution when clicking on links or opening attachments.


+1 [non-binding]

Thank you Piotr for all of the work you’ve put into this.

This should greatly benefit not only Iceberg on Trino, but hopefully can be 
used in many novel ways due to its well thought out generic design and 
incorporation of the ability to extend with new sketches.

Looking forward to the improvements this will bring.

- Kyle

On Fri, Jun 10, 2022 at 1:47 PM Alexander Jo 
mailto:alex...@starburstdata.com>> wrote:
+1, let's do it!

On Fri, Jun 10, 2022 at 2:47 PM John Zhuge 
mailto:jzh...@apache.org>> wrote:
+1  Looking forward to the features it enables.

On Fri, Jun 10, 2022 at 10:11 AM Yufei Gu 
mailto:flyrain...@gmail.com>> wrote:
+1. Looking forward to the partition stats.
Best,

Yufei


On Thu, Jun 9, 2022 at 6:32 PM Daniel Weeks 
mailto:dwe...@apache.org>> wrote:
+1 as well.  Excited about the progress here.

-Dan
On Thu, Jun 9, 2022, 6:25 PM Junjie Chen 
mailto:chenjunjied...@gmail.com>> wrote:
+1, really nice! Indexes are coming!

On Fri, Jun 10, 2022 at 8:04 AM Szehon Ho 
mailto:szehon.apa...@gmail.com>> wrote:
+1, it's an exciting step for Iceberg, look forward to all the new statistics 
and secondary indices it will allow.

Had a few questions of what the reference to Puffin file(s) will be in the 
Iceberg spec, but it's orthogonal to Puffin file format itself.

Thanks,
Szehon

On Thu, Jun 9, 2022 at 3:32 PM Ryan Blue 
mailto:b...@tabular.io>> wrote:
+1 from me!

There may also be people that haven't followed the design discussions and we 
can start a DISCUSS thread if needed. But if everyone is comfortable with the 
design and implementation, I think it's ready for a vote as well.

Huge thanks to Piotr for getting this ready! I think the format is going to be 
really useful for both stats and indexes in Iceberg.

On Thu, Jun 9, 2022 at 3:35 AM Piotr Findeisen 
mailto:pi...@starburstdata.com>> wrote:
Hi Everyone,

I propose that we adopt Puffin file format as a file format for statistics and 
indexes in Iceberg tables.

Puffin file format specification:
https://github.com/apache/iceberg/blob/master/format/puffin-spec.md
(previous discussions:  
https://github.com/apache/iceberg/pull/4944,
 
https://github.com/apache/iceberg-docs/pull/69)

Intend use:
* statistics in Iceberg tables (see 
https://github.com/apache/iceberg/pull/4945
 and associated proposed implementation 
https://github.com/apache/iceberg/pull/4741)
* in the future: storage for secondary indexes

Puffin file reader and writer implementation: