Re: Apache Arrow file format

2023-10-22 Thread Yue Ni
> Looks like projecting columns isn't available by default. > One of the benefits of Parquet file format is column projection, where the IO is limited to just the columns projected. > Unfortunately, you are correct that it doesn't allow for easy column projecting (you're going to read all the

Re: Apache Arrow file format

2023-10-22 Thread wish maple
IMO, Facebook has mentioned a format for ML in it's paper section 3.3.2[1]. It mentions that > ML tables are also typically much wider, and tend to have tens of thousands > of features usually stored as large maps. > > The most pressing issue with the DWRF format was metadata overhead; > our ML

Re: Apache Arrow file format

2023-10-21 Thread Weston Pace
> Of course, what I'm really asking for is to see how Lance would compare ;-) > P.S. The second paper [2] also talks about ML workloads (in Section 5.8) > and GPU performance (in Section 5.9). It also cites Lance as one of the > future formats (in Section 5.6.2). Disclaimer: I work for LanceDb

Re: Apache Arrow file format

2023-10-19 Thread Aldrin
And the first paper's reference of arrow (in the references section) lists 2022 as the date of last access. Sent from Proton Mail for iOS On Thu, Oct 19, 2023 at 18:51, Aldrin wrote: For context, that second referenced paper has Wes McKinney as a co-author, so

Re: Apache Arrow file format

2023-10-19 Thread Aldrin
For context, that second referenced paper has Wes McKinney as a co-author, so they were much better positioned to say "the right things." Sent from Proton Mail for iOS On Thu, Oct 19, 2023 at 18:38, Jin Shang wrote: Honestly I don't understand why this VLDB paper [1]

Re: Apache Arrow file format

2023-10-19 Thread Jin Shang
Honestly I don't understand why this VLDB paper [1] chooses to include Feather in their evaluations. This paper studies OLAP DBMS file formats. Feather is clearly not optimized for the workload and performs badly in most of their benchmarks. This paper also has several inaccurate or outdated

Re: Apache Arrow file format

2023-10-19 Thread Roman Shaposhnik
On Wed, Oct 18, 2023 at 11:20 PM Andrew Lamb wrote: > > If you are looking for a more formal discussion and empirical analysis of > the differences, I suggest reading "A Deep Dive into Common Open Formats > for Analytical DBMSs" [1], a VLDB 2023 (runner up best paper!) that > compares and

Re: Apache Arrow file format

2023-10-19 Thread Jacek Pliszka
There is a note there explaining what they understand by it but further down the line they do not make such distinction. The fact that parquet can be better in-memory format than arrow for certain common uses is something I haven't thought of and is eye-opening for me, admittedly so because I am

Re: Apache Arrow file format

2023-10-18 Thread Antoine Pitrou
The fact that they describe Arrow and Feather as distinct formats (they're not!) with different characteristics is a bit of a bummer. Le 18/10/2023 à 22:20, Andrew Lamb a écrit : If you are looking for a more formal discussion and empirical analysis of the differences, I suggest reading "A

Re: Apache Arrow file format

2023-10-18 Thread Andrew Lamb
If you are looking for a more formal discussion and empirical analysis of the differences, I suggest reading "A Deep Dive into Common Open Formats for Analytical DBMSs" [1], a VLDB 2023 (runner up best paper!) that compares and contrasts Arrow, Parquet, ORC and Feather file formats. [1]

Re: Apache Arrow file format

2023-10-18 Thread Raphael Taylor-Davies
To further what others have already mentioned, the IPC file format is primarily optimised for IPC use-cases, that is exchanging the entire contents between processes. It is relatively inexpensive to encode and decode, and supports all arrow datatypes, making it ideal for things like

Re: Apache Arrow file format

2023-10-18 Thread Dewey Dunnington
Plenty of opinions here already, but I happen to think that IPC streams and/or Arrow File/Feather are wildly underutilized. For the use-case where you're mostly just going to read an entire file into R or Python it's a bit faster (and far superior to a CSV or pickling or .rds files in R). >

Re: Apache Arrow file format

2023-10-17 Thread wish maple
Arrow IPC file is great, it focuses on in-memory representation and direct computation. Basically, it can support compression and dictionary encoding, and can zero-copy deserialize the file to memory Arrow format. Parquet provides some strong functionality, like Statistics, which could help

Re: Apache Arrow file format

2023-10-17 Thread Adam Lippai
Also there is https://github.com/lancedb/lance between the two formats. Depending on the use case it can be a great choice. Best regards Adam Lippai On Tue, Oct 17, 2023 at 22:44 Matt Topol wrote: > One benefit of the feather format (i.e. Arrow IPC file format) is the > ability to mmap the

Re: Apache Arrow file format

2023-10-17 Thread Matt Topol
One benefit of the feather format (i.e. Arrow IPC file format) is the ability to mmap the file to easily handle reading sections of a larger than memory file of data. Since, as Felipe mentioned, the format is focused on in-memory representation, you can easily and simply mmap the file and use the

Re: Apache Arrow file format

2023-10-17 Thread Felipe Oliveira Carvalho
It’s not the best since the format is really focused on in- memory representation and direct computation, but you can do it: https://arrow.apache.org/docs/python/feather.html — Felipe On Tue, 17 Oct 2023 at 23:26 Nara wrote: > Hi, > > Is it a good idea to use Apache Arrow as a file format?

Apache Arrow file format

2023-10-17 Thread Nara
Hi, Is it a good idea to use Apache Arrow as a file format? Looks like projecting columns isn't available by default. One of the benefits of Parquet file format is column projection, where the IO is limited to just the columns projected. Regards , Nara