Re: Arrow Dataset API on Ceph

2021-06-22 Thread Jayjeet Chakraborty
Hi Yibo, Thanks a lot for taking the time to set up and run SkyhookDM. For the performance evaluation, you can use the script here [1]. This script measures the query execution based on different row selectivities which also translates to the amount of data moved around. Please let me know if you

Re: Arrow Dataset API on Ceph

2021-06-21 Thread Yibo Cai
Hi Jayjeet, I've successfully validated basic functions based on the links you provided, on both Arm64 and x86, with binaries built from your PR. Everything looks fine. From perf, I can see arrow code is running actively on ceph osd nodes. Currently, I deployed and tested on 4 VMs. For

Re: Arrow Dataset API on Ceph

2021-06-08 Thread Jayjeet Chakraborty
Hi Yibo, Thanks a lot for your interest in our work. Please refer to this [1] guide to deploy a complete environment on a cluster of nodes. Regarding your comment about a Ceph patch, the arrow object class that we implement is actually a plugin and does not require the Ceph source tree for

Re: Arrow Dataset API on Ceph

2021-06-07 Thread Yibo Cai
Hi Jayjeet, It is exciting to see a real world computational storage solution built upon Arrow and Ceph. Amazing work! We are interesting in this project (I'm from Arm open source software team focusing on storage and big data OSS), and would like to reproduce your works first, then evaluate

Re: Arrow Dataset API on Ceph

2021-06-01 Thread Wes McKinney
Folks who are looking for some context about this work might like to read a recently-published paper from this research group: https://arxiv.org/pdf/2105.09894.pdf On Tue, Jun 1, 2021 at 3:12 PM Jayjeet Chakraborty < jayjeetchakrabort...@gmail.com> wrote: > Dear Arrow Community, > > In our

Re: Arrow Dataset API on Ceph

2021-06-01 Thread Jayjeet Chakraborty
Dear Arrow Community, In our previous discussion, we planned on implementing a new Dataset API like InMemoryDataset to interact with objects containing IPC data stored in Ceph/RADOS . We had implemented this design and raised a PR . But

Re: Arrow Dataset API on Ceph

2020-09-16 Thread Ivo Jimenez
Thanks a lot for your replies Micah and Antoine, really appreciate it! On 2020/09/15 18:06:56, Micah Kornfield wrote: > gmock is already a dependency. We haven't upgraded gmock/gtest in a while, > we might want to consider doing that (but this is orthogonal). > > On Tue, Sep 15, 2020 at 10:16

Re: Arrow Dataset API on Ceph

2020-09-15 Thread Micah Kornfield
gmock is already a dependency. We haven't upgraded gmock/gtest in a while, we might want to consider doing that (but this is orthogonal). On Tue, Sep 15, 2020 at 10:16 AM Antoine Pitrou wrote: > > Hi Ivo, > > You can open a JIRA once you've got a PR ready. No need to do it before > you think

Re: Arrow Dataset API on Ceph

2020-09-15 Thread Antoine Pitrou
Hi Ivo, You can open a JIRA once you've got a PR ready. No need to do it before you think you're ready for submission. AFAIK, gmock is already a dependency. Regards Antoine. Le 15/09/2020 à 18:49, Ivo Jimenez a écrit : > Hi again, > > We noticed in the contribution guidelines that there

Re: Arrow Dataset API on Ceph

2020-09-15 Thread Ivo Jimenez
Hi again, We noticed in the contribution guidelines that there needs to be an issue for every PR in JIRA. Should we open one for the eventual PR for the work we're doing on implementing the dataset on Ceph's RADOS? Also, on a related note, we would like to mock the RADOS client so that we can

Re: Arrow Dataset API on Ceph

2020-09-02 Thread Ivo Jimenez
Hi Ben, > > Our main concern is that this new arrow::dataset::RadosFormat class will > be > > deriving from the arrow::dataset::FileFormat class, which seems to raise > a > > conceptual mismatch as there isn’t really a RADOS format > > IIUC RADOS doesn't interact with a filesystem directly, so

Re: Arrow Dataset API on Ceph

2020-08-31 Thread Ben Kietzman
> as far as we can tell, this filesystem layer > is unaware of expressions, record batches, etc You're correct that the filesystem layer doesn't directly support Expressions. However the datasets API includes the Partitioning classes which embed expressions in paths. Depending on what expressions

Re: Arrow Dataset API on Ceph

2020-08-28 Thread Ivo Jimenez
Hi Antoine > Yes, that is our plan. Since this is going to be done on the storage-, > > server-side, this would be transparent to the client. So our main concern > > is whether this be OK from the design perspective, and could this > > eventually be merged upstream? > > Arrow datasets have no

Re: Arrow Dataset API on Ceph

2020-08-28 Thread Antoine Pitrou
Le 27/08/2020 à 21:55, Ivo Jimenez a écrit : > Hi Antoine, > >> Our main concern is that this new arrow::dataset::RadosFormat class will >> be >>> deriving from the arrow::dataset::FileFormat class, which seems to raise >> a >>> conceptual mismatch as there isn’t really a RADOS format but

Re: Arrow Dataset API on Ceph

2020-08-27 Thread Ivo Jimenez
Hi Antoine, > Our main concern is that this new arrow::dataset::RadosFormat class will > be > > deriving from the arrow::dataset::FileFormat class, which seems to raise > a > > conceptual mismatch as there isn’t really a RADOS format but rather a > > formatting/serialization deferral that will be

Re: Arrow Dataset API on Ceph

2020-08-27 Thread Antoine Pitrou
Hello Ivo, Le 27/08/2020 à 07:02, Ivo Jimenez a écrit : > > Our main concern is that this new arrow::dataset::RadosFormat class will be > deriving from the arrow::dataset::FileFormat class, which seems to raise a > conceptual mismatch as there isn’t really a RADOS format but rather a >

Arrow Dataset API on Ceph

2020-08-26 Thread Ivo Jimenez
Dear Arrow community, We are writing to share our thoughts about designing an Apache Arrow-native storage system leveraging Ceph’s extensibility mechanism as part of the SkyhookDM project and aim for a design that leverages Arrow as much as possible, both on the client API