Re: Introducing Cylon

2020-08-18 Thread Niranda Perera
Hi Uwe, I put a PR to the arrow-site repo. https://github.com/apache/arrow-site/pull/72 Best On Wed, Jul 22, 2020 at 10:38 AM Uwe L. Korn wrote: > Hello Niranda, > > cool to see this. Feel free to open a PR to add it to the Powered By list > on https://arrow.apache.org/powered_by/ > > Cheers

Re: Introducing Cylon

2020-07-27 Thread Niranda Perera
Hi Micah, Thank you very much for raising these questions. We are further analyzing the reasons for Cylon's performance improvement. We believe the main reason is using Arrow and columnar format and it helps our shuffleByIndex-compute-recreateData approach (more like BSP). And we are getting

Re: Introducing Cylon

2020-07-26 Thread Micah Kornfield
Hi Niranda, Interesting results. Did you do any analysis to understand what was the main contributor to the performance differences? Along these lines, did you try joins on any real world datasets? Are you using Spark SQL for comparisons? Also why not use parquet as a starting point? Thanks,

Re: Introducing Cylon

2020-07-22 Thread Uwe L. Korn
Hello Niranda, cool to see this. Feel free to open a PR to add it to the Powered By list on https://arrow.apache.org/powered_by/ Cheers Uwe On Tue, Jul 21, 2020, at 8:03 PM, Niranda Perera wrote: > Hi all, > > We would like to introduce Cylon to the Arrow community. It is an > open-source,

Introducing Cylon

2020-07-21 Thread Niranda Perera
Hi all, We would like to introduce Cylon to the Arrow community. It is an open-source, lean distributed data processing library using the Arrow data format underneath. It is developed in C++ with bindings to Java, and Python. It has an in-memory Table API that integrates with PyArrow Table API.