Re: [Rust] [DISCUSS] Donate DataFusion to Arrow project

2019-01-06 Thread Ted Dunning
Cool! On Sun, Jan 6, 2019 at 1:52 PM Andy Grove wrote: > I'm starting a new thread for this discussion (this was previously > discussed in the Rust Roadmap thread). > > The reason I got involved with Arrow is that I have been working on > DataFusion[1] which is currently an in-process SQL

Re: Algorithmic explorations of bitmaps vs. sentinel values

2018-10-17 Thread Ted Dunning
Memory binding can be viewed as opportunity for melding multiple aggregators. For instance, any additional aggregation comes nearly for free. Sum and count (non zero) will be the same as either alone. Or sum and sum of squares. On Wed, Oct 17, 2018, 06:21 Francois Saint-Jacques <

Re: Algorithmic explorations of bitmaps vs. sentinel values

2018-10-16 Thread Ted Dunning
It should be possible to unroll the sentinel version in many cases. For instance, sum += (data[i] == SENTINEL) * data[i] This doesn't work with NaN as a sentinel because 0 * NaN => NaN, but it can work with other values. On Tue, Oct 16, 2018 at 9:38 AM Antoine Pitrou wrote: > > Hi Wes,

Re: (Ab)using parquet files on S3 storage for a huge logging database

2018-09-19 Thread Ted Dunning
The effect of rename can be had by handling a small inventory file that is updated atomically. Having real file semantics is sooo much nicer, though. On Wed, Sep 19, 2018 at 1:51 PM Bill Glennon wrote: > Also, may want to take a look at https://aws.amazon.com/athena/. > > Thanks, > Bill > >

Re: [DISCUSS] Dropping support for CentOS 5 / RHEL5 in Python packages

2018-09-04 Thread Ted Dunning
Just as a point of reference, I don't think that get any pushback at MapR for not supporting RHEL 5 and that has been our policy for a few years now. That experience should be pretty similar for Arrow, except that I would expect that new adoptions might be even more canted towards current

Re: Increasing transparency of corporate support for Apache Arrow development

2018-08-16 Thread Ted Dunning
Yes, there are several such examples. And it turned into a monstrous mess with companies bragging over lines of code changed. Oddly, the guys who did lots of reformatting did really well. There is also the problem of the very strong Apache tradition that it is individuals who contribute to

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-07-30 Thread Ted Dunning
On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney wrote: > > > The community will be less willing to accept large > > changes that require multiple rounds of patches for stability and API > > convergence. Our contributions to Libhdfs++ in the HDFS community took a > > significantly long time for the

Re: [Python] Disk size performance of Snappy vs Brotli vs Blosc

2018-01-24 Thread Ted Dunning
Simba Nice summary. I think that there may be some issues with your tests. In particular, you are storing essentially uniform random values. That might be a viable test in some situations, there are many where there is considerably less entropy in the data being stored. For instance, if you store

Re: Java-C++ integration tests -- on the home stretch

2016-11-21 Thread Ted Dunning
Wes, This is awesome. Does it, however, imply that to run the tests that a C programmer will need a working Java environment and a Java programmer will need a C environment? Is there any way around that? Possibly by storing golden bits for the in-memory images somewhere? On Mon, Nov 21,

Re: Can someone help me how should I start using Arrow Java Jars ?

2016-08-01 Thread Ted Dunning
Cloning the git repository gives you the full source code. On Mon, Aug 1, 2016 at 8:11 AM, Sanjay Rao wrote: > Hi Kiril, > Thanks a lot for your reply, Can I have the full source code ? It would > help me, also could you help me with Java doc link if any as such. >

Re: Code review tools for Arrow patches

2016-04-24 Thread Ted Dunning
Just for the record, Apex had some issues getting Gerrit reviews reflected in a coherent fashion into the Apache record. I presume that you guys will have that handled or will check with the Apex devs to learn their resolution. On Sun, Apr 24, 2016 at 5:30 PM, Wes McKinney

Re: Understanding "shared" memory implications

2016-03-18 Thread Ted Dunning
On Tue, Mar 15, 2016 at 5:54 PM, Jacques Nadeau wrote: > How do others feel of my redefinition of IPC to mean the same memory space > communication (either via shared memory or rdma) versus RPC as socket based > communication? > IPC already has a strong definition which is