On Mon, Jun 23, 2014 at 8:50 AM, Aaron Davidson <ilike...@gmail.com> wrote:
> Note that regarding a "long load time", data format means a whole lot in > terms of query performance. If you load all your data into compressed, > columnar Parquet files on local hardware, Spark SQL would also perform far, > far better than it would reading from gzipped S3 files. > Yes. We're comparing our particular use cases; if we used Spark, we'd like to run from s3 from gzipped files for the sheer convenience of it. Having to pre-process data (which is the equivalent of the load phase with newSQL) is a PITN. One of the reasons for using post-Hadoop (rather than newSQL) systems is to avoid this. > You must also be careful about your queries; certain queries can be > answered much more efficiently due to specific optimizations implemented in > the query engine. For instance, Parquet keeps statistics. so you could > theoretically do a count(*) over petabytes of data in less than a second, > blowing away any competition that resorts to actually reading data. > Yes. I posted the query just now. The Redshifft table was only ordered by timestamp, so in all cases the database should perform a single full table scan.