Re: Cassandra as a key/object store for many small (10-60k) files

Jonathan Guberman Fri, 05 May 2017 12:27:06 -0700
Yes, local storage volumes on each machine.

> On May 5, 2017, at 3:25 PM, daemeon reiydelle <daeme...@gmail.com> wrote:
> 
> These numbers do not match e.g. AWS, so guessing you are using local storage?
> 
> 
> .......
> Making a billion dollar startup is easy: "take a human desire, preferably one 
> that has been around for a really long time … Identify that desire and use 
> modern technology to take out steps."
> .......
> Daemeon C.M. Reiydelle
> USA (+1) 415.501.0198
> London (+44) (0) 20 8144 9872
> 
> On Fri, May 5, 2017 at 12:19 PM, Jonathan Guberman <j...@tineye.com 
> <mailto:j...@tineye.com>> wrote:
> Hello,
> 
> We’re currently testing Cassandra for use as a pure key-object store for data 
> blobs around 10kB - 60kB each. Our use case is storing on the order of 10 
> billion objects with about 5-20 million new writes per day. A written object 
> will never be updated or deleted. Objects will be read at least once, some 
> time within 10 days of being written. This will generally happen as a batch; 
> that is, all of the images written on a particular day will be read together 
> at the same time. This batch read will only happen one time; future reads 
> will happen on individual objects, with no grouping, and they will follow a 
> long-tail distribution, with popular objects read thousands of times per year 
> but most read never or virtually never.
> 
> I’ve set up a small four node test cluster and have written test scripts to 
> benchmark writing and reading our data. The table I’ve set up is very simple: 
> an ascii primary key column with the object ID and a blob column for the 
> data. All other settings were left at their defaults.
> 
> I’ve found write speeds to be very fast most of the time. However, 
> periodically, writes will slow to a crawl for anywhere between half an hour 
> to two hours, after which speeds recover to their previous levels. I assume 
> this is some sort of data compaction or flushing to disk, but I haven’t been 
> able to figure out the exact cause.
> 
> Read speeds have been more disappointing. Cached reads are very fast, but 
> random read speed averages about 2 MB/sec, which is too slow when we need to 
> read out a batch of several million objects. I don’t think it’s reasonable to 
> assume that these rows will all still be cached by the time we need to read 
> them for that first large batch read.
> 
> My general question is whether anyone has any suggestions for how to improve 
> performance for our use case. More specifically:
> 
> - Is there a way to mitigate or eliminate the huge slowdowns I see when 
> writing millions of rows?
> - Are there settings I should be using in order to maximize read speeds for 
> random reads?
> - Is there a way to design our tables to improve the read speeds for the 
> initial large batched reads? I was thinking of using a batch ID column that 
> could be used to retrieve the data for the initial block. However, future 
> reads would need to be done by the object ID, not the batch ID, so it seems 
> to me I’d need to duplicate the data, one in a “objects by batch” table, and 
> the other in a simple “objects” table. Is there a better approach than this?
> 
> Thank you!
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org 
> <mailto:user-unsubscr...@cassandra.apache.org>
> For additional commands, e-mail: user-h...@cassandra.apache.org 
> <mailto:user-h...@cassandra.apache.org>
> 
>
Re: Cassandra as a key/object store for many small (10-60k) files

Reply via email to