Re: Bad read performances: 'few rows of many columns' vs 'many rows of few columns'

2010-03-10 Thread Jonathan Ellis
For the record, I note that "no row cache" is the default on user-defined CFs; we include it in the sample configuration file as an example only. On Wed, Mar 10, 2010 at 9:58 AM, Sylvain Lebresne wrote: >> So did you disable the row cache entirely? > > Yes (getting back reasonable performances).

Re: Bad read performances: 'few rows of many columns' vs 'many rows of few columns'

2010-03-10 Thread Sylvain Lebresne
> So did you disable the row cache entirely? Yes (getting back reasonable performances). >> From: Sylvain Lebresne >> >> Well, I've found the reason. >> The default cassandra configuration use a 10% row cache. >> And the row cache reads all the row each time. So it was indeed reading >> the >> fu

RE: Bad read performances: 'few rows of many columns' vs 'many rows of few columns'

2010-03-10 Thread David Dabbs
So did you disable the row cache entirely? > From: Sylvain Lebresne > > Well, I've found the reason. > The default cassandra configuration use a 10% row cache. > And the row cache reads all the row each time. So it was indeed reading > the > full row each time even though the request was asking

Re: Bad read performances: 'few rows of many columns' vs 'many rows of few columns'

2010-03-10 Thread Sylvain Lebresne
Well, I've found the reason. The default cassandra configuration use a 10% row cache. And the row cache reads all the row each time. So it was indeed reading the full row each time even though the request was asking for only one column. My bad (at least I learned something). -- Sylvain On Tue, M

Re: Bad read performances: 'few rows of many columns' vs 'many rows of few columns'

2010-03-09 Thread Brandon Williams
On Tue, Mar 9, 2010 at 2:28 PM, Sylvain Lebresne wrote: > > A row causes a disk seek while columns are contiguous. So if the row > isn't > > in the cache, you're being impaired by the seeks. In general, fatter > rows > > should be more performant than skinny ones. > > Sure, I understand that. S

Re: Bad read performances: 'few rows of many columns' vs 'many rows of few columns'

2010-03-09 Thread Sylvain Lebresne
> A row causes a disk seek while columns are contiguous.  So if the row isn't > in the cache, you're being impaired by the seeks.  In general, fatter rows > should be more performant than skinny ones. Sure, I understand that. Still, I get 400 columns by seconds (ie, 400 seeks by seconds) when the

Re: Bad read performances: 'few rows of many columns' vs 'many rows of few columns'

2010-03-09 Thread Brandon Williams
On Tue, Mar 9, 2010 at 1:14 PM, Sylvain Lebresne wrote: > I've inserted 1000 row of 100 column each (python stress.py -t 2 -n > 1000 -c 100 -i 5) > If I read, I get the roughly the same number of row whether I read the > whole row > (python stress.py -t 10 -n 1000 -o read -r -c 100) or only the f

Re: Bad read performances: 'few rows of many columns' vs 'many rows of few columns'

2010-03-09 Thread Sylvain Lebresne
Alright, What I'm observing shows better with bigger columns, so I've slightly modified the stress.py test so that it inserts column of 50K bytes (I attach the modified stress.py for info but it really just read 5 bytes from /dev/null and use that as data. I also added a sleep to the insert ot

Re: Bad read performances: 'few rows of many columns' vs 'many rows of few columns'

2010-03-09 Thread Jonathan Ellis
On Tue, Mar 9, 2010 at 8:31 AM, Sylvain Lebresne wrote: > Well, unless I'm mistaking, that's the same in my example as I give in > both case > to stress.py the option '-c 1' which tells it to retrieve only one > column each time > even in the case where I have 100 columns by row. Oh. Why would y

Re: Bad read performances: 'few rows of many columns' vs 'many rows of few columns'

2010-03-09 Thread Jonathan Ellis
On Tue, Mar 9, 2010 at 7:15 AM, Sylvain Lebresne wrote: >  1) stress.py -t 10 -o read -n 5000 -c 1 -r >  2) stress.py -t 10 -o read -n 50 -c 1 -r > > In the case 1) I get around 200 reads/seconds and that's pretty stable. The > disk is spinning like crazy (~25% io_wait), very few cpu or me

Re: Bad read performances: 'few rows of many columns' vs 'many rows of few columns'

2010-03-09 Thread Sylvain Lebresne
On Tue, Mar 9, 2010 at 2:52 PM, Jonathan Ellis wrote: > By "reads" do you mean what stress.py counts (rows) or rows * columns? >  If it is rows, then you are still actually reading more columns/s in > case 2. Well, unless I'm mistaking, that's the same in my example as I give in both case to stre

Re: Bad read performances: 'few rows of many columns' vs 'many rows of few columns'

2010-03-09 Thread Jesse McConnell
in my experience #2 will work well up to a point where it will trigger a limitation of cassandra (slated to be resolved in .7 \o/) where all of the columns under a given key must be able to fit into memory. For things like index's of data I have opted to shard the keys for really large data sets t

Bad read performances: 'few rows of many columns' vs 'many rows of few columns'

2010-03-09 Thread Sylvain Lebresne
Hello, I've done some tests and it seems that somehow to have more rows with few columns is better than to have more rows with fewer columns, at least as long as read performance is concerned. Using stress.py, on a quad core 2.27Ghz with 4Go RAM and the out of the box cassandra configuration, I in