Re: Cassandra search performance

2012-05-07 Thread David Jeske
On Sun, Apr 29, 2012 at 4:32 PM, Maxim Potekhin wrote: > Looking at your example,as I think you understand, you forgo indexes by > combining two conditions in one query, thinking along the lines of what is > often done in RDBMS. A scan is expected in this case, and there is no > magic to avoid it

Re: Data Model Design for Login Servie

2011-11-18 Thread David Jeske
On Thu, Nov 17, 2011 at 1:08 PM, Maciej Miklas wrote: > A) Skinny rows > - row key contains login name - this is the main search criteria > - login data is replicated - each possible login is stored as single row > which contains all user data - 10 logins for single customer create 10 rows, whe

Re: data model for unique users in a time period

2011-11-03 Thread David Jeske
On Wed, Nov 2, 2011 at 7:26 PM, David Jeske wrote: > - make sure the summarizer does try to do it's job for a batch of counters > until they are fully replicated and 'static' (no new increments will appear) > Apologies. make the summarizer ( doesn't ) try to do it's job...

Re: data model for unique users in a time period

2011-11-02 Thread David Jeske
I understand what you are thinking daniel, but this approach has at least one big wrinkle. You would be introducing depencencies between compaction and replication. The 'unique' idempotent records are required for cassandra to read repair properly. Therefore, if a compaction (or even a memtable f

Re: What does a cluster throttled by the network look like ?

2011-10-30 Thread David Jeske
You are answering your own question here. If you are running at 80% of network bandwidth, you are saturating your network. AFAIK - most distributed databases are running on gigabit, not 100mb. I recommend you upgrade your switch (and nics if necessary). Gigabit is insanely cheap now. In the extrem

Re: Programmatically allow only one out of two types of rows in a CF to enter the CACHE

2011-10-30 Thread David Jeske
If your summary data is frequently accessed, you will probably be best off storing the two sets of data separately (either in separate column families or with different key-prefixes). This will give you the greatest cache-locality for your summary data, which you say is popular. If your summary dat

Re: Read Performance / Schema Design

2011-10-26 Thread David Jeske
On Wed, Oct 26, 2011 at 7:35 PM, Ben Gambley wrote: > Our requirement is to store per user, many unique results (which is > basically an attempt at some questions ..) so I had thought of having the > userid as the row key and the result id as columns. > > The keys for the result ids are maintaine

Re: Cassandra cluster HW spec (commit log directory vs data file directory)

2011-10-25 Thread David Jeske
On Tue, Oct 25, 2011 at 5:23 AM, Alexandru Sicoe wrote: > At the moment I am partitioning the data in Cassandra in 75 CFs You might consider not using so many column families. I am not a Cassandra expert, but from what I've seen floated around, there is currently a unique memtable, commit log,

Re: Storing pre-sorted data

2011-10-20 Thread David Jeske
> > > 2) If a single key, would adding a file/block/record-level encryption to >> Cassandra solve this problem? If not, why not? Is there something >> special about your encryption methods? >> > > There is nothing special about our encryption methods but will never be > able to encrypt or decrypt

Re: Storing pre-sorted data

2011-10-18 Thread David Jeske
On Tue, Oct 18, 2011 at 12:14 AM, Matthias Pfau wrote: > we want to sort completely on the client-side (where the data is > encrypted). But that requires an "insert at offset X" operation. We would > always use CL QUORUM and client side synchronisation. > You can do "insert at offset X"... just

Re: Storing pre-sorted data

2011-10-17 Thread David Jeske
On Mon, Oct 17, 2011 at 2:39 AM, Matthias Pfau wrote: > We would be very happy if cassandra would give us an option to maintain the > sort order on our own (application logic). That is why it would be > interesting to hear from any of the developers if it would be easily > possible to add such a

Re: Storing pre-sorted data

2011-10-15 Thread David Jeske
Logically, whether you use cassandra or not, there is some "physics" of sorted order structures which you should understand and dictate what is possible. In order to keep data sorted, a database needs to be able to see the proper sort-order of the data "all the time" not just at insertion or query

Re: Not all data structures need timestamps (and don't require wasted memory).

2011-09-12 Thread David Jeske
After writing my message, I recognized a scenerio you might be referring to Kevin. If I understand correctly, you're not referring to set-membership in the general sense, where one could add and remove entries. General set-membership, in the context of eventual-consistency, requires timestamps. Th

Re: Not all data structures need timestamps (and don't require wasted memory).

2011-09-12 Thread David Jeske
On Sat, Sep 3, 2011 at 8:26 PM, Kevin Burton wrote: > The point is that replication in Cassandra only needs timestamps to handle > out of order writes … for values that are idempotent, this isn't necessary. > The order doesn't matter. > I believe this is a mis-understanding of how idempotency a

Re: cassandra vs hbase summary (was facebook messaging)

2010-11-27 Thread David Jeske
Thanks for all the great answers last week about Cassandra. I have an additional question about cassandra and columns/supercolumns. I had naively assumed that columns and super-columns map to an internal row-key (like how in Bigtable the indexed map is row/column-key/timestamp to data), but some pe

Re: cassandra vs hbase summary (was facebook messaging)

2010-11-22 Thread David Jeske
> My point still applies though. Caching HFIle blocks on a single node >> vs individual "dataums" on N nodes may not be more efficient. Thus >> terms like "Slower" and "Less Efficient" could be very misleading. >> > I seem to have missed this the first time around. Next time I correct the summary I

Re: cassandra vs hbase summary (was facebook messaging)

2010-11-22 Thread David Jeske
This is my second attempt at a summary of Cassandra vs HBase consistency and performance for an hbase acceptable workload. I think these tricky subtlties are hard to understand, yet it's helpful for the community to understand them. I'm not trying to state my own facts (or opinion) but merely summa

Re: cassandra vs hbase summary (was facebook messaging)

2010-11-22 Thread David Jeske
On Mon, Nov 22, 2010 at 2:44 PM, David Jeske wrote: > On Mon, Nov 22, 2010 at 2:39 PM, Edward Capriolo wrote: > >> Return messages such as "your data was written to at least 1 node but >> not enough to make your write-consistency count". Do not help the >> si

Re: cassandra vs hbase summary (was facebook messaging)

2010-11-22 Thread David Jeske
On Mon, Nov 22, 2010 at 2:39 PM, Edward Capriolo wrote: > Return messages such as "your data was written to at least 1 node but > not enough to make your write-consistency count". Do not help the > situation. As the client that writes the data would be aware of the > inconsistency, but the other c

Re: cassandra vs hbase summary (was facebook messaging)

2010-11-22 Thread David Jeske
On Mon, Nov 22, 2010 at 11:52 AM, Todd Lipcon wrote: > Not quite. The replica synchronization code is pretty messy, but basically > it will take the longest replica that may have been synced, not a quorum. > > i.e the guarantee is that "if you successfully sync() data, it will be > present after

Re: cassandra vs hbase summary (was facebook messaging)

2010-11-22 Thread David Jeske
On Mon, Nov 22, 2010 at 1:26 PM, Edward Capriolo wrote: > For cassandra all writes must be transmitted to all replicas. > I thought that was only true if you set the number of replicas required for the write to the same as the number of replicas. Further, we've established in this thread that ev

Re: cassandra vs hbase summary (was facebook messaging)

2010-11-22 Thread David Jeske
> > 2) Cassandra has a less efficient memory footprint data pinned in > memory (or cached). With 3 replicas on Cassandra, each element of data > pinned in-memory is kept in memory on 3 servers, wheras in hbase only > region masters keep the data in memory, so there is only one-copy of > each data e

Re: cassandra vs hbase summary (was facebook messaging)

2010-11-22 Thread David Jeske
I already noticed a mistake in my own facts... On Mon, Nov 22, 2010 at 10:01 AM, David Jeske wrote: > *4) Cassandra (N3/W3/R1) takes longer to allow data to become writable > again in the face of a node-failure than HBase/HDFS.* Cassandra must > repair the keyrange to bring N from

cassandra vs hbase summary (was facebook messaging)

2010-11-22 Thread David Jeske
I havn't used either Cassandra or hbase, so please don't take any part of this message as me attempting to state facts about either system. However, I'm very familiar with data-storage design details, and I've worked extensively optimizing applications running on MySQL, Oracle, berkeledb (including