I've been playing with Cassandra and have a few questions that I've been stuck on for awhile, and Googling around didn't seem to help much:
1. What's the quickest way to import a bunch of data from PostgreSQL? I have ~20M rows with mostly text (some long text with newlines, and blob files.) I tried exporting to CSV but had issues with newlines escaped characters. I also tried writing an ETL tool in Go, but it was taking a long time to go through the records. 2. How would I create a "versioned" schema with CQL? AFAIK Cassandra's cell versions are only for conflict resolution. I envision a wide row, with timestamps and keys representing fields of data through time. For example, for a CF of web page contents (inspired by Google's Bigtable paper): Key 1379649588:body 1379649522:body 1379649123:title a.com/1.html "<html>" "A" a.com/2.html "<html>" "B" b.com/1.html "<html>" "<html>" "C" But CQL doesn't seem to support this. (Yes, I've read http://www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows.) Once upon a time it seems Thrift and Supercolumns maybe would work? I'd want to efficiently iterate through the "history" of a particular row (in other words, read all the columns for a row) or efficiently iterate through all the latest values for the CF (not reading the entire row, just a column slice). In the previous example, I'd want to return the latest 'body' entries with timestamps for every page ("row"/"key") in the database. Some have talked of having two CFs, one for versioned data and one for current values? I've been struggling because most of the documentation revolves around Java. I'm most comfortable with Ruby and (increasingly) Go. I'd appreciate any insights, would really like to get Cassandra going for real. It's been such a pleasure to setup vs. HBase and whatnot. Keith
