BigTable-like Versioned Cells, Importing PostgreSQL Data

Keith Bogs Thu, 19 Sep 2013 21:13:59 -0700

I've been playing with Cassandra and have a few questions that I've been
stuck on for awhile, and Googling around didn't seem to help much:


1. What's the quickest way to import a bunch of data from PostgreSQL? I
have ~20M rows with mostly text (some long text with newlines, and blob
files.) I tried exporting to CSV but had issues with newlines escaped
characters. I also tried writing an ETL tool in Go, but it was taking a
long time to go through the records.

2. How would I create a "versioned" schema with CQL? AFAIK Cassandra's cell
versions are only for conflict resolution.

I envision a wide row, with timestamps and keys representing fields of data
through time. For example, for a CF of web page contents (inspired by
Google's Bigtable paper):

Key          1379649588:body 1379649522:body 1379649123:title
a.com/1.html "<html>"                        "A"
a.com/2.html                 "<html>"        "B"
b.com/1.html "<html>"        "<html>"        "C"

But CQL doesn't seem to support this. (Yes, I've read
http://www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows.)
Once upon a time it seems Thrift and Supercolumns maybe would work?

I'd want to efficiently iterate through the "history" of a particular row
(in other words, read all the columns for a row) or efficiently iterate
through all the latest values for the CF (not reading the entire row, just
a column slice). In the previous example, I'd want to return the latest
'body' entries with timestamps for every page ("row"/"key") in the database.

Some have talked of having two CFs, one for versioned data and one for
current values?

I've been struggling because most of the documentation revolves around
Java. I'm most comfortable with Ruby and (increasingly) Go.

I'd appreciate any insights, would really like to get Cassandra going for
real. It's been such a pleasure to setup vs. HBase and whatnot.

Keith

BigTable-like Versioned Cells, Importing PostgreSQL Data

Reply via email to