I'm transitioning my job from embedded space to Hadoop space. I was wondering 
if it is possible to come up with a SQLite cluster adaptation.

I will give you a crash course in hadoop. Basically we get a very large CSV, 
which is chopped up into 64MB chunks, and distributed to a number of nodes. The 
file is actually replicated 2 times for a total of 3 copies of all chunks on 
the cluster (no chunk is repeatedly stored on the same node). Then MapReduce 
logic is run, and the results are combined. Instrumental to this is the keys 
are returned in sorted order.

All of this is done in java (70% slower than C, on average, and with some 
non-trivial start-up cost). Everyone is clamoring for SQL to be run on the 
nodes. Hive attempts to leverage SQL, and is successful to some degree. But 
being able to use Full SQL would be a huge improvement. Akin to Hadoop is HBase 

HBase is similar with Hadoop, but it approaches things in a more conventional 
columnar format It a copy of "BigTable" form google.. Here, the notion of 
"column families" is important because column families are files. A row is made 
up of keys, at leas one column family. There is an implied join between the 
key, and each column family. As the table is viewed though, it is void as a 
join between the key and all column families. What denotes a column family (cf) 
is not specified, however the idea is to group columns into cfs by usage. That 
is cf1 is your most commonly needed data, and cfN is the least often needed.

HBase is queried by a specialized API. This API is written to work over very 
large datasets, working directly with the data. However not all uses of HBase 
need this. The majority of queries are distributed just because they are over a 
huge dataset, with a modest amount of rows returned. Distribution allows for 
much more paralleled disk reading.  For this case, a SQLite cluster makes 
perfect sense. 

Mapping all of this to SQLite, I could see a bit of work could go a long way. 
Column families can be implemented as separate files, which are ATTACHed and 
joined as needed. The most complicated operation is a join, where we have to 
coordinate the list of distinct values of the join to all other notes, for join 
matching. We then have to move all of that data to the same node for the join. 

The non-data input is a traditional SQL statement, but we will have to parse 
and restructure the statement to join for the needed column families. Also 
needed is a way to ship a row to another server for processing. 

I'm just putting this out there as me thinking out loud. I wonder how it would 
turn out. Comments?
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to