I'm transitioning my job from embedded space to Hadoop space. I was wondering if it is possible to come up with a SQLite cluster adaptation.
I will give you a crash course in hadoop. Basically we get a very large CSV, which is chopped up into 64MB chunks, and distributed to a number of nodes. The file is actually replicated 2 times for a total of 3 copies of all chunks on the cluster (no chunk is repeatedly stored on the same node). Then MapReduce logic is run, and the results are combined. Instrumental to this is the keys are returned in sorted order. All of this is done in java (70% slower than C, on average, and with some non-trivial start-up cost). Everyone is clamoring for SQL to be run on the nodes. Hive attempts to leverage SQL, and is successful to some degree. But being able to use Full SQL would be a huge improvement. Akin to Hadoop is HBase HBase is similar with Hadoop, but it approaches things in a more conventional columnar format It a copy of "BigTable" form google.. Here, the notion of "column families" is important because column families are files. A row is made up of keys, at leas one column family. There is an implied join between the key, and each column family. As the table is viewed though, it is void as a join between the key and all column families. What denotes a column family (cf) is not specified, however the idea is to group columns into cfs by usage. That is cf1 is your most commonly needed data, and cfN is the least often needed. HBase is queried by a specialized API. This API is written to work over very large datasets, working directly with the data. However not all uses of HBase need this. The majority of queries are distributed just because they are over a huge dataset, with a modest amount of rows returned. Distribution allows for much more paralleled disk reading. For this case, a SQLite cluster makes perfect sense. Mapping all of this to SQLite, I could see a bit of work could go a long way. Column families can be implemented as separate files, which are ATTACHed and joined as needed. The most complicated operation is a join, where we have to coordinate the list of distinct values of the join to all other notes, for join matching. We then have to move all of that data to the same node for the join. The non-data input is a traditional SQL statement, but we will have to parse and restructure the statement to join for the needed column families. Also needed is a way to ship a row to another server for processing. I'm just putting this out there as me thinking out loud. I wonder how it would turn out. Comments? _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users