Paul, Thanks for your helpful comments.
I am increasingly thinking that a distributed database might work out better for my purposes. On Mar 14, 2020, 12:44 AM -0700, Paul Rogers <par0...@yahoo.com.invalid>, wrote: > Hi Dobes, > > Updating data in this way does seem to be challenge. Hive added ACID features > a while back, but they are fiendishly complex: every record is given an ID. A > delete or replace creates a edit entry in another file that uses the same ID. > Hive then joins the main and update file to apply the updates, and anti-joins > the deletes file to remove deletions. Seems to work, but it does seem fragile > and expensive. > > For how long are your files open to revision? Days? Weeks? The whole school > term? Can deletes arrive after the school term? I'm wondering how dynamic the > data is? Is there a "frothy" present, but then a stable history? Or is the > whole history frothy? > > > If revisions are short-term (due to correcting errors, make-up tests, etc.), > you are trying to create a streaming database as described in the book > Streaming Systems [1]. If you read that book online, it has lots of nice > animations showing the various issues, and how late-arriving values can > replace earlier values. Might spark some ideas. > > > Depending on how fancy you want to get, you can create a fake directory > structure in S3 and let Drill's usual partition pruning reduce the number of > files for each query (partition pruning.) > > Or, you can create an index database that holds the complete list of your > files, along with metadata (maybe the school, class, date, assignment, > whatever.) You can write custom code that first looks up filter conditions in > your index DB, to return a list of files. From that, pass the files to Drill > in place of Drill's partition pruning. This is not an "out of the box" task; > you'd basically be writing a plugin that replaces Drill's normal directory & > partition push-down code. > > With the index DB, replacing a Parquet file is as easy as replacing the file > at a particular (coordinate) in your index. Doing this also avoids race > conditions: you can replace the index entry, wait a few hours to ensure all > in-flight queries using the old file complete, then delete the obsolete > Parquet file. > > If queries are local (for a single teacher, student or school), the number of > files can be small (with good data localization: all of the data from one > district in one file, say.) So, Drill might normally scan a handful (dozens, > few hundreds) of files. If, however, you are computing nation-wide trends > across all data, then all files might be involved. In the localized, case, > having a good index would help you keep the number of files per query small. > > Hive, by the way, helps with this because Hive maps table columns to > directories. You could have partitions for, say, school district, teacher, > class, student, test which would directly map to terms in your actual > queries. With Drill, you have to do the mapping yourself, which is a hassle. > With a roll-your-own index, you can reproduce the Hive behavior. (Yes, we > should implement the Hive pattern in Drill - perhaps an outcome of the Drill > metastore.) > > > Are you planning to combine this with Mongo for the most recent data? If so, > then your index mechanism can also identify when to use Mongo, and when data > has migrated to in S3. > > Just out of curiosity, if your data changes frequently, have you considered a > distributed DB such as Cassandra or the like? A Google search suggested there > are about 60 million students in the US. Let's assume you are wildly > successful, and serve all of them. There are 250 school days per year (IIRC). > Let's say each student takes 5 tests per day. (My kids would revolt, but > still.) That is 75 billion data points per school year. At, say, 100 bytes > per record, that is about 8 TB of data. Seems like something a small cluster > could handle, with the number of nodes probably driven by query rate. > > Maybe break the problems into pieces? Ignoring the update issue, could you > get good query performance from S3 with some partitioning strategy? Could you > get good performance from Mongo (or Cassandra or whatever?) If a static > solution won't perform well, a dynamic update one probably won't be any > better. > > > Anyone else whose built this kind of system who can offer suggestions? > > Thanks, > - Paul > > > [1] https://learning.oreilly.com/library/view/streaming-systems/9781491983867/ > > > > > > On Friday, March 13, 2020, 9:35:13 PM PDT, Dobes Vandermeer > <dob...@gmail.com> wrote: > > Hi, > > I've been thinking about how I might be able to get a good level of > performance from drill while still having data that updates and while storing > the data in s3. Maybe this is a pipe dream, but here are some thoughts and > questions. > > What I would like to be able to do is to update, replace, re-balance the > parquet files in s3, but I don't want to calculate and specify the whole list > of files that are "current" in each query. > > I was thinking perhaps I could use a view, so when I replace a file I can add > a new file, update the view to include it, and then delete the old file. > > But I'm worried that having a view with thousands of files could perform > poorly. > > Building on that idea, it occurred to me that perhaps I could have a > hierarchy of views - views of views. For example, a view for each day, > rolled into a view for each month, rolled into a view for each year, rolled > into a top-level view. This could be useful if drill could somehow prune > views, but I haven't seen any mention of that in the docs. > > It seems like Apache Iceberg is designed to help with this, but it doesn't > support s3 currently, I'm not sure if it will (or can) anytime soon. > > Does anyone have any thoughts or experience to share in this area? > > Maybe what I am really looking for is some other database entirely - some > kind of scalable database that supports updates but scales horizontally. > Maybe drill just isn't like that right now.