Re: Updating tables stored on s3

Dobes Vandermeer Sat, 14 Mar 2020 15:00:07 -0700

Paul,

Thanks for your helpful comments.


I am increasingly thinking that a distributed database might work out better 
for my purposes.
On Mar 14, 2020, 12:44 AM -0700, Paul Rogers <par0...@yahoo.com.invalid>, wrote:
> Hi Dobes,
>
> Updating data in this way does seem to be challenge. Hive added ACID features 
> a while back, but they are fiendishly complex: every record is given an ID. A 
> delete or replace creates a edit entry in another file that uses the same ID. 
> Hive then joins the main and update file to apply the updates, and anti-joins 
> the deletes file to remove deletions. Seems to work, but it does seem fragile 
> and expensive.
>
> For how long are your files open to revision? Days? Weeks? The whole school 
> term? Can deletes arrive after the school term? I'm wondering how dynamic the 
> data is? Is there a "frothy" present, but then a stable history? Or is the 
> whole history frothy?
>
>
> If revisions are short-term (due to correcting errors, make-up tests, etc.), 
> you are trying to create a streaming database as described in the book 
> Streaming Systems [1]. If you read that book online, it has lots of nice 
> animations showing the various issues, and how late-arriving values can 
> replace earlier values. Might spark some ideas.
>
>
> Depending on how fancy you want to get, you can create a fake directory 
> structure in S3 and let Drill's usual partition pruning reduce the number of 
> files for each query (partition pruning.)
>
> Or, you can create an index database that holds the complete list of your 
> files, along with metadata (maybe the school, class, date, assignment, 
> whatever.) You can write custom code that first looks up filter conditions in 
> your index DB, to return a list of files. From that, pass the files to Drill 
> in place of Drill's partition pruning. This is not an "out of the box" task; 
> you'd basically be writing a plugin that replaces Drill's normal directory & 
> partition push-down code.
>
> With the index DB, replacing a Parquet file is as easy as replacing the file 
> at a particular (coordinate) in your index. Doing this also avoids race 
> conditions: you can replace the index entry, wait a few hours to ensure all 
> in-flight queries using the old file complete, then delete the obsolete 
> Parquet file.
>
> If queries are local (for a single teacher, student or school), the number of 
> files can be small (with good data localization: all of the data from one 
> district in one file, say.) So, Drill might normally scan a handful (dozens, 
> few hundreds) of files. If, however, you are computing nation-wide trends 
> across all data, then all files might be involved. In the localized, case, 
> having a good index would help you keep the number of files per query small.
>
> Hive, by the way, helps with this because Hive maps table columns to 
> directories. You could have partitions for, say, school district, teacher, 
> class, student, test which would directly map to terms in your actual 
> queries. With Drill, you have to do the mapping yourself, which is a hassle. 
> With a roll-your-own index, you can reproduce the Hive behavior. (Yes, we 
> should implement the Hive pattern in Drill - perhaps an outcome of the Drill 
> metastore.)
>
>
> Are you planning to combine this with Mongo for the most recent data? If so, 
> then your index mechanism can also identify when to use Mongo, and when data 
> has migrated to in S3.
>
> Just out of curiosity, if your data changes frequently, have you considered a 
> distributed DB such as Cassandra or the like? A Google search suggested there 
> are about 60 million students in the US. Let's assume you are wildly 
> successful, and serve all of them. There are 250 school days per year (IIRC). 
> Let's say each student takes 5 tests per day. (My kids would revolt, but 
> still.) That is 75 billion data points per school year. At, say, 100 bytes 
> per record, that is about 8 TB of data. Seems like something a small cluster 
> could handle, with the number of nodes probably driven by query rate.
>
> Maybe break the problems into pieces? Ignoring the update issue, could you 
> get good query performance from S3 with some partitioning strategy? Could you 
> get good performance from Mongo (or Cassandra or whatever?) If a static 
> solution won't perform well, a dynamic update one probably won't be any 
> better.
>
>
> Anyone else whose built this kind of system who can offer suggestions?
>
> Thanks,
> - Paul
>
>
> [1] https://learning.oreilly.com/library/view/streaming-systems/9781491983867/
>
>
>
>
>
> On Friday, March 13, 2020, 9:35:13 PM PDT, Dobes Vandermeer 
> <dob...@gmail.com> wrote:
>
> Hi,
>
> I've been thinking about how I might be able to get a good level of 
> performance from drill while still having data that updates and while storing 
> the data in s3.  Maybe this is a pipe dream, but here are some thoughts and 
> questions.
>
> What I would like to be able to do is to update, replace, re-balance the 
> parquet files in s3, but I don't want to calculate and specify the whole list 
> of files that are "current" in each query.
>
> I was thinking perhaps I could use a view, so when I replace a file I can add 
> a new file, update the view to include it, and then delete the old file.
>
> But I'm worried that having a view with thousands of files could perform 
> poorly.
>
> Building on that idea, it occurred to me that perhaps I could have a 
> hierarchy of views - views of views.  For example, a view for each day, 
> rolled into a view for each month, rolled into a view for each year, rolled 
> into a top-level view.  This could be useful if drill could somehow prune 
> views, but I haven't seen any mention of that in the docs.
>
> It seems like Apache Iceberg is designed to help with this, but it doesn't 
> support s3 currently, I'm not sure if it will (or can) anytime soon.
>
> Does anyone have any thoughts or experience to share in this area?
>
> Maybe what I am really looking for is some other database entirely - some 
> kind of scalable database that supports updates but scales horizontally.  
> Maybe drill just isn't like that right now.

Re: Updating tables stored on s3

Reply via email to