Re: Mongo column types

2020-02-27 Thread Paul Rogers
Hi Dobes, I like your idea! I think it would be a great addition to Drill in general and will show the way for other storage plugins. Technically, you are right, there should not be that much work. We have working examples in other plugins of what would be needed. Perhaps the biggest cost is

Re: Mongo column types

2020-02-27 Thread Dobes Vandermeer
Hi Paul, After looking at the mongo stuff a bit more today I realized that probably the simplest solution would be to put some kind of schema mapping into the storage plugin configuration. Some subset of JSON schema using the drill config syntax. What do you think of this idea? Might actually

RE: Patterns for data updating?

2020-02-27 Thread Lee, David
I update parquet files as follows: A. First save your data in row groups. B. Modify any row groups by removing DELETED records. Delete the row group from the parquet file and append the modified row group to the file. C. Add any new INSERTS as a new row group appended to the file. Alternative

Re: Patterns for data updating?

2020-02-27 Thread Dobes Vandermeer
On 2/27/2020 1:04:07 PM, Nicolas PARIS wrote: > However, updating parquet files can be a bit troublesome. You might be interested in delta-lake which provides an implementation of the sql merge statement on top of parquet files. Implementing a drill connector on this should be feasible. This

Re: Patterns for data updating?

2020-02-27 Thread Dobes Vandermeer
Paul, Ted: Thanks for your excellent responses, I will mull on this. We have 443 million answers currently, so I suppose we are approaching the billions threshold. I had originally thought to do as you suggest - load the most recent data from mongodb and the history from parquet files.  This

Re: Patterns for data updating?

2020-02-27 Thread Nicolas PARIS
> However, updating parquet files can be a bit troublesome. The files > cannot easily be appended to. So some process has to periodically > re-write the parquet files. Also, we don't want to have hundreds or > thousands of separate files, as this can slow down query executing. > So we don't

Re: Patterns for data updating?

2020-02-27 Thread Ted Dunning
Yes. I have seen things like this before. Typically, if you have short time-to-visibility requirements, some kind of database is required. If you have large data and long retention requirements, it can be advantageous to roll out to a columnar compressed form like parquet. The design that I have

Re: Patterns for data updating?

2020-02-27 Thread Paul Rogers
Hi Dobes, Also, if Ted is still lurking on this list, he's an expert at this stuff. Here are some patterns I've seen. What you describe is a pretty standard pattern. Substitute anything for "scores" (logs, sales, clicks, GPS tracking locations) and you find that many folks have solved the

Patterns for data updating?

2020-02-27 Thread Dobes Vandermeer
Hi, I am trying to figure out a system that can offer both low latency in generating reports, low latency between data being collected and being available in the reporting system, and avoiding glitches and errors. In our system users are collecting responses from their students.  We want to

Re: Apache Drill: cast boolean to number

2020-02-27 Thread Зиновьев Олег
Hi Thanks so much for your answ On the other hand, I wonder if that would cause undesired interference with the SQL Boolean rules: maybe there is some place where the ability to treat a Bit as both an INT and a BOOLEAN causes ambiguities in the planner or in run-time type resolution code. On