Re: 回复:Is Drill query execution processing model just the same idea with the Spark whole-stage codegen improvement

2018-08-21 Thread Paul Rogers
Hi Alex, I agree with your description of what Drill could do. Someone could certainly add this code. My only point was to note that such code does not yet exist, and set expectations about the work involved to add such a feature. Would the single step map/reduce be new code? Take a look at Dri

Re: 回复:Is Drill query execution processing model just the same idea with the Spark whole-stage codegen improvement

2018-08-21 Thread Oleksandr Kalinin
Hi Paul (and all), I would like to come back to your remark in context of metadata collection distribution : "Drill does not have a general purpose task distribution (map/reduce) mechanism". The question is : does it have to be full-blown map-reduce mechanism or could it be something more simple

Re: 回复:Is Drill query execution processing model just the same idea with the Spark whole-stage codegen improvement

2018-08-11 Thread Ted Dunning
Actually, I would strongly disagree that a central metadata repository is a good thing for distributed data. HDFS is a great example of how centralized metadata turns into a major reliability and consistency nightmare. It would be much, much better to keep the metadata distributed near the data.

Re: 回复:Is Drill query execution processing model just the same idea with the Spark whole-stage codegen improvement

2018-08-11 Thread Vitalii Diravka
Hi all! I agree with Paul and Parth, Hive Metastore with it's RDMBS is the easiest way to manage metadata and statistics in better way than now. And it can be used not only for Parquet, so it will be good enhancement for Drill. Of course Drill will have own API for Metastore, so later other tools

Re: 回复:Is Drill query execution processing model just the same idea with the Spark whole-stage codegen improvement

2018-08-10 Thread Paul Rogers
Can't resist just a final couple of thoughts on this. First, we discussed how Drill infers schema from input files. We've discussed elsewhere how that can lead to ambiguties (reader 1 does not have any way to know what schema reader 2 might read, and so reader 1 has to guess a type if columns a

Re: 回复:Is Drill query execution processing model just the same idea with the Spark whole-stage codegen improvement

2018-08-10 Thread Oleksandr Kalinin
Hi Paul, Parth, Thanks a lot for your insightful, enlightening comments. It seems that it is possible to avoid or minimise described problem by careful selection of input file size and number of files for the query. However, "real life" observation is that these factors are often not easy to cont

Re: 回复:Is Drill query execution processing model just the same idea with the Spark whole-stage codegen improvement

2018-08-09 Thread Parth Chandra
In addition to directory and row group pruning, the physical plan generation looks at data locality for every row group and schedules the scan for the row group on the node where data is local. (Remote reads can kill performance like nothing else can). Short answer, query planning requires metadat

Re: 回复:Is Drill query execution processing model just the same idea with the Spark whole-stage codegen improvement

2018-08-09 Thread Paul Rogers
Hi Alex, Perhaps Parth can jump in here as he has deeper knowledge of Parquet. My understanding is that the planning-time metadata is used for partition (directory) and row group pruning. By scanning the footer of each Parquet row group, Drill can determine whether that group contains data that

Re: 回复:Is Drill query execution processing model just the same idea with the Spark whole-stage codegen improvement

2018-08-09 Thread Oleksandr Kalinin
Hi Paul and Drill developers, I am sorry for slight off-topic maybe, but I noticed that Drill's foreman collects metadata of all queried files in PLANNING state (ref. class e.g. MetadataGatherer), at least in case of Parquet when using dfs plugin. That costs a lot of time when number of queried fi

Re: 回复:Is Drill query execution processing model just the same idea with the Spark whole-stage codegen improvement

2018-08-07 Thread Paul Rogers
Hi Qiaoyi, In general, optimal performance occurs when a system knows the schema at the start and can fully optimize based on that schema. Think C or C++ compilers compared with Java or Python. On the other hand, the JVM HotSpot optimizer has shown that one can achieve very good performance vi