Hi Alex,
I agree with your description of what Drill could do. Someone could certainly
add this code. My only point was to note that such code does not yet exist, and
set expectations about the work involved to add such a feature.
Would the single step map/reduce be new code? Take a look at Dri
Hi Paul (and all),
I would like to come back to your remark in context of metadata collection
distribution : "Drill does not have a general purpose task distribution
(map/reduce) mechanism".
The question is : does it have to be full-blown map-reduce mechanism or
could it be something more simple
Actually, I would strongly disagree that a central metadata repository is a
good thing for distributed data. HDFS is a great example of how centralized
metadata turns into a major reliability and consistency nightmare.
It would be much, much better to keep the metadata distributed near the
data.
Hi all!
I agree with Paul and Parth, Hive Metastore with it's RDMBS is the easiest
way to manage metadata and statistics in better way than now. And it can be
used not only for Parquet,
so it will be good enhancement for Drill. Of course Drill will have own API
for Metastore, so later other tools
Can't resist just a final couple of thoughts on this.
First, we discussed how Drill infers schema from input files. We've discussed
elsewhere how that can lead to ambiguties (reader 1 does not have any way to
know what schema reader 2 might read, and so reader 1 has to guess a type if
columns a
Hi Paul, Parth,
Thanks a lot for your insightful, enlightening comments.
It seems that it is possible to avoid or minimise described problem by
careful selection of input file size and number of files for the query.
However, "real life" observation is that these factors are often not easy
to cont
In addition to directory and row group pruning, the physical plan
generation looks at data locality for every row group and schedules the
scan for the row group on the node where data is local. (Remote reads can
kill performance like nothing else can).
Short answer, query planning requires metadat
Hi Alex,
Perhaps Parth can jump in here as he has deeper knowledge of Parquet.
My understanding is that the planning-time metadata is used for partition
(directory) and row group pruning. By scanning the footer of each Parquet row
group, Drill can determine whether that group contains data that
Hi Paul and Drill developers,
I am sorry for slight off-topic maybe, but I noticed that Drill's foreman
collects metadata of all queried files in PLANNING state (ref. class
e.g. MetadataGatherer), at least in case of Parquet when using dfs plugin.
That costs a lot of time when number of queried fi
Hi Qiaoyi,
In general, optimal performance occurs when a system knows the schema at the
start and can fully optimize based on that schema. Think C or C++ compilers
compared with Java or Python.
On the other hand, the JVM HotSpot optimizer has shown that one can achieve
very good performance vi
10 matches
Mail list logo