Thanks Ted. That fits with my understanding of columnar data stores. I'm trying to get a handle on how Drill deals with parquet. Am I correct in assuming that it will allocate a thread for each core available to all drill bits and read x numbers of columns in parallel? so if we have 48 cores available and the file has 48 columns, then the time for the query for a single column should roughly equal the time for 48 columns? All other factors, such as data types being the same of course.
July 7 2015 2:14 AM, "Ted Dunning" <[email protected]> wrote: > How many columns do you have? > > Do you understand about columnar data stores and how selecting only a > single column means that much less data needs to be read? If your data > consists, say, of integers, then Drill only needs to read 160MB to satisfy > your query which is quite reasonable to be read in a second or less. > > If your records are much wider than that (say 50 columns or so), then > reading * could easily take a minute, especially if you don't have disk > bandwidth to read that much data in parallel. > > On Mon, Jul 6, 2015 at 7:11 PM, Yousef Lasi <[email protected]> wrote: > >> I'm hoping someone can expand my understanding of the mechanics of a query >> against a parquet file. We're finding that selecting a single column in a >> record from a file with > 40 million records is extremely fast - typically >> less than a second. However, running a 'select *" query against the same >> record using the same criteria is somewhat slow - as in greater than 60 >> seconds. >> >> This might be expected behavior, but hopefully a better understanding of >> why this occurs might help us optimize the structure of our data files >> better as we create them. >> >> Thanks
