Alternatively you could probably try to get fancy by leveraging the
newer
feature which exposes the filename as part of the query results:
https://issues.apache.org/jira/browse/DRILL-3474
I can't really begin to think of the explicit SQL syntax you would
need in
order to get inputfile.csv and not inputfile.parquet, etc... I would
think
it would be rather complicated.
I considered filename in the query, but it appears that the filenames
are exposed per-row, making a distinct query of filenames a performance
issue.
Path switching is probably the best approach.
On 24 Oct 2016, at 17:28, Jim Scott wrote:
I would think that the best way to accommodate this would be:
When landing the CSV, place it into folder A, then convert them to
parquet
format and put them in folder B...
This will give you isolation between the file formats, and you can
then
choose to only query the parquet files. This is the simplest way to
guarantee that you don't double count.
Alternatively you could probably try to get fancy by leveraging the
newer
feature which exposes the filename as part of the query results:
https://issues.apache.org/jira/browse/DRILL-3474
I can't really begin to think of the explicit SQL syntax you would
need in
order to get inputfile.csv and not inputfile.parquet, etc... I would
think
it would be rather complicated.
On Mon, Oct 24, 2016 at 3:49 PM, MattK <[email protected]> wrote:
I have a cluster that receives log files in a csv format on a
per-minute
basis, and those files are immediately available to Drill users. For
performance I create Parquet files from them in batch using CTAS
commands.
I would like to script a process that makes the Parquet files
available on
creation, perhaps through a UNION view, but that does not serve
duplicate
data through both an original csv and converted Parquet file at the
same
time.
Is there a common practice to making data available once converted,
in
something similar to a transactional batch of "convert then (re)move
source
csv files" ?
--
*Jim Scott*
Director, Enterprise Strategy & Architecture
+1 (347) 746-9281
@kingmesal <https://twitter.com/kingmesal>
<http://www.mapr.com/>
[image: MapR Technologies] <http://www.mapr.com>
Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>