Re: Querying multiple TSV or CSV files at once?

Kristine Hahn Sun, 08 Feb 2015 13:53:23 -0800

I sent the attachment to you in a private email. I think the problem is
that you need to specify the workspace, as shown below. Also included below
is a portion of the steps for querying multiple files in a directory.

Connect to and Query a File

When querying the same data source repeatedly, avoiding long path names is
important. This exercise demonstrates how to simplify the query. Instead of
using the full path to the Ngram file, you use dot notation in the FROM
clause.

<workspace name>.`<location>`

This syntax assumes you connected to a storage plugin that defines the
location of the data. To query the data source while you *not* connected to
that storage plugin, include the plugin name:

<plugin name>.<workspace name>.`<location>`

This exercise shows how to query Ngram data when you are, and when you are
not, connected to myplugin.

   1. Connect to the ngram file through the custom storage plugin.
   USE myplugin;

2.     Get data about "Zoological Journal of the Linnean" that appears more
than 250 times a year in the books that Google scans. In the FROM clause,
instead of using the full path to the file as you did in the last exercise,
connect to the data using the storage plugin workspace name ngram.

 SELECT COLUMNS[0],          COLUMNS[1],          COLUMNS[2]   FROM
ngram.`/googlebooks-eng-all-5gram-20120701-zo.tsv`   WHERE
((columns[0] = 'Zoological Journal of the Linnean')    AND (columns[2]
> 250))   LIMIT 10;

The output consists of 5 rows of data.

3.     Switch to the dfs storage plugin.

 0: jdbc:drill:zk=local> USE dfs;  +------------+------------+  |
ok     |  summary   |  +------------+------------+  | true       |
Default schema changed to 'dfs' |  +------------+------------+  1 row
selected (0.019 seconds)

4.     Query the TSV file again. Because you switched to dfs, Drill does
not know the location of the file. To provide the information to Drill,
preface the file name with the storage plugin and workspace names in the
FROM clause.

 SELECT COLUMNS[0],          COLUMNS[1],          COLUMNS[2]   FROM
myplugin.ngram.`/googlebooks-eng-all-5gram-20120701-zo.tsv`   WHERE
((columns[0] = 'Zoological Journal of the Linnean')     AND
(columns[2] > 250))   LIMIT 10;

Query Multiple Files in a Directory

In this exercise, first you create a subdirectory in the ngram directory.
Next, you download, unzip, and add an extension to a second Ngram file. You
move both Ngram TSV files to the subdirectory. Finally, using the custom
plugin workspace, you query both files. In the FROM clause, simply
reference the subdirectory.

1.     Download a second file of compressed Google Ngram data from this
location:

http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-2gram-20120701-ze.gz

2.     Unzip googlebooks-eng-all-2gram-20120701-ze.gz and move
googlebooks-eng-all-2gram-20120701-ze to the ngram/myfiles subdirectory.

3.     Change the name of googlebooks-eng-all-2gram-20120701-ze to add a
.tsv extension.

4.     Move the 5gram file you worked with earlier
googlebooks-eng-all-5gram-20120701-zo.tsv from the ngram directory to the
ngram/myfiles subdirectory.

5.     At the SQLLine prompt, use the myplugin.ngrams workspace.

  USE myplugin.ngram;

6.     Query the myfiles directory for the "Zoological Journal of the
Linnean" or "zero temperatures" in books published in 1998.

  SELECT *    FROM myfiles    WHERE (((COLUMNS[0] = 'Zoological
Journal of the Linnean')     OR (COLUMNS[0] = 'zero temperatures'))
  AND (COLUMNS[1] = '1998'));

The output lists ngrams from both files.

  +------------+   |  columns   |   +------------+   | ["Zoological
Journal of the Linnean","1998","157","53"] |   | ["zero
temperatures","1998","628","487"] |   +------------+   2 rows selected
(5.316 seconds)

On Sun, Feb 8, 2015 at 1:15 PM, Minnow Noir <[email protected]> wrote:

> Thanks for confirming that I should be able to query multiple files,
> Kristine. Your attachment didn't make it through.  Was my syntax off, or do
> I have another problem?
>
> On Sun, Feb 8, 2015 at 4:00 PM, Kristine Hahn <[email protected]> wrote:
>
> > Minnow, a single-query of multiple TSV files works for me. Here's how:
> > "Query Multiple Files in a Directory" at the end of the attached PDF.
> > Please overlook the code formatting and some links that didn't copy/paste
> > well.
> >
> > On Sun, Feb 8, 2015 at 12:35 PM, Minnow Noir <[email protected]>
> wrote:
> >
> >> "Failure while running fragment."
> >>
> >> Here's an example doing it the way you suggest.  (Last 100 lines of
> >> sqlline.log are also attached.)  It sounds like you think the multi-file
> >> query should actually work?
> >>
> >> Thanks
> >>
> >>
> >>
> >>
> >> 0: jdbc:drill:zk=local> select * from dfs.`/data` limit 5;
> >> +------------+
> >> |  columns   |
> >> +------------+
> >>
> >>  <first 5 rows of data from first file>
> >>
> >> Query failed: Query failed: Failure while running fragment., refCnt: 0,
> >> decrement: 1 [ 29674d0f-bec3-4175-b1df-3afb68c6e76f on
> >> sandbox.hortonworks.com:31010 ]
> >> [ 29674d0f-bec3-4175-b1df-3afb68c6e76f on sandbox.hortonworks.com:31010
> ]
> >>
> >>
> >> java.lang.RuntimeException: java.sql.SQLException: Failure while
> >> executing query.
> >>         at sqlline.SqlLine$IncrementalRows.hasNext(SqlLine.java:2514)
> >>         at sqlline.SqlLine$TableOutputFormat.print(SqlLine.java:2148)
> >>         at sqlline.SqlLine.print(SqlLine.java:1809)
> >>         at sqlline.SqlLine$Commands.execute(SqlLine.java:3766)
> >>         at sqlline.SqlLine$Commands.sql(SqlLine.java:3663)
> >>         at sqlline.SqlLine.dispatch(SqlLine.java:889)
> >>         at sqlline.SqlLine.begin(SqlLine.java:763)
> >>         at sqlline.SqlLine.start(SqlLine.java:498)
> >>         at sqlline.SqlLine.main(SqlLine.java:460)
> >>
> >> On Sun, Feb 8, 2015 at 3:21 PM, Neeraja Rentachintala <
> >> [email protected]> wrote:
> >>
> >>> What is the error that you are seeing?
> >>> Can you simply point it to the directory (without *.csv) to see if it
> >>> helps.
> >>>
> >>>
> >>> On Sun, Feb 8, 2015 at 10:33 AM, Minnow Noir <[email protected]>
> >>> wrote:
> >>>
> >>> > I'm trying to do ad-hoc exploration/analysis over multiple files
> >>> without
> >>> > having to concatenate them.  New files show up on a regular basis,
> and
> >>> > creating large, redundant concatenated files seems inelegant for data
> >>> > exploration.  I've tried the obvious (... from dfs.`/dir/*.csv` but
> >>> that
> >>> > only returns lines from the first file it finds, and then an error
> for
> >>> the
> >>> > next file.
> >>> >
> >>> > Is there any current way to do this?
> >>> >
> >>> >
> >>> > Thanks
> >>> >
> >>>
> >>
> >>
> >
>

Re: Querying multiple TSV or CSV files at once?

Reply via email to