Re: Querying multiple TSV or CSV files at once?

Minnow Noir Sun, 08 Feb 2015 14:58:10 -0800

Thanks, Kristine.  Those are some helpful pointers.

On Sun, Feb 8, 2015 at 4:50 PM, Kristine Hahn <[email protected]> wrote:


> I sent the attachment to you in a private email. I think the problem is
> that you need to specify the workspace, as shown below. Also included below
> is a portion of the steps for querying multiple files in a directory.
>
> Connect to and Query a File
>
> When querying the same data source repeatedly, avoiding long path names is
> important. This exercise demonstrates how to simplify the query. Instead of
> using the full path to the Ngram file, you use dot notation in the FROM
> clause.
>
> <workspace name>.`<location>`
>
> This syntax assumes you connected to a storage plugin that defines the
> location of the data. To query the data source while you *not* connected to
> that storage plugin, include the plugin name:
>
> <plugin name>.<workspace name>.`<location>`
>
> This exercise shows how to query Ngram data when you are, and when you are
> not, connected to myplugin.
>
>    1. Connect to the ngram file through the custom storage plugin.
>    USE myplugin;
>
> 2.     Get data about "Zoological Journal of the Linnean" that appears more
> than 250 times a year in the books that Google scans. In the FROM clause,
> instead of using the full path to the file as you did in the last exercise,
> connect to the data using the storage plugin workspace name ngram.
>
>  SELECT COLUMNS[0],          COLUMNS[1],          COLUMNS[2]   FROM
> ngram.`/googlebooks-eng-all-5gram-20120701-zo.tsv`   WHERE
> ((columns[0] = 'Zoological Journal of the Linnean')    AND (columns[2]
> > 250))   LIMIT 10;
>
> The output consists of 5 rows of data.
>
> 3.     Switch to the dfs storage plugin.
>
>  0: jdbc:drill:zk=local> USE dfs;  +------------+------------+  |
> ok     |  summary   |  +------------+------------+  | true       |
> Default schema changed to 'dfs' |  +------------+------------+  1 row
> selected (0.019 seconds)
>
> 4.     Query the TSV file again. Because you switched to dfs, Drill does
> not know the location of the file. To provide the information to Drill,
> preface the file name with the storage plugin and workspace names in the
> FROM clause.
>
>  SELECT COLUMNS[0],          COLUMNS[1],          COLUMNS[2]   FROM
> myplugin.ngram.`/googlebooks-eng-all-5gram-20120701-zo.tsv`   WHERE
> ((columns[0] = 'Zoological Journal of the Linnean')     AND
> (columns[2] > 250))   LIMIT 10;
>
> Query Multiple Files in a Directory
>
> In this exercise, first you create a subdirectory in the ngram directory.
> Next, you download, unzip, and add an extension to a second Ngram file. You
> move both Ngram TSV files to the subdirectory. Finally, using the custom
> plugin workspace, you query both files. In the FROM clause, simply
> reference the subdirectory.
>
> 1.     Download a second file of compressed Google Ngram data from this
> location:
>
>
> http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-2gram-20120701-ze.gz
>
> 2.     Unzip googlebooks-eng-all-2gram-20120701-ze.gz and move
> googlebooks-eng-all-2gram-20120701-ze to the ngram/myfiles subdirectory.
>
> 3.     Change the name of googlebooks-eng-all-2gram-20120701-ze to add a
> .tsv extension.
>
> 4.     Move the 5gram file you worked with earlier
> googlebooks-eng-all-5gram-20120701-zo.tsv from the ngram directory to the
> ngram/myfiles subdirectory.
>
> 5.     At the SQLLine prompt, use the myplugin.ngrams workspace.
>
>   USE myplugin.ngram;
>
> 6.     Query the myfiles directory for the "Zoological Journal of the
> Linnean" or "zero temperatures" in books published in 1998.
>
>   SELECT *    FROM myfiles    WHERE (((COLUMNS[0] = 'Zoological
> Journal of the Linnean')     OR (COLUMNS[0] = 'zero temperatures'))
>   AND (COLUMNS[1] = '1998'));
>
> The output lists ngrams from both files.
>
>   +------------+   |  columns   |   +------------+   | ["Zoological
> Journal of the Linnean","1998","157","53"] |   | ["zero
> temperatures","1998","628","487"] |   +------------+   2 rows selected
> (5.316 seconds)
>
>
>
>
>
> On Sun, Feb 8, 2015 at 1:15 PM, Minnow Noir <[email protected]> wrote:
>
> > Thanks for confirming that I should be able to query multiple files,
> > Kristine. Your attachment didn't make it through.  Was my syntax off, or
> do
> > I have another problem?
> >
> > On Sun, Feb 8, 2015 at 4:00 PM, Kristine Hahn <[email protected]>
> wrote:
> >
> > > Minnow, a single-query of multiple TSV files works for me. Here's how:
> > > "Query Multiple Files in a Directory" at the end of the attached PDF.
> > > Please overlook the code formatting and some links that didn't
> copy/paste
> > > well.
> > >
> > > On Sun, Feb 8, 2015 at 12:35 PM, Minnow Noir <[email protected]>
> > wrote:
> > >
> > >> "Failure while running fragment."
> > >>
> > >> Here's an example doing it the way you suggest.  (Last 100 lines of
> > >> sqlline.log are also attached.)  It sounds like you think the
> multi-file
> > >> query should actually work?
> > >>
> > >> Thanks
> > >>
> > >>
> > >>
> > >>
> > >> 0: jdbc:drill:zk=local> select * from dfs.`/data` limit 5;
> > >> +------------+
> > >> |  columns   |
> > >> +------------+
> > >>
> > >>  <first 5 rows of data from first file>
> > >>
> > >> Query failed: Query failed: Failure while running fragment., refCnt:
> 0,
> > >> decrement: 1 [ 29674d0f-bec3-4175-b1df-3afb68c6e76f on
> > >> sandbox.hortonworks.com:31010 ]
> > >> [ 29674d0f-bec3-4175-b1df-3afb68c6e76f on
> sandbox.hortonworks.com:31010
> > ]
> > >>
> > >>
> > >> java.lang.RuntimeException: java.sql.SQLException: Failure while
> > >> executing query.
> > >>         at sqlline.SqlLine$IncrementalRows.hasNext(SqlLine.java:2514)
> > >>         at sqlline.SqlLine$TableOutputFormat.print(SqlLine.java:2148)
> > >>         at sqlline.SqlLine.print(SqlLine.java:1809)
> > >>         at sqlline.SqlLine$Commands.execute(SqlLine.java:3766)
> > >>         at sqlline.SqlLine$Commands.sql(SqlLine.java:3663)
> > >>         at sqlline.SqlLine.dispatch(SqlLine.java:889)
> > >>         at sqlline.SqlLine.begin(SqlLine.java:763)
> > >>         at sqlline.SqlLine.start(SqlLine.java:498)
> > >>         at sqlline.SqlLine.main(SqlLine.java:460)
> > >>
> > >> On Sun, Feb 8, 2015 at 3:21 PM, Neeraja Rentachintala <
> > >> [email protected]> wrote:
> > >>
> > >>> What is the error that you are seeing?
> > >>> Can you simply point it to the directory (without *.csv) to see if it
> > >>> helps.
> > >>>
> > >>>
> > >>> On Sun, Feb 8, 2015 at 10:33 AM, Minnow Noir <[email protected]>
> > >>> wrote:
> > >>>
> > >>> > I'm trying to do ad-hoc exploration/analysis over multiple files
> > >>> without
> > >>> > having to concatenate them.  New files show up on a regular basis,
> > and
> > >>> > creating large, redundant concatenated files seems inelegant for
> data
> > >>> > exploration.  I've tried the obvious (... from dfs.`/dir/*.csv` but
> > >>> that
> > >>> > only returns lines from the first file it finds, and then an error
> > for
> > >>> the
> > >>> > next file.
> > >>> >
> > >>> > Is there any current way to do this?
> > >>> >
> > >>> >
> > >>> > Thanks
> > >>> >
> > >>>
> > >>
> > >>
> > >
> >
>

Re: Querying multiple TSV or CSV files at once?

Reply via email to