I sent the attachment to you in a private email. I think the problem is that you need to specify the workspace, as shown below. Also included below is a portion of the steps for querying multiple files in a directory.
Connect to and Query a File When querying the same data source repeatedly, avoiding long path names is important. This exercise demonstrates how to simplify the query. Instead of using the full path to the Ngram file, you use dot notation in the FROM clause. <workspace name>.`<location>` This syntax assumes you connected to a storage plugin that defines the location of the data. To query the data source while you *not* connected to that storage plugin, include the plugin name: <plugin name>.<workspace name>.`<location>` This exercise shows how to query Ngram data when you are, and when you are not, connected to myplugin. 1. Connect to the ngram file through the custom storage plugin. USE myplugin; 2. Get data about "Zoological Journal of the Linnean" that appears more than 250 times a year in the books that Google scans. In the FROM clause, instead of using the full path to the file as you did in the last exercise, connect to the data using the storage plugin workspace name ngram. SELECT COLUMNS[0], COLUMNS[1], COLUMNS[2] FROM ngram.`/googlebooks-eng-all-5gram-20120701-zo.tsv` WHERE ((columns[0] = 'Zoological Journal of the Linnean') AND (columns[2] > 250)) LIMIT 10; The output consists of 5 rows of data. 3. Switch to the dfs storage plugin. 0: jdbc:drill:zk=local> USE dfs; +------------+------------+ | ok | summary | +------------+------------+ | true | Default schema changed to 'dfs' | +------------+------------+ 1 row selected (0.019 seconds) 4. Query the TSV file again. Because you switched to dfs, Drill does not know the location of the file. To provide the information to Drill, preface the file name with the storage plugin and workspace names in the FROM clause. SELECT COLUMNS[0], COLUMNS[1], COLUMNS[2] FROM myplugin.ngram.`/googlebooks-eng-all-5gram-20120701-zo.tsv` WHERE ((columns[0] = 'Zoological Journal of the Linnean') AND (columns[2] > 250)) LIMIT 10; Query Multiple Files in a Directory In this exercise, first you create a subdirectory in the ngram directory. Next, you download, unzip, and add an extension to a second Ngram file. You move both Ngram TSV files to the subdirectory. Finally, using the custom plugin workspace, you query both files. In the FROM clause, simply reference the subdirectory. 1. Download a second file of compressed Google Ngram data from this location: http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-2gram-20120701-ze.gz 2. Unzip googlebooks-eng-all-2gram-20120701-ze.gz and move googlebooks-eng-all-2gram-20120701-ze to the ngram/myfiles subdirectory. 3. Change the name of googlebooks-eng-all-2gram-20120701-ze to add a .tsv extension. 4. Move the 5gram file you worked with earlier googlebooks-eng-all-5gram-20120701-zo.tsv from the ngram directory to the ngram/myfiles subdirectory. 5. At the SQLLine prompt, use the myplugin.ngrams workspace. USE myplugin.ngram; 6. Query the myfiles directory for the "Zoological Journal of the Linnean" or "zero temperatures" in books published in 1998. SELECT * FROM myfiles WHERE (((COLUMNS[0] = 'Zoological Journal of the Linnean') OR (COLUMNS[0] = 'zero temperatures')) AND (COLUMNS[1] = '1998')); The output lists ngrams from both files. +------------+ | columns | +------------+ | ["Zoological Journal of the Linnean","1998","157","53"] | | ["zero temperatures","1998","628","487"] | +------------+ 2 rows selected (5.316 seconds) On Sun, Feb 8, 2015 at 1:15 PM, Minnow Noir <[email protected]> wrote: > Thanks for confirming that I should be able to query multiple files, > Kristine. Your attachment didn't make it through. Was my syntax off, or do > I have another problem? > > On Sun, Feb 8, 2015 at 4:00 PM, Kristine Hahn <[email protected]> wrote: > > > Minnow, a single-query of multiple TSV files works for me. Here's how: > > "Query Multiple Files in a Directory" at the end of the attached PDF. > > Please overlook the code formatting and some links that didn't copy/paste > > well. > > > > On Sun, Feb 8, 2015 at 12:35 PM, Minnow Noir <[email protected]> > wrote: > > > >> "Failure while running fragment." > >> > >> Here's an example doing it the way you suggest. (Last 100 lines of > >> sqlline.log are also attached.) It sounds like you think the multi-file > >> query should actually work? > >> > >> Thanks > >> > >> > >> > >> > >> 0: jdbc:drill:zk=local> select * from dfs.`/data` limit 5; > >> +------------+ > >> | columns | > >> +------------+ > >> > >> <first 5 rows of data from first file> > >> > >> Query failed: Query failed: Failure while running fragment., refCnt: 0, > >> decrement: 1 [ 29674d0f-bec3-4175-b1df-3afb68c6e76f on > >> sandbox.hortonworks.com:31010 ] > >> [ 29674d0f-bec3-4175-b1df-3afb68c6e76f on sandbox.hortonworks.com:31010 > ] > >> > >> > >> java.lang.RuntimeException: java.sql.SQLException: Failure while > >> executing query. > >> at sqlline.SqlLine$IncrementalRows.hasNext(SqlLine.java:2514) > >> at sqlline.SqlLine$TableOutputFormat.print(SqlLine.java:2148) > >> at sqlline.SqlLine.print(SqlLine.java:1809) > >> at sqlline.SqlLine$Commands.execute(SqlLine.java:3766) > >> at sqlline.SqlLine$Commands.sql(SqlLine.java:3663) > >> at sqlline.SqlLine.dispatch(SqlLine.java:889) > >> at sqlline.SqlLine.begin(SqlLine.java:763) > >> at sqlline.SqlLine.start(SqlLine.java:498) > >> at sqlline.SqlLine.main(SqlLine.java:460) > >> > >> On Sun, Feb 8, 2015 at 3:21 PM, Neeraja Rentachintala < > >> [email protected]> wrote: > >> > >>> What is the error that you are seeing? > >>> Can you simply point it to the directory (without *.csv) to see if it > >>> helps. > >>> > >>> > >>> On Sun, Feb 8, 2015 at 10:33 AM, Minnow Noir <[email protected]> > >>> wrote: > >>> > >>> > I'm trying to do ad-hoc exploration/analysis over multiple files > >>> without > >>> > having to concatenate them. New files show up on a regular basis, > and > >>> > creating large, redundant concatenated files seems inelegant for data > >>> > exploration. I've tried the obvious (... from dfs.`/dir/*.csv` but > >>> that > >>> > only returns lines from the first file it finds, and then an error > for > >>> the > >>> > next file. > >>> > > >>> > Is there any current way to do this? > >>> > > >>> > > >>> > Thanks > >>> > > >>> > >> > >> > > >
