That worked, Kristine. Thanks
On Sun, Feb 8, 2015 at 4:50 PM, Kristine Hahn <[email protected]> wrote: > I sent the attachment to you in a private email. I think the problem is > that you need to specify the workspace, as shown below. Also included below > is a portion of the steps for querying multiple files in a directory. > > Connect to and Query a File > > When querying the same data source repeatedly, avoiding long path names is > important. This exercise demonstrates how to simplify the query. Instead of > using the full path to the Ngram file, you use dot notation in the FROM > clause. > > <workspace name>.`<location>` > > This syntax assumes you connected to a storage plugin that defines the > location of the data. To query the data source while you *not* connected to > that storage plugin, include the plugin name: > > <plugin name>.<workspace name>.`<location>` > > This exercise shows how to query Ngram data when you are, and when you are > not, connected to myplugin. > > 1. Connect to the ngram file through the custom storage plugin. > USE myplugin; > > 2. Get data about "Zoological Journal of the Linnean" that appears more > than 250 times a year in the books that Google scans. In the FROM clause, > instead of using the full path to the file as you did in the last exercise, > connect to the data using the storage plugin workspace name ngram. > > SELECT COLUMNS[0], COLUMNS[1], COLUMNS[2] FROM > ngram.`/googlebooks-eng-all-5gram-20120701-zo.tsv` WHERE > ((columns[0] = 'Zoological Journal of the Linnean') AND (columns[2] > > 250)) LIMIT 10; > > The output consists of 5 rows of data. > > 3. Switch to the dfs storage plugin. > > 0: jdbc:drill:zk=local> USE dfs; +------------+------------+ | > ok | summary | +------------+------------+ | true | > Default schema changed to 'dfs' | +------------+------------+ 1 row > selected (0.019 seconds) > > 4. Query the TSV file again. Because you switched to dfs, Drill does > not know the location of the file. To provide the information to Drill, > preface the file name with the storage plugin and workspace names in the > FROM clause. > > SELECT COLUMNS[0], COLUMNS[1], COLUMNS[2] FROM > myplugin.ngram.`/googlebooks-eng-all-5gram-20120701-zo.tsv` WHERE > ((columns[0] = 'Zoological Journal of the Linnean') AND > (columns[2] > 250)) LIMIT 10; > > Query Multiple Files in a Directory > > In this exercise, first you create a subdirectory in the ngram directory. > Next, you download, unzip, and add an extension to a second Ngram file. You > move both Ngram TSV files to the subdirectory. Finally, using the custom > plugin workspace, you query both files. In the FROM clause, simply > reference the subdirectory. > > 1. Download a second file of compressed Google Ngram data from this > location: > > > http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-2gram-20120701-ze.gz > > 2. Unzip googlebooks-eng-all-2gram-20120701-ze.gz and move > googlebooks-eng-all-2gram-20120701-ze to the ngram/myfiles subdirectory. > > 3. Change the name of googlebooks-eng-all-2gram-20120701-ze to add a > .tsv extension. > > 4. Move the 5gram file you worked with earlier > googlebooks-eng-all-5gram-20120701-zo.tsv from the ngram directory to the > ngram/myfiles subdirectory. > > 5. At the SQLLine prompt, use the myplugin.ngrams workspace. > > USE myplugin.ngram; > > 6. Query the myfiles directory for the "Zoological Journal of the > Linnean" or "zero temperatures" in books published in 1998. > > SELECT * FROM myfiles WHERE (((COLUMNS[0] = 'Zoological > Journal of the Linnean') OR (COLUMNS[0] = 'zero temperatures')) > AND (COLUMNS[1] = '1998')); > > The output lists ngrams from both files. > > +------------+ | columns | +------------+ | ["Zoological > Journal of the Linnean","1998","157","53"] | | ["zero > temperatures","1998","628","487"] | +------------+ 2 rows selected > (5.316 seconds) > > > > > > On Sun, Feb 8, 2015 at 1:15 PM, Minnow Noir <[email protected]> wrote: > > > Thanks for confirming that I should be able to query multiple files, > > Kristine. Your attachment didn't make it through. Was my syntax off, or > do > > I have another problem? > > > > On Sun, Feb 8, 2015 at 4:00 PM, Kristine Hahn <[email protected]> > wrote: > > > > > Minnow, a single-query of multiple TSV files works for me. Here's how: > > > "Query Multiple Files in a Directory" at the end of the attached PDF. > > > Please overlook the code formatting and some links that didn't > copy/paste > > > well. > > > > > > On Sun, Feb 8, 2015 at 12:35 PM, Minnow Noir <[email protected]> > > wrote: > > > > > >> "Failure while running fragment." > > >> > > >> Here's an example doing it the way you suggest. (Last 100 lines of > > >> sqlline.log are also attached.) It sounds like you think the > multi-file > > >> query should actually work? > > >> > > >> Thanks > > >> > > >> > > >> > > >> > > >> 0: jdbc:drill:zk=local> select * from dfs.`/data` limit 5; > > >> +------------+ > > >> | columns | > > >> +------------+ > > >> > > >> <first 5 rows of data from first file> > > >> > > >> Query failed: Query failed: Failure while running fragment., refCnt: > 0, > > >> decrement: 1 [ 29674d0f-bec3-4175-b1df-3afb68c6e76f on > > >> sandbox.hortonworks.com:31010 ] > > >> [ 29674d0f-bec3-4175-b1df-3afb68c6e76f on > sandbox.hortonworks.com:31010 > > ] > > >> > > >> > > >> java.lang.RuntimeException: java.sql.SQLException: Failure while > > >> executing query. > > >> at sqlline.SqlLine$IncrementalRows.hasNext(SqlLine.java:2514) > > >> at sqlline.SqlLine$TableOutputFormat.print(SqlLine.java:2148) > > >> at sqlline.SqlLine.print(SqlLine.java:1809) > > >> at sqlline.SqlLine$Commands.execute(SqlLine.java:3766) > > >> at sqlline.SqlLine$Commands.sql(SqlLine.java:3663) > > >> at sqlline.SqlLine.dispatch(SqlLine.java:889) > > >> at sqlline.SqlLine.begin(SqlLine.java:763) > > >> at sqlline.SqlLine.start(SqlLine.java:498) > > >> at sqlline.SqlLine.main(SqlLine.java:460) > > >> > > >> On Sun, Feb 8, 2015 at 3:21 PM, Neeraja Rentachintala < > > >> [email protected]> wrote: > > >> > > >>> What is the error that you are seeing? > > >>> Can you simply point it to the directory (without *.csv) to see if it > > >>> helps. > > >>> > > >>> > > >>> On Sun, Feb 8, 2015 at 10:33 AM, Minnow Noir <[email protected]> > > >>> wrote: > > >>> > > >>> > I'm trying to do ad-hoc exploration/analysis over multiple files > > >>> without > > >>> > having to concatenate them. New files show up on a regular basis, > > and > > >>> > creating large, redundant concatenated files seems inelegant for > data > > >>> > exploration. I've tried the obvious (... from dfs.`/dir/*.csv` but > > >>> that > > >>> > only returns lines from the first file it finds, and then an error > > for > > >>> the > > >>> > next file. > > >>> > > > >>> > Is there any current way to do this? > > >>> > > > >>> > > > >>> > Thanks > > >>> > > > >>> > > >> > > >> > > > > > >
