Re: Query local files on cluster? [Beginner]

Andries Engelbrecht Wed, 27 May 2015 08:46:48 -0700

OK, that is the simplest way to get going. And see how the solution works for 
you. It can be a little confusing between local FS and working with a cluster.


I have found that dealing with large data volumes it worked much easier to use 
the NFS on the MapR cluster to directly move data to the DFS and bypass the 
local FS and then to DFS. Skips a step and also a lot more robust and faster to 
get the data directly to the DFS.



On May 27, 2015, at 8:37 AM, Matt <[email protected]> wrote:

>> Drill can process a lot of data quickly, and for best performance and 
>> consistency you will likely find that the sooner you get the data to the DFS 
>> the better.
> 
> Already most of the way there. Initial confusion came from the features to 
> query the local / native filesystem, and how that does not fit a distributed 
> Drill cluster well. In other words its really an embedded / single-node Drill 
> feature.
> 
> Currently using the approach of doing a put from local filsystem into hdfs, 
> then CTAS into Parquet, if only for simplicity in testing (not performance).
> 
> Thanks!
> 
>> On 27 May 2015, at 11:29, Andries Engelbrecht wrote:
>> 
>> You will be better off to use the Drill cluster as a whole vs trying to play 
>> with local vs DFS storage.
>> 
>> A couple of ideas:
>> As previously mentioned you can use the robust NFS on MapR to easily place 
>> the CSV/files on the DFS, and then use Drill with CTAS to convert the files 
>> to Parquet on the DFS.
>> 
>> You can set up a remote NFS server and map the local FS on each node to the 
>> same NFS mount point to the remote NFS server, this will the files will be 
>> consistently available to the Drillbits in the cluster and you can do CTAS 
>> to create parquet file son the DFS. This however will likely be a lot slower 
>> than the first option, as the NFS server BW will become a bottleneck if you 
>> have a number of drillbits in the cluster.
>> 
>> Just copy the files to one node in the cluster and then use hadoop fs to put 
>> the files in the DFS, and then do the CTAS from DFS to parquet DFS.
>> 
>> You can even place the data on S3 and then query and CTAS from there, 
>> however security and bandwidth may be a concern for large data volumes, 
>> pending the use case.
>> 
>> I really think you will find the first option the most robust and fastest in 
>> the long run. You can point Drill at any FS source as long as it is 
>> consistent to all nodes in the cluster, but keep in mind that Drill can 
>> process a lot of data quickly, and for best performance and consistency you 
>> will likely find that the sooner you get the data to the DFS the better.
>> 
>> 
>> 
>> 
>>> On May 26, 2015, at 5:58 PM, Matt <[email protected]> wrote:
>>> 
>>> Thanks, I am incorrectly conflating the file system with data storage.
>>> 
>>> Looking to experiment with the Parquet format, and was looking at CTAS 
>>> queries as an import approach.
>>> 
>>> Are direct queries over local files meant for an embedded drill, where on a 
>>> cluster files should be moved into HDFS first?
>>> 
>>> That would make sense as files on one node would be query bound to that 
>>> local filesystem.
>>> 
>>>> On May 26, 2015, at 8:28 PM, Andries Engelbrecht 
>>>> <[email protected]> wrote:
>>>> 
>>>> You can use the HDFS shell
>>>> hadoop fs -put
>>>> 
>>>> To copy from local file system to HDFS
>>>> 
>>>> 
>>>> For more robust mechanisms from remote systems you can look at using NFS, 
>>>> MapR has a really robust NFS integration and you can use it with the 
>>>> community edition.
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On May 26, 2015, at 5:11 PM, Matt <[email protected]> wrote:
>>>>> 
>>>>> 
>>>>> That might be the end goal, but currently I don't have an HDFS ingest 
>>>>> mechanism.
>>>>> 
>>>>> We are not currently a Hadoop shop - can you suggest simple approaches 
>>>>> for bulk loading data from delimited files into HDFS?
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On May 26, 2015, at 8:04 PM, Andries Engelbrecht 
>>>>>> <[email protected]> wrote:
>>>>>> 
>>>>>> Perhaps I’m missing something here.
>>>>>> 
>>>>>> Why not create a DFS plug in for HDFS and put the file in HDFS?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On May 26, 2015, at 4:54 PM, Matt <[email protected]> wrote:
>>>>>>> 
>>>>>>> New installation with Hadoop 2.7 and Drill 1.0 on 4 nodes, it appears 
>>>>>>> text files need to be on all nodes in a cluster?
>>>>>>> 
>>>>>>> Using the dfs config below, I am only able to query if a csv file is on 
>>>>>>> all 4 nodes. If the file is only on the local node and not others, I 
>>>>>>> get errors in the form of:
>>>>>>> 
>>>>>>> ~~~
>>>>>>> 0: jdbc:drill:zk=es05:2181> select * from 
>>>>>>> root.`customer_reviews_1998.csv`;
>>>>>>> Error: PARSE ERROR: From line 1, column 15 to line 1, column 18: Table 
>>>>>>> 'root.customer_reviews_1998.csv' not found
>>>>>>> ~~~
>>>>>>> 
>>>>>>> ~~~
>>>>>>> {
>>>>>>> "type": "file",
>>>>>>> "enabled": true,
>>>>>>> "connection": "file:///",
>>>>>>> "workspaces": {
>>>>>>> "root": {
>>>>>>> "location": "/localdata/hadoop/stage",
>>>>>>> "writable": false,
>>>>>>> "defaultInputFormat": null
>>>>>>> },
>>>>>>> ~~~
>>>>>>> 
>>>>>>>> On 25 May 2015, at 20:39, Kristine Hahn wrote:
>>>>>>>> 
>>>>>>>> The storage plugin "location" needs to be the full path to the 
>>>>>>>> localdata
>>>>>>>> directory. This partial storage plugin definition works for the user 
>>>>>>>> named
>>>>>>>> mapr:
>>>>>>>> 
>>>>>>>> {
>>>>>>>> "type": "file",
>>>>>>>> "enabled": true,
>>>>>>>> "connection": "file:///",
>>>>>>>> "workspaces": {
>>>>>>>> "root": {
>>>>>>>> "location": "/home/mapr/localdata",
>>>>>>>> "writable": false,
>>>>>>>> "defaultInputFormat": null
>>>>>>>> },
>>>>>>>> . . .
>>>>>>>> 
>>>>>>>> Here's a working query for the data in localdata:
>>>>>>>> 
>>>>>>>> 0: jdbc:drill:> SELECT COLUMNS[0] AS Ngram,
>>>>>>>> . . . . . . . > COLUMNS[1] AS Publication_Date,
>>>>>>>> . . . . . . . > COLUMNS[2] AS Frequency
>>>>>>>> . . . . . . . > FROM dfs.root.`mydata.csv`
>>>>>>>> . . . . . . . > WHERE ((columns[0] = 'Zoological Journal of the 
>>>>>>>> Linnean')
>>>>>>>> . . . . . . . > AND (columns[2] > 250)) LIMIT 10;
>>>>>>>> 
>>>>>>>> An complete example, not yet published on the Drill site, shows in 
>>>>>>>> detail
>>>>>>>> the steps involved:
>>>>>>>> http://tshiran.github.io/drill/docs/querying-plain-text-files/#example-of-querying-a-tsv-file
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Kristine Hahn
>>>>>>>> Sr. Technical Writer
>>>>>>>> 415-497-8107 @krishahn
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Sun, May 24, 2015 at 1:56 PM, Matt <[email protected]> wrote:
>>>>>>>>> 
>>>>>>>>> I have used a single node install (unzip and run) to query local text 
>>>>>>>>> /
>>>>>>>>> csv files, but on a 3 node cluster (installed via MapR CE), a query 
>>>>>>>>> with
>>>>>>>>> local files results in:
>>>>>>>>> 
>>>>>>>>> ~~~
>>>>>>>>> sqlline version 1.1.6
>>>>>>>>> 0: jdbc:drill:> select * from dfs.`testdata.csv`;
>>>>>>>>> Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 
>>>>>>>>> 17:
>>>>>>>>> Table 'dfs./localdata/testdata.csv' not found
>>>>>>>>> 
>>>>>>>>> 0: jdbc:drill:> select * from dfs.`/localdata/testdata.csv`;
>>>>>>>>> Query failed: PARSE ERROR: From line 1, column 15 to line 1, column 
>>>>>>>>> 17:
>>>>>>>>> Table 'dfs./localdata/testdata.csv' not found
>>>>>>>>> ~~~
>>>>>>>>> 
>>>>>>>>> Is there a special config for local file querying? An initial doc 
>>>>>>>>> search
>>>>>>>>> did not point me to a solution, but I may simply not have found the
>>>>>>>>> relevant sections.
>>>>>>>>> 
>>>>>>>>> I have tried modifying the default dfs config to no avail:
>>>>>>>>> 
>>>>>>>>> ~~~
>>>>>>>>> "type": "file",
>>>>>>>>> "enabled": true,
>>>>>>>>> "connection": "file:///",
>>>>>>>>> "workspaces": {
>>>>>>>>> "root": {
>>>>>>>>> "location": "/localdata",
>>>>>>>>> "writable": false,
>>>>>>>>> "defaultInputFormat": null
>>>>>>>>> }
>>>>>>>>> ~~~
>>>>

Re: Query local files on cluster? [Beginner]

Reply via email to