Firstly, I don't think this is a default setting, so you will need to
explicitly add this under every text format plugin ("csv", "tsv", ...), and
inside every dfs storage plugin (if you have more than one). Later turn on
the new text reader system/session option, before you can query.

Secondly, if you are running in distributed mode, you only need to do this
once (for example, via Drill Web UI accessed via node 1). If you running in
embedded or single-node setups on every node (which i don't think you
intend to), you might need to set it on all nodes. I'm not sure why you
observe differently - can you may be try it another time to confirm?

On Mon, Apr 18, 2016 at 9:17 AM, Matt <[email protected]> wrote:

> I found that the dfs storage section for csv file types did not all have
> the extractHeader setting in place. Manually putting it in all four of my
> nodes may have resolved the issue.
>
> In my vanilla Hadoop 2.7.0 setup on the same servers, I don't recall
> having to set it on all nodes.
>
> Did I perhaps miss something in the MapR cluster setup?
>
>
>
> On 15 Apr 2016, at 14:16, Abhishek Girish wrote:
>
> Hello,
>>
>> This is my format setting:
>>
>>     "csv": {
>>       "type": "text",
>>       "extensions": [
>>         "csv"
>>       ],
>>       "extractHeader": true,
>>       "delimiter": ","
>>     }
>>
>> I was able to extract the header and get expected results:
>>
>>
>> select * from mfs.tmp.`abcd.csv`;
>>>
>> +----+----+----+----+
>> | A  | B  | C  | D  |
>> +----+----+----+----+
>> | 1  | 2  | 3  | 4  |
>> | 2  | 3  | 4  | 5  |
>> | 3  | 4  | 5  | 6  |
>> +----+----+----+----+
>> 3 rows selected (0.196 seconds)
>>
>> select A from mfs.tmp.`abcd.csv`;
>>>
>> +----+
>> | A  |
>> +----+
>> | 1  |
>> | 2  |
>> | 3  |
>> +----+
>> 3 rows selected (0.16 seconds)
>>
>> I am using a MapR cluster with Drill 1.6.0. I had also enabled the new
>> text
>> reader.
>>
>> Note: My initial query failed to extract header, similar to what you
>> reported. I had to set the "skipFirstLine" option to true, for it to work.
>> Strangely, for subsequent queries, it works even after removing /
>> disabling
>> the "skipFirstLine" option. This could be a bug, but I'm not able to
>> reproduce it right now. Will file a JIRA once i have more clarity.
>>
>>
>>
>> Regards,
>> Abhishek
>>
>> On Fri, Apr 15, 2016 at 10:53 AM, Matt <[email protected]> wrote:
>>
>> With files in the local filesystem, and an embedded drill bit from the
>>> download on drill.apache.org, I can successfully query csv data by
>>> column
>>> name with the extractHeader option on, as in SELECT customer_if FROM
>>> `file`;
>>>
>>> But in a MapR cluster (v. 5.1.0.37549.GA) with the data in MapR-FS, the
>>> extractHeader options does not seem to be taking effect. A plain "SELECT
>>> *"
>>> returns rows with the header as a data row, not in the columns list.
>>>
>>> I have verified that exec.storage.enable_new_text_reader is true, and in
>>> both cases csv storage is defined as:
>>>
>>> ~~~
>>>     "csv": {
>>>       "type": "text",
>>>       "extensions": [
>>>         "csv"
>>>       ],
>>>       "extractHeader": true,
>>>       "delimiter": ","
>>>     }
>>> ~~~
>>>
>>> Of course with the csv reader not extracting the columns, an attempt to
>>> reference columns by name results in:
>>>
>>> Error: DATA_READ ERROR: Selected column 'customer_id' must have name
>>> 'columns' or must be plain '*'. In trying to diagnose the issue, I noted
>>> that at times the file header row not being part of the SELECT * results,
>>> but also not being used to detect column names.
>>>
>>> Both cases are Drill v1.6.0, but the MapR installed version has a
>>> different commit than the standalone copy I am using:
>>>
>>> MapR:
>>>
>>> ~~~
>>>
>>>
>>> +----------+-------------------------------------------+----------------------------------------------------------------------------------------------------------+----------------------------+--------------+----------------------------+
>>> | version  |                 commit_id                 |
>>>                             commit_message
>>>             |        commit_time         | build_email  |
>>>  build_time         |
>>>
>>>
>>> +----------+-------------------------------------------+----------------------------------------------------------------------------------------------------------+----------------------------+--------------+----------------------------+
>>> | 1.6.0    | 2d532bd206d7ae9f3cb703ee7f51ae3764374d43  | MD-850: Treat
>>> the
>>> type of decimal literals as DOUBLE only when
>>> planner.enable_decimal_data_type is true  | 31.03.2016 @ 04:47:25 UTC  |
>>> Unknown      | 31.03.2016 @ 04:40:54 UTC  |
>>>
>>>
>>> +----------+-------------------------------------------+----------------------------------------------------------------------------------------------------------+----------------------------+--------------+----------------------------+
>>> ~~~
>>>
>>> Local:
>>>
>>> ~~~
>>>
>>>
>>> +----------+-------------------------------------------+-----------------------------------------------------+----------------------------+--------------------+----------------------------+
>>> | version  |                 commit_id                 |
>>>  commit_message                    |        commit_time         |
>>> build_email     |         build_time         |
>>>
>>>
>>> +----------+-------------------------------------------+-----------------------------------------------------+----------------------------+--------------------+----------------------------+
>>> | 1.6.0    | d51f7fc14bd71d3e711ece0d02cdaa4d4c385eeb  |
>>> [maven-release-plugin] prepare release drill-1.6.0  | 10.03.2016 @
>>> 16:34:37
>>> PST  | [email protected]  | 10.03.2016 @ 17:45:29 PST  |
>>>
>>>
>>> +----------+-------------------------------------------+-----------------------------------------------------+----------------------------+--------------------+----------------------------+
>>> ~~~
>>>
>>

Reply via email to