One correction: The new text reader is turned on (set to true) by default.
I was confused with the doc (which asked user to set the option - but it
does mention that the value is true by default).

On Mon, Apr 18, 2016 at 11:06 AM, Abhishek Girish <[email protected]> wrote:

> Firstly, I don't think this is a default setting, so you will need to
> explicitly add this under every text format plugin ("csv", "tsv", ...), and
> inside every dfs storage plugin (if you have more than one). Later turn on
> the new text reader system/session option, before you can query.
>
> Secondly, if you are running in distributed mode, you only need to do this
> once (for example, via Drill Web UI accessed via node 1). If you running in
> embedded or single-node setups on every node (which i don't think you
> intend to), you might need to set it on all nodes. I'm not sure why you
> observe differently - can you may be try it another time to confirm?
>
> On Mon, Apr 18, 2016 at 9:17 AM, Matt <[email protected]> wrote:
>
>> I found that the dfs storage section for csv file types did not all have
>> the extractHeader setting in place. Manually putting it in all four of my
>> nodes may have resolved the issue.
>>
>> In my vanilla Hadoop 2.7.0 setup on the same servers, I don't recall
>> having to set it on all nodes.
>>
>> Did I perhaps miss something in the MapR cluster setup?
>>
>>
>>
>> On 15 Apr 2016, at 14:16, Abhishek Girish wrote:
>>
>> Hello,
>>>
>>> This is my format setting:
>>>
>>>     "csv": {
>>>       "type": "text",
>>>       "extensions": [
>>>         "csv"
>>>       ],
>>>       "extractHeader": true,
>>>       "delimiter": ","
>>>     }
>>>
>>> I was able to extract the header and get expected results:
>>>
>>>
>>> select * from mfs.tmp.`abcd.csv`;
>>>>
>>> +----+----+----+----+
>>> | A  | B  | C  | D  |
>>> +----+----+----+----+
>>> | 1  | 2  | 3  | 4  |
>>> | 2  | 3  | 4  | 5  |
>>> | 3  | 4  | 5  | 6  |
>>> +----+----+----+----+
>>> 3 rows selected (0.196 seconds)
>>>
>>> select A from mfs.tmp.`abcd.csv`;
>>>>
>>> +----+
>>> | A  |
>>> +----+
>>> | 1  |
>>> | 2  |
>>> | 3  |
>>> +----+
>>> 3 rows selected (0.16 seconds)
>>>
>>> I am using a MapR cluster with Drill 1.6.0. I had also enabled the new
>>> text
>>> reader.
>>>
>>> Note: My initial query failed to extract header, similar to what you
>>> reported. I had to set the "skipFirstLine" option to true, for it to
>>> work.
>>> Strangely, for subsequent queries, it works even after removing /
>>> disabling
>>> the "skipFirstLine" option. This could be a bug, but I'm not able to
>>> reproduce it right now. Will file a JIRA once i have more clarity.
>>>
>>>
>>>
>>> Regards,
>>> Abhishek
>>>
>>> On Fri, Apr 15, 2016 at 10:53 AM, Matt <[email protected]> wrote:
>>>
>>> With files in the local filesystem, and an embedded drill bit from the
>>>> download on drill.apache.org, I can successfully query csv data by
>>>> column
>>>> name with the extractHeader option on, as in SELECT customer_if FROM
>>>> `file`;
>>>>
>>>> But in a MapR cluster (v. 5.1.0.37549.GA) with the data in MapR-FS, the
>>>> extractHeader options does not seem to be taking effect. A plain
>>>> "SELECT *"
>>>> returns rows with the header as a data row, not in the columns list.
>>>>
>>>> I have verified that exec.storage.enable_new_text_reader is true, and in
>>>> both cases csv storage is defined as:
>>>>
>>>> ~~~
>>>>     "csv": {
>>>>       "type": "text",
>>>>       "extensions": [
>>>>         "csv"
>>>>       ],
>>>>       "extractHeader": true,
>>>>       "delimiter": ","
>>>>     }
>>>> ~~~
>>>>
>>>> Of course with the csv reader not extracting the columns, an attempt to
>>>> reference columns by name results in:
>>>>
>>>> Error: DATA_READ ERROR: Selected column 'customer_id' must have name
>>>> 'columns' or must be plain '*'. In trying to diagnose the issue, I noted
>>>> that at times the file header row not being part of the SELECT *
>>>> results,
>>>> but also not being used to detect column names.
>>>>
>>>> Both cases are Drill v1.6.0, but the MapR installed version has a
>>>> different commit than the standalone copy I am using:
>>>>
>>>> MapR:
>>>>
>>>> ~~~
>>>>
>>>>
>>>> +----------+-------------------------------------------+----------------------------------------------------------------------------------------------------------+----------------------------+--------------+----------------------------+
>>>> | version  |                 commit_id                 |
>>>>                             commit_message
>>>>             |        commit_time         | build_email  |
>>>>  build_time         |
>>>>
>>>>
>>>> +----------+-------------------------------------------+----------------------------------------------------------------------------------------------------------+----------------------------+--------------+----------------------------+
>>>> | 1.6.0    | 2d532bd206d7ae9f3cb703ee7f51ae3764374d43  | MD-850: Treat
>>>> the
>>>> type of decimal literals as DOUBLE only when
>>>> planner.enable_decimal_data_type is true  | 31.03.2016 @ 04:47:25 UTC  |
>>>> Unknown      | 31.03.2016 @ 04:40:54 UTC  |
>>>>
>>>>
>>>> +----------+-------------------------------------------+----------------------------------------------------------------------------------------------------------+----------------------------+--------------+----------------------------+
>>>> ~~~
>>>>
>>>> Local:
>>>>
>>>> ~~~
>>>>
>>>>
>>>> +----------+-------------------------------------------+-----------------------------------------------------+----------------------------+--------------------+----------------------------+
>>>> | version  |                 commit_id                 |
>>>>  commit_message                    |        commit_time         |
>>>> build_email     |         build_time         |
>>>>
>>>>
>>>> +----------+-------------------------------------------+-----------------------------------------------------+----------------------------+--------------------+----------------------------+
>>>> | 1.6.0    | d51f7fc14bd71d3e711ece0d02cdaa4d4c385eeb  |
>>>> [maven-release-plugin] prepare release drill-1.6.0  | 10.03.2016 @
>>>> 16:34:37
>>>> PST  | [email protected]  | 10.03.2016 @ 17:45:29 PST  |
>>>>
>>>>
>>>> +----------+-------------------------------------------+-----------------------------------------------------+----------------------------+--------------------+----------------------------+
>>>> ~~~
>>>>
>>>
>

Reply via email to