One correction: The new text reader is turned on (set to true) by default. I was confused with the doc (which asked user to set the option - but it does mention that the value is true by default).
On Mon, Apr 18, 2016 at 11:06 AM, Abhishek Girish <[email protected]> wrote: > Firstly, I don't think this is a default setting, so you will need to > explicitly add this under every text format plugin ("csv", "tsv", ...), and > inside every dfs storage plugin (if you have more than one). Later turn on > the new text reader system/session option, before you can query. > > Secondly, if you are running in distributed mode, you only need to do this > once (for example, via Drill Web UI accessed via node 1). If you running in > embedded or single-node setups on every node (which i don't think you > intend to), you might need to set it on all nodes. I'm not sure why you > observe differently - can you may be try it another time to confirm? > > On Mon, Apr 18, 2016 at 9:17 AM, Matt <[email protected]> wrote: > >> I found that the dfs storage section for csv file types did not all have >> the extractHeader setting in place. Manually putting it in all four of my >> nodes may have resolved the issue. >> >> In my vanilla Hadoop 2.7.0 setup on the same servers, I don't recall >> having to set it on all nodes. >> >> Did I perhaps miss something in the MapR cluster setup? >> >> >> >> On 15 Apr 2016, at 14:16, Abhishek Girish wrote: >> >> Hello, >>> >>> This is my format setting: >>> >>> "csv": { >>> "type": "text", >>> "extensions": [ >>> "csv" >>> ], >>> "extractHeader": true, >>> "delimiter": "," >>> } >>> >>> I was able to extract the header and get expected results: >>> >>> >>> select * from mfs.tmp.`abcd.csv`; >>>> >>> +----+----+----+----+ >>> | A | B | C | D | >>> +----+----+----+----+ >>> | 1 | 2 | 3 | 4 | >>> | 2 | 3 | 4 | 5 | >>> | 3 | 4 | 5 | 6 | >>> +----+----+----+----+ >>> 3 rows selected (0.196 seconds) >>> >>> select A from mfs.tmp.`abcd.csv`; >>>> >>> +----+ >>> | A | >>> +----+ >>> | 1 | >>> | 2 | >>> | 3 | >>> +----+ >>> 3 rows selected (0.16 seconds) >>> >>> I am using a MapR cluster with Drill 1.6.0. I had also enabled the new >>> text >>> reader. >>> >>> Note: My initial query failed to extract header, similar to what you >>> reported. I had to set the "skipFirstLine" option to true, for it to >>> work. >>> Strangely, for subsequent queries, it works even after removing / >>> disabling >>> the "skipFirstLine" option. This could be a bug, but I'm not able to >>> reproduce it right now. Will file a JIRA once i have more clarity. >>> >>> >>> >>> Regards, >>> Abhishek >>> >>> On Fri, Apr 15, 2016 at 10:53 AM, Matt <[email protected]> wrote: >>> >>> With files in the local filesystem, and an embedded drill bit from the >>>> download on drill.apache.org, I can successfully query csv data by >>>> column >>>> name with the extractHeader option on, as in SELECT customer_if FROM >>>> `file`; >>>> >>>> But in a MapR cluster (v. 5.1.0.37549.GA) with the data in MapR-FS, the >>>> extractHeader options does not seem to be taking effect. A plain >>>> "SELECT *" >>>> returns rows with the header as a data row, not in the columns list. >>>> >>>> I have verified that exec.storage.enable_new_text_reader is true, and in >>>> both cases csv storage is defined as: >>>> >>>> ~~~ >>>> "csv": { >>>> "type": "text", >>>> "extensions": [ >>>> "csv" >>>> ], >>>> "extractHeader": true, >>>> "delimiter": "," >>>> } >>>> ~~~ >>>> >>>> Of course with the csv reader not extracting the columns, an attempt to >>>> reference columns by name results in: >>>> >>>> Error: DATA_READ ERROR: Selected column 'customer_id' must have name >>>> 'columns' or must be plain '*'. In trying to diagnose the issue, I noted >>>> that at times the file header row not being part of the SELECT * >>>> results, >>>> but also not being used to detect column names. >>>> >>>> Both cases are Drill v1.6.0, but the MapR installed version has a >>>> different commit than the standalone copy I am using: >>>> >>>> MapR: >>>> >>>> ~~~ >>>> >>>> >>>> +----------+-------------------------------------------+----------------------------------------------------------------------------------------------------------+----------------------------+--------------+----------------------------+ >>>> | version | commit_id | >>>> commit_message >>>> | commit_time | build_email | >>>> build_time | >>>> >>>> >>>> +----------+-------------------------------------------+----------------------------------------------------------------------------------------------------------+----------------------------+--------------+----------------------------+ >>>> | 1.6.0 | 2d532bd206d7ae9f3cb703ee7f51ae3764374d43 | MD-850: Treat >>>> the >>>> type of decimal literals as DOUBLE only when >>>> planner.enable_decimal_data_type is true | 31.03.2016 @ 04:47:25 UTC | >>>> Unknown | 31.03.2016 @ 04:40:54 UTC | >>>> >>>> >>>> +----------+-------------------------------------------+----------------------------------------------------------------------------------------------------------+----------------------------+--------------+----------------------------+ >>>> ~~~ >>>> >>>> Local: >>>> >>>> ~~~ >>>> >>>> >>>> +----------+-------------------------------------------+-----------------------------------------------------+----------------------------+--------------------+----------------------------+ >>>> | version | commit_id | >>>> commit_message | commit_time | >>>> build_email | build_time | >>>> >>>> >>>> +----------+-------------------------------------------+-----------------------------------------------------+----------------------------+--------------------+----------------------------+ >>>> | 1.6.0 | d51f7fc14bd71d3e711ece0d02cdaa4d4c385eeb | >>>> [maven-release-plugin] prepare release drill-1.6.0 | 10.03.2016 @ >>>> 16:34:37 >>>> PST | [email protected] | 10.03.2016 @ 17:45:29 PST | >>>> >>>> >>>> +----------+-------------------------------------------+-----------------------------------------------------+----------------------------+--------------------+----------------------------+ >>>> ~~~ >>>> >>> >
