[jira] [Commented] (DRILL-8457) Allow configuring csv parser in http storage plugin configuration

2023-10-31 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781429#comment-17781429
 ] 

ASF GitHub Bot commented on DRILL-8457:
---

cgivre merged PR #2840:
URL: https://github.com/apache/drill/pull/2840




> Allow configuring csv parser in http storage plugin configuration
> -
>
> Key: DRILL-8457
> URL: https://issues.apache.org/jira/browse/DRILL-8457
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - HTTP
>Affects Versions: Future
>Reporter: Zbigniew Tomanek
>Priority: Minor
> Fix For: Future
>
>
> Currently there is no way to configure csv parser when http plugin is used. 
> Because of that some kind of files cannot be parsed (e.g. when any column has 
> more than 4096 chars or file has a delimiter different from `,`).
> Since in DataWalk we utilize http plugin quite often we've changed our 
> internal fork of Drill so following parser/format properties can be 
> configured using additional `csvOptions` field:
>  
> {code:json}
> {
>   "csvOptions": {
>     "delimiter": "\t",
>     "quote": "\"",
>     "quote_escape": "\"",
>     "line_separator": "\n",
>     "header_extraction_enabled": null,
>     "number_of_rows_to_skip": 0,
>     "number_of_records_to_read": -1,
>     "line_separator_detection_enabled": true,
>     "max_columns": 512,
>     "max_chars_per_column": 4096,
>     "skip_empty_lines": true,
>     "ignore_leading_whitespaces": true,
>     "ignore_trailing_whitespaces": true,
>     "null_value": null
>   }
> }{code}
> I'd be glad to get feedback whether creating PR with these changes would 
> bring any value to the Drill



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (DRILL-8457) Allow configuring csv parser in http storage plugin configuration

2023-10-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781046#comment-17781046
 ] 

ASF GitHub Bot commented on DRILL-8457:
---

ztomanek-dw commented on PR #2840:
URL: https://github.com/apache/drill/pull/2840#issuecomment-1785556883

   @cgivre 
   Thanks for the clarification, I was not sure if I could push multiple 
commits per one jira issue. 
   I've applied your suggestions and made sure it's rebased to current master




> Allow configuring csv parser in http storage plugin configuration
> -
>
> Key: DRILL-8457
> URL: https://issues.apache.org/jira/browse/DRILL-8457
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - HTTP
>Affects Versions: Future
>Reporter: Zbigniew Tomanek
>Priority: Minor
> Fix For: Future
>
>
> Currently there is no way to configure csv parser when http plugin is used. 
> Because of that some kind of files cannot be parsed (e.g. when any column has 
> more than 4096 chars or file has a delimiter different from `,`).
> Since in DataWalk we utilize http plugin quite often we've changed our 
> internal fork of Drill so following parser/format properties can be 
> configured using additional `csvOptions` field:
>  
> {code:json}
> {
>   "csvOptions": {
>     "delimiter": "\t",
>     "quote": "\"",
>     "quote_escape": "\"",
>     "line_separator": "\n",
>     "header_extraction_enabled": null,
>     "number_of_rows_to_skip": 0,
>     "number_of_records_to_read": -1,
>     "line_separator_detection_enabled": true,
>     "max_columns": 512,
>     "max_chars_per_column": 4096,
>     "skip_empty_lines": true,
>     "ignore_leading_whitespaces": true,
>     "ignore_trailing_whitespaces": true,
>     "null_value": null
>   }
> }{code}
> I'd be glad to get feedback whether creating PR with these changes would 
> bring any value to the Drill



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (DRILL-8457) Allow configuring csv parser in http storage plugin configuration

2023-10-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781042#comment-17781042
 ] 

ASF GitHub Bot commented on DRILL-8457:
---

cgivre commented on code in PR #2840:
URL: https://github.com/apache/drill/pull/2840#discussion_r1376443490


##
contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/HttpCSVOptions.java:
##
@@ -0,0 +1,287 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.store.http;
+
+
+import com.fasterxml.jackson.annotation.JsonInclude;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.fasterxml.jackson.databind.annotation.JsonDeserialize;
+import com.fasterxml.jackson.databind.annotation.JsonPOJOBuilder;
+
+import java.util.Objects;
+
+@JsonInclude(JsonInclude.Include.NON_DEFAULT)
+@JsonDeserialize(builder = HttpCSVOptions.HttpCSVOptionsBuilder.class)
+public class HttpCSVOptions {
+
+
+  @JsonProperty
+  private final String delimiter;
+
+  @JsonProperty
+  private final char quote;
+
+  @JsonProperty
+  private final char quoteEscape;
+
+  @JsonProperty
+  private final String lineSeparator;
+
+  @JsonProperty
+  private final Boolean headerExtractionEnabled;
+
+  @JsonProperty
+  private final long numberOfRowsToSkip;
+
+  @JsonProperty
+  private final long numberOfRecordsToRead;
+
+  @JsonProperty
+  private final boolean lineSeparatorDetectionEnabled;
+
+  @JsonProperty
+  private final int maxColumns;
+
+  @JsonProperty
+  private final int maxCharsPerColumn;
+
+  @JsonProperty
+  private final boolean skipEmptyLines;
+
+  @JsonProperty
+  private final boolean ignoreLeadingWhitespaces;
+
+  @JsonProperty
+  private final boolean ignoreTrailingWhitespaces;
+
+  @JsonProperty
+  private final String nullValue;
+
+  HttpCSVOptions(HttpCSVOptionsBuilder builder) {
+this.delimiter = builder.delimiter;
+this.quote = builder.quote;
+this.quoteEscape = builder.quoteEscape;
+this.lineSeparator = builder.lineSeparator;
+this.headerExtractionEnabled = builder.headerExtractionEnabled;
+this.numberOfRowsToSkip = builder.numberOfRowsToSkip;
+this.numberOfRecordsToRead = builder.numberOfRecordsToRead;
+this.lineSeparatorDetectionEnabled = builder.lineSeparatorDetectionEnabled;
+this.maxColumns = builder.maxColumns;
+this.maxCharsPerColumn = builder.maxCharsPerColumn;
+this.skipEmptyLines = builder.skipEmptyLines;
+this.ignoreLeadingWhitespaces = builder.ignoreLeadingWhitespaces;
+this.ignoreTrailingWhitespaces = builder.ignoreTrailingWhitespaces;
+this.nullValue = builder.nullValue;
+  }
+
+  public static HttpCSVOptionsBuilder builder() {
+return new HttpCSVOptionsBuilder();
+  }
+
+  public String getDelimiter() {
+return delimiter;
+  }
+
+  public char getQuote() {
+return quote;
+  }
+
+  public char getQuoteEscape() {
+return quoteEscape;
+  }
+
+  public String getLineSeparator() {
+return lineSeparator;
+  }
+
+  public Boolean getHeaderExtractionEnabled() {
+return headerExtractionEnabled;
+  }
+
+  public long getNumberOfRowsToSkip() {
+return numberOfRowsToSkip;
+  }
+
+  public long getNumberOfRecordsToRead() {
+return numberOfRecordsToRead;
+  }
+
+  public boolean isLineSeparatorDetectionEnabled() {
+return lineSeparatorDetectionEnabled;
+  }
+
+  public int getMaxColumns() {
+return maxColumns;
+  }
+
+  public int getMaxCharsPerColumn() {
+return maxCharsPerColumn;
+  }
+
+  public boolean isSkipEmptyLines() {
+return skipEmptyLines;
+  }
+
+  public boolean isIgnoreLeadingWhitespaces() {
+return ignoreLeadingWhitespaces;
+  }
+
+  public boolean isIgnoreTrailingWhitespaces() {
+return ignoreTrailingWhitespaces;
+  }
+
+  public String getNullValue() {
+return nullValue;
+  }
+
+  @Override
+  public boolean equals(Object o) {
+if (this == o) {
+  return true;
+}
+if (o == null || getClass() != o.getClass()) {
+  return false;
+}
+HttpCSVOptions that = (HttpCSVOptions) o;
+return quote == that.quote && quoteEscape == that.quoteEscape && 
numberOfRowsToSkip == that.numberOfRowsToSkip && 

[jira] [Commented] (DRILL-8457) Allow configuring csv parser in http storage plugin configuration

2023-10-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781003#comment-17781003
 ] 

ASF GitHub Bot commented on DRILL-8457:
---

ztomanek-dw commented on PR #2840:
URL: https://github.com/apache/drill/pull/2840#issuecomment-1785165480

   @cgivre 
   Thanks for your feedback!
   
   According to your comments:
- written unit tests for `HttpCSVOptions`
- written unit tests for `HttpApiConfig`, by the way fixing small bug on 
`HttpMethod` validation
- added tsv parsing test to `TestHttpPlugin` 
- documented `csvOptions` configuration in `CSV_Options.md`
   
   Let me know if you see anything else to cover :)




> Allow configuring csv parser in http storage plugin configuration
> -
>
> Key: DRILL-8457
> URL: https://issues.apache.org/jira/browse/DRILL-8457
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - HTTP
>Affects Versions: Future
>Reporter: Zbigniew Tomanek
>Priority: Minor
> Fix For: Future
>
>
> Currently there is no way to configure csv parser when http plugin is used. 
> Because of that some kind of files cannot be parsed (e.g. when any column has 
> more than 4096 chars or file has a delimiter different from `,`).
> Since in DataWalk we utilize http plugin quite often we've changed our 
> internal fork of Drill so following parser/format properties can be 
> configured using additional `csvOptions` field:
>  
> {code:json}
> {
>   "csvOptions": {
>     "delimiter": "\t",
>     "quote": "\"",
>     "quote_escape": "\"",
>     "line_separator": "\n",
>     "header_extraction_enabled": null,
>     "number_of_rows_to_skip": 0,
>     "number_of_records_to_read": -1,
>     "line_separator_detection_enabled": true,
>     "max_columns": 512,
>     "max_chars_per_column": 4096,
>     "skip_empty_lines": true,
>     "ignore_leading_whitespaces": true,
>     "ignore_trailing_whitespaces": true,
>     "null_value": null
>   }
> }{code}
> I'd be glad to get feedback whether creating PR with these changes would 
> bring any value to the Drill



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (DRILL-8457) Allow configuring csv parser in http storage plugin configuration

2023-10-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779443#comment-17779443
 ] 

ASF GitHub Bot commented on DRILL-8457:
---

ztomanek-dw opened a new pull request, #2840:
URL: https://github.com/apache/drill/pull/2840

   # [DRILL-8457](https://issues.apache.org/jira/browse/DRILL-8457): Allow 
configuring csv parser in http storage plugin configuration
   
   ## Description
   
   HttpApiConfiguration was extended with `csvOptions` field which allows 
setting a following properties:
   
   ```json
   {
 "csvOptions": {
   "delimiter": ",",
   "quote": "\"",
   "quoteEscape": "\"",
   "lineSeparator": "\n",
   "headerExtractionEnabled": null,
   "numberOfRowsToSkip": 0,
   "numberOfRecordsToRead": -1,
   "lineSeparatorDetectionEnabled": true,
   "maxColumns": 512,
   "maxCharsPerColumn": 4096,
   "skipEmptyLines": true,
   "ignoreLeadingWhitespaces": true,
   "ignoreTrailingWhitespaces": true,
   "nullValue": null
 }
   }
   ```
   
   this provides greater csv parsing flexibility since user can set different 
delimiters, number of columns or max column size. 
   
   Also backward compatibility is ensured and parser works same as before if 
`csvOptions` is null.
   
   ## Documentation
   
   Add a following paragraph into 
https://drill.apache.org/docs/http-storage-plugin/#configuring-the-api-connections
   
   ```
   # CSV parser options
   
   CSV parser of HTTP Storage plugin can be configured using `csvOptions`.
   
   ```json
   {
 "csvOptions": {
   "delimiter": ",",
   "quote": "\"",
   "quoteEscape": "\"",
   "lineSeparator": "\n",
   "headerExtractionEnabled": null,
   "numberOfRowsToSkip": 0,
   "numberOfRecordsToRead": -1,
   "lineSeparatorDetectionEnabled": true,
   "maxColumns": 512,
   "maxCharsPerColumn": 4096,
   "skipEmptyLines": true,
   "ignoreLeadingWhitespaces": true,
   "ignoreTrailingWhitespaces": true,
   "nullValue": null
 }
   }
   ```
   
   E.g. to parse `.tsv` files you can use a following config:
   
   ```json
   {
 "csvOptions": {
   "delimiter": "\t"
 }
   }
   ```
   
   ```
   
   ## Testing
   
   Create a following storage plugin with name `github`
   
   
   ```json
   {
 "type": "http",
 "connections": {
   "test-data": {
 "url": 
"https://raw.githubusercontent.com/semantic-web-company/wic-tsv/master/data/de/Test/test_examples.txt;,
 "requireTail": false,
 "method": "GET",
 "authType": "none",
 "inputType": "csv",
 "xmlDataLevel": 1,
 "postParameterLocation": "QUERY_STRING",
 "csvOptions": {
   "delimiter": "\t",
   "quote": "\"",
   "quoteEscape": "\"",
   "lineSeparator": "\n",
   "numberOfRecordsToRead": -1,
   "lineSeparatorDetectionEnabled": true,
   "maxColumns": 512,
   "maxCharsPerColumn": 4096,
   "skipEmptyLines": true,
   "ignoreLeadingWhitespaces": true,
   "ignoreTrailingWhitespaces": true
 },
 "verifySSLCert": true
   }
 },
 "timeout": 5,
 "retryDelay": 1000,
 "proxyType": "direct",
 "authMode": "SHARED_USER",
 "enabled": true
   }
   ```
   
   Then query tsv file with 
   
   ```sql
   SELECT * from github.`test-data`
   ```.
   
   You should see a result set containing three columns
   




> Allow configuring csv parser in http storage plugin configuration
> -
>
> Key: DRILL-8457
> URL: https://issues.apache.org/jira/browse/DRILL-8457
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - HTTP
>Affects Versions: Future
>Reporter: Zbigniew Tomanek
>Priority: Minor
> Fix For: Future
>
>
> Currently there is no way to configure csv parser when http plugin is used. 
> Because of that some kind of files cannot be parsed (e.g. when any column has 
> more than 4096 chars or file has a delimiter different from `,`).
> Since in DataWalk we utilize http plugin quite often we've changed our 
> internal fork of Drill so following parser/format properties can be 
> configured using additional `csvOptions` field:
>  
> {code:json}
> {
>   "csvOptions": {
>     "delimiter": "\t",
>     "quote": "\"",
>     "quote_escape": "\"",
>     "line_separator": "\n",
>     "header_extraction_enabled": null,
>     "number_of_rows_to_skip": 0,
>     "number_of_records_to_read": -1,
>     "line_separator_detection_enabled": true,
>     "max_columns": 512,
>     "max_chars_per_column": 4096,
>     "skip_empty_lines": true,
>     "ignore_leading_whitespaces": true,
>     "ignore_trailing_whitespaces": true,
>     "null_value": null
>   }