[jira] [Commented] (DRILL-8450) Add Data Type Inference to XML Format Plugin

2023-08-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752112#comment-17752112
 ] 

ASF GitHub Bot commented on DRILL-8450:
---

mbeckerle commented on code in PR #2819:
URL: https://github.com/apache/drill/pull/2819#discussion_r1287322034


##
common/src/main/java/org/apache/drill/common/Typifier.java:
##
@@ -88,6 +96,40 @@ public class Typifier {
   // If a String contains any of these, try to evaluate it as an equation
   private static final char[] MathCharacters = new char[]{'+', '-', '/', '*', 
'='};
 
+  /**
+   * This function infers the Drill data type of unknown data.
+   * @param data The input text of unknown data type.
+   * @return A {@link MinorType} of the Drill data type.
+   */
+  public static MinorType typifyToDrill (String data) {
+Entry result = Typifier.typify(data);
+String dataType = result.getKey().getSimpleName();
+
+// If the string is empty, return UNKNOWN

Review Comment:
   Makes perfect sense. 
   
   For XML you need XSD to know what's potentially repeating. 
   
   Sometimes that is easy because of minOccurs/maxOccurs.
   
   But there's also these "implied arrays".
   ```
Add Data Type Inference to XML Format Plugin
> 
>
> Key: DRILL-8450
> URL: https://issues.apache.org/jira/browse/DRILL-8450
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Format - XML
>Affects Versions: 1.21.1
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.22.0
>
>
> This PR adds data type inference to the XML format plugin.  In similar 
> fashion to other plugins, it adds a new configuration parameter: allTextMode, 
> which when set to true, reads all data as strings.  The default is true.
> Note that the inference is limited to doubles, date, timestamps, boolean and 
> strings.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (DRILL-8450) Add Data Type Inference to XML Format Plugin

2023-08-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752099#comment-17752099
 ] 

ASF GitHub Bot commented on DRILL-8450:
---

cgivre commented on code in PR #2819:
URL: https://github.com/apache/drill/pull/2819#discussion_r1287295957


##
common/src/main/java/org/apache/drill/common/Typifier.java:
##
@@ -88,6 +96,40 @@ public class Typifier {
   // If a String contains any of these, try to evaluate it as an equation
   private static final char[] MathCharacters = new char[]{'+', '-', '/', '*', 
'='};
 
+  /**
+   * This function infers the Drill data type of unknown data.
+   * @param data The input text of unknown data type.
+   * @return A {@link MinorType} of the Drill data type.
+   */
+  public static MinorType typifyToDrill (String data) {
+Entry result = Typifier.typify(data);
+String dataType = result.getKey().getSimpleName();
+
+// If the string is empty, return UNKNOWN

Review Comment:
   @mbeckerle Drill doesn't really have an `UNKNOWN` data type.   The way the 
typifier works is that if it can't determine the datatype, it falls back to 
string which can basically accept anything.
   
   Regarding the lists...  The issue is that to create a list, you have to set 
the data mode to `REPEATED`.  The problem with XML is that there's no real way 
to know if a field is repeated or not.  Consider this:
   
   ```xml
   
   
 a
   
   
   a1
   a2
   
   ```
   
   Since Drill uses the streaming reader, when it first encounters the `author` 
field, it would add an entry for a VARCHAR field.  However, when it gets to the 
next author record, it should be list, but there's no way to really know that 
w/o a schema.  
   
   With JSON we don't have this problem because it uses `[` to denote lists. 

   Does that make sense?
   
   
   
   





> Add Data Type Inference to XML Format Plugin
> 
>
> Key: DRILL-8450
> URL: https://issues.apache.org/jira/browse/DRILL-8450
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Format - XML
>Affects Versions: 1.21.1
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.22.0
>
>
> This PR adds data type inference to the XML format plugin.  In similar 
> fashion to other plugins, it adds a new configuration parameter: allTextMode, 
> which when set to true, reads all data as strings.  The default is true.
> Note that the inference is limited to doubles, date, timestamps, boolean and 
> strings.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (DRILL-8450) Add Data Type Inference to XML Format Plugin

2023-08-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752087#comment-17752087
 ] 

ASF GitHub Bot commented on DRILL-8450:
---

mbeckerle commented on code in PR #2819:
URL: https://github.com/apache/drill/pull/2819#discussion_r1287251884


##
contrib/format-xml/README.md:
##
@@ -15,12 +15,15 @@ The default configuration is shown below:
   "extensions": [
 "xml"
   ],
+  "allTextMode": true,
   "dataLevel": 2
 }
 ```
 
 ## Data Types
-All fields are read as strings.  Nested fields are read as maps.  Future 
functionality could include support for lists.
+The XML reader has an `allTextMode` which, when set to `true` reads all data 
fields as strings.
+When set to `false`, Drill will attempt to infer data types.
+Nested fields are read as maps.  Future functionality could include support 
for lists.

Review Comment:
   Not really part of this change set, but I don't know what you are suggesting 
by "future functionality could include support for lists." I'd like to 
understand that plan/idea just as part of grokking all of this XML mapping. 



##
common/src/main/java/org/apache/drill/common/Typifier.java:
##
@@ -88,6 +96,40 @@ public class Typifier {
   // If a String contains any of these, try to evaluate it as an equation
   private static final char[] MathCharacters = new char[]{'+', '-', '/', '*', 
'='};
 
+  /**
+   * This function infers the Drill data type of unknown data.
+   * @param data The input text of unknown data type.
+   * @return A {@link MinorType} of the Drill data type.
+   */
+  public static MinorType typifyToDrill (String data) {
+Entry result = Typifier.typify(data);
+String dataType = result.getKey().getSimpleName();
+
+// If the string is empty, return UNKNOWN

Review Comment:
   The next line of code contradicts this comment by returning VARCHAR. 
   (Unless VARCHAR == UNKNOWN, which is news to me.)





> Add Data Type Inference to XML Format Plugin
> 
>
> Key: DRILL-8450
> URL: https://issues.apache.org/jira/browse/DRILL-8450
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Format - XML
>Affects Versions: 1.21.1
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.22.0
>
>
> This PR adds data type inference to the XML format plugin.  In similar 
> fashion to other plugins, it adds a new configuration parameter: allTextMode, 
> which when set to true, reads all data as strings.  The default is true.
> Note that the inference is limited to doubles, date, timestamps, boolean and 
> strings.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (DRILL-8450) Add Data Type Inference to XML Format Plugin

2023-08-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752065#comment-17752065
 ] 

ASF GitHub Bot commented on DRILL-8450:
---

cgivre opened a new pull request, #2819:
URL: https://github.com/apache/drill/pull/2819

   # [DRILL-8450](https://issues.apache.org/jira/browse/DRILL-8450): Add Data 
Type Inference to XML Format Plugin
   
   ## Description
   
   This PR adds data type inference to the XML format plugin.  In similar 
fashion to other plugins, it adds a new configuration parameter: `allTextMode`, 
which when set to `true`, reads all data as strings.  The default is `true`.
   Note that the inference is limited to doubles, date, timestamps, boolean and 
strings.
   
   ## Documentation
   Updated README
   
   ## Testing
   Added unit test.




> Add Data Type Inference to XML Format Plugin
> 
>
> Key: DRILL-8450
> URL: https://issues.apache.org/jira/browse/DRILL-8450
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Format - XML
>Affects Versions: 1.21.1
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.22.0
>
>
> This PR adds data type inference to the XML format plugin.  In similar 
> fashion to other plugins, it adds a new configuration parameter: allTextMode, 
> which when set to true, reads all data as strings.  The default is true.
> Note that the inference is limited to doubles, date, timestamps, boolean and 
> strings.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (DRILL-8450) Add Data Type Inference to XML Format Plugin

2023-08-08 Thread Charles Givre (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Givre updated DRILL-8450:
-
Description: 
This PR adds data type inference to the XML format plugin.  In similar fashion 
to other plugins, it adds a new configuration parameter: allTextMode, which 
when set to true, reads all data as strings.  The default is true.

Note that the inference is limited to doubles, date, timestamps, boolean and 
strings.

> Add Data Type Inference to XML Format Plugin
> 
>
> Key: DRILL-8450
> URL: https://issues.apache.org/jira/browse/DRILL-8450
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Format - XML
>Affects Versions: 1.21.1
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.22.0
>
>
> This PR adds data type inference to the XML format plugin.  In similar 
> fashion to other plugins, it adds a new configuration parameter: allTextMode, 
> which when set to true, reads all data as strings.  The default is true.
> Note that the inference is limited to doubles, date, timestamps, boolean and 
> strings.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8450) Add Data Type Inference to XML Format Plugin

2023-08-08 Thread Charles Givre (Jira)
Charles Givre created DRILL-8450:


 Summary: Add Data Type Inference to XML Format Plugin
 Key: DRILL-8450
 URL: https://issues.apache.org/jira/browse/DRILL-8450
 Project: Apache Drill
  Issue Type: Improvement
  Components: Format - XML
Affects Versions: 1.21.1
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 1.22.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)