[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2024-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806651#comment-17806651
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

jnturton merged PR #2515:
URL: https://github.com/apache/drill/pull/2515




> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2024-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806649#comment-17806649
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

jnturton commented on code in PR #2515:
URL: https://github.com/apache/drill/pull/2515#discussion_r1446231938


##
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java:
##
@@ -171,107 +168,109 @@ public HDF5ReaderConfig(HDF5FormatPlugin plugin, 
HDF5FormatConfig formatConfig)
 }
   }
 
-  public HDF5BatchReader(HDF5ReaderConfig readerConfig, int maxRecords) {
-this.readerConfig = readerConfig;
-this.maxRecords = maxRecords;
+  public HDF5BatchReader(HDF5ReaderConfig config, EasySubScan scan, 
FileSchemaNegotiator negotiator) {
+errorContext = negotiator.parentErrorContext();
+file = negotiator.file();
+readerConfig = config;
 dataWriters = new ArrayList<>();
-this.showMetadataPreview = readerConfig.formatConfig.showPreview();
-  }
+showMetadataPreview = readerConfig.formatConfig.showPreview();
 
-  @Override
-  public boolean open(FileSchemaNegotiator negotiator) {
-split = negotiator.split();
-errorContext = negotiator.parentErrorContext();
 // Since the HDF file reader uses a stream to actually read the file, the 
file name from the
 // module is incorrect.
-fileName = split.getPath().getName();
-try {
-  openFile(negotiator);
-} catch (IOException e) {
-  throw UserException
-.dataReadError(e)
-.addContext("Failed to close input file: %s", split.getPath())
-.addContext(errorContext)
-.build(logger);
-}
+fileName = file.split().getPath().getName();
 
-ResultSetLoader loader;
-if (readerConfig.defaultPath == null) {
-  // Get file metadata
-  List metadata = getFileMetadata(hdfFile, new 
ArrayList<>());
-  metadataIterator = metadata.iterator();
-
-  // Schema for Metadata query
-  SchemaBuilder builder = new SchemaBuilder()
-.addNullable(PATH_COLUMN_NAME, MinorType.VARCHAR)
-.addNullable(DATA_TYPE_COLUMN_NAME, MinorType.VARCHAR)
-.addNullable(FILE_NAME_COLUMN_NAME, MinorType.VARCHAR)
-.addNullable(DATA_SIZE_COLUMN_NAME, MinorType.BIGINT)
-.addNullable(IS_LINK_COLUMN_NAME, MinorType.BIT)
-.addNullable(ELEMENT_COUNT_NAME, MinorType.BIGINT)
-.addNullable(DATASET_DATA_TYPE_NAME, MinorType.VARCHAR)
-.addNullable(DIMENSIONS_FIELD_NAME, MinorType.VARCHAR);
-
-  negotiator.tableSchema(builder.buildSchema(), false);
-
-  loader = negotiator.build();
-  dimensions = new int[0];
-  rowWriter = loader.writer();
-
-} else {
-  // This is the case when the default path is specified. Since the user 
is explicitly asking for a dataset
-  // Drill can obtain the schema by getting the datatypes below and 
ultimately mapping that schema to columns
-  Dataset dataSet = hdfFile.getDatasetByPath(readerConfig.defaultPath);
-  dimensions = dataSet.getDimensions();
-
-  loader = negotiator.build();
-  rowWriter = loader.writer();
-  writerSpec = new WriterSpec(rowWriter, negotiator.providedSchema(),
-  negotiator.parentErrorContext());
-  if (dimensions.length <= 1) {
-buildSchemaFor1DimensionalDataset(dataSet);
-  } else if (dimensions.length == 2) {
-buildSchemaFor2DimensionalDataset(dataSet);
-  } else {
-// Case for datasets of greater than 2D
-// These are automatically flattened
-buildSchemaFor2DimensionalDataset(dataSet);
+{ // Opens an HDF5 file

Review Comment:
   I guess some of these could become private methods but it's a minor point 
for me.



##
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java:
##
@@ -171,107 +164,104 @@ public HDF5ReaderConfig(HDF5FormatPlugin plugin, 
HDF5FormatConfig formatConfig)
 }
   }
 
-  public HDF5BatchReader(HDF5ReaderConfig readerConfig, int maxRecords) {
-this.readerConfig = readerConfig;
-this.maxRecords = maxRecords;
+  public HDF5BatchReader(HDF5ReaderConfig config, EasySubScan scan, 
FileSchemaNegotiator negotiator) {
+errorContext = negotiator.parentErrorContext();
+file = negotiator.file();
+readerConfig = config;
 dataWriters = new ArrayList<>();
-this.showMetadataPreview = readerConfig.formatConfig.showPreview();
-  }
+showMetadataPreview = readerConfig.formatConfig.showPreview();
 
-  @Override
-  public boolean open(FileSchemaNegotiator negotiator) {
-split = negotiator.split();
-errorContext = negotiator.parentErrorContext();
 // Since the HDF file reader uses a stream to actually read the file, the 
file name from the
 // module is incorrect.
-fileName = split.getPath().getName();
-try {
-  openFile(negotiator);
-} catch (IOException e) {
-  throw 

[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2024-01-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805821#comment-17805821
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

cgivre commented on PR #2515:
URL: https://github.com/apache/drill/pull/2515#issuecomment-1888037847

   > Did the recent EVF revisions allow the tests for this PR to pass? Is there 
anything that is still missing? Also, did the excitement over my botched merge 
settle down and are we good now?
   
   All the unit tests pass whether that means that everything is 
working this plugin has a decent amount of tests, so I'd feel pretty good. 




> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2024-01-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805790#comment-17805790
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

paul-rogers commented on PR #2515:
URL: https://github.com/apache/drill/pull/2515#issuecomment-1887901054

   Did the recent EVF revisions allow the tests for this PR to pass? Is there 
anything that is still missing? Also, did the excitement over my botched merge 
settle down and are we good now?




> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2024-01-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805457#comment-17805457
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

jnturton commented on PR #2515:
URL: https://github.com/apache/drill/pull/2515#issuecomment-1886695348

   > @paul-rogers I attempted to fix. I kind of suck at git, so I think it's 
more or less correct now, but there was probably a better way to do this.
   
   Just workng through the review comments that @paul-rogers left (the ones 
unrelated to the needed functionality that was missing from EVF2).




> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2024-01-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805188#comment-17805188
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

cgivre commented on PR #2515:
URL: https://github.com/apache/drill/pull/2515#issuecomment-1885044910

   @jnturton I did as you suggested.  Would you mind please taking a look?




> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2024-01-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17804571#comment-17804571
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

jnturton commented on PR #2515:
URL: https://github.com/apache/drill/pull/2515#issuecomment-1882403741

I see Git's "patch contents already upstream" feature doesn't automatically 
clean up the unwanted commits. I've dropped them manually in a new branch in my 
fork and now suggest
```
   git reset --hard origin/master
   git pull --rebase https://github.com/jnturton/drill.git 8188-hdf5-evf2
   git push --force # to luocooong's fork
   ```




> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2024-01-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17804568#comment-17804568
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

jnturton commented on PR #2515:
URL: https://github.com/apache/drill/pull/2515#issuecomment-1882389703

   > @paul-rogers I attempted to fix. I kind of suck at git, so I think it's 
more or less correct now, but there was probably a better way to do this.
   
   I think you still want something like
   ```
   git pull --rebase upstream master
   git push --force-with-lease
   ```




> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2024-01-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17804563#comment-17804563
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

cgivre commented on PR #2515:
URL: https://github.com/apache/drill/pull/2515#issuecomment-1882377677

   @paul-rogers I attempted to fix.  I kind of suck at git, so I think it's 
more or less correct now, but there was probably a better way to do this.




> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2024-01-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17804552#comment-17804552
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

paul-rogers commented on PR #2515:
URL: https://github.com/apache/drill/pull/2515#issuecomment-1882246203

   It seems you did this work on top of the master with my unsquashed commits. 
When you try to push, those commits come along for the ride. I think you should 
grab the latest master, then rebase your branch on it.
   
   Plan B is to a) grab the latest master, and b) create a new branch that 
cherry-picks the  commit(s) you meant to add.
   
   If even this doesn't work, then I'll clean up this branch for you since I 
created the mess in the first place...




> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2024-01-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17804547#comment-17804547
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

cgivre commented on PR #2515:
URL: https://github.com/apache/drill/pull/2515#issuecomment-1882168615

   I think I hosed the version control somehow This PR should only modify a 
few files in the HDF5 reader. 




> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2022-11-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627746#comment-17627746
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

cgivre commented on PR #2515:
URL: https://github.com/apache/drill/pull/2515#issuecomment-1300548353

   Hey @luocooong @paul-rogers I hope all is well.  I wanted to check in on 
this PR to see where we are.  At this point, nearly all the other format 
plugins have been converted to EVF V2.
   
   The other outstanding ones are the image format and LTSV.  I'd really like 
to see this merged so that we can remove the EVF V1 code. 
   
   Do you think we could get this ready to go soon?
   




> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
> Fix For: 2.0.0
>
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2022-07-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17564864#comment-17564864
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

jnturton commented on PR #2515:
URL: https://github.com/apache/drill/pull/2515#issuecomment-1180116018

   Converted to draft to prevent merging.




> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2022-05-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17542305#comment-17542305
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

luocooong commented on PR #2515:
URL: https://github.com/apache/drill/pull/2515#issuecomment-1138051837

   Hi @cgivre Thank you for paying attention to this PR.
   The pull request cannot be merged now, and Paul is going to re-review the V2 
section code. and we're going to fix the bugs above from the framework.




> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2022-05-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17542292#comment-17542292
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

cgivre commented on PR #2515:
URL: https://github.com/apache/drill/pull/2515#issuecomment-1138029864

   Hi @luocooong Thank you for this PR.  Where are we in terms of getting it 
merged?




> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2022-04-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527369#comment-17527369
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

luocooong commented on PR #2515:
URL: https://github.com/apache/drill/pull/2515#issuecomment-1108226622

   > Found the bug. It is in `ColumnBuilder` which seems to be missing code to 
handle an unprojected repeated list. This bug then caused the other "bugs" that 
we discussed in the review: those bits of code are working as they should. The 
problem is that the result set loader is materializing a vector when it should 
not. It will take some time to remember how all this stuff works. Stay tuned.
   
   @paul-rogers Great! Thank you for the quick work.
   From the end of the most recent discussion, I completely rejected my 
previous code revision, guess that the unprojected handle might have been lost. 
Actually, I've added this function locally, but I'm not sure it's correct. 
Would you mind checking mine before you submit the new revision?




> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2022-04-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527363#comment-17527363
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

paul-rogers commented on PR #2515:
URL: https://github.com/apache/drill/pull/2515#issuecomment-1108207457

   Found the bug. It is in `ColumnBuilder` which seems to be missing code to 
handle an unprojected repeated list. This bug then caused the other "bugs" that 
we discussed in the review: those bits of code are working as they should. The 
problem is that the result set loader is materializing a vector when it should 
not. It will take some time to remember how all this stuff works. Stay tuned.




> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2022-04-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524062#comment-17524062
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

paul-rogers commented on PR #2515:
URL: https://github.com/apache/drill/pull/2515#issuecomment-1102092604

   This PR is getting a bit complex with the bug or two that this PR uncovered. 
Iexplain a bit about how EVF2 works. There are two case: wildcard projection 
(SELECT *) and explicit projection (SELECT a, b, c). The way EVF2 works is 
different in these two cases.
   
   Then, for each reader, there are three other cases. The reader might know 
all its columns before the file is even opened. The PCAP reader is an example: 
all PCAP files have the same schema, so we don't need to look at the file to 
know the schema. The second case are files were we can learn the schema when 
opening the file. Parquet and CSV are examples: we can learn the Parquet schema 
from the file metadata, and CSV schema from the headers. The last case is where 
we don't know the schema until we read each row. JSON is the best example.
   
   So, now we have six cases to consider. This is why EVF2 is so complex!
   
   For the wildcard, EVF2 "discovers" columns as the reader creates them: 
either via the up-front schema, or as the reader reads data. In JSON, for 
example, we can discover a new column at any time. Once a column is added, EVF2 
will automatically fill in null values if values are missing. In the extreme 
case, it can fill in nulls for an entire batch. Because of the wildcard, all 
discovered columns are materialized and added to the result set. If reading 
JSON, and a column does not appear until the third batch, then the first two 
won't contain that column, but the third batch will have a schema change and 
will include the column. This can cause a problem for operators such as joins, 
sort or aggregation that have to store a collection of rows, not all can handle 
a schema change.
   
   Now, for the explicit schema case, EVF2 knows what columns the user wants: 
those in the list. EVF2 waits as long as it can, hoping the reader will provide 
the columns. Again, the reader can provide them up front, before the first 
record, or as the read proceeds (as in JSON.) As the reader provides each 
column, EVF2 has to decide: do we need that column? If so, we create a vector 
and a column writer: we materialize the column. If the column is not needed, 
EVF2 creates a dummy column writer.
   Now the interesting part. Suppose we get to the end of the first batch, the 
query wants column c, and the reader has never defined column c? What do we do? 
In this case, we have to make something up. Historically, Drill would make up a 
Nullable Int, with all-null values. EVF added the ability to specify the type 
for such columns, and we use that. If a provided schema is available, then the 
user tells us the type.
   
   Now we get to another interesting part. What if we guessed, say, Varchar, 
but the column later shows up as a JSON array? We're stuck: we can't go back 
and redo the old batches. We end up with a "hard" schema change. Bad things 
happen unless the query is really simple. This is the fun of Drill's schemaless 
system.
   
   With that background, we can try to answer your question. The answer is: it 
depends. If the reader says, "hey Mr. EVF2, here is the full schema I will 
read, I promise not to discover more columns", then EVF2 will throw an 
exception if later you say, "ha! just kidding. Actually, I discovered another 
one." I wonder if that's what is happening here.
   
   If, however, the reader left the schema open, and said, "here are the 
columns I know about now, but I might find more later", then EVF2 will expect 
more columns, and will handle them as above: materialize them if they are 
projected or if we have a wildcard, provide a dummy writer if we have explicit 
projection and the column is not projected.
   
   In this PR, we have two separate cases in the reader constructor.
   
   * In the `if `path, we define a "reader schema", and reserve the right to 
add more columns later. "That's what the `false` argument means to 
`tableSchema()`.
   * In the `else` path, we define no schema at all: we don't all 
`tableSchema()`.
   
   This means the reader is doing two entirely different things. In the `if` 
case, we define the schema and we just ask for column writers by name. In the 
`else` case, we don't define a schema, and we have to define the column when we 
ask for the column writers.
   
   This seems horribly complicated! I wonder, are we missing logic in the 
`then` case? Or, should there be two distinct readers, each of which implements 
one of the above cases?




> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: 

[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2022-04-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524061#comment-17524061
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

paul-rogers commented on code in PR #2515:
URL: https://github.com/apache/drill/pull/2515#discussion_r852601292


##
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java:
##
@@ -171,107 +164,104 @@ public HDF5ReaderConfig(HDF5FormatPlugin plugin, 
HDF5FormatConfig formatConfig)
 }
   }
 
-  public HDF5BatchReader(HDF5ReaderConfig readerConfig, int maxRecords) {
-this.readerConfig = readerConfig;
-this.maxRecords = maxRecords;
+  public HDF5BatchReader(HDF5ReaderConfig config, EasySubScan scan, 
FileSchemaNegotiator negotiator) {
+errorContext = negotiator.parentErrorContext();
+file = negotiator.file();
+readerConfig = config;
 dataWriters = new ArrayList<>();
-this.showMetadataPreview = readerConfig.formatConfig.showPreview();
-  }
+showMetadataPreview = readerConfig.formatConfig.showPreview();
 
-  @Override
-  public boolean open(FileSchemaNegotiator negotiator) {
-split = negotiator.split();
-errorContext = negotiator.parentErrorContext();
 // Since the HDF file reader uses a stream to actually read the file, the 
file name from the
 // module is incorrect.
-fileName = split.getPath().getName();
-try {
-  openFile(negotiator);
-} catch (IOException e) {
-  throw UserException
-.dataReadError(e)
-.addContext("Failed to close input file: %s", split.getPath())
-.addContext(errorContext)
-.build(logger);
+fileName = file.split().getPath().getName();
+
+{ // Opens an HDF5 file
+  try (InputStream in = 
file.fileSystem().openPossiblyCompressedStream(file.split().getPath())) {
+/*
+ * As a possible future improvement, the jhdf reader has the ability 
to read hdf5 files from
+ * a byte array or byte buffer. This implementation is better in that 
it does not require creating
+ * a temporary file which must be deleted later.  However, it could 
result in memory issues in the
+ * event of large files.
+ */
+hdfFile = HdfFile.fromInputStream(in);
+  } catch (IOException e) {
+throw UserException
+  .dataReadError(e)
+  .message("Failed to open input file: %s", file.split().getPath())
+  .addContext(errorContext)
+  .build(logger);
+  }
 }
 
-ResultSetLoader loader;
-if (readerConfig.defaultPath == null) {
-  // Get file metadata
-  List metadata = getFileMetadata(hdfFile, new 
ArrayList<>());
-  metadataIterator = metadata.iterator();
-
-  // Schema for Metadata query
-  SchemaBuilder builder = new SchemaBuilder()
-.addNullable(PATH_COLUMN_NAME, MinorType.VARCHAR)
-.addNullable(DATA_TYPE_COLUMN_NAME, MinorType.VARCHAR)
-.addNullable(FILE_NAME_COLUMN_NAME, MinorType.VARCHAR)
-.addNullable(DATA_SIZE_COLUMN_NAME, MinorType.BIGINT)
-.addNullable(IS_LINK_COLUMN_NAME, MinorType.BIT)
-.addNullable(ELEMENT_COUNT_NAME, MinorType.BIGINT)
-.addNullable(DATASET_DATA_TYPE_NAME, MinorType.VARCHAR)
-.addNullable(DIMENSIONS_FIELD_NAME, MinorType.VARCHAR);
-
-  negotiator.tableSchema(builder.buildSchema(), false);
-
-  loader = negotiator.build();
-  dimensions = new int[0];
-  rowWriter = loader.writer();
-
-} else {
-  // This is the case when the default path is specified. Since the user 
is explicitly asking for a dataset
-  // Drill can obtain the schema by getting the datatypes below and 
ultimately mapping that schema to columns
-  Dataset dataSet = hdfFile.getDatasetByPath(readerConfig.defaultPath);
-  dimensions = dataSet.getDimensions();
-
-  loader = negotiator.build();
-  rowWriter = loader.writer();
-  writerSpec = new WriterSpec(rowWriter, negotiator.providedSchema(),
-  negotiator.parentErrorContext());
-  if (dimensions.length <= 1) {
-buildSchemaFor1DimensionalDataset(dataSet);
-  } else if (dimensions.length == 2) {
-buildSchemaFor2DimensionalDataset(dataSet);
+{ // Build the schema and initial the writer
+  ResultSetLoader loader;
+  if (readerConfig.defaultPath == null) {
+// Get file metadata
+List metadata = getFileMetadata(hdfFile, new 
ArrayList<>());
+metadataIterator = metadata.iterator();
+
+// Schema for Metadata query
+SchemaBuilder builder = new SchemaBuilder()
+  .addNullable(PATH_COLUMN_NAME, MinorType.VARCHAR)
+  .addNullable(DATA_TYPE_COLUMN_NAME, MinorType.VARCHAR)
+  .addNullable(FILE_NAME_COLUMN_NAME, MinorType.VARCHAR)
+  .addNullable(DATA_SIZE_COLUMN_NAME, MinorType.BIGINT)
+  .addNullable(IS_LINK_COLUMN_NAME, MinorType.BIT)
+  

[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2022-04-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17523102#comment-17523102
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

luocooong commented on code in PR #2515:
URL: https://github.com/apache/drill/pull/2515#discussion_r851626806


##
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/scan/v3/schema/ScanSchemaResolver.java:
##
@@ -189,7 +189,7 @@ private void insertColumn(ColumnMetadata col) {
 switch (mode) {
   case FIRST_READER_SCHEMA:
   case READER_SCHEMA:
-if (schema.projectionType() != ProjectionType.ALL) {
+if (schema.projectionType() != ProjectionType.ALL && !col.isArray()) {

Review Comment:
   Let's look at this case :
   1. batch reader received the project column, from ` SELECT path, data_type, 
file_name FROM ` in the constructor().
   2. schema is [TupleSchema [ProjectedColumn [`path` (LATE:REQUIRED)]], 
[ProjectedColumn [`data_type` (LATE:REQUIRED)]], [ProjectedColumn [`file_name` 
(LATE:REQUIRED)]]].
   3. ProjectionType = 'SOME'.
   4. batch reader create new repeated list column in next().
   5. do the project for the schema.
   ```java
   ColumnHandle existing = schema.find(colSchema.name());
if (existing == null) {
 insertColumn(colSchema);
   }
   ```
   6. catch and throw the `IllegalStateException`.
   7. failed to reader the format data.
   
   In EVF1, creating repeated list fields dynamically is allowed and does not 
throw such exceptions.





> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2022-04-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17523100#comment-17523100
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

luocooong commented on code in PR #2515:
URL: https://github.com/apache/drill/pull/2515#discussion_r851627884


##
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/scan/v3/schema/MutableTupleSchema.java:
##
@@ -173,7 +173,7 @@ public ColumnHandle insert(int posn, ColumnMetadata col) {
   }
 
   public ColumnHandle insert(ColumnMetadata col) {
-return insert(insertPoint++, col);
+return insert(insertPoint == -1 ? size() : insertPoint++, col);

Review Comment:
   When I fixed the following error, I received a new one : Dynamically create 
a repeated list field (and success) in next(), then this field may be assigned 
to the index value of -1.





> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2022-04-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17523098#comment-17523098
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

luocooong commented on code in PR #2515:
URL: https://github.com/apache/drill/pull/2515#discussion_r851627022


##
exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/scan/v3/TestScanOuputSchema.java:
##
@@ -327,4 +375,50 @@ public void 
testStrictProvidedSchemaWithWildcardAndSpecialCols() {
 assertFalse(scan.next());
 scanFixture.close();
   }
+
+  @Test
+  public void testProvidedSchemaWithListArray() {
+TupleMetadata providedSchema = new SchemaBuilder()
+.add("a", MinorType.INT)
+.buildSchema();
+
+BaseScanFixtureBuilder builder = new BaseScanFixtureBuilder(fixture);
+builder.addReader(negotiator -> new MockSimpleReader(negotiator, true));
+builder.builder.providedSchema(providedSchema);
+builder.setProjection(new String[] { "a", "b", "c" });
+builder.builder.nullType(Types.optional(MinorType.VARCHAR));
+ScanFixture scanFixture = builder.build();
+ScanOperatorExec scan = scanFixture.scanOp;
+
+TupleMetadata expectedSchema = new SchemaBuilder()
+.add("a", MinorType.INT)
+.add("b", MinorType.VARCHAR)
+.add("c", MinorType.VARCHAR)
+.addRepeatedList("int_list")
+  .addArray(MinorType.INT)
+  .resumeSchema()
+.addRepeatedList("long_list")

Review Comment:
   These tests are to verify the above fix. Is it better to split into a new 
test class?





> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2022-04-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17523097#comment-17523097
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

luocooong commented on code in PR #2515:
URL: https://github.com/apache/drill/pull/2515#discussion_r851626806


##
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/scan/v3/schema/ScanSchemaResolver.java:
##
@@ -189,7 +189,7 @@ private void insertColumn(ColumnMetadata col) {
 switch (mode) {
   case FIRST_READER_SCHEMA:
   case READER_SCHEMA:
-if (schema.projectionType() != ProjectionType.ALL) {
+if (schema.projectionType() != ProjectionType.ALL && !col.isArray()) {

Review Comment:
   Let's look at this case :
   1. batch reader received the project column, from ` SELECT path, data_type, 
file_name FROM ` in the constructor().
   2. schema is [TupleSchema [ProjectedColumn [`path` (LATE:REQUIRED)]], 
[ProjectedColumn [`data_type` (LATE:REQUIRED)]], [ProjectedColumn [`file_name` 
(LATE:REQUIRED)]]].
   3. ProjectionType = 'SOME'.
   4. batch reader create new array column in next().
   5. do the project for the schema.
   ```java
   ColumnHandle existing = schema.find(colSchema.name());
if (existing == null) {
 insertColumn(colSchema);
   }
   ```
   6. catch and throw the `IllegalStateException`.
   7. failed to reader the format data.
   
   In EVF1, creating array fields dynamically is allowed and does not throw 
such exceptions.





> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2022-04-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17523093#comment-17523093
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

luocooong commented on code in PR #2515:
URL: https://github.com/apache/drill/pull/2515#discussion_r851624724


##
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java:
##
@@ -171,107 +168,109 @@ public HDF5ReaderConfig(HDF5FormatPlugin plugin, 
HDF5FormatConfig formatConfig)
 }
   }
 
-  public HDF5BatchReader(HDF5ReaderConfig readerConfig, int maxRecords) {
-this.readerConfig = readerConfig;
-this.maxRecords = maxRecords;
+  public HDF5BatchReader(HDF5ReaderConfig config, EasySubScan scan, 
FileSchemaNegotiator negotiator) {
+errorContext = negotiator.parentErrorContext();
+file = negotiator.file();
+readerConfig = config;
 dataWriters = new ArrayList<>();
-this.showMetadataPreview = readerConfig.formatConfig.showPreview();
-  }
+showMetadataPreview = readerConfig.formatConfig.showPreview();
 
-  @Override
-  public boolean open(FileSchemaNegotiator negotiator) {
-split = negotiator.split();
-errorContext = negotiator.parentErrorContext();
 // Since the HDF file reader uses a stream to actually read the file, the 
file name from the
 // module is incorrect.
-fileName = split.getPath().getName();
-try {
-  openFile(negotiator);
-} catch (IOException e) {
-  throw UserException
-.dataReadError(e)
-.addContext("Failed to close input file: %s", split.getPath())
-.addContext(errorContext)
-.build(logger);
-}
+fileName = file.split().getPath().getName();
 
-ResultSetLoader loader;
-if (readerConfig.defaultPath == null) {
-  // Get file metadata
-  List metadata = getFileMetadata(hdfFile, new 
ArrayList<>());
-  metadataIterator = metadata.iterator();
-
-  // Schema for Metadata query
-  SchemaBuilder builder = new SchemaBuilder()
-.addNullable(PATH_COLUMN_NAME, MinorType.VARCHAR)
-.addNullable(DATA_TYPE_COLUMN_NAME, MinorType.VARCHAR)
-.addNullable(FILE_NAME_COLUMN_NAME, MinorType.VARCHAR)
-.addNullable(DATA_SIZE_COLUMN_NAME, MinorType.BIGINT)
-.addNullable(IS_LINK_COLUMN_NAME, MinorType.BIT)
-.addNullable(ELEMENT_COUNT_NAME, MinorType.BIGINT)
-.addNullable(DATASET_DATA_TYPE_NAME, MinorType.VARCHAR)
-.addNullable(DIMENSIONS_FIELD_NAME, MinorType.VARCHAR);
-
-  negotiator.tableSchema(builder.buildSchema(), false);
-
-  loader = negotiator.build();
-  dimensions = new int[0];
-  rowWriter = loader.writer();
-
-} else {
-  // This is the case when the default path is specified. Since the user 
is explicitly asking for a dataset
-  // Drill can obtain the schema by getting the datatypes below and 
ultimately mapping that schema to columns
-  Dataset dataSet = hdfFile.getDatasetByPath(readerConfig.defaultPath);
-  dimensions = dataSet.getDimensions();
-
-  loader = negotiator.build();
-  rowWriter = loader.writer();
-  writerSpec = new WriterSpec(rowWriter, negotiator.providedSchema(),
-  negotiator.parentErrorContext());
-  if (dimensions.length <= 1) {
-buildSchemaFor1DimensionalDataset(dataSet);
-  } else if (dimensions.length == 2) {
-buildSchemaFor2DimensionalDataset(dataSet);
-  } else {
-// Case for datasets of greater than 2D
-// These are automatically flattened
-buildSchemaFor2DimensionalDataset(dataSet);
+{ // Opens an HDF5 file

Review Comment:
   Interesting thing, when we use EVF2, the batch reader only needs a 
constructor to start all initialization operations. But we will write a lot of 
code into that function, initialize the final variables, open files, define the 
schema, and so on. So I used the code blocks to group these actions to make 
them easier to read.





> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2022-04-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17523091#comment-17523091
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

luocooong commented on code in PR #2515:
URL: https://github.com/apache/drill/pull/2515#discussion_r851623857


##
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java:
##
@@ -410,8 +390,22 @@ public boolean next() {
 return true;
   }
 
+  @Override
+  public void close() {
+AutoCloseables.closeSilently(hdfFile, reader);
+/*
+ * The current implementation of the HDF5 reader creates a temp file which 
needs to be removed
+ * when the batch reader is closed. A possible future functionality might 
be to used to.
+ */
+boolean result = hdfFile.getFile().delete();

Review Comment:
   
https://github.com/jamesmudd/jhdf/blob/3136bba6d5c6d6956bea006865c7241b6ddc6f25/jhdf/src/main/java/io/jhdf/HdfFile.java#L167-L169
   In the above code, these temporary files are deleted only before application 
exits.





> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2022-04-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17523086#comment-17523086
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

luocooong commented on code in PR #2515:
URL: https://github.com/apache/drill/pull/2515#discussion_r851623334


##
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java:
##
@@ -410,8 +390,22 @@ public boolean next() {
 return true;
   }
 
+  @Override
+  public void close() {
+AutoCloseables.closeSilently(hdfFile, reader);
+/*
+ * The current implementation of the HDF5 reader creates a temp file which 
needs to be removed
+ * when the batch reader is closed. A possible future functionality might 
be to used to.

Review Comment:
   Done, I added new text for the note.





> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2022-04-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17523083#comment-17523083
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

luocooong commented on code in PR #2515:
URL: https://github.com/apache/drill/pull/2515#discussion_r851623187


##
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java:
##
@@ -171,107 +168,109 @@ public HDF5ReaderConfig(HDF5FormatPlugin plugin, 
HDF5FormatConfig formatConfig)
 }
   }
 
-  public HDF5BatchReader(HDF5ReaderConfig readerConfig, int maxRecords) {
-this.readerConfig = readerConfig;
-this.maxRecords = maxRecords;
+  public HDF5BatchReader(HDF5ReaderConfig config, EasySubScan scan, 
FileSchemaNegotiator negotiator) {
+errorContext = negotiator.parentErrorContext();
+file = negotiator.file();
+readerConfig = config;
 dataWriters = new ArrayList<>();
-this.showMetadataPreview = readerConfig.formatConfig.showPreview();
-  }
+showMetadataPreview = readerConfig.formatConfig.showPreview();
 
-  @Override
-  public boolean open(FileSchemaNegotiator negotiator) {
-split = negotiator.split();
-errorContext = negotiator.parentErrorContext();
 // Since the HDF file reader uses a stream to actually read the file, the 
file name from the
 // module is incorrect.
-fileName = split.getPath().getName();
-try {
-  openFile(negotiator);
-} catch (IOException e) {
-  throw UserException
-.dataReadError(e)
-.addContext("Failed to close input file: %s", split.getPath())
-.addContext(errorContext)
-.build(logger);
-}
+fileName = file.split().getPath().getName();
 
-ResultSetLoader loader;
-if (readerConfig.defaultPath == null) {
-  // Get file metadata
-  List metadata = getFileMetadata(hdfFile, new 
ArrayList<>());
-  metadataIterator = metadata.iterator();
-
-  // Schema for Metadata query
-  SchemaBuilder builder = new SchemaBuilder()
-.addNullable(PATH_COLUMN_NAME, MinorType.VARCHAR)
-.addNullable(DATA_TYPE_COLUMN_NAME, MinorType.VARCHAR)
-.addNullable(FILE_NAME_COLUMN_NAME, MinorType.VARCHAR)
-.addNullable(DATA_SIZE_COLUMN_NAME, MinorType.BIGINT)
-.addNullable(IS_LINK_COLUMN_NAME, MinorType.BIT)
-.addNullable(ELEMENT_COUNT_NAME, MinorType.BIGINT)
-.addNullable(DATASET_DATA_TYPE_NAME, MinorType.VARCHAR)
-.addNullable(DIMENSIONS_FIELD_NAME, MinorType.VARCHAR);
-
-  negotiator.tableSchema(builder.buildSchema(), false);
-
-  loader = negotiator.build();
-  dimensions = new int[0];
-  rowWriter = loader.writer();
-
-} else {
-  // This is the case when the default path is specified. Since the user 
is explicitly asking for a dataset
-  // Drill can obtain the schema by getting the datatypes below and 
ultimately mapping that schema to columns
-  Dataset dataSet = hdfFile.getDatasetByPath(readerConfig.defaultPath);
-  dimensions = dataSet.getDimensions();
-
-  loader = negotiator.build();
-  rowWriter = loader.writer();
-  writerSpec = new WriterSpec(rowWriter, negotiator.providedSchema(),
-  negotiator.parentErrorContext());
-  if (dimensions.length <= 1) {
-buildSchemaFor1DimensionalDataset(dataSet);
-  } else if (dimensions.length == 2) {
-buildSchemaFor2DimensionalDataset(dataSet);
-  } else {
-// Case for datasets of greater than 2D
-// These are automatically flattened
-buildSchemaFor2DimensionalDataset(dataSet);
+{ // Opens an HDF5 file
+  InputStream in = null;
+  try {

Review Comment:
   Done, thanks.





> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2022-04-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17523084#comment-17523084
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

luocooong commented on code in PR #2515:
URL: https://github.com/apache/drill/pull/2515#discussion_r851623209


##
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java:
##
@@ -171,107 +168,109 @@ public HDF5ReaderConfig(HDF5FormatPlugin plugin, 
HDF5FormatConfig formatConfig)
 }
   }
 
-  public HDF5BatchReader(HDF5ReaderConfig readerConfig, int maxRecords) {
-this.readerConfig = readerConfig;
-this.maxRecords = maxRecords;
+  public HDF5BatchReader(HDF5ReaderConfig config, EasySubScan scan, 
FileSchemaNegotiator negotiator) {
+errorContext = negotiator.parentErrorContext();
+file = negotiator.file();
+readerConfig = config;
 dataWriters = new ArrayList<>();
-this.showMetadataPreview = readerConfig.formatConfig.showPreview();
-  }
+showMetadataPreview = readerConfig.formatConfig.showPreview();
 
-  @Override
-  public boolean open(FileSchemaNegotiator negotiator) {
-split = negotiator.split();
-errorContext = negotiator.parentErrorContext();
 // Since the HDF file reader uses a stream to actually read the file, the 
file name from the
 // module is incorrect.
-fileName = split.getPath().getName();
-try {
-  openFile(negotiator);
-} catch (IOException e) {
-  throw UserException
-.dataReadError(e)
-.addContext("Failed to close input file: %s", split.getPath())
-.addContext(errorContext)
-.build(logger);
-}
+fileName = file.split().getPath().getName();
 
-ResultSetLoader loader;
-if (readerConfig.defaultPath == null) {
-  // Get file metadata
-  List metadata = getFileMetadata(hdfFile, new 
ArrayList<>());
-  metadataIterator = metadata.iterator();
-
-  // Schema for Metadata query
-  SchemaBuilder builder = new SchemaBuilder()
-.addNullable(PATH_COLUMN_NAME, MinorType.VARCHAR)
-.addNullable(DATA_TYPE_COLUMN_NAME, MinorType.VARCHAR)
-.addNullable(FILE_NAME_COLUMN_NAME, MinorType.VARCHAR)
-.addNullable(DATA_SIZE_COLUMN_NAME, MinorType.BIGINT)
-.addNullable(IS_LINK_COLUMN_NAME, MinorType.BIT)
-.addNullable(ELEMENT_COUNT_NAME, MinorType.BIGINT)
-.addNullable(DATASET_DATA_TYPE_NAME, MinorType.VARCHAR)
-.addNullable(DIMENSIONS_FIELD_NAME, MinorType.VARCHAR);
-
-  negotiator.tableSchema(builder.buildSchema(), false);
-
-  loader = negotiator.build();
-  dimensions = new int[0];
-  rowWriter = loader.writer();
-
-} else {
-  // This is the case when the default path is specified. Since the user 
is explicitly asking for a dataset
-  // Drill can obtain the schema by getting the datatypes below and 
ultimately mapping that schema to columns
-  Dataset dataSet = hdfFile.getDatasetByPath(readerConfig.defaultPath);
-  dimensions = dataSet.getDimensions();
-
-  loader = negotiator.build();
-  rowWriter = loader.writer();
-  writerSpec = new WriterSpec(rowWriter, negotiator.providedSchema(),
-  negotiator.parentErrorContext());
-  if (dimensions.length <= 1) {
-buildSchemaFor1DimensionalDataset(dataSet);
-  } else if (dimensions.length == 2) {
-buildSchemaFor2DimensionalDataset(dataSet);
-  } else {
-// Case for datasets of greater than 2D
-// These are automatically flattened
-buildSchemaFor2DimensionalDataset(dataSet);
+{ // Opens an HDF5 file
+  InputStream in = null;
+  try {
+/*
+ * As a possible future improvement, the jhdf reader has the ability 
to read hdf5 files from
+ * a byte array or byte buffer. This implementation is better in that 
it does not require creating
+ * a temporary file which must be deleted later.  However, it could 
result in memory issues in the
+ * event of large files.
+ */
+in = 
file.fileSystem().openPossiblyCompressedStream(file.split().getPath());
+hdfFile = HdfFile.fromInputStream(in);
+reader = new BufferedReader(new InputStreamReader(in));

Review Comment:
   Good catch, Done.





> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2022-04-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17523082#comment-17523082
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

luocooong commented on code in PR #2515:
URL: https://github.com/apache/drill/pull/2515#discussion_r851623160


##
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java:
##
@@ -171,107 +168,109 @@ public HDF5ReaderConfig(HDF5FormatPlugin plugin, 
HDF5FormatConfig formatConfig)
 }
   }
 
-  public HDF5BatchReader(HDF5ReaderConfig readerConfig, int maxRecords) {
-this.readerConfig = readerConfig;
-this.maxRecords = maxRecords;
+  public HDF5BatchReader(HDF5ReaderConfig config, EasySubScan scan, 
FileSchemaNegotiator negotiator) {
+errorContext = negotiator.parentErrorContext();
+file = negotiator.file();
+readerConfig = config;
 dataWriters = new ArrayList<>();
-this.showMetadataPreview = readerConfig.formatConfig.showPreview();
-  }
+showMetadataPreview = readerConfig.formatConfig.showPreview();
 
-  @Override
-  public boolean open(FileSchemaNegotiator negotiator) {
-split = negotiator.split();
-errorContext = negotiator.parentErrorContext();
 // Since the HDF file reader uses a stream to actually read the file, the 
file name from the
 // module is incorrect.
-fileName = split.getPath().getName();
-try {
-  openFile(negotiator);
-} catch (IOException e) {
-  throw UserException
-.dataReadError(e)
-.addContext("Failed to close input file: %s", split.getPath())
-.addContext(errorContext)
-.build(logger);
-}
+fileName = file.split().getPath().getName();
 
-ResultSetLoader loader;
-if (readerConfig.defaultPath == null) {
-  // Get file metadata
-  List metadata = getFileMetadata(hdfFile, new 
ArrayList<>());
-  metadataIterator = metadata.iterator();
-
-  // Schema for Metadata query
-  SchemaBuilder builder = new SchemaBuilder()
-.addNullable(PATH_COLUMN_NAME, MinorType.VARCHAR)
-.addNullable(DATA_TYPE_COLUMN_NAME, MinorType.VARCHAR)
-.addNullable(FILE_NAME_COLUMN_NAME, MinorType.VARCHAR)
-.addNullable(DATA_SIZE_COLUMN_NAME, MinorType.BIGINT)
-.addNullable(IS_LINK_COLUMN_NAME, MinorType.BIT)
-.addNullable(ELEMENT_COUNT_NAME, MinorType.BIGINT)
-.addNullable(DATASET_DATA_TYPE_NAME, MinorType.VARCHAR)
-.addNullable(DIMENSIONS_FIELD_NAME, MinorType.VARCHAR);
-
-  negotiator.tableSchema(builder.buildSchema(), false);
-
-  loader = negotiator.build();
-  dimensions = new int[0];
-  rowWriter = loader.writer();
-
-} else {
-  // This is the case when the default path is specified. Since the user 
is explicitly asking for a dataset
-  // Drill can obtain the schema by getting the datatypes below and 
ultimately mapping that schema to columns
-  Dataset dataSet = hdfFile.getDatasetByPath(readerConfig.defaultPath);
-  dimensions = dataSet.getDimensions();
-
-  loader = negotiator.build();
-  rowWriter = loader.writer();
-  writerSpec = new WriterSpec(rowWriter, negotiator.providedSchema(),
-  negotiator.parentErrorContext());
-  if (dimensions.length <= 1) {
-buildSchemaFor1DimensionalDataset(dataSet);
-  } else if (dimensions.length == 2) {
-buildSchemaFor2DimensionalDataset(dataSet);
-  } else {
-// Case for datasets of greater than 2D
-// These are automatically flattened
-buildSchemaFor2DimensionalDataset(dataSet);
+{ // Opens an HDF5 file
+  InputStream in = null;
+  try {
+/*
+ * As a possible future improvement, the jhdf reader has the ability 
to read hdf5 files from
+ * a byte array or byte buffer. This implementation is better in that 
it does not require creating
+ * a temporary file which must be deleted later.  However, it could 
result in memory issues in the
+ * event of large files.
+ */
+in = 
file.fileSystem().openPossiblyCompressedStream(file.split().getPath());
+hdfFile = HdfFile.fromInputStream(in);
+reader = new BufferedReader(new InputStreamReader(in));
+  } catch (IOException e) {
+throw UserException
+  .dataReadError(e)
+  .message("Failed to open input file: %s", file.split().getPath())
+  .addContext(errorContext)
+  .build(logger);
+  } finally {
+AutoCloseables.closeSilently(in);
   }
 }
-if (readerConfig.defaultPath == null) {
-  pathWriter = rowWriter.scalar(PATH_COLUMN_NAME);
-  dataTypeWriter = rowWriter.scalar(DATA_TYPE_COLUMN_NAME);
-  fileNameWriter = rowWriter.scalar(FILE_NAME_COLUMN_NAME);
-  dataSizeWriter = rowWriter.scalar(DATA_SIZE_COLUMN_NAME);
-  linkWriter = rowWriter.scalar(IS_LINK_COLUMN_NAME);
-  

[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2022-04-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521460#comment-17521460
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

paul-rogers commented on code in PR #2515:
URL: https://github.com/apache/drill/pull/2515#discussion_r849083600


##
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java:
##
@@ -171,107 +168,109 @@ public HDF5ReaderConfig(HDF5FormatPlugin plugin, 
HDF5FormatConfig formatConfig)
 }
   }
 
-  public HDF5BatchReader(HDF5ReaderConfig readerConfig, int maxRecords) {
-this.readerConfig = readerConfig;
-this.maxRecords = maxRecords;
+  public HDF5BatchReader(HDF5ReaderConfig config, EasySubScan scan, 
FileSchemaNegotiator negotiator) {
+errorContext = negotiator.parentErrorContext();
+file = negotiator.file();
+readerConfig = config;
 dataWriters = new ArrayList<>();
-this.showMetadataPreview = readerConfig.formatConfig.showPreview();
-  }
+showMetadataPreview = readerConfig.formatConfig.showPreview();
 
-  @Override
-  public boolean open(FileSchemaNegotiator negotiator) {
-split = negotiator.split();
-errorContext = negotiator.parentErrorContext();
 // Since the HDF file reader uses a stream to actually read the file, the 
file name from the
 // module is incorrect.
-fileName = split.getPath().getName();
-try {
-  openFile(negotiator);
-} catch (IOException e) {
-  throw UserException
-.dataReadError(e)
-.addContext("Failed to close input file: %s", split.getPath())
-.addContext(errorContext)
-.build(logger);
-}
+fileName = file.split().getPath().getName();
 
-ResultSetLoader loader;
-if (readerConfig.defaultPath == null) {
-  // Get file metadata
-  List metadata = getFileMetadata(hdfFile, new 
ArrayList<>());
-  metadataIterator = metadata.iterator();
-
-  // Schema for Metadata query
-  SchemaBuilder builder = new SchemaBuilder()
-.addNullable(PATH_COLUMN_NAME, MinorType.VARCHAR)
-.addNullable(DATA_TYPE_COLUMN_NAME, MinorType.VARCHAR)
-.addNullable(FILE_NAME_COLUMN_NAME, MinorType.VARCHAR)
-.addNullable(DATA_SIZE_COLUMN_NAME, MinorType.BIGINT)
-.addNullable(IS_LINK_COLUMN_NAME, MinorType.BIT)
-.addNullable(ELEMENT_COUNT_NAME, MinorType.BIGINT)
-.addNullable(DATASET_DATA_TYPE_NAME, MinorType.VARCHAR)
-.addNullable(DIMENSIONS_FIELD_NAME, MinorType.VARCHAR);
-
-  negotiator.tableSchema(builder.buildSchema(), false);
-
-  loader = negotiator.build();
-  dimensions = new int[0];
-  rowWriter = loader.writer();
-
-} else {
-  // This is the case when the default path is specified. Since the user 
is explicitly asking for a dataset
-  // Drill can obtain the schema by getting the datatypes below and 
ultimately mapping that schema to columns
-  Dataset dataSet = hdfFile.getDatasetByPath(readerConfig.defaultPath);
-  dimensions = dataSet.getDimensions();
-
-  loader = negotiator.build();
-  rowWriter = loader.writer();
-  writerSpec = new WriterSpec(rowWriter, negotiator.providedSchema(),
-  negotiator.parentErrorContext());
-  if (dimensions.length <= 1) {
-buildSchemaFor1DimensionalDataset(dataSet);
-  } else if (dimensions.length == 2) {
-buildSchemaFor2DimensionalDataset(dataSet);
-  } else {
-// Case for datasets of greater than 2D
-// These are automatically flattened
-buildSchemaFor2DimensionalDataset(dataSet);
+{ // Opens an HDF5 file

Review Comment:
   Is the nested block necessary? It is handy when we want to reuse variable 
names (as sometimes occurs in tests), but is it needed here?



##
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java:
##
@@ -410,8 +390,22 @@ public boolean next() {
 return true;
   }
 
+  @Override
+  public void close() {
+AutoCloseables.closeSilently(hdfFile, reader);
+/*
+ * The current implementation of the HDF5 reader creates a temp file which 
needs to be removed
+ * when the batch reader is closed. A possible future functionality might 
be to used to.

Review Comment:
   seems an incomplete comment: "used to" do what?



##
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java:
##
@@ -410,8 +390,22 @@ public boolean next() {
 return true;
   }
 
+  @Override
+  public void close() {
+AutoCloseables.closeSilently(hdfFile, reader);
+/*
+ * The current implementation of the HDF5 reader creates a temp file which 
needs to be removed
+ * when the batch reader is closed. A possible future functionality might 
be to used to.
+ */
+boolean result = hdfFile.getFile().delete();

Review Comment:
   There was a comment above about reading 

[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2022-04-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516860#comment-17516860
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

luocooong opened a new pull request, #2515:
URL: https://github.com/apache/drill/pull/2515

   # [DRILL-8188](https://issues.apache.org/jira/browse/DRILL-8188): Convert 
HDF5 format to EVF2
   
   ## Description
   Use EVF V2 instead of old V1.
   
   Included two bug fixes with the V2 framework :
   
   1. Projected an unprojected column error in array object
   2. IndexOutOfBoundsException at add column
   
   ## Documentation
   N/A
   
   ## Testing
   
   1. Use the CI.
   2. Add the tests for the bugfix. (To-Do)
   




> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)