date:20200217

Re: [DISCUSS] Schema queries - solutions?

2020-02-17 Thread Paul Rogers

Hi Igor,

Thanks! I should have remembered that bit of SQL.

Yes, if we can generalize `DESCRIBE`, we could create another path of some kind 
through the plugins that say, "return schema, not data."

Then, for the HDF5 use case we could have:

DESCRIBE TABLE `dfs`.`myFile.hdf5` -- returns schema


And

SELECT * FROM `dfs`.`myFile.hdf5` -- returns data


Nice solution! I'll file a feature request.


The next interesting bit about HDF5 is that it is a file system, it contains 
multiple data sets. Would be great to be able to express that in the FROM 
clause:

SELECT * FROM `dfs`.`myFile.hdf5`.`dataSet1`

>From my random walks though Calcite, it appears that we can have any level of 
>schema/table path. True? We'd need some way to resolve a name part to a file, 
>then ask the format plugin for that file if it supports additional parts. This 
>seems pretty obscure. Have we done anything like that before? Maybe in storage 
>(rather than format) plugin?

Thanks,
- Paul

 

On Monday, February 17, 2020, 11:34:48 PM PST, Igor Guzenko 
 wrote:  
 
 Hello Paul,

Seems like we simply need to improve our DESCRIBE [1] table functionality.

[1] https://drill.apache.org/docs/describe/

Thanks,
Igor

On Tue, Feb 18, 2020 at 9:23 AM Paul Rogers 
wrote:

> Hi All,
>
> Charles has a little PR,  #1978, that adds a convenient feature to his
> HDF5 format reader: the ability to query the schema of the file. (It seems
> that HDF5 is a bit like a zip file: it contains a set of files. Unlike zip,
> each file is a data set with a schema.) Charles added a clever way to tell
> the reader that the user wants a schema rather than data.
>
> If we think a bit, we realize that a schema query would be handy for any
> data source. Maybe I want to know the fields in a JSON or Parquet file
> without getting the data for those fields (and, for example, inferring type
> and nullability from data.)
>
> In a relational DB, we'd get the schema by querying system tables. We'd do
> the same thing in Hive because Hive requires an up-front schema. But, Drill
> is unique in that it can infer schema at run time; no previous schema
> required. So, we have no system tables to answer schema questions. Instead,
> we want to get the schema directly from the data source itself by doing a
> query.
>
> (This feature would be in addition to the case when the Metastore does
> hold a schema.)
>
>
> How might we accomplish the same result? Can we create some kind of
> "virtual" system table that tells us to rewrite the query to get schema?
> Something like:
>
> SELECT * FROM sys.columns WHERE tableName = `dfs`.`my/path/someFile.json`
>
> Or, maybe some implied columns in the table schema?
>
>
> SELECT schema.* FROM `dfs`.`my/path/someFile.json`
>
>
> Or, maybe some special schema name space?
>
> SELECT schema.* FROM schema.`dfs`.`my/path/someFile.json`
>
>
> Anyone know of any system that has an elegant solution we could mimic?
> Other suggestions?
>
>
> Thanks,
> - Paul
>
>

[GitHub] [drill] paul-rogers commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail with Large Files

2020-02-17 Thread GitBox

paul-rogers commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978#issuecomment-587322871
 
 
   @cgivre, one more design-level comment about this particular file format. 
You've mentioned several times that HDF5 is "a file system within a file." It 
finally clicked: we need need to treat this file as a directory, not a file. 
This means adding a layer of schema in Calcite planning:
   
   ```
   SELECT * FROM `dfs`.`some/path/myFile.hdf5`.`dataSet1`
   ```
   
   This would let the reader load only data from `dataSet1`, using only the 
schema from that data set.
   
   (Can't use slashes; that is a notation for the Hadoop file system.)
   
   Fortunately, Calcite seems to allow any number of schema levels. It is why 
we can have plugins, workspaces, etc. The challenge is to provide some way for 
a format plugin to influence the planner and say, "hey, if you do a query 
against me, ask me to resolve all path elements below my file name."
   
   Again, not something for this PR. But, it is something we can think about as 
we try to improve our storage plugin API.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Re: [DISCUSS] Schema queries - solutions?

2020-02-17 Thread Igor Guzenko

Hello Paul,

Seems like we simply need to improve our DESCRIBE [1] table functionality.

[1] https://drill.apache.org/docs/describe/

Thanks,
Igor

On Tue, Feb 18, 2020 at 9:23 AM Paul Rogers 
wrote:

> Hi All,
>
> Charles has a little PR,  #1978, that adds a convenient feature to his
> HDF5 format reader: the ability to query the schema of the file. (It seems
> that HDF5 is a bit like a zip file: it contains a set of files. Unlike zip,
> each file is a data set with a schema.) Charles added a clever way to tell
> the reader that the user wants a schema rather than data.
>
> If we think a bit, we realize that a schema query would be handy for any
> data source. Maybe I want to know the fields in a JSON or Parquet file
> without getting the data for those fields (and, for example, inferring type
> and nullability from data.)
>
> In a relational DB, we'd get the schema by querying system tables. We'd do
> the same thing in Hive because Hive requires an up-front schema. But, Drill
> is unique in that it can infer schema at run time; no previous schema
> required. So, we have no system tables to answer schema questions. Instead,
> we want to get the schema directly from the data source itself by doing a
> query.
>
> (This feature would be in addition to the case when the Metastore does
> hold a schema.)
>
>
> How might we accomplish the same result? Can we create some kind of
> "virtual" system table that tells us to rewrite the query to get schema?
> Something like:
>
> SELECT * FROM sys.columns WHERE tableName = `dfs`.`my/path/someFile.json`
>
> Or, maybe some implied columns in the table schema?
>
>
> SELECT schema.* FROM `dfs`.`my/path/someFile.json`
>
>
> Or, maybe some special schema name space?
>
> SELECT schema.* FROM schema.`dfs`.`my/path/someFile.json`
>
>
> Anyone know of any system that has an elegant solution we could mimic?
> Other suggestions?
>
>
> Thanks,
> - Paul
>
>

[DISCUSS] Schema queries - solutions?

2020-02-17 Thread Paul Rogers

Hi All,

Charles has a little PR,  #1978, that adds a convenient feature to his HDF5 
format reader: the ability to query the schema of the file. (It seems that HDF5 
is a bit like a zip file: it contains a set of files. Unlike zip, each file is 
a data set with a schema.) Charles added a clever way to tell the reader that 
the user wants a schema rather than data.

If we think a bit, we realize that a schema query would be handy for any data 
source. Maybe I want to know the fields in a JSON or Parquet file without 
getting the data for those fields (and, for example, inferring type and 
nullability from data.)

In a relational DB, we'd get the schema by querying system tables. We'd do the 
same thing in Hive because Hive requires an up-front schema. But, Drill is 
unique in that it can infer schema at run time; no previous schema required. 
So, we have no system tables to answer schema questions. Instead, we want to 
get the schema directly from the data source itself by doing a query.

(This feature would be in addition to the case when the Metastore does hold a 
schema.)


How might we accomplish the same result? Can we create some kind of "virtual" 
system table that tells us to rewrite the query to get schema? Something like:

SELECT * FROM sys.columns WHERE tableName = `dfs`.`my/path/someFile.json`

Or, maybe some implied columns in the table schema?


SELECT schema.* FROM `dfs`.`my/path/someFile.json`


Or, maybe some special schema name space?

SELECT schema.* FROM schema.`dfs`.`my/path/someFile.json`


Anyone know of any system that has an elegant solution we could mimic? Other 
suggestions?


Thanks,
- Paul

[GitHub] [drill] paul-rogers commented on a change in pull request #1978: DRILL-7578: HDF5 Metadata Queries Fail with Large Files

2020-02-17 Thread GitBox

paul-rogers commented on a change in pull request #1978: DRILL-7578: HDF5 
Metadata Queries Fail with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r380488669
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -1069,26 +1125,30 @@ private void getAndMapCompoundData(String path, 
List fieldNames, IHDF5Re
   for (int col = 0; col < values[row].length; col++) {
 assert fieldNames != null;
 currentFieldName = fieldNames.get(col);
-ArrayWriter innerWriter = listWriter.array(currentFieldName);
-if (values[row][col] instanceof Integer) {
-  innerWriter.scalar().setInt((Integer) values[row][col]);
-} else if (values[row][col] instanceof Short) {
-  innerWriter.scalar().setInt((Short) values[row][col]);
-} else if (values[row][col] instanceof Byte) {
-  innerWriter.scalar().setInt((Byte) values[row][col]);
-} else if (values[row][col] instanceof Long) {
-  innerWriter.scalar().setLong((Long) values[row][col]);
-} else if (values[row][col] instanceof Float) {
-  innerWriter.scalar().setDouble((Float) values[row][col]);
-} else if (values[row][col] instanceof Double) {
-  innerWriter.scalar().setDouble((Double) values[row][col]);
-} else if (values[row][col] instanceof BitSet || values[row][col] 
instanceof Boolean) {
-  innerWriter.scalar().setBoolean((Boolean) values[row][col]);
-} else if (values[row][col] instanceof String) {
-  innerWriter.scalar().setString((String) values[row][col]);
-}
-if (col == values[row].length) {
-  innerWriter.save();
+try {
+  ArrayWriter innerWriter = listWriter.array(currentFieldName);
+  if (values[row][col] instanceof Integer) {
 
 Review comment:
   I realize that this is existing code, but boxing and comparing each value 
will be slow and will thrash the heap. Far better if we can use "shims" that 
can read the data as the Java primitive type and write it directly to the 
corresponding `set()` method without boxing.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] paul-rogers commented on a change in pull request #1978: DRILL-7578: HDF5 Metadata Queries Fail with Large Files

2020-02-17 Thread GitBox

paul-rogers commented on a change in pull request #1978: DRILL-7578: HDF5 
Metadata Queries Fail with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r380490693
 
 

 ##
 File path: 
contrib/format-hdf5/src/test/java/org/apache/drill/exec/store/hdf5/TestHDF5Format.java
 ##
 @@ -98,9 +98,9 @@ public void testStarQuery() throws Exception {
 
 testBuilder()
   .sqlQuery("SELECT * FROM dfs.`hdf5/dset.h5`")
-  .unOrdered()
-  .baselineColumns("path", "data_type", "file_name", "int_data")
-  .baselineValues("/dset", "DATASET", "dset.h5", finalList)
+  .ordered()
 
 Review comment:
   This might be the place to ask the question about schema. We have two 
distinct views of a data set. The general rule of the wildcard (`*`) is to 
return all available columns. Here, we special-case wildcard to mean "return 
metadata." This is, unfortunately, very non standard.
   
   We need some way to express two views of the file. The same problem occurs 
for any database. We could even use if for JSON, CSV and other file formats.
   
   The challenge is, how do we tell the query we want metadata and not data? In 
a normal DB, we query system tables. Perhaps we could jimmy up something in 
Drill:
   
   ```
   SELECT * FROM sys.schema.dfs.`hdf5/dset.h5`
   ```
   
   Or, maybe think of the table as a namespace, and have an optional `.schema` 
tail:
   
   ```
   SELECT * FROM dfs.`hdf5/dset.h5`.schema
   ```
   
   The point is not for you to implement this, or even to design the solution. 
Rather, the point is that the current solution is a hack, and that we need a 
better solution.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] paul-rogers commented on a change in pull request #1978: DRILL-7578: HDF5 Metadata Queries Fail with Large Files

2020-02-17 Thread GitBox

paul-rogers commented on a change in pull request #1978: DRILL-7578: HDF5 
Metadata Queries Fail with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r380487624
 
 

 ##
 File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
 ##
 @@ -92,6 +93,20 @@
 
   private static final String LONG_COLUMN_NAME = "long_data";
 
+  private static final String DATA_SIZE_COLUMN_NAME = "data_size";
+
+  private static final String ELEMENT_COUNT_NAME = "element_count";
+
+  private static final String IS_TIMESTAMP_NAME = "is_timestamp";
 
 Review comment:
   The two `is` columns appear mutually exclusive. I wonder, does it make sense 
to define an `extended_type` column if `data_type` is the Drill type? That is, 
for most columns, `extended_type` would be null. For these two it would be, say 
`TIMESTAMP` or `TIME_DURATION`. Though, truth be told, Drill has `TIMESTAMP` 
and `INTERVAL` columns, so if we mapped the HDF5 type to these Drill types, we 
would not need the extended type (or these two Boolean columns).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] paul-rogers commented on a change in pull request #1987: DRILL-7589: Set temporary tests folder for UDF_DIRECTORY_LOCAL, fix allocators closing in BloomFilterTest, fix permissions iss

2020-02-17 Thread GitBox

paul-rogers commented on a change in pull request #1987: DRILL-7589: Set 
temporary tests folder for UDF_DIRECTORY_LOCAL, fix allocators closing in 
BloomFilterTest, fix permissions issue for TestGracefulShutdown tests
URL: https://github.com/apache/drill/pull/1987#discussion_r380485944
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/test/TestGracefulShutdown.java
 ##
 @@ -262,17 +265,15 @@ private boolean 
waitAndAssertDrillbitCount(ClusterFixture cluster, int zkRefresh
   }
 
   private static void setupFile(int file_num) throws Exception {
-final String file = "employee"+file_num+".json";
-final Path path = dirTestWatcher.getRootDir().toPath().resolve(file);
-try(PrintWriter out = new PrintWriter(new BufferedWriter(new 
FileWriter(path.toFile(), true {
+String file = "employee" + file_num + ".json";
+Path path = dirTestWatcher.getRootDir().toPath().resolve(file);
+try (PrintWriter out = new PrintWriter(new BufferedWriter(new 
FileWriter(path.toFile(), true {
 
 Review comment:
   I realize the code here is original; but it might be a bit cleaner to put 
the data in a resource file than in Java.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] paul-rogers commented on a change in pull request #1987: DRILL-7589: Set temporary tests folder for UDF_DIRECTORY_LOCAL, fix allocators closing in BloomFilterTest, fix permissions iss

2020-02-17 Thread GitBox

paul-rogers commented on a change in pull request #1987: DRILL-7589: Set 
temporary tests folder for UDF_DIRECTORY_LOCAL, fix allocators closing in 
BloomFilterTest, fix permissions issue for TestGracefulShutdown tests
URL: https://github.com/apache/drill/pull/1987#discussion_r380485298
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/work/filter/BloomFilterTest.java
 ##
 @@ -133,214 +135,227 @@ public boolean hasFailed() {
 }
   }
 
-
   @Test
   public void testNotExist() throws Exception {
-Drillbit bit = new Drillbit(c, RemoteServiceSet.getLocalServiceSet(), 
ClassPathScanner.fromPrescan(c));
-bit.run();
-DrillbitContext bitContext = bit.getContext();
-FunctionImplementationRegistry registry = 
bitContext.getFunctionImplementationRegistry();
-FragmentContextImpl context = new FragmentContextImpl(bitContext, 
BitControl.PlanFragment.getDefaultInstance(), null, registry);
-BufferAllocator bufferAllocator = bitContext.getAllocator();
-//create RecordBatch
-VarCharVector vector = new VarCharVector(SchemaBuilder.columnSchema("a", 
TypeProtos.MinorType.VARCHAR, TypeProtos.DataMode.REQUIRED), bufferAllocator);
-vector.allocateNew();
-int valueCount = 3;
-VarCharVector.Mutator mutator = vector.getMutator();
-mutator.setSafe(0, "a".getBytes());
-mutator.setSafe(1, "b".getBytes());
-mutator.setSafe(2, "c".getBytes());
-mutator.setValueCount(valueCount);
-VectorContainer vectorContainer = new VectorContainer();
-TypedFieldId fieldId = vectorContainer.add(vector);
-RecordBatch recordBatch = new TestRecordBatch(vectorContainer);
-//construct hash64
-ValueVectorReadExpression exp = new ValueVectorReadExpression(fieldId);
-LogicalExpression[] expressions = new LogicalExpression[1];
-expressions[0] = exp;
-TypedFieldId[] fieldIds = new TypedFieldId[1];
-fieldIds[0] = fieldId;
-ValueVectorHashHelper valueVectorHashHelper = new 
ValueVectorHashHelper(recordBatch, context);
-ValueVectorHashHelper.Hash64 hash64 = 
valueVectorHashHelper.getHash64(expressions, fieldIds);
-
-//construct BloomFilter
-int numBytes = BloomFilter.optimalNumOfBytes(3, 0.03);
-
-BloomFilter bloomFilter = new BloomFilter(numBytes, bufferAllocator);
-for (int i = 0; i < valueCount; i++) {
-  long hashCode = hash64.hash64Code(i, 0, 0);
-  bloomFilter.insert(hashCode);
+int userPort = QueryTestUtil.getFreePortNumber(31170, 300);
+int bitPort = QueryTestUtil.getFreePortNumber(31180, 300);
+ClusterFixtureBuilder clusterFixtureBuilder = 
ClusterFixture.bareBuilder(dirTestWatcher)
+.configProperty(ExecConstants.INITIAL_USER_PORT, userPort)
+.configProperty(ExecConstants.INITIAL_BIT_PORT, bitPort)
+.configProperty(ExecConstants.ALLOW_LOOPBACK_ADDRESS_BINDING, true);
+try (ClusterFixture cluster = clusterFixtureBuilder.build()) {
+  Drillbit bit = cluster.drillbit();
+  DrillbitContext bitContext = bit.getContext();
+  FunctionImplementationRegistry registry = 
bitContext.getFunctionImplementationRegistry();
+  FragmentContextImpl context = new FragmentContextImpl(bitContext, 
BitControl.PlanFragment.getDefaultInstance(), null, registry);
+  BufferAllocator bufferAllocator = bitContext.getAllocator();
+  //create RecordBatch
+  VarCharVector vector = new VarCharVector(SchemaBuilder.columnSchema("a", 
TypeProtos.MinorType.VARCHAR, TypeProtos.DataMode.REQUIRED), bufferAllocator);
+  vector.allocateNew();
+  int valueCount = 3;
+  VarCharVector.Mutator mutator = vector.getMutator();
+  mutator.setSafe(0, "a".getBytes());
+  mutator.setSafe(1, "b".getBytes());
+  mutator.setSafe(2, "c".getBytes());
+  mutator.setValueCount(valueCount);
+  VectorContainer vectorContainer = new VectorContainer();
+  TypedFieldId fieldId = vectorContainer.add(vector);
+  RecordBatch recordBatch = new TestRecordBatch(vectorContainer);
+  //construct hash64
+  ValueVectorReadExpression exp = new ValueVectorReadExpression(fieldId);
+  LogicalExpression[] expressions = new LogicalExpression[1];
+  expressions[0] = exp;
+  TypedFieldId[] fieldIds = new TypedFieldId[1];
+  fieldIds[0] = fieldId;
+  ValueVectorHashHelper valueVectorHashHelper = new 
ValueVectorHashHelper(recordBatch, context);
+  ValueVectorHashHelper.Hash64 hash64 = 
valueVectorHashHelper.getHash64(expressions, fieldIds);
+
+  //construct BloomFilter
+  int numBytes = BloomFilter.optimalNumOfBytes(3, 0.03);
+
+  BloomFilter bloomFilter = new BloomFilter(numBytes, bufferAllocator);
+  for (int i = 0; i < valueCount; i++) {
+long hashCode = hash64.hash64Code(i, 0, 0);
+bloomFilter.insert(hashCode);
+  }
+
+  //-create probe side RecordBatch-
+  VarCharVector probeVector = new 
VarCharVector(SchemaBuilder.columnSchema("a",

[GitHub] [drill] paul-rogers commented on a change in pull request #1987: DRILL-7589: Set temporary tests folder for UDF_DIRECTORY_LOCAL, fix allocators closing in BloomFilterTest, fix permissions iss

2020-02-17 Thread GitBox

paul-rogers commented on a change in pull request #1987: DRILL-7589: Set 
temporary tests folder for UDF_DIRECTORY_LOCAL, fix allocators closing in 
BloomFilterTest, fix permissions issue for TestGracefulShutdown tests
URL: https://github.com/apache/drill/pull/1987#discussion_r380481139
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/work/filter/BloomFilterTest.java
 ##
 @@ -133,214 +135,227 @@ public boolean hasFailed() {
 }
   }
 
-
   @Test
   public void testNotExist() throws Exception {
-Drillbit bit = new Drillbit(c, RemoteServiceSet.getLocalServiceSet(), 
ClassPathScanner.fromPrescan(c));
-bit.run();
-DrillbitContext bitContext = bit.getContext();
-FunctionImplementationRegistry registry = 
bitContext.getFunctionImplementationRegistry();
-FragmentContextImpl context = new FragmentContextImpl(bitContext, 
BitControl.PlanFragment.getDefaultInstance(), null, registry);
-BufferAllocator bufferAllocator = bitContext.getAllocator();
-//create RecordBatch
-VarCharVector vector = new VarCharVector(SchemaBuilder.columnSchema("a", 
TypeProtos.MinorType.VARCHAR, TypeProtos.DataMode.REQUIRED), bufferAllocator);
-vector.allocateNew();
-int valueCount = 3;
-VarCharVector.Mutator mutator = vector.getMutator();
-mutator.setSafe(0, "a".getBytes());
-mutator.setSafe(1, "b".getBytes());
-mutator.setSafe(2, "c".getBytes());
-mutator.setValueCount(valueCount);
-VectorContainer vectorContainer = new VectorContainer();
-TypedFieldId fieldId = vectorContainer.add(vector);
-RecordBatch recordBatch = new TestRecordBatch(vectorContainer);
-//construct hash64
-ValueVectorReadExpression exp = new ValueVectorReadExpression(fieldId);
-LogicalExpression[] expressions = new LogicalExpression[1];
-expressions[0] = exp;
-TypedFieldId[] fieldIds = new TypedFieldId[1];
-fieldIds[0] = fieldId;
-ValueVectorHashHelper valueVectorHashHelper = new 
ValueVectorHashHelper(recordBatch, context);
-ValueVectorHashHelper.Hash64 hash64 = 
valueVectorHashHelper.getHash64(expressions, fieldIds);
-
-//construct BloomFilter
-int numBytes = BloomFilter.optimalNumOfBytes(3, 0.03);
-
-BloomFilter bloomFilter = new BloomFilter(numBytes, bufferAllocator);
-for (int i = 0; i < valueCount; i++) {
-  long hashCode = hash64.hash64Code(i, 0, 0);
-  bloomFilter.insert(hashCode);
+int userPort = QueryTestUtil.getFreePortNumber(31170, 300);
+int bitPort = QueryTestUtil.getFreePortNumber(31180, 300);
+ClusterFixtureBuilder clusterFixtureBuilder = 
ClusterFixture.bareBuilder(dirTestWatcher)
 
 Review comment:
   Do you need a full cluster for this? There is a `SubOperatorTest` that will 
give you a fragment context and allocator so you can create vectors and invoke 
"sub-operator" functionality such as the BloomFilter stuff.
   
   If any of the code under tests needs the `DrillbitContext`, perhaps look at 
modifying so that it doesn't. There is nothing a Bloom filter should need.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] paul-rogers commented on a change in pull request #1987: DRILL-7589: Set temporary tests folder for UDF_DIRECTORY_LOCAL, fix allocators closing in BloomFilterTest, fix permissions iss

2020-02-17 Thread GitBox

paul-rogers commented on a change in pull request #1987: DRILL-7589: Set 
temporary tests folder for UDF_DIRECTORY_LOCAL, fix allocators closing in 
BloomFilterTest, fix permissions issue for TestGracefulShutdown tests
URL: https://github.com/apache/drill/pull/1987#discussion_r380479013
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/udf/dynamic/TestDynamicUDFSupport.java
 ##
 @@ -104,8 +104,10 @@ public static void buildAndStoreDefaultJars() throws 
IOException {
   @Before
   public void setupNewDrillbit() throws Exception {
 udfDir = dirTestWatcher.makeSubDir(Paths.get("udf"));
+File udfLocalDir = dirTestWatcher.makeSubDir(Paths.get("udf", "local"));
 Properties overrideProps = new Properties();
 overrideProps.setProperty(ExecConstants.UDF_DIRECTORY_ROOT, 
udfDir.getAbsolutePath());
+overrideProps.setProperty(ExecConstants.UDF_DIRECTORY_LOCAL, 
udfLocalDir.getAbsolutePath());
 
 Review comment:
   We've got lots of local directory properties. Hard to keep them all in sync. 
I wonder if we can use a feature of HOCON to default them to a known structure:
   
   ```
   exec: {
  ...
  local: {
baseDir: "/tmp/drill",
udfDir: "${drill.exec.local.baseDir}/udf",
pluginDir: "${drill.exec.local.baseDir}/plugins",
...
   },
   ```
   
   Probably some setup to do in the `ClusterFixture` and `DirTestWatcher` to 
get everything set up.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] paul-rogers commented on a change in pull request #1987: DRILL-7589: Set temporary tests folder for UDF_DIRECTORY_LOCAL, fix allocators closing in BloomFilterTest, fix permissions iss

2020-02-17 Thread GitBox

paul-rogers commented on a change in pull request #1987: DRILL-7589: Set 
temporary tests folder for UDF_DIRECTORY_LOCAL, fix allocators closing in 
BloomFilterTest, fix permissions issue for TestGracefulShutdown tests
URL: https://github.com/apache/drill/pull/1987#discussion_r380485678
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/test/TestGracefulShutdown.java
 ##
 @@ -73,31 +73,39 @@ private static void 
enableDrillPortHunting(ClusterFixtureBuilder builder) {
 builder.configBuilder.put(ExecConstants.DRILL_PORT_HUNT, true);
 builder.configBuilder.put(ExecConstants.GRACE_PERIOD, 500);
 builder.configBuilder.put(ExecConstants.ALLOW_LOOPBACK_ADDRESS_BINDING, 
true);
+
+setTestDirectories(builder);
+  }
+
+  private static void setTestDirectories(ClusterFixtureBuilder builder) {
+builder.configBuilder.put(ExecConstants.DRILL_TMP_DIR, 
dirTestWatcher.getTmpDir().getAbsolutePath());
+builder.configBuilder.put(ExecConstants.SYS_STORE_PROVIDER_LOCAL_PATH, 
dirTestWatcher.getStoreDir().getAbsolutePath());
 
 Review comment:
   Can this be done in `ClusterFixture` or its builder so we use a consistent 
set of directories everywhere? I've been burned by these being a bit 
ill-defined.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] paul-rogers commented on a change in pull request #1987: DRILL-7589: Set temporary tests folder for UDF_DIRECTORY_LOCAL, fix allocators closing in BloomFilterTest, fix permissions iss

2020-02-17 Thread GitBox

paul-rogers commented on a change in pull request #1987: DRILL-7589: Set 
temporary tests folder for UDF_DIRECTORY_LOCAL, fix allocators closing in 
BloomFilterTest, fix permissions issue for TestGracefulShutdown tests
URL: https://github.com/apache/drill/pull/1987#discussion_r380476183
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/udf/dynamic/TestDynamicUDFSupport.java
 ##
 @@ -104,8 +104,10 @@ public static void buildAndStoreDefaultJars() throws 
IOException {
   @Before
   public void setupNewDrillbit() throws Exception {
 udfDir = dirTestWatcher.makeSubDir(Paths.get("udf"));
+File udfLocalDir = dirTestWatcher.makeSubDir(Paths.get("udf", "local"));
 
 Review comment:
   The `DirTestWatcher` has internal support for each of Drill's working 
directories. Might we want to add another directory for UDF files?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] paul-rogers commented on issue #1971: DRILL-7572: JSON structure parser

2020-02-17 Thread GitBox

paul-rogers commented on issue #1971: DRILL-7572: JSON structure parser
URL: https://github.com/apache/drill/pull/1971#issuecomment-587300916
 
 
   @vvysotskyi, thanks for pointing out the question; I missed it when reading 
the code comments.
   
   Looked at `CountingJsonReader`. Looks like creates a series or rows, one per 
input row, with just a bit field set to 1. This reader could do exactly the 
same by projecting none of the columns and instead writing that bit = 1 value 
for the start of each top-level object. The non-projected columns will 
"free-wheel" over the incoming JSON.
   
   A better solution is to actually return the count. Maybe we need another 
option on the format plugin, `supportsCountPushDown()` so that we return the 
per-file row count rather grind through the effort of making trivial rows.
   
   EVF has support for this idea with its notion of "project none" which occurs 
when the scan asks for now rows as in a `COUNT(*)`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] paul-rogers opened a new pull request #1988: DRILL-7590: Refactor plugin registry

2020-02-17 Thread GitBox

paul-rogers opened a new pull request #1988: DRILL-7590: Refactor plugin 
registry
URL: https://github.com/apache/drill/pull/1988
 
 
   Performs a thorough "spring cleaning" of the storage plugin registry to 
prepare it to add a proper plugin API.
   
   This is a complex PR with lots going on.
   
   The plugin registry connects configurations, stored in ZK, with 
implementations, which are Java classes. The registry handles a large number of 
tasks including:
   
   * Populating "bootstrap" plugin configurations and handling upgrades.
   * Reading from, and writing to, the persistent store in ZK.
   * Handling "normal" (configured) plugins and special system plugins (which 
have no configuration.)
   * Handle format plugins which are always associated with the DFS storage 
plugin.
   * Handle "ephemeral" plugins which correspond to configs not stored in the 
registry.
   * And so on.
   
   The code has grown overly complex. As we look to add a new, cleaner plugin 
mechanism, we will start by cleaning up what we have to allow the new mechanism 
to be one of many.
   
   ## Terminology
   
   There is no more confusing term in Drill than "plugin." That single term can 
mean:
   
   * The stored JSON definition for a plugin config. (What we see in the web 
console.)
   * The config object which holds the configuation.
   * The storage plugin instance with the config attached. This is the 
functional aspect
 of a plugin.
   * The storage plugin class itself.
   
   To make the following discussion clearer, we redefine terms as:
   
   * *Connector*: the storage plugin class (which needs a config to be useful)
   * *Plugin*: the configuration of a plugin in any of its three forms: JSON, 
config-only or as part
 of a connector + config pair.
   
   ## Standard and System Connectors
   
   The registry class handled many tasks itself, making the code hard to 
follow. The first task is to split apart responsibilities into separate 
classses.
   
   The registry handles two kinds of plugins at present:
   
   * "Classic" plugins are those defined by a `StoragePluginConfig` subclass 
and a `StoragePlugin` subclass
   with a specific constructor. Their configs are persistently stored in ZK. 
That is, the storage plugins most of us think about.
   * System plugins are a special case: they are always defined by default, and 
have no (or, actually, an implicit) config.
   Examples: `sys` and `information_schema` System plugins have the `` 
annotation, are created at boot time, and do not reside in
   the ZK store.
   
   The first step is to split out these two kinds of plugins into separate 
"provider" classes, along with a common interface. A new `ConnectorProvider` 
interface has two implementations: one for "classic" plugins another for system 
plugins. Then, when we add the new mechanism, it becomes a third plugin 
provider.
   
   ## Bootstrap and Upgrade
   
   The registry also handles the process of initializing a newly installed 
Drill, or upgrading an existing one. The code for this is pulled out into a 
separate class.
   
   Moved the names of the bootstrap plugins and plugins upgrade files into the 
config system
   to allow easier testing with test-specific files. Added complete unit tests.
   
   ## Plugin Lifecycle
   
   Plugins have a surprisingly robust lifecycle. Revised the code to better 
model the nuances of the
   lifecycle (and fix a number of subtle bugs).
   
   Plugin instances must be created, but only for standard plugins (not system 
plugins). Added a
   `ConnectorHandle` so we can track the source of each connector so that the 
locator can create
   connector instances (for standard plugins) or not (for system plugins.)
   
   Plugins are defined by persistent storage as a (name, config) pair. There is 
no reason to
   create a connector instance just to load plugins from storage. So, added a 
`PluginHandle` class
   to hold onto the (name, config, `ConnectorHandle`) triple.
   
   This handle then allows us to do lazy instantiation of the connector class. 
Rather than creating
   it on load, we wait until some code actually needs the plugin. (Some code 
still demands that we load all plugins; this can be fixed in a later PR.)
   
   The registry API was changed to make this clear. `createOrUpdate()` is 
renamed to `put` and
   no longer returns the plugin instance (which, it turned out, was never 
used.) Now, we don't
   create the connector instance until `getPlugin()` is called. Added a new 
`getConfig()` method for the many times we only want the config and don't 
actually need the instance.
   
   Drill is a concurrent, distributed system. Plugin (configurations) can 
change at any time.
   We might change `dfs` while queries run. The registry supports "ephemeral" 
plugins, those
   that occur in a query execution plan, but do not match a name in persistent 
storage.
   
   Previously, ephemeral plugins were not connected to normal named plugins. 
Revised this so that

[GitHub] [drill] cgivre commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail with Large Files

2020-02-17 Thread GitBox

cgivre commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail with 
Large Files
URL: https://github.com/apache/drill/pull/1978#issuecomment-587259759
 
 
   @paul-rogers @vvysotskyi 
   See above comment.  I removed the config option and added logger warnings if 
the data is truncated.  Again, this is just for "preview" mode so real data 
queries are not affected.  
   In doing this PR, I discovered that the HDF5 format allows for arrays within 
compound fields. 
   
   This functionality is not supported by Drill so I added a warning for that.  
In the future, or if anyone asks for it, I may add it but for now, I'm leaving 
that alone.  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi opened a new pull request #1987: DRILL-7589: Set temporary tests folder for UDF_DIRECTORY_LOCAL, fix allocators closing in BloomFilterTest, fix permissions issue for TestGr

2020-02-17 Thread GitBox

vvysotskyi opened a new pull request #1987: DRILL-7589: Set temporary tests 
folder for UDF_DIRECTORY_LOCAL, fix allocators closing in BloomFilterTest, fix 
permissions issue for TestGracefulShutdown tests
URL: https://github.com/apache/drill/pull/1987
 
 
   # [DRILL-7589](https://issues.apache.org/jira/browse/DRILL-7589): Set 
temporary tests folder for UDF_DIRECTORY_LOCAL, fix allocators closing in 
BloomFilterTest, fix permissions issue for TestGracefulShutdown tests
   
   ## Description
   
   Initially, `UDF_DIRECTORY_LOCAL` had default value for tests and was set to 
`/tmp/drill/udf/udf/local`. Changed its value to refer to the test directory. 
Hope it will help to fix CI failures.
   
   Fixed the following errors for `TestGracefulShutdown` tests (it was only 
logged, but tests pass.
   ```
   Unable to store data for the path 
[file:/var/log/drill/profiles/21b7ceae-680b-91ab-3cd2-24f6d5d53a7d.sys.drill]: 
Mkdirs failed to create file:/var/log/drill/profiles (exists=false, 
cwd=file:/home/runner/work/drill/drill/exec/java-exec)
   ```
   
   Fixed closing allocators for `BloomFilterTest` tests, the following error 
was logged, after tests from this class are finished:
   ```
   java.lang.IllegalStateException: Allocator[ROOT] closed with outstanding 
buffers allocated (1).
   ```
   
   ## Documentation
   NA
   
   ## Testing
   Checked several times on GitHub Actions Jobs on the forked repo.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Created] (DRILL-7590) Refactor plugin registry

2020-02-17 Thread Paul Rogers (Jira)

Paul Rogers created DRILL-7590:
--

 Summary: Refactor plugin registry
 Key: DRILL-7590
 URL: https://issues.apache.org/jira/browse/DRILL-7590
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


The plugin registry connects configurations, stored in ZK, with 
implementations, which are Java classes. The registry handles a large number of 
tasks including:

* Populating "bootstrap" plugin configurations and handling upgrades.
* Reading from, and writing to, the persistent store in ZK.
* Handling "normal" (configured) plugins and special system plugins (which have 
no configuration.)
* Handle format plugins which are always associated with the DFS storage plugin.
* And so on.

The code has grown overly complex. As we look to add a new, cleaner plugin 
mechanism, we will start by cleaning up what we have to allow the new mechanism 
to be one of many.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (DRILL-7589) TestDynamicUDFSupport fails on GitHub Actions

2020-02-17 Thread Vova Vysotskyi (Jira)

Vova Vysotskyi created DRILL-7589:
-

 Summary: TestDynamicUDFSupport fails on GitHub Actions
 Key: DRILL-7589
 URL: https://issues.apache.org/jira/browse/DRILL-7589
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.18.0
Reporter: Vova Vysotskyi
Assignee: Vova Vysotskyi
 Fix For: 1.18.0


{{TestDynamicUDFSupport}} tests fail when running in GitHub Actions job for 
occasional JDK version: sometimes passes for specific JDK, but sometimes fails 
for it.
Also, different tests from the same test class may fail.

When enabling logs for tests, the following stack traces are logged:
{noformat}
2020-02-15T10:56:33.8624913Z 10:56:33.855 
[21b8319e-7e24-a9b9-34b7-74e1d27f64e8:foreman] ERROR 
o.a.d.e.e.f.FunctionImplementationRegistry - Problem during remote functions 
load from drill-custom-abs.jar
2020-02-15T10:56:33.8626171Z java.io.IOException: Error during jar 
[drill-custom-abs-sources.jar] coping from 
[/home/runner/work/drill/drill/exec/java-exec/target/org.apache.drill.exec.udf.dynamic.TestDynamicUDFSupport/udf/drill/udf/registry]
 to [/tmp/drill/udf/udf/local/]
2020-02-15T10:56:33.8626499Zat 
org.apache.drill.exec.expr.fn.FunctionImplementationRegistry.copyJarToLocal(FunctionImplementationRegistry.java:573)
2020-02-15T10:56:33.8626758Zat 
org.apache.drill.exec.expr.fn.FunctionImplementationRegistry.syncWithRemoteRegistry(FunctionImplementationRegistry.java:369)
2020-02-15T10:56:33.8627312Zat 
org.apache.drill.exec.planner.sql.DrillSqlWorker.convertPlan(DrillSqlWorker.java:135)
2020-02-15T10:56:33.8627544Zat 
org.apache.drill.exec.planner.sql.DrillSqlWorker.getPlan(DrillSqlWorker.java:93)
2020-02-15T10:56:33.8628086Zat 
org.apache.drill.exec.work.foreman.Foreman.runSQL(Foreman.java:590)
2020-02-15T10:56:33.8628315Zat 
org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java:275)
2020-02-15T10:56:33.8628522Zat 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
2020-02-15T10:56:33.8628749Zat 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
2020-02-15T10:56:33.8628961Zat 
java.base/java.lang.Thread.run(Thread.java:834)
2020-02-15T10:56:33.8629569Z Caused by: 
org.apache.hadoop.util.Shell$ExitCodeException: chmod: cannot access 
'/tmp/drill/udf/udf/local/.drill-custom-abs-sources.jar.crc': No such file or 
directory
2020-02-15T10:56:33.8629777Z 
2020-02-15T10:56:33.8629975Zat 
org.apache.hadoop.util.Shell.runCommand(Shell.java:1008)
2020-02-15T10:56:33.8630183Zat 
org.apache.hadoop.util.Shell.run(Shell.java:901)
2020-02-15T10:56:33.8630396Zat 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213)
2020-02-15T10:56:33.8630618Zat 
org.apache.hadoop.util.Shell.execCommand(Shell.java:1307)
2020-02-15T10:56:33.8630813Zat 
org.apache.hadoop.util.Shell.execCommand(Shell.java:1289)
2020-02-15T10:56:33.8631031Zat 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:865)
2020-02-15T10:56:33.8631283Zat 
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.(RawLocalFileSystem.java:252)
2020-02-15T10:56:33.8631519Zat 
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.(RawLocalFileSystem.java:232)
2020-02-15T10:56:33.8631876Zat 
org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:331)
2020-02-15T10:56:33.8632094Zat 
org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:320)
2020-02-15T10:56:33.8632306Zat 
org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:351)
2020-02-15T10:56:33.8632528Zat 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.(ChecksumFileSystem.java:405)
2020-02-15T10:56:33.8632748Zat 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:464)
2020-02-15T10:56:33.8632961Zat 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:443)
2020-02-15T10:56:33.8633171Zat 
org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1118)
2020-02-15T10:56:33.8633380Zat 
org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1098)
2020-02-15T10:56:33.8633580Zat 
org.apache.hadoop.fs.FileSystem.create(FileSystem.java:987)
2020-02-15T10:56:33.8633780Zat 
org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:414)
2020-02-15T10:56:33.8633986Zat 
org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:387)
2020-02-15T10:56:33.8634187Zat 
org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
2020-02-15T10:56:33.8634398Zat 
org.apache.hadoop.fs.LocalFileSystem.copyToLocalFile(LocalFileSystem.java:88)
2020-02-15T10:56:33.8634613Zat 
org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2379)
2020-02-15T10:56:33.8634845Zat

[GitHub] [drill] vvysotskyi commented on issue #1984: DRILL-7586: drill-hive-exec-shaded contains commons-lang3 version 3.1

2020-02-17 Thread GitBox

vvysotskyi commented on issue #1984: DRILL-7586: drill-hive-exec-shaded 
contains commons-lang3 version 3.1
URL: https://github.com/apache/drill/pull/1984#issuecomment-587094713
 
 
   @oleg-zinovev, thanks for the PR and making changes. Could you please also 
update your commit message to reflect its changes?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r380192497
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##
 @@ -0,0 +1,408 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and 
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is a Beta feature; it is subject to change. We encourage you to 
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may 
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
+
+To use the Drill Metastore, you must enable it at the session or system level 
with one of the following commands:
+
+   SET `metastore.enabled` = true;
+   ALTER SYSTEM SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://:8047/options`.
+
+## Computing and storing table metadata to Drill Metastore
+
+Once you enable the Metastore, the next step is to populate it with data. 
Drill can query a table whether that table
+ has a Metastore entry or not. (If you are familiar with Hive, then you know 
that Hive requires that all tables have
+ Hive Metastore entries before you can query them.) In Drill, only add data to 
the Metastore when doing so improves
+ query performance. In general, large tables benefit from statistics more than 
small tables do.
+
+Unlike Hive, Drill does not require you to declare a schema. Instead, Drill 
infers the schema by scanning your table 
+ in the same way as it is done during regular select and computes some 
metadata like `MIN` / `MAX` column values and
+ `NULLS_COUNT` designated as "metadata" to be able to produce more 
optimizations like filter push-down, etc. If
+ `planner.statistics.use` option is enabled, this command will also calculate 
and store table statistics into Drill
+ Metastore.
+
+## Configuration
+
+Default Metastore configuration is defined in `drill-metastore-default.conf` 
file.
+It can be overridden in `drill-metastore-override.conf`. Distribution 
configuration can be
+indicated in `drill-metastore-distrib.conf`.
+
+All configuration properties should reside in `drill.metastore` namespace.
+Metastore implementation based on class implementation config property 
`drill.metastore.implementation.class`.
+The default value is the following:
+
+```
+drill.metastore: {
+  implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore"
+}
+```
+
+Note, that currently out of box Iceberg Metastore is available and is the 
default one. Though any custom
+ implementation can be added by placing the JAR into classpath which has the 
implementation of
+ `org.apache.drill.metastore.Metastore` interface and indicating custom class 
in the `drill.metastore.implementation.class`.
+
+### Metastore Components
+
+Metastore can store metadata for various components: tables, views, etc.
+Current implementation provides fully functioning support for tables component.
+Views component support is not implemented but contains stub methods to show
+how new Metastore components like UDFs, storage plugins, etc. can be added in 
the future.
+
+### Metastore Tables
+
+Metastore Tables component contains metadata about Drill tables, including 
general information, as well as
+information about table segments, files, row groups, partitions.
+
+Full table metadata consists of two major concepts: general information and 
top-level segments metadata.
+Table general information contains basic table information and corresponds to 
the `BaseTableMetadata` class.
+
+A table can be non-partitioned and partitioned. Non-partitioned tables have 
only one top-level segment 
+which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned 
tables may have several top-level segments.
+Each top-level segment can include metadata about inner segments, files, row 
groups, and partitions.
+
+A unique table identifier in Metastore Tables is a combination of storage 
plugin, workspace, and table name.
+Table metadata inside is grouped by top-level segments, unique identifier of 
the top-level segment and its metadata
+is storage plugin, workspace, table name, and metadata key.
+
+### Related Session/System Options
+
+The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` 
or via the Drill Web console.
+
+- **metastore.enabled**
+Enables Drill Metastore usage to be able to store table metadata during 
ANALYZE TABLE commands execution and to be able
+ to read table metadata during regular queries execution or when querying some 
INFORMATION_SCHEMA tables. Default is `false`.
+-

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r380218259
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##
 @@ -0,0 +1,408 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and 
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is a Beta feature; it is subject to change. We encourage you to 
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may 
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
+
+To use the Drill Metastore, you must enable it at the session or system level 
with one of the following commands:
+
+   SET `metastore.enabled` = true;
+   ALTER SYSTEM SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://:8047/options`.
+
+## Computing and storing table metadata to Drill Metastore
+
+Once you enable the Metastore, the next step is to populate it with data. 
Drill can query a table whether that table
+ has a Metastore entry or not. (If you are familiar with Hive, then you know 
that Hive requires that all tables have
+ Hive Metastore entries before you can query them.) In Drill, only add data to 
the Metastore when doing so improves
+ query performance. In general, large tables benefit from statistics more than 
small tables do.
+
+Unlike Hive, Drill does not require you to declare a schema. Instead, Drill 
infers the schema by scanning your table 
+ in the same way as it is done during regular select and computes some 
metadata like `MIN` / `MAX` column values and
+ `NULLS_COUNT` designated as "metadata" to be able to produce more 
optimizations like filter push-down, etc. If
+ `planner.statistics.use` option is enabled, this command will also calculate 
and store table statistics into Drill
+ Metastore.
+
+## Configuration
+
+Default Metastore configuration is defined in `drill-metastore-default.conf` 
file.
+It can be overridden in `drill-metastore-override.conf`. Distribution 
configuration can be
+indicated in `drill-metastore-distrib.conf`.
+
+All configuration properties should reside in `drill.metastore` namespace.
+Metastore implementation based on class implementation config property 
`drill.metastore.implementation.class`.
+The default value is the following:
+
+```
+drill.metastore: {
+  implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore"
+}
+```
+
+Note, that currently out of box Iceberg Metastore is available and is the 
default one. Though any custom
+ implementation can be added by placing the JAR into classpath which has the 
implementation of
+ `org.apache.drill.metastore.Metastore` interface and indicating custom class 
in the `drill.metastore.implementation.class`.
+
+### Metastore Components
+
+Metastore can store metadata for various components: tables, views, etc.
+Current implementation provides fully functioning support for tables component.
+Views component support is not implemented but contains stub methods to show
+how new Metastore components like UDFs, storage plugins, etc. can be added in 
the future.
+
+### Metastore Tables
+
+Metastore Tables component contains metadata about Drill tables, including 
general information, as well as
+information about table segments, files, row groups, partitions.
+
+Full table metadata consists of two major concepts: general information and 
top-level segments metadata.
+Table general information contains basic table information and corresponds to 
the `BaseTableMetadata` class.
+
+A table can be non-partitioned and partitioned. Non-partitioned tables have 
only one top-level segment 
+which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned 
tables may have several top-level segments.
+Each top-level segment can include metadata about inner segments, files, row 
groups, and partitions.
+
+A unique table identifier in Metastore Tables is a combination of storage 
plugin, workspace, and table name.
+Table metadata inside is grouped by top-level segments, unique identifier of 
the top-level segment and its metadata
+is storage plugin, workspace, table name, and metadata key.
+
+### Related Session/System Options
+
+The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` 
or via the Drill Web console.
+
+- **metastore.enabled**
+Enables Drill Metastore usage to be able to store table metadata during 
ANALYZE TABLE commands execution and to be able
+ to read table metadata during regular queries execution or when querying some 
INFORMATION_SCHEMA tables. Default is `false`.
+-

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r380209257
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##
 @@ -0,0 +1,408 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and 
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is a Beta feature; it is subject to change. We encourage you to 
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may 
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
+
+To use the Drill Metastore, you must enable it at the session or system level 
with one of the following commands:
+
+   SET `metastore.enabled` = true;
+   ALTER SYSTEM SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://:8047/options`.
+
+## Computing and storing table metadata to Drill Metastore
+
+Once you enable the Metastore, the next step is to populate it with data. 
Drill can query a table whether that table
+ has a Metastore entry or not. (If you are familiar with Hive, then you know 
that Hive requires that all tables have
+ Hive Metastore entries before you can query them.) In Drill, only add data to 
the Metastore when doing so improves
+ query performance. In general, large tables benefit from statistics more than 
small tables do.
+
+Unlike Hive, Drill does not require you to declare a schema. Instead, Drill 
infers the schema by scanning your table 
+ in the same way as it is done during regular select and computes some 
metadata like `MIN` / `MAX` column values and
+ `NULLS_COUNT` designated as "metadata" to be able to produce more 
optimizations like filter push-down, etc. If
+ `planner.statistics.use` option is enabled, this command will also calculate 
and store table statistics into Drill
+ Metastore.
+
+## Configuration
+
+Default Metastore configuration is defined in `drill-metastore-default.conf` 
file.
+It can be overridden in `drill-metastore-override.conf`. Distribution 
configuration can be
+indicated in `drill-metastore-distrib.conf`.
+
+All configuration properties should reside in `drill.metastore` namespace.
+Metastore implementation based on class implementation config property 
`drill.metastore.implementation.class`.
+The default value is the following:
+
+```
+drill.metastore: {
+  implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore"
+}
+```
+
+Note, that currently out of box Iceberg Metastore is available and is the 
default one. Though any custom
+ implementation can be added by placing the JAR into classpath which has the 
implementation of
+ `org.apache.drill.metastore.Metastore` interface and indicating custom class 
in the `drill.metastore.implementation.class`.
+
+### Metastore Components
+
+Metastore can store metadata for various components: tables, views, etc.
+Current implementation provides fully functioning support for tables component.
+Views component support is not implemented but contains stub methods to show
+how new Metastore components like UDFs, storage plugins, etc. can be added in 
the future.
+
+### Metastore Tables
+
+Metastore Tables component contains metadata about Drill tables, including 
general information, as well as
+information about table segments, files, row groups, partitions.
+
+Full table metadata consists of two major concepts: general information and 
top-level segments metadata.
+Table general information contains basic table information and corresponds to 
the `BaseTableMetadata` class.
+
+A table can be non-partitioned and partitioned. Non-partitioned tables have 
only one top-level segment 
+which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned 
tables may have several top-level segments.
+Each top-level segment can include metadata about inner segments, files, row 
groups, and partitions.
+
+A unique table identifier in Metastore Tables is a combination of storage 
plugin, workspace, and table name.
+Table metadata inside is grouped by top-level segments, unique identifier of 
the top-level segment and its metadata
+is storage plugin, workspace, table name, and metadata key.
+
+### Related Session/System Options
+
+The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` 
or via the Drill Web console.
+
+- **metastore.enabled**
+Enables Drill Metastore usage to be able to store table metadata during 
ANALYZE TABLE commands execution and to be able
+ to read table metadata during regular queries execution or when querying some 
INFORMATION_SCHEMA tables. Default is `false`.
+-

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r380229001
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##
 @@ -0,0 +1,408 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and 
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is a Beta feature; it is subject to change. We encourage you to 
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may 
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
+
+To use the Drill Metastore, you must enable it at the session or system level 
with one of the following commands:
+
+   SET `metastore.enabled` = true;
+   ALTER SYSTEM SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://:8047/options`.
+
+## Computing and storing table metadata to Drill Metastore
+
+Once you enable the Metastore, the next step is to populate it with data. 
Drill can query a table whether that table
+ has a Metastore entry or not. (If you are familiar with Hive, then you know 
that Hive requires that all tables have
+ Hive Metastore entries before you can query them.) In Drill, only add data to 
the Metastore when doing so improves
+ query performance. In general, large tables benefit from statistics more than 
small tables do.
+
+Unlike Hive, Drill does not require you to declare a schema. Instead, Drill 
infers the schema by scanning your table 
+ in the same way as it is done during regular select and computes some 
metadata like `MIN` / `MAX` column values and
+ `NULLS_COUNT` designated as "metadata" to be able to produce more 
optimizations like filter push-down, etc. If
+ `planner.statistics.use` option is enabled, this command will also calculate 
and store table statistics into Drill
+ Metastore.
+
+## Configuration
+
+Default Metastore configuration is defined in `drill-metastore-default.conf` 
file.
+It can be overridden in `drill-metastore-override.conf`. Distribution 
configuration can be
+indicated in `drill-metastore-distrib.conf`.
+
+All configuration properties should reside in `drill.metastore` namespace.
+Metastore implementation based on class implementation config property 
`drill.metastore.implementation.class`.
+The default value is the following:
+
+```
+drill.metastore: {
+  implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore"
+}
+```
+
+Note, that currently out of box Iceberg Metastore is available and is the 
default one. Though any custom
+ implementation can be added by placing the JAR into classpath which has the 
implementation of
+ `org.apache.drill.metastore.Metastore` interface and indicating custom class 
in the `drill.metastore.implementation.class`.
+
+### Metastore Components
+
+Metastore can store metadata for various components: tables, views, etc.
+Current implementation provides fully functioning support for tables component.
+Views component support is not implemented but contains stub methods to show
+how new Metastore components like UDFs, storage plugins, etc. can be added in 
the future.
+
+### Metastore Tables
+
+Metastore Tables component contains metadata about Drill tables, including 
general information, as well as
+information about table segments, files, row groups, partitions.
+
+Full table metadata consists of two major concepts: general information and 
top-level segments metadata.
+Table general information contains basic table information and corresponds to 
the `BaseTableMetadata` class.
+
+A table can be non-partitioned and partitioned. Non-partitioned tables have 
only one top-level segment 
+which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned 
tables may have several top-level segments.
+Each top-level segment can include metadata about inner segments, files, row 
groups, and partitions.
+
+A unique table identifier in Metastore Tables is a combination of storage 
plugin, workspace, and table name.
+Table metadata inside is grouped by top-level segments, unique identifier of 
the top-level segment and its metadata
+is storage plugin, workspace, table name, and metadata key.
+
+### Related Session/System Options
+
+The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` 
or via the Drill Web console.
+
+- **metastore.enabled**
+Enables Drill Metastore usage to be able to store table metadata during 
ANALYZE TABLE commands execution and to be able
+ to read table metadata during regular queries execution or when querying some 
INFORMATION_SCHEMA tables. Default is `false`.
+-

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r380275860
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/030-drill-iceberg-metastore.md
 ##
 @@ -0,0 +1,69 @@
+---
+title: "Drill Iceberg Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill uses Iceberg Metastore implementation based on [Iceberg 
tables](http://iceberg.incubator.apache.org). For Drill 1.17,
+ this is default Drill Metastore implementation. For details on how to 
configure Iceberg Metastore implementation and
+ its option descriptions, please refer to [Iceberg Metastore 
docs](https://github.com/apache/drill/blob/master/metastore/iceberg-metastore/README.md).
+
+{% include startnote.html %}
+Iceberg table supports concurrent writes and transactions but they are only 
effective on file systems that support
+ atomic rename.
+If the file system does not support atomic rename, it could lead to 
inconsistencies during concurrent writes.
+{% include endnote.html %}
+
+### Iceberg Tables Location
+
+Iceberg tables will reside on the file system in the location based on
+Iceberg Metastore base location `drill.metastore.iceberg.location.base_path` 
and component specific location.
+If Iceberg Metastore base location is `/drill/metastore/iceberg`
+and tables component location is `tables`. Iceberg table for tables component
+will be located in `/drill/metastore/iceberg/tables` folder.
+
+Metastore metadata will be stored inside Iceberg table location provided
+in the configuration file. Drill table metadata location will be constructed
+based on specific component storage keys. For example, for `tables` component,
+storage keys are storage plugin, workspace and table name: unique table 
identifier in Drill.
 
 Review comment:
   Thanks, replaced.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r379568993
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##
 @@ -0,0 +1,408 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and 
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is a Beta feature; it is subject to change. We encourage you to 
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may 
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
+
+To use the Drill Metastore, you must enable it at the session or system level 
with one of the following commands:
+
+   SET `metastore.enabled` = true;
+   ALTER SYSTEM SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://:8047/options`.
+
+## Computing and storing table metadata to Drill Metastore
+
+Once you enable the Metastore, the next step is to populate it with data. 
Drill can query a table whether that table
+ has a Metastore entry or not. (If you are familiar with Hive, then you know 
that Hive requires that all tables have
+ Hive Metastore entries before you can query them.) In Drill, only add data to 
the Metastore when doing so improves
+ query performance. In general, large tables benefit from statistics more than 
small tables do.
+
+Unlike Hive, Drill does not require you to declare a schema. Instead, Drill 
infers the schema by scanning your table 
+ in the same way as it is done during regular select and computes some 
metadata like `MIN` / `MAX` column values and
+ `NULLS_COUNT` designated as "metadata" to be able to produce more 
optimizations like filter push-down, etc. If
+ `planner.statistics.use` option is enabled, this command will also calculate 
and store table statistics into Drill
+ Metastore.
+
+## Configuration
+
+Default Metastore configuration is defined in `drill-metastore-default.conf` 
file.
+It can be overridden in `drill-metastore-override.conf`. Distribution 
configuration can be
+indicated in `drill-metastore-distrib.conf`.
+
+All configuration properties should reside in `drill.metastore` namespace.
+Metastore implementation based on class implementation config property 
`drill.metastore.implementation.class`.
+The default value is the following:
+
+```
+drill.metastore: {
+  implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore"
+}
+```
+
+Note, that currently out of box Iceberg Metastore is available and is the 
default one. Though any custom
+ implementation can be added by placing the JAR into classpath which has the 
implementation of
+ `org.apache.drill.metastore.Metastore` interface and indicating custom class 
in the `drill.metastore.implementation.class`.
+
+### Metastore Components
+
+Metastore can store metadata for various components: tables, views, etc.
+Current implementation provides fully functioning support for tables component.
+Views component support is not implemented but contains stub methods to show
+how new Metastore components like UDFs, storage plugins, etc. can be added in 
the future.
+
+### Metastore Tables
+
+Metastore Tables component contains metadata about Drill tables, including 
general information, as well as
+information about table segments, files, row groups, partitions.
+
+Full table metadata consists of two major concepts: general information and 
top-level segments metadata.
+Table general information contains basic table information and corresponds to 
the `BaseTableMetadata` class.
+
+A table can be non-partitioned and partitioned. Non-partitioned tables have 
only one top-level segment 
+which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned 
tables may have several top-level segments.
+Each top-level segment can include metadata about inner segments, files, row 
groups, and partitions.
+
+A unique table identifier in Metastore Tables is a combination of storage 
plugin, workspace, and table name.
 
 Review comment:
   Thanks, replaced. Currently, user can delete only metadata for an existing 
table. Added this info also.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r380209127
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##
 @@ -0,0 +1,408 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and 
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is a Beta feature; it is subject to change. We encourage you to 
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may 
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
+
+To use the Drill Metastore, you must enable it at the session or system level 
with one of the following commands:
+
+   SET `metastore.enabled` = true;
+   ALTER SYSTEM SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://:8047/options`.
+
+## Computing and storing table metadata to Drill Metastore
+
+Once you enable the Metastore, the next step is to populate it with data. 
Drill can query a table whether that table
+ has a Metastore entry or not. (If you are familiar with Hive, then you know 
that Hive requires that all tables have
+ Hive Metastore entries before you can query them.) In Drill, only add data to 
the Metastore when doing so improves
+ query performance. In general, large tables benefit from statistics more than 
small tables do.
+
+Unlike Hive, Drill does not require you to declare a schema. Instead, Drill 
infers the schema by scanning your table 
+ in the same way as it is done during regular select and computes some 
metadata like `MIN` / `MAX` column values and
+ `NULLS_COUNT` designated as "metadata" to be able to produce more 
optimizations like filter push-down, etc. If
+ `planner.statistics.use` option is enabled, this command will also calculate 
and store table statistics into Drill
+ Metastore.
+
+## Configuration
+
+Default Metastore configuration is defined in `drill-metastore-default.conf` 
file.
+It can be overridden in `drill-metastore-override.conf`. Distribution 
configuration can be
+indicated in `drill-metastore-distrib.conf`.
+
+All configuration properties should reside in `drill.metastore` namespace.
+Metastore implementation based on class implementation config property 
`drill.metastore.implementation.class`.
+The default value is the following:
+
+```
+drill.metastore: {
+  implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore"
+}
+```
+
+Note, that currently out of box Iceberg Metastore is available and is the 
default one. Though any custom
+ implementation can be added by placing the JAR into classpath which has the 
implementation of
+ `org.apache.drill.metastore.Metastore` interface and indicating custom class 
in the `drill.metastore.implementation.class`.
+
+### Metastore Components
+
+Metastore can store metadata for various components: tables, views, etc.
+Current implementation provides fully functioning support for tables component.
+Views component support is not implemented but contains stub methods to show
+how new Metastore components like UDFs, storage plugins, etc. can be added in 
the future.
+
+### Metastore Tables
+
+Metastore Tables component contains metadata about Drill tables, including 
general information, as well as
+information about table segments, files, row groups, partitions.
+
+Full table metadata consists of two major concepts: general information and 
top-level segments metadata.
+Table general information contains basic table information and corresponds to 
the `BaseTableMetadata` class.
+
+A table can be non-partitioned and partitioned. Non-partitioned tables have 
only one top-level segment 
+which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned 
tables may have several top-level segments.
+Each top-level segment can include metadata about inner segments, files, row 
groups, and partitions.
+
+A unique table identifier in Metastore Tables is a combination of storage 
plugin, workspace, and table name.
+Table metadata inside is grouped by top-level segments, unique identifier of 
the top-level segment and its metadata
+is storage plugin, workspace, table name, and metadata key.
+
+### Related Session/System Options
+
+The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` 
or via the Drill Web console.
+
+- **metastore.enabled**
+Enables Drill Metastore usage to be able to store table metadata during 
ANALYZE TABLE commands execution and to be able
+ to read table metadata during regular queries execution or when querying some 
INFORMATION_SCHEMA tables. Default is `false`.
+-

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r380230120
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##
 @@ -0,0 +1,408 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and 
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is a Beta feature; it is subject to change. We encourage you to 
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may 
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
+
+To use the Drill Metastore, you must enable it at the session or system level 
with one of the following commands:
+
+   SET `metastore.enabled` = true;
+   ALTER SYSTEM SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://:8047/options`.
+
+## Computing and storing table metadata to Drill Metastore
+
+Once you enable the Metastore, the next step is to populate it with data. 
Drill can query a table whether that table
+ has a Metastore entry or not. (If you are familiar with Hive, then you know 
that Hive requires that all tables have
+ Hive Metastore entries before you can query them.) In Drill, only add data to 
the Metastore when doing so improves
+ query performance. In general, large tables benefit from statistics more than 
small tables do.
+
+Unlike Hive, Drill does not require you to declare a schema. Instead, Drill 
infers the schema by scanning your table 
+ in the same way as it is done during regular select and computes some 
metadata like `MIN` / `MAX` column values and
+ `NULLS_COUNT` designated as "metadata" to be able to produce more 
optimizations like filter push-down, etc. If
+ `planner.statistics.use` option is enabled, this command will also calculate 
and store table statistics into Drill
+ Metastore.
+
+## Configuration
+
+Default Metastore configuration is defined in `drill-metastore-default.conf` 
file.
+It can be overridden in `drill-metastore-override.conf`. Distribution 
configuration can be
+indicated in `drill-metastore-distrib.conf`.
+
+All configuration properties should reside in `drill.metastore` namespace.
+Metastore implementation based on class implementation config property 
`drill.metastore.implementation.class`.
+The default value is the following:
+
+```
+drill.metastore: {
+  implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore"
+}
+```
+
+Note, that currently out of box Iceberg Metastore is available and is the 
default one. Though any custom
+ implementation can be added by placing the JAR into classpath which has the 
implementation of
+ `org.apache.drill.metastore.Metastore` interface and indicating custom class 
in the `drill.metastore.implementation.class`.
+
+### Metastore Components
+
+Metastore can store metadata for various components: tables, views, etc.
+Current implementation provides fully functioning support for tables component.
+Views component support is not implemented but contains stub methods to show
+how new Metastore components like UDFs, storage plugins, etc. can be added in 
the future.
+
+### Metastore Tables
+
+Metastore Tables component contains metadata about Drill tables, including 
general information, as well as
+information about table segments, files, row groups, partitions.
+
+Full table metadata consists of two major concepts: general information and 
top-level segments metadata.
+Table general information contains basic table information and corresponds to 
the `BaseTableMetadata` class.
+
+A table can be non-partitioned and partitioned. Non-partitioned tables have 
only one top-level segment 
+which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned 
tables may have several top-level segments.
+Each top-level segment can include metadata about inner segments, files, row 
groups, and partitions.
+
+A unique table identifier in Metastore Tables is a combination of storage 
plugin, workspace, and table name.
+Table metadata inside is grouped by top-level segments, unique identifier of 
the top-level segment and its metadata
+is storage plugin, workspace, table name, and metadata key.
+
+### Related Session/System Options
+
+The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` 
or via the Drill Web console.
+
+- **metastore.enabled**
+Enables Drill Metastore usage to be able to store table metadata during 
ANALYZE TABLE commands execution and to be able
+ to read table metadata during regular queries execution or when querying some 
INFORMATION_SCHEMA tables. Default is `false`.
+-

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r380276160
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/030-drill-iceberg-metastore.md
 ##
 @@ -0,0 +1,69 @@
+---
+title: "Drill Iceberg Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill uses Iceberg Metastore implementation based on [Iceberg 
tables](http://iceberg.incubator.apache.org). For Drill 1.17,
+ this is default Drill Metastore implementation. For details on how to 
configure Iceberg Metastore implementation and
+ its option descriptions, please refer to [Iceberg Metastore 
docs](https://github.com/apache/drill/blob/master/metastore/iceberg-metastore/README.md).
+
+{% include startnote.html %}
+Iceberg table supports concurrent writes and transactions but they are only 
effective on file systems that support
+ atomic rename.
+If the file system does not support atomic rename, it could lead to 
inconsistencies during concurrent writes.
+{% include endnote.html %}
+
+### Iceberg Tables Location
+
+Iceberg tables will reside on the file system in the location based on
+Iceberg Metastore base location `drill.metastore.iceberg.location.base_path` 
and component specific location.
+If Iceberg Metastore base location is `/drill/metastore/iceberg`
+and tables component location is `tables`. Iceberg table for tables component
+will be located in `/drill/metastore/iceberg/tables` folder.
+
+Metastore metadata will be stored inside Iceberg table location provided
+in the configuration file. Drill table metadata location will be constructed
+based on specific component storage keys. For example, for `tables` component,
+storage keys are storage plugin, workspace and table name: unique table 
identifier in Drill.
+
+Assume Iceberg table location is `/drill/metastore/iceberg/tables`, metadata 
for the table
+`dfs.tmp.nation` will be stored in the 
`/drill/metastore/iceberg/tables/dfs/tmp/nation` folder.
 
 Review comment:
   Thanks, updated the docs as proposed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r380146051
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##
 @@ -0,0 +1,408 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and 
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is a Beta feature; it is subject to change. We encourage you to 
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may 
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
+
+To use the Drill Metastore, you must enable it at the session or system level 
with one of the following commands:
+
+   SET `metastore.enabled` = true;
+   ALTER SYSTEM SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://:8047/options`.
+
+## Computing and storing table metadata to Drill Metastore
+
+Once you enable the Metastore, the next step is to populate it with data. 
Drill can query a table whether that table
+ has a Metastore entry or not. (If you are familiar with Hive, then you know 
that Hive requires that all tables have
+ Hive Metastore entries before you can query them.) In Drill, only add data to 
the Metastore when doing so improves
+ query performance. In general, large tables benefit from statistics more than 
small tables do.
+
+Unlike Hive, Drill does not require you to declare a schema. Instead, Drill 
infers the schema by scanning your table 
+ in the same way as it is done during regular select and computes some 
metadata like `MIN` / `MAX` column values and
+ `NULLS_COUNT` designated as "metadata" to be able to produce more 
optimizations like filter push-down, etc. If
+ `planner.statistics.use` option is enabled, this command will also calculate 
and store table statistics into Drill
+ Metastore.
+
+## Configuration
+
+Default Metastore configuration is defined in `drill-metastore-default.conf` 
file.
+It can be overridden in `drill-metastore-override.conf`. Distribution 
configuration can be
+indicated in `drill-metastore-distrib.conf`.
+
+All configuration properties should reside in `drill.metastore` namespace.
+Metastore implementation based on class implementation config property 
`drill.metastore.implementation.class`.
+The default value is the following:
+
+```
+drill.metastore: {
+  implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore"
+}
+```
+
+Note, that currently out of box Iceberg Metastore is available and is the 
default one. Though any custom
+ implementation can be added by placing the JAR into classpath which has the 
implementation of
+ `org.apache.drill.metastore.Metastore` interface and indicating custom class 
in the `drill.metastore.implementation.class`.
+
+### Metastore Components
+
+Metastore can store metadata for various components: tables, views, etc.
+Current implementation provides fully functioning support for tables component.
+Views component support is not implemented but contains stub methods to show
+how new Metastore components like UDFs, storage plugins, etc. can be added in 
the future.
+
+### Metastore Tables
+
+Metastore Tables component contains metadata about Drill tables, including 
general information, as well as
+information about table segments, files, row groups, partitions.
+
+Full table metadata consists of two major concepts: general information and 
top-level segments metadata.
+Table general information contains basic table information and corresponds to 
the `BaseTableMetadata` class.
+
+A table can be non-partitioned and partitioned. Non-partitioned tables have 
only one top-level segment 
+which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned 
tables may have several top-level segments.
+Each top-level segment can include metadata about inner segments, files, row 
groups, and partitions.
+
+A unique table identifier in Metastore Tables is a combination of storage 
plugin, workspace, and table name.
+Table metadata inside is grouped by top-level segments, unique identifier of 
the top-level segment and its metadata
+is storage plugin, workspace, table name, and metadata key.
+
+### Related Session/System Options
+
+The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` 
or via the Drill Web console.
+
+- **metastore.enabled**
+Enables Drill Metastore usage to be able to store table metadata during 
ANALYZE TABLE commands execution and to be able
+ to read table metadata during regular queries execution or when querying some 
INFORMATION_SCHEMA tables. Default is `false`.
+-

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r380210657
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##
 @@ -0,0 +1,408 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and 
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is a Beta feature; it is subject to change. We encourage you to 
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may 
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
+
+To use the Drill Metastore, you must enable it at the session or system level 
with one of the following commands:
+
+   SET `metastore.enabled` = true;
+   ALTER SYSTEM SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://:8047/options`.
+
+## Computing and storing table metadata to Drill Metastore
+
+Once you enable the Metastore, the next step is to populate it with data. 
Drill can query a table whether that table
+ has a Metastore entry or not. (If you are familiar with Hive, then you know 
that Hive requires that all tables have
+ Hive Metastore entries before you can query them.) In Drill, only add data to 
the Metastore when doing so improves
+ query performance. In general, large tables benefit from statistics more than 
small tables do.
+
+Unlike Hive, Drill does not require you to declare a schema. Instead, Drill 
infers the schema by scanning your table 
+ in the same way as it is done during regular select and computes some 
metadata like `MIN` / `MAX` column values and
+ `NULLS_COUNT` designated as "metadata" to be able to produce more 
optimizations like filter push-down, etc. If
+ `planner.statistics.use` option is enabled, this command will also calculate 
and store table statistics into Drill
+ Metastore.
+
+## Configuration
+
+Default Metastore configuration is defined in `drill-metastore-default.conf` 
file.
+It can be overridden in `drill-metastore-override.conf`. Distribution 
configuration can be
+indicated in `drill-metastore-distrib.conf`.
+
+All configuration properties should reside in `drill.metastore` namespace.
+Metastore implementation based on class implementation config property 
`drill.metastore.implementation.class`.
+The default value is the following:
+
+```
+drill.metastore: {
+  implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore"
+}
+```
+
+Note, that currently out of box Iceberg Metastore is available and is the 
default one. Though any custom
+ implementation can be added by placing the JAR into classpath which has the 
implementation of
+ `org.apache.drill.metastore.Metastore` interface and indicating custom class 
in the `drill.metastore.implementation.class`.
+
+### Metastore Components
+
+Metastore can store metadata for various components: tables, views, etc.
+Current implementation provides fully functioning support for tables component.
+Views component support is not implemented but contains stub methods to show
+how new Metastore components like UDFs, storage plugins, etc. can be added in 
the future.
+
+### Metastore Tables
+
+Metastore Tables component contains metadata about Drill tables, including 
general information, as well as
+information about table segments, files, row groups, partitions.
+
+Full table metadata consists of two major concepts: general information and 
top-level segments metadata.
+Table general information contains basic table information and corresponds to 
the `BaseTableMetadata` class.
+
+A table can be non-partitioned and partitioned. Non-partitioned tables have 
only one top-level segment 
+which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned 
tables may have several top-level segments.
+Each top-level segment can include metadata about inner segments, files, row 
groups, and partitions.
+
+A unique table identifier in Metastore Tables is a combination of storage 
plugin, workspace, and table name.
+Table metadata inside is grouped by top-level segments, unique identifier of 
the top-level segment and its metadata
+is storage plugin, workspace, table name, and metadata key.
+
+### Related Session/System Options
+
+The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` 
or via the Drill Web console.
+
+- **metastore.enabled**
+Enables Drill Metastore usage to be able to store table metadata during 
ANALYZE TABLE commands execution and to be able
+ to read table metadata during regular queries execution or when querying some 
INFORMATION_SCHEMA tables. Default is `false`.
+-

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r379576801
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##
 @@ -0,0 +1,408 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and 
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is a Beta feature; it is subject to change. We encourage you to 
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may 
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
+
+To use the Drill Metastore, you must enable it at the session or system level 
with one of the following commands:
+
+   SET `metastore.enabled` = true;
+   ALTER SYSTEM SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://:8047/options`.
+
+## Computing and storing table metadata to Drill Metastore
+
+Once you enable the Metastore, the next step is to populate it with data. 
Drill can query a table whether that table
+ has a Metastore entry or not. (If you are familiar with Hive, then you know 
that Hive requires that all tables have
+ Hive Metastore entries before you can query them.) In Drill, only add data to 
the Metastore when doing so improves
+ query performance. In general, large tables benefit from statistics more than 
small tables do.
+
+Unlike Hive, Drill does not require you to declare a schema. Instead, Drill 
infers the schema by scanning your table 
+ in the same way as it is done during regular select and computes some 
metadata like `MIN` / `MAX` column values and
+ `NULLS_COUNT` designated as "metadata" to be able to produce more 
optimizations like filter push-down, etc. If
+ `planner.statistics.use` option is enabled, this command will also calculate 
and store table statistics into Drill
+ Metastore.
+
+## Configuration
+
+Default Metastore configuration is defined in `drill-metastore-default.conf` 
file.
+It can be overridden in `drill-metastore-override.conf`. Distribution 
configuration can be
+indicated in `drill-metastore-distrib.conf`.
+
+All configuration properties should reside in `drill.metastore` namespace.
+Metastore implementation based on class implementation config property 
`drill.metastore.implementation.class`.
+The default value is the following:
+
+```
+drill.metastore: {
+  implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore"
+}
+```
+
+Note, that currently out of box Iceberg Metastore is available and is the 
default one. Though any custom
+ implementation can be added by placing the JAR into classpath which has the 
implementation of
+ `org.apache.drill.metastore.Metastore` interface and indicating custom class 
in the `drill.metastore.implementation.class`.
+
+### Metastore Components
+
+Metastore can store metadata for various components: tables, views, etc.
+Current implementation provides fully functioning support for tables component.
+Views component support is not implemented but contains stub methods to show
+how new Metastore components like UDFs, storage plugins, etc. can be added in 
the future.
+
+### Metastore Tables
+
+Metastore Tables component contains metadata about Drill tables, including 
general information, as well as
+information about table segments, files, row groups, partitions.
+
+Full table metadata consists of two major concepts: general information and 
top-level segments metadata.
+Table general information contains basic table information and corresponds to 
the `BaseTableMetadata` class.
+
+A table can be non-partitioned and partitioned. Non-partitioned tables have 
only one top-level segment 
+which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned 
tables may have several top-level segments.
+Each top-level segment can include metadata about inner segments, files, row 
groups, and partitions.
+
+A unique table identifier in Metastore Tables is a combination of storage 
plugin, workspace, and table name.
+Table metadata inside is grouped by top-level segments, unique identifier of 
the top-level segment and its metadata
+is storage plugin, workspace, table name, and metadata key.
+
+### Related Session/System Options
+
+The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` 
or via the Drill Web console.
+
+- **metastore.enabled**
+Enables Drill Metastore usage to be able to store table metadata during 
ANALYZE TABLE commands execution and to be able
+ to read table metadata during regular queries execution or when querying some 
INFORMATION_SCHEMA tables. Default is `false`.
+-

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r380164420
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##
 @@ -0,0 +1,408 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and 
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is a Beta feature; it is subject to change. We encourage you to 
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may 
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
+
+To use the Drill Metastore, you must enable it at the session or system level 
with one of the following commands:
+
+   SET `metastore.enabled` = true;
+   ALTER SYSTEM SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://:8047/options`.
+
+## Computing and storing table metadata to Drill Metastore
+
+Once you enable the Metastore, the next step is to populate it with data. 
Drill can query a table whether that table
+ has a Metastore entry or not. (If you are familiar with Hive, then you know 
that Hive requires that all tables have
+ Hive Metastore entries before you can query them.) In Drill, only add data to 
the Metastore when doing so improves
+ query performance. In general, large tables benefit from statistics more than 
small tables do.
+
+Unlike Hive, Drill does not require you to declare a schema. Instead, Drill 
infers the schema by scanning your table 
+ in the same way as it is done during regular select and computes some 
metadata like `MIN` / `MAX` column values and
+ `NULLS_COUNT` designated as "metadata" to be able to produce more 
optimizations like filter push-down, etc. If
+ `planner.statistics.use` option is enabled, this command will also calculate 
and store table statistics into Drill
+ Metastore.
+
+## Configuration
+
+Default Metastore configuration is defined in `drill-metastore-default.conf` 
file.
+It can be overridden in `drill-metastore-override.conf`. Distribution 
configuration can be
+indicated in `drill-metastore-distrib.conf`.
+
+All configuration properties should reside in `drill.metastore` namespace.
+Metastore implementation based on class implementation config property 
`drill.metastore.implementation.class`.
+The default value is the following:
+
+```
+drill.metastore: {
+  implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore"
+}
+```
+
+Note, that currently out of box Iceberg Metastore is available and is the 
default one. Though any custom
+ implementation can be added by placing the JAR into classpath which has the 
implementation of
+ `org.apache.drill.metastore.Metastore` interface and indicating custom class 
in the `drill.metastore.implementation.class`.
+
+### Metastore Components
+
+Metastore can store metadata for various components: tables, views, etc.
+Current implementation provides fully functioning support for tables component.
+Views component support is not implemented but contains stub methods to show
+how new Metastore components like UDFs, storage plugins, etc. can be added in 
the future.
+
+### Metastore Tables
+
+Metastore Tables component contains metadata about Drill tables, including 
general information, as well as
+information about table segments, files, row groups, partitions.
+
+Full table metadata consists of two major concepts: general information and 
top-level segments metadata.
+Table general information contains basic table information and corresponds to 
the `BaseTableMetadata` class.
+
+A table can be non-partitioned and partitioned. Non-partitioned tables have 
only one top-level segment 
+which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned 
tables may have several top-level segments.
+Each top-level segment can include metadata about inner segments, files, row 
groups, and partitions.
+
+A unique table identifier in Metastore Tables is a combination of storage 
plugin, workspace, and table name.
+Table metadata inside is grouped by top-level segments, unique identifier of 
the top-level segment and its metadata
+is storage plugin, workspace, table name, and metadata key.
+
+### Related Session/System Options
+
+The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` 
or via the Drill Web console.
+
+- **metastore.enabled**
+Enables Drill Metastore usage to be able to store table metadata during 
ANALYZE TABLE commands execution and to be able
+ to read table metadata during regular queries execution or when querying some 
INFORMATION_SCHEMA tables. Default is `false`.
+-

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r380209529
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##
 @@ -0,0 +1,408 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and 
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is a Beta feature; it is subject to change. We encourage you to 
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may 
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
+
+To use the Drill Metastore, you must enable it at the session or system level 
with one of the following commands:
+
+   SET `metastore.enabled` = true;
+   ALTER SYSTEM SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://:8047/options`.
+
+## Computing and storing table metadata to Drill Metastore
+
+Once you enable the Metastore, the next step is to populate it with data. 
Drill can query a table whether that table
+ has a Metastore entry or not. (If you are familiar with Hive, then you know 
that Hive requires that all tables have
+ Hive Metastore entries before you can query them.) In Drill, only add data to 
the Metastore when doing so improves
+ query performance. In general, large tables benefit from statistics more than 
small tables do.
+
+Unlike Hive, Drill does not require you to declare a schema. Instead, Drill 
infers the schema by scanning your table 
+ in the same way as it is done during regular select and computes some 
metadata like `MIN` / `MAX` column values and
+ `NULLS_COUNT` designated as "metadata" to be able to produce more 
optimizations like filter push-down, etc. If
+ `planner.statistics.use` option is enabled, this command will also calculate 
and store table statistics into Drill
+ Metastore.
+
+## Configuration
+
+Default Metastore configuration is defined in `drill-metastore-default.conf` 
file.
+It can be overridden in `drill-metastore-override.conf`. Distribution 
configuration can be
+indicated in `drill-metastore-distrib.conf`.
+
+All configuration properties should reside in `drill.metastore` namespace.
+Metastore implementation based on class implementation config property 
`drill.metastore.implementation.class`.
+The default value is the following:
+
+```
+drill.metastore: {
+  implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore"
+}
+```
+
+Note, that currently out of box Iceberg Metastore is available and is the 
default one. Though any custom
+ implementation can be added by placing the JAR into classpath which has the 
implementation of
+ `org.apache.drill.metastore.Metastore` interface and indicating custom class 
in the `drill.metastore.implementation.class`.
+
+### Metastore Components
+
+Metastore can store metadata for various components: tables, views, etc.
+Current implementation provides fully functioning support for tables component.
+Views component support is not implemented but contains stub methods to show
+how new Metastore components like UDFs, storage plugins, etc. can be added in 
the future.
+
+### Metastore Tables
+
+Metastore Tables component contains metadata about Drill tables, including 
general information, as well as
+information about table segments, files, row groups, partitions.
+
+Full table metadata consists of two major concepts: general information and 
top-level segments metadata.
+Table general information contains basic table information and corresponds to 
the `BaseTableMetadata` class.
+
+A table can be non-partitioned and partitioned. Non-partitioned tables have 
only one top-level segment 
+which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned 
tables may have several top-level segments.
+Each top-level segment can include metadata about inner segments, files, row 
groups, and partitions.
+
+A unique table identifier in Metastore Tables is a combination of storage 
plugin, workspace, and table name.
+Table metadata inside is grouped by top-level segments, unique identifier of 
the top-level segment and its metadata
+is storage plugin, workspace, table name, and metadata key.
+
+### Related Session/System Options
+
+The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` 
or via the Drill Web console.
+
+- **metastore.enabled**
+Enables Drill Metastore usage to be able to store table metadata during 
ANALYZE TABLE commands execution and to be able
+ to read table metadata during regular queries execution or when querying some 
INFORMATION_SCHEMA tables. Default is `false`.
+-

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r380270959
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/030-drill-iceberg-metastore.md
 ##
 @@ -0,0 +1,69 @@
+---
+title: "Drill Iceberg Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill uses Iceberg Metastore implementation based on [Iceberg 
tables](http://iceberg.incubator.apache.org). For Drill 1.17,
+ this is default Drill Metastore implementation. For details on how to 
configure Iceberg Metastore implementation and
+ its option descriptions, please refer to [Iceberg Metastore 
docs](https://github.com/apache/drill/blob/master/metastore/iceberg-metastore/README.md).
+
+{% include startnote.html %}
+Iceberg table supports concurrent writes and transactions but they are only 
effective on file systems that support
+ atomic rename.
+If the file system does not support atomic rename, it could lead to 
inconsistencies during concurrent writes.
+{% include endnote.html %}
+
+### Iceberg Tables Location
+
+Iceberg tables will reside on the file system in the location based on
+Iceberg Metastore base location `drill.metastore.iceberg.location.base_path` 
and component specific location.
+If Iceberg Metastore base location is `/drill/metastore/iceberg`
+and tables component location is `tables`. Iceberg table for tables component
 
 Review comment:
   Thanks, updated.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r380199229
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##
 @@ -0,0 +1,408 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and 
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is a Beta feature; it is subject to change. We encourage you to 
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may 
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
+
+To use the Drill Metastore, you must enable it at the session or system level 
with one of the following commands:
+
+   SET `metastore.enabled` = true;
+   ALTER SYSTEM SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://:8047/options`.
+
+## Computing and storing table metadata to Drill Metastore
+
+Once you enable the Metastore, the next step is to populate it with data. 
Drill can query a table whether that table
+ has a Metastore entry or not. (If you are familiar with Hive, then you know 
that Hive requires that all tables have
+ Hive Metastore entries before you can query them.) In Drill, only add data to 
the Metastore when doing so improves
+ query performance. In general, large tables benefit from statistics more than 
small tables do.
+
+Unlike Hive, Drill does not require you to declare a schema. Instead, Drill 
infers the schema by scanning your table 
+ in the same way as it is done during regular select and computes some 
metadata like `MIN` / `MAX` column values and
+ `NULLS_COUNT` designated as "metadata" to be able to produce more 
optimizations like filter push-down, etc. If
+ `planner.statistics.use` option is enabled, this command will also calculate 
and store table statistics into Drill
+ Metastore.
+
+## Configuration
+
+Default Metastore configuration is defined in `drill-metastore-default.conf` 
file.
+It can be overridden in `drill-metastore-override.conf`. Distribution 
configuration can be
+indicated in `drill-metastore-distrib.conf`.
+
+All configuration properties should reside in `drill.metastore` namespace.
+Metastore implementation based on class implementation config property 
`drill.metastore.implementation.class`.
+The default value is the following:
+
+```
+drill.metastore: {
+  implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore"
+}
+```
+
+Note, that currently out of box Iceberg Metastore is available and is the 
default one. Though any custom
+ implementation can be added by placing the JAR into classpath which has the 
implementation of
+ `org.apache.drill.metastore.Metastore` interface and indicating custom class 
in the `drill.metastore.implementation.class`.
+
+### Metastore Components
+
+Metastore can store metadata for various components: tables, views, etc.
+Current implementation provides fully functioning support for tables component.
+Views component support is not implemented but contains stub methods to show
+how new Metastore components like UDFs, storage plugins, etc. can be added in 
the future.
+
+### Metastore Tables
+
+Metastore Tables component contains metadata about Drill tables, including 
general information, as well as
+information about table segments, files, row groups, partitions.
+
+Full table metadata consists of two major concepts: general information and 
top-level segments metadata.
+Table general information contains basic table information and corresponds to 
the `BaseTableMetadata` class.
+
+A table can be non-partitioned and partitioned. Non-partitioned tables have 
only one top-level segment 
+which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned 
tables may have several top-level segments.
+Each top-level segment can include metadata about inner segments, files, row 
groups, and partitions.
+
+A unique table identifier in Metastore Tables is a combination of storage 
plugin, workspace, and table name.
+Table metadata inside is grouped by top-level segments, unique identifier of 
the top-level segment and its metadata
+is storage plugin, workspace, table name, and metadata key.
+
+### Related Session/System Options
+
+The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` 
or via the Drill Web console.
+
+- **metastore.enabled**
+Enables Drill Metastore usage to be able to store table metadata during 
ANALYZE TABLE commands execution and to be able
+ to read table metadata during regular queries execution or when querying some 
INFORMATION_SCHEMA tables. Default is `false`.
+-

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r380219490
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##
 @@ -0,0 +1,408 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and 
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is a Beta feature; it is subject to change. We encourage you to 
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may 
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
+
+To use the Drill Metastore, you must enable it at the session or system level 
with one of the following commands:
+
+   SET `metastore.enabled` = true;
+   ALTER SYSTEM SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://:8047/options`.
+
+## Computing and storing table metadata to Drill Metastore
+
+Once you enable the Metastore, the next step is to populate it with data. 
Drill can query a table whether that table
+ has a Metastore entry or not. (If you are familiar with Hive, then you know 
that Hive requires that all tables have
+ Hive Metastore entries before you can query them.) In Drill, only add data to 
the Metastore when doing so improves
+ query performance. In general, large tables benefit from statistics more than 
small tables do.
+
+Unlike Hive, Drill does not require you to declare a schema. Instead, Drill 
infers the schema by scanning your table 
+ in the same way as it is done during regular select and computes some 
metadata like `MIN` / `MAX` column values and
+ `NULLS_COUNT` designated as "metadata" to be able to produce more 
optimizations like filter push-down, etc. If
+ `planner.statistics.use` option is enabled, this command will also calculate 
and store table statistics into Drill
+ Metastore.
+
+## Configuration
+
+Default Metastore configuration is defined in `drill-metastore-default.conf` 
file.
+It can be overridden in `drill-metastore-override.conf`. Distribution 
configuration can be
+indicated in `drill-metastore-distrib.conf`.
+
+All configuration properties should reside in `drill.metastore` namespace.
+Metastore implementation based on class implementation config property 
`drill.metastore.implementation.class`.
+The default value is the following:
+
+```
+drill.metastore: {
+  implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore"
+}
+```
+
+Note, that currently out of box Iceberg Metastore is available and is the 
default one. Though any custom
+ implementation can be added by placing the JAR into classpath which has the 
implementation of
+ `org.apache.drill.metastore.Metastore` interface and indicating custom class 
in the `drill.metastore.implementation.class`.
+
+### Metastore Components
+
+Metastore can store metadata for various components: tables, views, etc.
+Current implementation provides fully functioning support for tables component.
+Views component support is not implemented but contains stub methods to show
+how new Metastore components like UDFs, storage plugins, etc. can be added in 
the future.
+
+### Metastore Tables
+
+Metastore Tables component contains metadata about Drill tables, including 
general information, as well as
+information about table segments, files, row groups, partitions.
+
+Full table metadata consists of two major concepts: general information and 
top-level segments metadata.
+Table general information contains basic table information and corresponds to 
the `BaseTableMetadata` class.
+
+A table can be non-partitioned and partitioned. Non-partitioned tables have 
only one top-level segment 
+which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned 
tables may have several top-level segments.
+Each top-level segment can include metadata about inner segments, files, row 
groups, and partitions.
+
+A unique table identifier in Metastore Tables is a combination of storage 
plugin, workspace, and table name.
+Table metadata inside is grouped by top-level segments, unique identifier of 
the top-level segment and its metadata
+is storage plugin, workspace, table name, and metadata key.
+
+### Related Session/System Options
+
+The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` 
or via the Drill Web console.
+
+- **metastore.enabled**
+Enables Drill Metastore usage to be able to store table metadata during 
ANALYZE TABLE commands execution and to be able
+ to read table metadata during regular queries execution or when querying some 
INFORMATION_SCHEMA tables. Default is `false`.
+-

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r380250105
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/030-drill-iceberg-metastore.md
 ##
 @@ -0,0 +1,69 @@
+---
+title: "Drill Iceberg Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill uses Iceberg Metastore implementation based on [Iceberg 
tables](http://iceberg.incubator.apache.org). For Drill 1.17,
+ this is default Drill Metastore implementation. For details on how to 
configure Iceberg Metastore implementation and
+ its option descriptions, please refer to [Iceberg Metastore 
docs](https://github.com/apache/drill/blob/master/metastore/iceberg-metastore/README.md).
+
+{% include startnote.html %}
+Iceberg table supports concurrent writes and transactions but they are only 
effective on file systems that support
+ atomic rename.
+If the file system does not support atomic rename, it could lead to 
inconsistencies during concurrent writes.
+{% include endnote.html %}
+
+### Iceberg Tables Location
+
+Iceberg tables will reside on the file system in the location based on
+Iceberg Metastore base location `drill.metastore.iceberg.location.base_path` 
and component specific location.
 
 Review comment:
   Good point! Added sentence before this one about configuration files and 
added specified that the above is the configuration property.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r380159080
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##
 @@ -0,0 +1,408 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and 
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is a Beta feature; it is subject to change. We encourage you to 
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may 
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
+
+To use the Drill Metastore, you must enable it at the session or system level 
with one of the following commands:
+
+   SET `metastore.enabled` = true;
+   ALTER SYSTEM SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://:8047/options`.
+
+## Computing and storing table metadata to Drill Metastore
+
+Once you enable the Metastore, the next step is to populate it with data. 
Drill can query a table whether that table
+ has a Metastore entry or not. (If you are familiar with Hive, then you know 
that Hive requires that all tables have
+ Hive Metastore entries before you can query them.) In Drill, only add data to 
the Metastore when doing so improves
+ query performance. In general, large tables benefit from statistics more than 
small tables do.
+
+Unlike Hive, Drill does not require you to declare a schema. Instead, Drill 
infers the schema by scanning your table 
+ in the same way as it is done during regular select and computes some 
metadata like `MIN` / `MAX` column values and
+ `NULLS_COUNT` designated as "metadata" to be able to produce more 
optimizations like filter push-down, etc. If
+ `planner.statistics.use` option is enabled, this command will also calculate 
and store table statistics into Drill
+ Metastore.
+
+## Configuration
+
+Default Metastore configuration is defined in `drill-metastore-default.conf` 
file.
+It can be overridden in `drill-metastore-override.conf`. Distribution 
configuration can be
+indicated in `drill-metastore-distrib.conf`.
+
+All configuration properties should reside in `drill.metastore` namespace.
+Metastore implementation based on class implementation config property 
`drill.metastore.implementation.class`.
+The default value is the following:
+
+```
+drill.metastore: {
+  implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore"
+}
+```
+
+Note, that currently out of box Iceberg Metastore is available and is the 
default one. Though any custom
+ implementation can be added by placing the JAR into classpath which has the 
implementation of
+ `org.apache.drill.metastore.Metastore` interface and indicating custom class 
in the `drill.metastore.implementation.class`.
+
+### Metastore Components
+
+Metastore can store metadata for various components: tables, views, etc.
+Current implementation provides fully functioning support for tables component.
+Views component support is not implemented but contains stub methods to show
+how new Metastore components like UDFs, storage plugins, etc. can be added in 
the future.
+
+### Metastore Tables
+
+Metastore Tables component contains metadata about Drill tables, including 
general information, as well as
+information about table segments, files, row groups, partitions.
+
+Full table metadata consists of two major concepts: general information and 
top-level segments metadata.
+Table general information contains basic table information and corresponds to 
the `BaseTableMetadata` class.
+
+A table can be non-partitioned and partitioned. Non-partitioned tables have 
only one top-level segment 
+which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned 
tables may have several top-level segments.
+Each top-level segment can include metadata about inner segments, files, row 
groups, and partitions.
+
+A unique table identifier in Metastore Tables is a combination of storage 
plugin, workspace, and table name.
+Table metadata inside is grouped by top-level segments, unique identifier of 
the top-level segment and its metadata
+is storage plugin, workspace, table name, and metadata key.
+
+### Related Session/System Options
+
+The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` 
or via the Drill Web console.
+
+- **metastore.enabled**
+Enables Drill Metastore usage to be able to store table metadata during 
ANALYZE TABLE commands execution and to be able
+ to read table metadata during regular queries execution or when querying some 
INFORMATION_SCHEMA tables. Default is `false`.
+-

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r379573331
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##
 @@ -0,0 +1,408 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and 
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is a Beta feature; it is subject to change. We encourage you to 
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may 
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
+
+To use the Drill Metastore, you must enable it at the session or system level 
with one of the following commands:
+
+   SET `metastore.enabled` = true;
+   ALTER SYSTEM SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://:8047/options`.
+
+## Computing and storing table metadata to Drill Metastore
+
+Once you enable the Metastore, the next step is to populate it with data. 
Drill can query a table whether that table
+ has a Metastore entry or not. (If you are familiar with Hive, then you know 
that Hive requires that all tables have
+ Hive Metastore entries before you can query them.) In Drill, only add data to 
the Metastore when doing so improves
+ query performance. In general, large tables benefit from statistics more than 
small tables do.
+
+Unlike Hive, Drill does not require you to declare a schema. Instead, Drill 
infers the schema by scanning your table 
+ in the same way as it is done during regular select and computes some 
metadata like `MIN` / `MAX` column values and
+ `NULLS_COUNT` designated as "metadata" to be able to produce more 
optimizations like filter push-down, etc. If
+ `planner.statistics.use` option is enabled, this command will also calculate 
and store table statistics into Drill
+ Metastore.
+
+## Configuration
+
+Default Metastore configuration is defined in `drill-metastore-default.conf` 
file.
+It can be overridden in `drill-metastore-override.conf`. Distribution 
configuration can be
+indicated in `drill-metastore-distrib.conf`.
+
+All configuration properties should reside in `drill.metastore` namespace.
+Metastore implementation based on class implementation config property 
`drill.metastore.implementation.class`.
+The default value is the following:
+
+```
+drill.metastore: {
+  implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore"
+}
+```
+
+Note, that currently out of box Iceberg Metastore is available and is the 
default one. Though any custom
+ implementation can be added by placing the JAR into classpath which has the 
implementation of
+ `org.apache.drill.metastore.Metastore` interface and indicating custom class 
in the `drill.metastore.implementation.class`.
+
+### Metastore Components
+
+Metastore can store metadata for various components: tables, views, etc.
+Current implementation provides fully functioning support for tables component.
+Views component support is not implemented but contains stub methods to show
+how new Metastore components like UDFs, storage plugins, etc. can be added in 
the future.
+
+### Metastore Tables
+
+Metastore Tables component contains metadata about Drill tables, including 
general information, as well as
+information about table segments, files, row groups, partitions.
+
+Full table metadata consists of two major concepts: general information and 
top-level segments metadata.
+Table general information contains basic table information and corresponds to 
the `BaseTableMetadata` class.
+
+A table can be non-partitioned and partitioned. Non-partitioned tables have 
only one top-level segment 
+which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned 
tables may have several top-level segments.
+Each top-level segment can include metadata about inner segments, files, row 
groups, and partitions.
+
+A unique table identifier in Metastore Tables is a combination of storage 
plugin, workspace, and table name.
+Table metadata inside is grouped by top-level segments, unique identifier of 
the top-level segment and its metadata
+is storage plugin, workspace, table name, and metadata key.
+
+### Related Session/System Options
+
+The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` 
or via the Drill Web console.
+
+- **metastore.enabled**
+Enables Drill Metastore usage to be able to store table metadata during 
ANALYZE TABLE commands execution and to be able
+ to read table metadata during regular queries execution or when querying some 
INFORMATION_SCHEMA tables. Default is `false`.
+-

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r380278702
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/030-drill-iceberg-metastore.md
 ##
 @@ -0,0 +1,69 @@
+---
+title: "Drill Iceberg Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill uses Iceberg Metastore implementation based on [Iceberg 
tables](http://iceberg.incubator.apache.org). For Drill 1.17,
+ this is default Drill Metastore implementation. For details on how to 
configure Iceberg Metastore implementation and
+ its option descriptions, please refer to [Iceberg Metastore 
docs](https://github.com/apache/drill/blob/master/metastore/iceberg-metastore/README.md).
+
+{% include startnote.html %}
+Iceberg table supports concurrent writes and transactions but they are only 
effective on file systems that support
+ atomic rename.
+If the file system does not support atomic rename, it could lead to 
inconsistencies during concurrent writes.
+{% include endnote.html %}
+
+### Iceberg Tables Location
+
+Iceberg tables will reside on the file system in the location based on
+Iceberg Metastore base location `drill.metastore.iceberg.location.base_path` 
and component specific location.
+If Iceberg Metastore base location is `/drill/metastore/iceberg`
+and tables component location is `tables`. Iceberg table for tables component
+will be located in `/drill/metastore/iceberg/tables` folder.
+
+Metastore metadata will be stored inside Iceberg table location provided
+in the configuration file. Drill table metadata location will be constructed
+based on specific component storage keys. For example, for `tables` component,
+storage keys are storage plugin, workspace and table name: unique table 
identifier in Drill.
+
+Assume Iceberg table location is `/drill/metastore/iceberg/tables`, metadata 
for the table
+`dfs.tmp.nation` will be stored in the 
`/drill/metastore/iceberg/tables/dfs/tmp/nation` folder.
+
+Example of base Metastore configuration file `drill-metastore-override.conf`, 
where Iceberg tables will be stored in
+ hdfs:
+
+```
+drill.metastore.iceberg: {
+  config.properties: {
+fs.defaultFS: "hdfs:///"
+  }
+
+  location: {
+base_path: "/drill/metastore",
+relative_path: "iceberg"
+  }
+}
+```
+
+### Metadata Storage Format
+
+Iceberg tables support data storage in three formats: Parquet, Avro, ORC. 
Drill metadata will be stored in Parquet files.
+This format was chosen over others since it is column oriented and efficient 
in terms of disk I/O when specific
+columns need to be queried.
+
+Each Parquet file will hold information for one partition. Partition keys will 
depend on Metastore
+component characteristics. For example, for tables component, partitions keys 
are storage plugin, workspace,
+table name and metadata key.
+
+Parquet files name will be based on UUID to ensure uniqueness. If somehow 
collision occurs, modify operation
+in Metastore will fail.
 
 Review comment:
   Thanks, removed this section.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r380199004
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##
 @@ -0,0 +1,408 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and 
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is a Beta feature; it is subject to change. We encourage you to 
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may 
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
+
+To use the Drill Metastore, you must enable it at the session or system level 
with one of the following commands:
+
+   SET `metastore.enabled` = true;
+   ALTER SYSTEM SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://:8047/options`.
+
+## Computing and storing table metadata to Drill Metastore
+
+Once you enable the Metastore, the next step is to populate it with data. 
Drill can query a table whether that table
+ has a Metastore entry or not. (If you are familiar with Hive, then you know 
that Hive requires that all tables have
+ Hive Metastore entries before you can query them.) In Drill, only add data to 
the Metastore when doing so improves
+ query performance. In general, large tables benefit from statistics more than 
small tables do.
+
+Unlike Hive, Drill does not require you to declare a schema. Instead, Drill 
infers the schema by scanning your table 
+ in the same way as it is done during regular select and computes some 
metadata like `MIN` / `MAX` column values and
+ `NULLS_COUNT` designated as "metadata" to be able to produce more 
optimizations like filter push-down, etc. If
+ `planner.statistics.use` option is enabled, this command will also calculate 
and store table statistics into Drill
+ Metastore.
+
+## Configuration
+
+Default Metastore configuration is defined in `drill-metastore-default.conf` 
file.
+It can be overridden in `drill-metastore-override.conf`. Distribution 
configuration can be
+indicated in `drill-metastore-distrib.conf`.
+
+All configuration properties should reside in `drill.metastore` namespace.
+Metastore implementation based on class implementation config property 
`drill.metastore.implementation.class`.
+The default value is the following:
+
+```
+drill.metastore: {
+  implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore"
+}
+```
+
+Note, that currently out of box Iceberg Metastore is available and is the 
default one. Though any custom
+ implementation can be added by placing the JAR into classpath which has the 
implementation of
+ `org.apache.drill.metastore.Metastore` interface and indicating custom class 
in the `drill.metastore.implementation.class`.
+
+### Metastore Components
+
+Metastore can store metadata for various components: tables, views, etc.
+Current implementation provides fully functioning support for tables component.
+Views component support is not implemented but contains stub methods to show
+how new Metastore components like UDFs, storage plugins, etc. can be added in 
the future.
+
+### Metastore Tables
+
+Metastore Tables component contains metadata about Drill tables, including 
general information, as well as
+information about table segments, files, row groups, partitions.
+
+Full table metadata consists of two major concepts: general information and 
top-level segments metadata.
+Table general information contains basic table information and corresponds to 
the `BaseTableMetadata` class.
+
+A table can be non-partitioned and partitioned. Non-partitioned tables have 
only one top-level segment 
+which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned 
tables may have several top-level segments.
+Each top-level segment can include metadata about inner segments, files, row 
groups, and partitions.
+
+A unique table identifier in Metastore Tables is a combination of storage 
plugin, workspace, and table name.
+Table metadata inside is grouped by top-level segments, unique identifier of 
the top-level segment and its metadata
+is storage plugin, workspace, table name, and metadata key.
+
+### Related Session/System Options
+
+The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` 
or via the Drill Web console.
+
+- **metastore.enabled**
+Enables Drill Metastore usage to be able to store table metadata during 
ANALYZE TABLE commands execution and to be able
+ to read table metadata during regular queries execution or when querying some 
INFORMATION_SCHEMA tables. Default is `false`.
+-

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r380241423
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##
 @@ -0,0 +1,408 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and 
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is a Beta feature; it is subject to change. We encourage you to 
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may 
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
+
+To use the Drill Metastore, you must enable it at the session or system level 
with one of the following commands:
+
+   SET `metastore.enabled` = true;
+   ALTER SYSTEM SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://:8047/options`.
+
+## Computing and storing table metadata to Drill Metastore
+
+Once you enable the Metastore, the next step is to populate it with data. 
Drill can query a table whether that table
+ has a Metastore entry or not. (If you are familiar with Hive, then you know 
that Hive requires that all tables have
+ Hive Metastore entries before you can query them.) In Drill, only add data to 
the Metastore when doing so improves
+ query performance. In general, large tables benefit from statistics more than 
small tables do.
+
+Unlike Hive, Drill does not require you to declare a schema. Instead, Drill 
infers the schema by scanning your table 
+ in the same way as it is done during regular select and computes some 
metadata like `MIN` / `MAX` column values and
+ `NULLS_COUNT` designated as "metadata" to be able to produce more 
optimizations like filter push-down, etc. If
+ `planner.statistics.use` option is enabled, this command will also calculate 
and store table statistics into Drill
+ Metastore.
+
+## Configuration
+
+Default Metastore configuration is defined in `drill-metastore-default.conf` 
file.
+It can be overridden in `drill-metastore-override.conf`. Distribution 
configuration can be
+indicated in `drill-metastore-distrib.conf`.
+
+All configuration properties should reside in `drill.metastore` namespace.
+Metastore implementation based on class implementation config property 
`drill.metastore.implementation.class`.
+The default value is the following:
+
+```
+drill.metastore: {
+  implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore"
+}
+```
+
+Note, that currently out of box Iceberg Metastore is available and is the 
default one. Though any custom
+ implementation can be added by placing the JAR into classpath which has the 
implementation of
+ `org.apache.drill.metastore.Metastore` interface and indicating custom class 
in the `drill.metastore.implementation.class`.
+
+### Metastore Components
+
+Metastore can store metadata for various components: tables, views, etc.
+Current implementation provides fully functioning support for tables component.
+Views component support is not implemented but contains stub methods to show
+how new Metastore components like UDFs, storage plugins, etc. can be added in 
the future.
+
+### Metastore Tables
+
+Metastore Tables component contains metadata about Drill tables, including 
general information, as well as
+information about table segments, files, row groups, partitions.
+
+Full table metadata consists of two major concepts: general information and 
top-level segments metadata.
+Table general information contains basic table information and corresponds to 
the `BaseTableMetadata` class.
+
+A table can be non-partitioned and partitioned. Non-partitioned tables have 
only one top-level segment 
+which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned 
tables may have several top-level segments.
+Each top-level segment can include metadata about inner segments, files, row 
groups, and partitions.
+
+A unique table identifier in Metastore Tables is a combination of storage 
plugin, workspace, and table name.
+Table metadata inside is grouped by top-level segments, unique identifier of 
the top-level segment and its metadata
+is storage plugin, workspace, table name, and metadata key.
+
+### Related Session/System Options
+
+The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` 
or via the Drill Web console.
+
+- **metastore.enabled**
+Enables Drill Metastore usage to be able to store table metadata during 
ANALYZE TABLE commands execution and to be able
+ to read table metadata during regular queries execution or when querying some 
INFORMATION_SCHEMA tables. Default is `false`.
+-

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r380164542
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##
 @@ -0,0 +1,408 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and 
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is a Beta feature; it is subject to change. We encourage you to 
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may 
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
+
+To use the Drill Metastore, you must enable it at the session or system level 
with one of the following commands:
+
+   SET `metastore.enabled` = true;
+   ALTER SYSTEM SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://:8047/options`.
+
+## Computing and storing table metadata to Drill Metastore
+
+Once you enable the Metastore, the next step is to populate it with data. 
Drill can query a table whether that table
+ has a Metastore entry or not. (If you are familiar with Hive, then you know 
that Hive requires that all tables have
+ Hive Metastore entries before you can query them.) In Drill, only add data to 
the Metastore when doing so improves
+ query performance. In general, large tables benefit from statistics more than 
small tables do.
+
+Unlike Hive, Drill does not require you to declare a schema. Instead, Drill 
infers the schema by scanning your table 
+ in the same way as it is done during regular select and computes some 
metadata like `MIN` / `MAX` column values and
+ `NULLS_COUNT` designated as "metadata" to be able to produce more 
optimizations like filter push-down, etc. If
+ `planner.statistics.use` option is enabled, this command will also calculate 
and store table statistics into Drill
+ Metastore.
+
+## Configuration
+
+Default Metastore configuration is defined in `drill-metastore-default.conf` 
file.
+It can be overridden in `drill-metastore-override.conf`. Distribution 
configuration can be
+indicated in `drill-metastore-distrib.conf`.
+
+All configuration properties should reside in `drill.metastore` namespace.
+Metastore implementation based on class implementation config property 
`drill.metastore.implementation.class`.
+The default value is the following:
+
+```
+drill.metastore: {
+  implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore"
+}
+```
+
+Note, that currently out of box Iceberg Metastore is available and is the 
default one. Though any custom
+ implementation can be added by placing the JAR into classpath which has the 
implementation of
+ `org.apache.drill.metastore.Metastore` interface and indicating custom class 
in the `drill.metastore.implementation.class`.
+
+### Metastore Components
+
+Metastore can store metadata for various components: tables, views, etc.
+Current implementation provides fully functioning support for tables component.
+Views component support is not implemented but contains stub methods to show
+how new Metastore components like UDFs, storage plugins, etc. can be added in 
the future.
+
+### Metastore Tables
+
+Metastore Tables component contains metadata about Drill tables, including 
general information, as well as
+information about table segments, files, row groups, partitions.
+
+Full table metadata consists of two major concepts: general information and 
top-level segments metadata.
+Table general information contains basic table information and corresponds to 
the `BaseTableMetadata` class.
+
+A table can be non-partitioned and partitioned. Non-partitioned tables have 
only one top-level segment 
+which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned 
tables may have several top-level segments.
+Each top-level segment can include metadata about inner segments, files, row 
groups, and partitions.
+
+A unique table identifier in Metastore Tables is a combination of storage 
plugin, workspace, and table name.
+Table metadata inside is grouped by top-level segments, unique identifier of 
the top-level segment and its metadata
+is storage plugin, workspace, table name, and metadata key.
+
+### Related Session/System Options
+
+The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` 
or via the Drill Web console.
+
+- **metastore.enabled**
+Enables Drill Metastore usage to be able to store table metadata during 
ANALYZE TABLE commands execution and to be able
+ to read table metadata during regular queries execution or when querying some 
INFORMATION_SCHEMA tables. Default is `false`.
+-

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r379567240
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##
 @@ -0,0 +1,408 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and 
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is a Beta feature; it is subject to change. We encourage you to 
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may 
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
+
+To use the Drill Metastore, you must enable it at the session or system level 
with one of the following commands:
+
+   SET `metastore.enabled` = true;
+   ALTER SYSTEM SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://:8047/options`.
+
+## Computing and storing table metadata to Drill Metastore
+
+Once you enable the Metastore, the next step is to populate it with data. 
Drill can query a table whether that table
+ has a Metastore entry or not. (If you are familiar with Hive, then you know 
that Hive requires that all tables have
+ Hive Metastore entries before you can query them.) In Drill, only add data to 
the Metastore when doing so improves
+ query performance. In general, large tables benefit from statistics more than 
small tables do.
+
+Unlike Hive, Drill does not require you to declare a schema. Instead, Drill 
infers the schema by scanning your table 
+ in the same way as it is done during regular select and computes some 
metadata like `MIN` / `MAX` column values and
+ `NULLS_COUNT` designated as "metadata" to be able to produce more 
optimizations like filter push-down, etc. If
+ `planner.statistics.use` option is enabled, this command will also calculate 
and store table statistics into Drill
+ Metastore.
+
+## Configuration
+
+Default Metastore configuration is defined in `drill-metastore-default.conf` 
file.
+It can be overridden in `drill-metastore-override.conf`. Distribution 
configuration can be
+indicated in `drill-metastore-distrib.conf`.
+
+All configuration properties should reside in `drill.metastore` namespace.
+Metastore implementation based on class implementation config property 
`drill.metastore.implementation.class`.
+The default value is the following:
+
+```
+drill.metastore: {
+  implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore"
+}
+```
+
+Note, that currently out of box Iceberg Metastore is available and is the 
default one. Though any custom
+ implementation can be added by placing the JAR into classpath which has the 
implementation of
+ `org.apache.drill.metastore.Metastore` interface and indicating custom class 
in the `drill.metastore.implementation.class`.
+
+### Metastore Components
+
+Metastore can store metadata for various components: tables, views, etc.
+Current implementation provides fully functioning support for tables component.
+Views component support is not implemented but contains stub methods to show
+how new Metastore components like UDFs, storage plugins, etc. can be added in 
the future.
+
+### Metastore Tables
+
+Metastore Tables component contains metadata about Drill tables, including 
general information, as well as
+information about table segments, files, row groups, partitions.
+
+Full table metadata consists of two major concepts: general information and 
top-level segments metadata.
+Table general information contains basic table information and corresponds to 
the `BaseTableMetadata` class.
+
+A table can be non-partitioned and partitioned. Non-partitioned tables have 
only one top-level segment 
+which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned 
tables may have several top-level segments.
+Each top-level segment can include metadata about inner segments, files, row 
groups, and partitions.
 
 Review comment:
   Metastore supports single files also.
   Added part of the info you have proposed and added references to the 
examples, where was described how to query partitions and segments metadata.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r379576091
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##
 @@ -0,0 +1,408 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and 
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is a Beta feature; it is subject to change. We encourage you to 
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may 
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
+
+To use the Drill Metastore, you must enable it at the session or system level 
with one of the following commands:
+
+   SET `metastore.enabled` = true;
+   ALTER SYSTEM SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://:8047/options`.
+
+## Computing and storing table metadata to Drill Metastore
+
+Once you enable the Metastore, the next step is to populate it with data. 
Drill can query a table whether that table
+ has a Metastore entry or not. (If you are familiar with Hive, then you know 
that Hive requires that all tables have
+ Hive Metastore entries before you can query them.) In Drill, only add data to 
the Metastore when doing so improves
+ query performance. In general, large tables benefit from statistics more than 
small tables do.
+
+Unlike Hive, Drill does not require you to declare a schema. Instead, Drill 
infers the schema by scanning your table 
+ in the same way as it is done during regular select and computes some 
metadata like `MIN` / `MAX` column values and
+ `NULLS_COUNT` designated as "metadata" to be able to produce more 
optimizations like filter push-down, etc. If
+ `planner.statistics.use` option is enabled, this command will also calculate 
and store table statistics into Drill
+ Metastore.
+
+## Configuration
+
+Default Metastore configuration is defined in `drill-metastore-default.conf` 
file.
+It can be overridden in `drill-metastore-override.conf`. Distribution 
configuration can be
+indicated in `drill-metastore-distrib.conf`.
+
+All configuration properties should reside in `drill.metastore` namespace.
+Metastore implementation based on class implementation config property 
`drill.metastore.implementation.class`.
+The default value is the following:
+
+```
+drill.metastore: {
+  implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore"
+}
+```
+
+Note, that currently out of box Iceberg Metastore is available and is the 
default one. Though any custom
+ implementation can be added by placing the JAR into classpath which has the 
implementation of
+ `org.apache.drill.metastore.Metastore` interface and indicating custom class 
in the `drill.metastore.implementation.class`.
+
+### Metastore Components
+
+Metastore can store metadata for various components: tables, views, etc.
+Current implementation provides fully functioning support for tables component.
+Views component support is not implemented but contains stub methods to show
+how new Metastore components like UDFs, storage plugins, etc. can be added in 
the future.
+
+### Metastore Tables
+
+Metastore Tables component contains metadata about Drill tables, including 
general information, as well as
+information about table segments, files, row groups, partitions.
+
+Full table metadata consists of two major concepts: general information and 
top-level segments metadata.
+Table general information contains basic table information and corresponds to 
the `BaseTableMetadata` class.
+
+A table can be non-partitioned and partitioned. Non-partitioned tables have 
only one top-level segment 
+which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned 
tables may have several top-level segments.
+Each top-level segment can include metadata about inner segments, files, row 
groups, and partitions.
+
+A unique table identifier in Metastore Tables is a combination of storage 
plugin, workspace, and table name.
+Table metadata inside is grouped by top-level segments, unique identifier of 
the top-level segment and its metadata
+is storage plugin, workspace, table name, and metadata key.
+
+### Related Session/System Options
+
+The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` 
or via the Drill Web console.
+
+- **metastore.enabled**
+Enables Drill Metastore usage to be able to store table metadata during 
ANALYZE TABLE commands execution and to be able
+ to read table metadata during regular queries execution or when querying some 
INFORMATION_SCHEMA tables. Default is `false`.
+-

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r380244476
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/030-drill-iceberg-metastore.md
 ##
 @@ -0,0 +1,69 @@
+---
+title: "Drill Iceberg Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill uses Iceberg Metastore implementation based on [Iceberg 
tables](http://iceberg.incubator.apache.org). For Drill 1.17,
+ this is default Drill Metastore implementation. For details on how to 
configure Iceberg Metastore implementation and
+ its option descriptions, please refer to [Iceberg Metastore 
docs](https://github.com/apache/drill/blob/master/metastore/iceberg-metastore/README.md).
+
+{% include startnote.html %}
+Iceberg table supports concurrent writes and transactions but they are only 
effective on file systems that support
+ atomic rename.
+If the file system does not support atomic rename, it could lead to 
inconsistencies during concurrent writes.
+{% include endnote.html %}
+
+### Iceberg Tables Location
+
 
 Review comment:
   Thanks, added.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r379556928
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##
 @@ -0,0 +1,408 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and 
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is a Beta feature; it is subject to change. We encourage you to 
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may 
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
+
+To use the Drill Metastore, you must enable it at the session or system level 
with one of the following commands:
+
+   SET `metastore.enabled` = true;
+   ALTER SYSTEM SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://:8047/options`.
+
+## Computing and storing table metadata to Drill Metastore
+
+Once you enable the Metastore, the next step is to populate it with data. 
Drill can query a table whether that table
+ has a Metastore entry or not. (If you are familiar with Hive, then you know 
that Hive requires that all tables have
+ Hive Metastore entries before you can query them.) In Drill, only add data to 
the Metastore when doing so improves
+ query performance. In general, large tables benefit from statistics more than 
small tables do.
+
+Unlike Hive, Drill does not require you to declare a schema. Instead, Drill 
infers the schema by scanning your table 
+ in the same way as it is done during regular select and computes some 
metadata like `MIN` / `MAX` column values and
+ `NULLS_COUNT` designated as "metadata" to be able to produce more 
optimizations like filter push-down, etc. If
+ `planner.statistics.use` option is enabled, this command will also calculate 
and store table statistics into Drill
+ Metastore.
+
+## Configuration
+
+Default Metastore configuration is defined in `drill-metastore-default.conf` 
file.
+It can be overridden in `drill-metastore-override.conf`. Distribution 
configuration can be
+indicated in `drill-metastore-distrib.conf`.
+
+All configuration properties should reside in `drill.metastore` namespace.
+Metastore implementation based on class implementation config property 
`drill.metastore.implementation.class`.
+The default value is the following:
+
+```
+drill.metastore: {
+  implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore"
+}
+```
+
+Note, that currently out of box Iceberg Metastore is available and is the 
default one. Though any custom
+ implementation can be added by placing the JAR into classpath which has the 
implementation of
+ `org.apache.drill.metastore.Metastore` interface and indicating custom class 
in the `drill.metastore.implementation.class`.
+
+### Metastore Components
+
+Metastore can store metadata for various components: tables, views, etc.
+Current implementation provides fully functioning support for tables component.
+Views component support is not implemented but contains stub methods to show
+how new Metastore components like UDFs, storage plugins, etc. can be added in 
the future.
+
+### Metastore Tables
+
+Metastore Tables component contains metadata about Drill tables, including 
general information, as well as
+information about table segments, files, row groups, partitions.
+
+Full table metadata consists of two major concepts: general information and 
top-level segments metadata.
 
 Review comment:
   Yes, we have a section below with the real tables and examples of how to 
discover metastore metadata.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r374778009
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##
 @@ -0,0 +1,408 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and 
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is a Beta feature; it is subject to change. We encourage you to 
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may 
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
+
+To use the Drill Metastore, you must enable it at the session or system level 
with one of the following commands:
+
+   SET `metastore.enabled` = true;
+   ALTER SYSTEM SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://:8047/options`.
+
+## Computing and storing table metadata to Drill Metastore
+
+Once you enable the Metastore, the next step is to populate it with data. 
Drill can query a table whether that table
+ has a Metastore entry or not. (If you are familiar with Hive, then you know 
that Hive requires that all tables have
+ Hive Metastore entries before you can query them.) In Drill, only add data to 
the Metastore when doing so improves
+ query performance. In general, large tables benefit from statistics more than 
small tables do.
+
+Unlike Hive, Drill does not require you to declare a schema. Instead, Drill 
infers the schema by scanning your table 
+ in the same way as it is done during regular select and computes some 
metadata like `MIN` / `MAX` column values and
+ `NULLS_COUNT` designated as "metadata" to be able to produce more 
optimizations like filter push-down, etc. If
+ `planner.statistics.use` option is enabled, this command will also calculate 
and store table statistics into Drill
+ Metastore.
+
+## Configuration
+
+Default Metastore configuration is defined in `drill-metastore-default.conf` 
file.
 
 Review comment:
   Thanks, done.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r380159969
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##
 @@ -0,0 +1,408 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and 
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is a Beta feature; it is subject to change. We encourage you to 
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may 
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
+
+To use the Drill Metastore, you must enable it at the session or system level 
with one of the following commands:
+
+   SET `metastore.enabled` = true;
+   ALTER SYSTEM SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://:8047/options`.
+
+## Computing and storing table metadata to Drill Metastore
+
+Once you enable the Metastore, the next step is to populate it with data. 
Drill can query a table whether that table
+ has a Metastore entry or not. (If you are familiar with Hive, then you know 
that Hive requires that all tables have
+ Hive Metastore entries before you can query them.) In Drill, only add data to 
the Metastore when doing so improves
+ query performance. In general, large tables benefit from statistics more than 
small tables do.
+
+Unlike Hive, Drill does not require you to declare a schema. Instead, Drill 
infers the schema by scanning your table 
+ in the same way as it is done during regular select and computes some 
metadata like `MIN` / `MAX` column values and
+ `NULLS_COUNT` designated as "metadata" to be able to produce more 
optimizations like filter push-down, etc. If
+ `planner.statistics.use` option is enabled, this command will also calculate 
and store table statistics into Drill
+ Metastore.
+
+## Configuration
+
+Default Metastore configuration is defined in `drill-metastore-default.conf` 
file.
+It can be overridden in `drill-metastore-override.conf`. Distribution 
configuration can be
+indicated in `drill-metastore-distrib.conf`.
+
+All configuration properties should reside in `drill.metastore` namespace.
+Metastore implementation based on class implementation config property 
`drill.metastore.implementation.class`.
+The default value is the following:
+
+```
+drill.metastore: {
+  implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore"
+}
+```
+
+Note, that currently out of box Iceberg Metastore is available and is the 
default one. Though any custom
+ implementation can be added by placing the JAR into classpath which has the 
implementation of
+ `org.apache.drill.metastore.Metastore` interface and indicating custom class 
in the `drill.metastore.implementation.class`.
+
+### Metastore Components
+
+Metastore can store metadata for various components: tables, views, etc.
+Current implementation provides fully functioning support for tables component.
+Views component support is not implemented but contains stub methods to show
+how new Metastore components like UDFs, storage plugins, etc. can be added in 
the future.
+
+### Metastore Tables
+
+Metastore Tables component contains metadata about Drill tables, including 
general information, as well as
+information about table segments, files, row groups, partitions.
+
+Full table metadata consists of two major concepts: general information and 
top-level segments metadata.
+Table general information contains basic table information and corresponds to 
the `BaseTableMetadata` class.
+
+A table can be non-partitioned and partitioned. Non-partitioned tables have 
only one top-level segment 
+which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned 
tables may have several top-level segments.
+Each top-level segment can include metadata about inner segments, files, row 
groups, and partitions.
+
+A unique table identifier in Metastore Tables is a combination of storage 
plugin, workspace, and table name.
+Table metadata inside is grouped by top-level segments, unique identifier of 
the top-level segment and its metadata
+is storage plugin, workspace, table name, and metadata key.
+
+### Related Session/System Options
+
+The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` 
or via the Drill Web console.
+
+- **metastore.enabled**
+Enables Drill Metastore usage to be able to store table metadata during 
ANALYZE TABLE commands execution and to be able
+ to read table metadata during regular queries execution or when querying some 
INFORMATION_SCHEMA tables. Default is `false`.
+-

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r379541595
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##
 @@ -0,0 +1,408 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and 
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is a Beta feature; it is subject to change. We encourage you to 
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may 
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
+
+To use the Drill Metastore, you must enable it at the session or system level 
with one of the following commands:
+
+   SET `metastore.enabled` = true;
+   ALTER SYSTEM SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://:8047/options`.
+
+## Computing and storing table metadata to Drill Metastore
+
+Once you enable the Metastore, the next step is to populate it with data. 
Drill can query a table whether that table
+ has a Metastore entry or not. (If you are familiar with Hive, then you know 
that Hive requires that all tables have
+ Hive Metastore entries before you can query them.) In Drill, only add data to 
the Metastore when doing so improves
+ query performance. In general, large tables benefit from statistics more than 
small tables do.
+
+Unlike Hive, Drill does not require you to declare a schema. Instead, Drill 
infers the schema by scanning your table 
+ in the same way as it is done during regular select and computes some 
metadata like `MIN` / `MAX` column values and
+ `NULLS_COUNT` designated as "metadata" to be able to produce more 
optimizations like filter push-down, etc. If
+ `planner.statistics.use` option is enabled, this command will also calculate 
and store table statistics into Drill
+ Metastore.
+
+## Configuration
+
+Default Metastore configuration is defined in `drill-metastore-default.conf` 
file.
+It can be overridden in `drill-metastore-override.conf`. Distribution 
configuration can be
+indicated in `drill-metastore-distrib.conf`.
+
+All configuration properties should reside in `drill.metastore` namespace.
+Metastore implementation based on class implementation config property 
`drill.metastore.implementation.class`.
+The default value is the following:
+
+```
+drill.metastore: {
+  implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore"
+}
+```
+
+Note, that currently out of box Iceberg Metastore is available and is the 
default one. Though any custom
 
 Review comment:
   Thanks, separated these two concepts and added links to iceberg 
documentation.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r37466
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##
 @@ -0,0 +1,408 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and 
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is a Beta feature; it is subject to change. We encourage you to 
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may 
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
+
+To use the Drill Metastore, you must enable it at the session or system level 
with one of the following commands:
+
+   SET `metastore.enabled` = true;
+   ALTER SYSTEM SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://:8047/options`.
+
+## Computing and storing table metadata to Drill Metastore
+
+Once you enable the Metastore, the next step is to populate it with data. 
Drill can query a table whether that table
+ has a Metastore entry or not. (If you are familiar with Hive, then you know 
that Hive requires that all tables have
 
 Review comment:
   Thanks, reworded.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r379551973
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##
 @@ -0,0 +1,408 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and 
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is a Beta feature; it is subject to change. We encourage you to 
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may 
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
+
+To use the Drill Metastore, you must enable it at the session or system level 
with one of the following commands:
+
+   SET `metastore.enabled` = true;
+   ALTER SYSTEM SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://:8047/options`.
+
+## Computing and storing table metadata to Drill Metastore
+
+Once you enable the Metastore, the next step is to populate it with data. 
Drill can query a table whether that table
+ has a Metastore entry or not. (If you are familiar with Hive, then you know 
that Hive requires that all tables have
+ Hive Metastore entries before you can query them.) In Drill, only add data to 
the Metastore when doing so improves
+ query performance. In general, large tables benefit from statistics more than 
small tables do.
+
+Unlike Hive, Drill does not require you to declare a schema. Instead, Drill 
infers the schema by scanning your table 
+ in the same way as it is done during regular select and computes some 
metadata like `MIN` / `MAX` column values and
+ `NULLS_COUNT` designated as "metadata" to be able to produce more 
optimizations like filter push-down, etc. If
+ `planner.statistics.use` option is enabled, this command will also calculate 
and store table statistics into Drill
+ Metastore.
+
+## Configuration
+
+Default Metastore configuration is defined in `drill-metastore-default.conf` 
file.
+It can be overridden in `drill-metastore-override.conf`. Distribution 
configuration can be
+indicated in `drill-metastore-distrib.conf`.
+
+All configuration properties should reside in `drill.metastore` namespace.
+Metastore implementation based on class implementation config property 
`drill.metastore.implementation.class`.
+The default value is the following:
+
+```
+drill.metastore: {
+  implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore"
+}
+```
+
+Note, that currently out of box Iceberg Metastore is available and is the 
default one. Though any custom
+ implementation can be added by placing the JAR into classpath which has the 
implementation of
+ `org.apache.drill.metastore.Metastore` interface and indicating custom class 
in the `drill.metastore.implementation.class`.
+
+### Metastore Components
+
+Metastore can store metadata for various components: tables, views, etc.
+Current implementation provides fully functioning support for tables component.
+Views component support is not implemented but contains stub methods to show
+how new Metastore components like UDFs, storage plugins, etc. can be added in 
the future.
+
+### Metastore Tables
+
+Metastore Tables component contains metadata about Drill tables, including 
general information, as well as
 
 Review comment:
   Thanks, replaced as you proposed, but also left mentioning that we have 
metadata about segments, files, row groups, partitions since it wasn't 
described in this doc yet.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r379573563
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##
 @@ -0,0 +1,408 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and 
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is a Beta feature; it is subject to change. We encourage you to 
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may 
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
+
+To use the Drill Metastore, you must enable it at the session or system level 
with one of the following commands:
+
+   SET `metastore.enabled` = true;
+   ALTER SYSTEM SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://:8047/options`.
+
+## Computing and storing table metadata to Drill Metastore
+
+Once you enable the Metastore, the next step is to populate it with data. 
Drill can query a table whether that table
+ has a Metastore entry or not. (If you are familiar with Hive, then you know 
that Hive requires that all tables have
+ Hive Metastore entries before you can query them.) In Drill, only add data to 
the Metastore when doing so improves
+ query performance. In general, large tables benefit from statistics more than 
small tables do.
+
+Unlike Hive, Drill does not require you to declare a schema. Instead, Drill 
infers the schema by scanning your table 
+ in the same way as it is done during regular select and computes some 
metadata like `MIN` / `MAX` column values and
+ `NULLS_COUNT` designated as "metadata" to be able to produce more 
optimizations like filter push-down, etc. If
+ `planner.statistics.use` option is enabled, this command will also calculate 
and store table statistics into Drill
+ Metastore.
+
+## Configuration
+
+Default Metastore configuration is defined in `drill-metastore-default.conf` 
file.
+It can be overridden in `drill-metastore-override.conf`. Distribution 
configuration can be
+indicated in `drill-metastore-distrib.conf`.
+
+All configuration properties should reside in `drill.metastore` namespace.
+Metastore implementation based on class implementation config property 
`drill.metastore.implementation.class`.
+The default value is the following:
+
+```
+drill.metastore: {
+  implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore"
+}
+```
+
+Note, that currently out of box Iceberg Metastore is available and is the 
default one. Though any custom
+ implementation can be added by placing the JAR into classpath which has the 
implementation of
+ `org.apache.drill.metastore.Metastore` interface and indicating custom class 
in the `drill.metastore.implementation.class`.
+
+### Metastore Components
+
+Metastore can store metadata for various components: tables, views, etc.
+Current implementation provides fully functioning support for tables component.
+Views component support is not implemented but contains stub methods to show
+how new Metastore components like UDFs, storage plugins, etc. can be added in 
the future.
+
+### Metastore Tables
+
+Metastore Tables component contains metadata about Drill tables, including 
general information, as well as
+information about table segments, files, row groups, partitions.
+
+Full table metadata consists of two major concepts: general information and 
top-level segments metadata.
+Table general information contains basic table information and corresponds to 
the `BaseTableMetadata` class.
+
+A table can be non-partitioned and partitioned. Non-partitioned tables have 
only one top-level segment 
+which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned 
tables may have several top-level segments.
+Each top-level segment can include metadata about inner segments, files, row 
groups, and partitions.
+
+A unique table identifier in Metastore Tables is a combination of storage 
plugin, workspace, and table name.
+Table metadata inside is grouped by top-level segments, unique identifier of 
the top-level segment and its metadata
+is storage plugin, workspace, table name, and metadata key.
+
+### Related Session/System Options
+
+The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` 
or via the Drill Web console.
+
+- **metastore.enabled**
+Enables Drill Metastore usage to be able to store table metadata during 
ANALYZE TABLE commands execution and to be able
+ to read table metadata during regular queries execution or when querying some 
INFORMATION_SCHEMA tables. Default is `false`.
+-

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r379569561
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##
 @@ -0,0 +1,408 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and 
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is a Beta feature; it is subject to change. We encourage you to 
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may 
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
+
+To use the Drill Metastore, you must enable it at the session or system level 
with one of the following commands:
+
+   SET `metastore.enabled` = true;
+   ALTER SYSTEM SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://:8047/options`.
+
+## Computing and storing table metadata to Drill Metastore
+
+Once you enable the Metastore, the next step is to populate it with data. 
Drill can query a table whether that table
+ has a Metastore entry or not. (If you are familiar with Hive, then you know 
that Hive requires that all tables have
+ Hive Metastore entries before you can query them.) In Drill, only add data to 
the Metastore when doing so improves
+ query performance. In general, large tables benefit from statistics more than 
small tables do.
+
+Unlike Hive, Drill does not require you to declare a schema. Instead, Drill 
infers the schema by scanning your table 
+ in the same way as it is done during regular select and computes some 
metadata like `MIN` / `MAX` column values and
+ `NULLS_COUNT` designated as "metadata" to be able to produce more 
optimizations like filter push-down, etc. If
+ `planner.statistics.use` option is enabled, this command will also calculate 
and store table statistics into Drill
+ Metastore.
+
+## Configuration
+
+Default Metastore configuration is defined in `drill-metastore-default.conf` 
file.
+It can be overridden in `drill-metastore-override.conf`. Distribution 
configuration can be
+indicated in `drill-metastore-distrib.conf`.
+
+All configuration properties should reside in `drill.metastore` namespace.
+Metastore implementation based on class implementation config property 
`drill.metastore.implementation.class`.
+The default value is the following:
+
+```
+drill.metastore: {
+  implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore"
+}
+```
+
+Note, that currently out of box Iceberg Metastore is available and is the 
default one. Though any custom
+ implementation can be added by placing the JAR into classpath which has the 
implementation of
+ `org.apache.drill.metastore.Metastore` interface and indicating custom class 
in the `drill.metastore.implementation.class`.
+
+### Metastore Components
+
+Metastore can store metadata for various components: tables, views, etc.
+Current implementation provides fully functioning support for tables component.
+Views component support is not implemented but contains stub methods to show
+how new Metastore components like UDFs, storage plugins, etc. can be added in 
the future.
+
+### Metastore Tables
+
+Metastore Tables component contains metadata about Drill tables, including 
general information, as well as
+information about table segments, files, row groups, partitions.
+
+Full table metadata consists of two major concepts: general information and 
top-level segments metadata.
+Table general information contains basic table information and corresponds to 
the `BaseTableMetadata` class.
+
+A table can be non-partitioned and partitioned. Non-partitioned tables have 
only one top-level segment 
+which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned 
tables may have several top-level segments.
+Each top-level segment can include metadata about inner segments, files, row 
groups, and partitions.
+
+A unique table identifier in Metastore Tables is a combination of storage 
plugin, workspace, and table name.
+Table metadata inside is grouped by top-level segments, unique identifier of 
the top-level segment and its metadata
+is storage plugin, workspace, table name, and metadata key.
+
+### Related Session/System Options
+
 
 Review comment:
   Thanks, replaced.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r379534144
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##
 @@ -0,0 +1,408 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and 
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is a Beta feature; it is subject to change. We encourage you to 
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may 
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
+
+To use the Drill Metastore, you must enable it at the session or system level 
with one of the following commands:
+
+   SET `metastore.enabled` = true;
+   ALTER SYSTEM SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://:8047/options`.
+
+## Computing and storing table metadata to Drill Metastore
+
+Once you enable the Metastore, the next step is to populate it with data. 
Drill can query a table whether that table
+ has a Metastore entry or not. (If you are familiar with Hive, then you know 
that Hive requires that all tables have
+ Hive Metastore entries before you can query them.) In Drill, only add data to 
the Metastore when doing so improves
+ query performance. In general, large tables benefit from statistics more than 
small tables do.
+
+Unlike Hive, Drill does not require you to declare a schema. Instead, Drill 
infers the schema by scanning your table 
+ in the same way as it is done during regular select and computes some 
metadata like `MIN` / `MAX` column values and
+ `NULLS_COUNT` designated as "metadata" to be able to produce more 
optimizations like filter push-down, etc. If
+ `planner.statistics.use` option is enabled, this command will also calculate 
and store table statistics into Drill
+ Metastore.
+
+## Configuration
+
+Default Metastore configuration is defined in `drill-metastore-default.conf` 
file.
+It can be overridden in `drill-metastore-override.conf`. Distribution 
configuration can be
+indicated in `drill-metastore-distrib.conf`.
 
 Review comment:
   Thanks, reworded.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r379543295
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##
 @@ -0,0 +1,408 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and 
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is a Beta feature; it is subject to change. We encourage you to 
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may 
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
+
+To use the Drill Metastore, you must enable it at the session or system level 
with one of the following commands:
+
+   SET `metastore.enabled` = true;
+   ALTER SYSTEM SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://:8047/options`.
+
+## Computing and storing table metadata to Drill Metastore
+
+Once you enable the Metastore, the next step is to populate it with data. 
Drill can query a table whether that table
+ has a Metastore entry or not. (If you are familiar with Hive, then you know 
that Hive requires that all tables have
+ Hive Metastore entries before you can query them.) In Drill, only add data to 
the Metastore when doing so improves
+ query performance. In general, large tables benefit from statistics more than 
small tables do.
+
+Unlike Hive, Drill does not require you to declare a schema. Instead, Drill 
infers the schema by scanning your table 
+ in the same way as it is done during regular select and computes some 
metadata like `MIN` / `MAX` column values and
+ `NULLS_COUNT` designated as "metadata" to be able to produce more 
optimizations like filter push-down, etc. If
+ `planner.statistics.use` option is enabled, this command will also calculate 
and store table statistics into Drill
+ Metastore.
+
+## Configuration
+
+Default Metastore configuration is defined in `drill-metastore-default.conf` 
file.
+It can be overridden in `drill-metastore-override.conf`. Distribution 
configuration can be
+indicated in `drill-metastore-distrib.conf`.
+
+All configuration properties should reside in `drill.metastore` namespace.
+Metastore implementation based on class implementation config property 
`drill.metastore.implementation.class`.
+The default value is the following:
+
+```
+drill.metastore: {
+  implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore"
+}
+```
+
+Note, that currently out of box Iceberg Metastore is available and is the 
default one. Though any custom
+ implementation can be added by placing the JAR into classpath which has the 
implementation of
+ `org.apache.drill.metastore.Metastore` interface and indicating custom class 
in the `drill.metastore.implementation.class`.
+
+### Metastore Components
+
+Metastore can store metadata for various components: tables, views, etc.
 
 Review comment:
   Thanks, updated section with the info you have proposed and added a link to 
main Jira.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r379521993
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##
 @@ -0,0 +1,408 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and 
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is a Beta feature; it is subject to change. We encourage you to 
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may 
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
 
 Review comment:
   Thanks, good idea. I have added a section where enumerated problems that 
Metastore may help to solve.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r379543754
 
 

 ##
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##
 @@ -0,0 +1,408 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and 
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is a Beta feature; it is subject to change. We encourage you to 
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may 
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
+
+To use the Drill Metastore, you must enable it at the session or system level 
with one of the following commands:
+
+   SET `metastore.enabled` = true;
+   ALTER SYSTEM SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://:8047/options`.
+
+## Computing and storing table metadata to Drill Metastore
+
+Once you enable the Metastore, the next step is to populate it with data. 
Drill can query a table whether that table
+ has a Metastore entry or not. (If you are familiar with Hive, then you know 
that Hive requires that all tables have
+ Hive Metastore entries before you can query them.) In Drill, only add data to 
the Metastore when doing so improves
+ query performance. In general, large tables benefit from statistics more than 
small tables do.
+
+Unlike Hive, Drill does not require you to declare a schema. Instead, Drill 
infers the schema by scanning your table 
+ in the same way as it is done during regular select and computes some 
metadata like `MIN` / `MAX` column values and
+ `NULLS_COUNT` designated as "metadata" to be able to produce more 
optimizations like filter push-down, etc. If
+ `planner.statistics.use` option is enabled, this command will also calculate 
and store table statistics into Drill
+ Metastore.
+
+## Configuration
+
+Default Metastore configuration is defined in `drill-metastore-default.conf` 
file.
+It can be overridden in `drill-metastore-override.conf`. Distribution 
configuration can be
+indicated in `drill-metastore-distrib.conf`.
+
+All configuration properties should reside in `drill.metastore` namespace.
+Metastore implementation based on class implementation config property 
`drill.metastore.implementation.class`.
+The default value is the following:
+
+```
+drill.metastore: {
+  implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore"
+}
+```
+
+Note, that currently out of box Iceberg Metastore is available and is the 
default one. Though any custom
+ implementation can be added by placing the JAR into classpath which has the 
implementation of
+ `org.apache.drill.metastore.Metastore` interface and indicating custom class 
in the `drill.metastore.implementation.class`.
+
+### Metastore Components
+
+Metastore can store metadata for various components: tables, views, etc.
+Current implementation provides fully functioning support for tables component.
+Views component support is not implemented but contains stub methods to show
+how new Metastore components like UDFs, storage plugins, etc. can be added in 
the future.
+
+### Metastore Tables
 
 Review comment:
   Thanks, agree that it may seem a little bit confusing, so changed as you 
have proposed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi opened a new pull request #1986: Additional changes for Drill Metastore docs

2020-02-17 Thread GitBox

vvysotskyi opened a new pull request #1986: Additional changes for Drill 
Metastore docs
URL: https://github.com/apache/drill/pull/1986
 
 
   Changes after code review for https://github.com/apache/drill/pull/1953


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on issue #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files

2020-02-17 Thread GitBox

vvysotskyi commented on issue #1985: DRILL-7565: ANALYZE TABLE ... REFRESH 
METADATA does not work for empty Parquet files
URL: https://github.com/apache/drill/pull/1985#issuecomment-587039702
 
 
   @KazydubB, thanks for the review, I have made requested changes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE 
TABLE ... REFRESH METADATA does not work for empty Parquet files
URL: https://github.com/apache/drill/pull/1985#discussion_r380237113
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/metadata/MetadataAggregateHelper.java
 ##
 @@ -117,16 +120,16 @@ private void createAggregatorInternal() {
   }
 }
 
-for (SchemaPath excludedColumn : excludedColumns) {
-  if 
(excludedColumn.equals(SchemaPath.getSimplePath(columnNamesOptions.rowGroupStart()))
-  || 
excludedColumn.equals(SchemaPath.getSimplePath(columnNamesOptions.rowGroupLength(
 {
-LogicalExpression lastModifiedTime = new FunctionCall("any_value",
+for (SchemaPath nonSchemaColumn : context.metadataColumns()) {
 
 Review comment:
   Sorry, missed it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE 
TABLE ... REFRESH METADATA does not work for empty Parquet files
URL: https://github.com/apache/drill/pull/1985#discussion_r380236720
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/metastore/ColumnNamesOptions.java
 ##
 @@ -40,6 +41,7 @@ public ColumnNamesOptions(OptionManager optionManager) {
 this.rowGroupStart = 
optionManager.getOption(ExecConstants.IMPLICIT_ROW_GROUP_START_COLUMN_LABEL).string_val;
 this.rowGroupLength = 
optionManager.getOption(ExecConstants.IMPLICIT_ROW_GROUP_LENGTH_COLUMN_LABEL).string_val;
 this.lastModifiedTime = 
optionManager.getOption(ExecConstants.IMPLICIT_LAST_MODIFIED_TIME_COLUMN_LABEL).string_val;
+this.projectMetadataColumn = 
optionManager.getOption(ExecConstants.IMPLICIT_PROJECT_METADATA_COLUMN_LABEL).string_val;
 
 Review comment:
   Good idea, thanks, done.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] KazydubB commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files

2020-02-17 Thread GitBox

KazydubB commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE 
... REFRESH METADATA does not work for empty Parquet files
URL: https://github.com/apache/drill/pull/1985#discussion_r380227482
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/metastore/ColumnNamesOptions.java
 ##
 @@ -40,6 +41,7 @@ public ColumnNamesOptions(OptionManager optionManager) {
 this.rowGroupStart = 
optionManager.getOption(ExecConstants.IMPLICIT_ROW_GROUP_START_COLUMN_LABEL).string_val;
 this.rowGroupLength = 
optionManager.getOption(ExecConstants.IMPLICIT_ROW_GROUP_LENGTH_COLUMN_LABEL).string_val;
 this.lastModifiedTime = 
optionManager.getOption(ExecConstants.IMPLICIT_LAST_MODIFIED_TIME_COLUMN_LABEL).string_val;
+this.projectMetadataColumn = 
optionManager.getOption(ExecConstants.IMPLICIT_PROJECT_METADATA_COLUMN_LABEL).string_val;
 
 Review comment:
   I think, it is better to declare 
`ExecConstants.IMPLICIT_PROJECT_METADATA_COLUMN_LABEL` (and 
`ExecConstants.IMPLICIT_LAST_MODIFIED_TIME_COLUMN_LABEL`) as `StringValidator` 
and use it as `this.projectMetadataColumn = 
optionManager.getOption(ExecConstants.IMPLICIT_PROJECT_METADATA_COLUMN_LABEL);`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] KazydubB commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files

2020-02-17 Thread GitBox

KazydubB commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE 
... REFRESH METADATA does not work for empty Parquet files
URL: https://github.com/apache/drill/pull/1985#discussion_r380231898
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/metadata/MetadataAggregateHelper.java
 ##
 @@ -117,16 +120,16 @@ private void createAggregatorInternal() {
   }
 }
 
-for (SchemaPath excludedColumn : excludedColumns) {
-  if 
(excludedColumn.equals(SchemaPath.getSimplePath(columnNamesOptions.rowGroupStart()))
-  || 
excludedColumn.equals(SchemaPath.getSimplePath(columnNamesOptions.rowGroupLength(
 {
-LogicalExpression lastModifiedTime = new FunctionCall("any_value",
+for (SchemaPath nonSchemaColumn : context.metadataColumns()) {
 
 Review comment:
   Rename to `metadataColumn` or `implicitColumn`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #1984: DRILL-7586: drill-hive-exec-shaded contains commons-lang3 version 3.1

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1984: DRILL-7586: 
drill-hive-exec-shaded contains commons-lang3 version 3.1
URL: https://github.com/apache/drill/pull/1984#discussion_r380155881
 
 

 ##
 File path: contrib/storage-hive/hive-exec-shade/pom.xml
 ##
 @@ -158,6 +158,8 @@
  you can use 
TestHiveStorage.readFromAlteredPartitionedTableWithEmptyGroupType() test case. 
-->
 org/apache/parquet/**
 shaded/parquet/org/**
+org/apache/commons/lang/**
 
 Review comment:
   I'm afraid it can break something since hive explicitly includes these 
libraries into `hive-exec` jar: 
https://github.com/apache/hive/blob/master/ql/pom.xml#L958.
   
   As an alternative solution, I would recommend relocating them (as it is done 
above for other libraries).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] KazydubB commented on a change in pull request #1974: DRILL-7574: Generalize the projection parser

2020-02-17 Thread GitBox

KazydubB commented on a change in pull request #1974: DRILL-7574: Generalize 
the projection parser
URL: https://github.com/apache/drill/pull/1974#discussion_r380154237
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/resultSet/project/RequestedColumn.java
 ##
 @@ -0,0 +1,137 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.resultSet.project;
+
+/**
+ * Plan-time properties of a requested column.Represents
+ * a consolidated view of the set of references to a column.
+ * For example, the project list might contain:
+ * SELECT columns[4], columns[8]
+ * SELECT a.b, a.c
+ * SELECT columns, columns[1]
+ * SELECT a, a.b
+ * In each case, the same column is referenced in different
+ * forms which are consolidated into this abstraction.
+ * 
+ * The resulting information is a "pattern": a form of reference
+ * which which a concrete type can be compatible or not. The project
+ * list does not contain sufficient information to definitively pick
+ * a type; it only excludes certain types.
+ * 
+ * Depending on the syntax, we can infer if a column must
+ * be an array or map. This is definitive: though we know that
+ * columns of the form above must be an array or a map,
+ * we cannot know if a simple column reference might refer
+ * to an array or map.
+ *
+ * Compatibility Rules
+ *
+ * The pattern given by projection is consistent with certain concrete types
+ * as follows. + means any number of additional qualifiers.
+ * 
+ * 
+ * TypeConsistent with
+ * Non-repeated MAP
+ * {@code a+} {@code a.b+}
+ * Repeated MAP
+ * {@code a+} {@code a.b+} {@code a[n].b+}>
+ * Non-repeated Scalar
+ * {@code a}
+ * Repeated Scalar
+ * {@code a} {@code a[n]}
+ * Non-repeated DICT
+ * {@code a} {@code a['key']}
+ * Repeated DICT
+ * {@code a} {@code a[n]} {@code a['key']} {@code 
a[n]['key']}
 
 Review comment:
   Checked whether `m.a` is supported for `MAP` arrays: it looks like this is 
not supported in Drill.
   For a json file `file.json`
   ```
   {"sa": [{"a": 1}, {"a": 2}, {"a": 3}]}
   {"sa": [{"a": 1}]}
   ```
   following query ``select t.sa.a kv from dfs.`file.json` t`` produces two 
rows with `null` value each. (Should have returned an error instead?)
   
   In case when types are known during planning, e.g. in case when querying 
Hive table, there is following validation: `VALIDATION ERROR: From line 1, 
column 27 to line 1, column 28: Cannot apply 'ITEM' to arguments of type 
'ITEM(, )'. Supported form(s): []
   []`
   (used following test in `TestHiveStructs.java`:
   ```
   @Test
 public void strWithArr2ByIdxP0111() throws Exception {
   HiveTestUtilities.assertNativeScanUsed(queryBuilder(), "struct_tbl_p");
   testBuilder()
   .sqlQuery("SELECT rid, t.str_wa_2.fa.sn p0 FROM hive.struct_tbl_p t")
   .unOrdered()
   .baselineColumns("rid", "p0")
   .expectsEmptyResultSet()
   .go();
 }
   ```
   )
   
   However, such behavior is present in Hive, but for repeated (Drill's) `MAP` 
only (but not for repeated `DICT`), IIRC.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] oleg-zinovev commented on issue #1984: DRILL-7586: drill-hive-exec-shaded contains commons-lang3 version 3.1

2020-02-17 Thread GitBox

oleg-zinovev commented on issue #1984: DRILL-7586: drill-hive-exec-shaded 
contains commons-lang3 version 3.1
URL: https://github.com/apache/drill/pull/1984#issuecomment-586964736
 
 
   I can not reproduce the error on any version of JDK


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE 
TABLE ... REFRESH METADATA does not work for empty Parquet files
URL: https://github.com/apache/drill/pull/1985#discussion_r380128666
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/handlers/MetastoreAnalyzeTableHandler.java
 ##
 @@ -406,13 +406,13 @@ private DrillRel getTableAggRelNode(DrillRel 
convertedRelNode, boolean createNew
 SchemaPath lastModifiedTimeField =
 
SchemaPath.getSimplePath(config.getContext().getOptions().getString(ExecConstants.IMPLICIT_LAST_MODIFIED_TIME_COLUMN_LABEL));
 
-List excludedColumns = Arrays.asList(locationField, 
lastModifiedTimeField);
+List nonSchemaColumns = Arrays.asList(locationField, 
lastModifiedTimeField);
 
 Review comment:
   Thanks, renamed here and in other places.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE 
TABLE ... REFRESH METADATA does not work for empty Parquet files
URL: https://github.com/apache/drill/pull/1985#discussion_r380124653
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java
 ##
 @@ -237,6 +238,18 @@ private IterOutcome internalNext() {
   logger.trace("currentReader.next return recordCount={}", recordCount);
   Preconditions.checkArgument(recordCount >= 0, "recordCount from 
RecordReader.next() should not be negative");
   boolean isNewSchema = mutator.isNewSchema();
+  // adds additional record for the case of making scan for obtaining 
metadata if required
+  if (implicitValues != null) {
+String projectMetadataColumn = 
context.getOptions().getOption(ExecConstants.IMPLICIT_PROJECT_METADATA_COLUMN_LABEL).string_val;
+if (recordCount > 0) {
+  // sets implicit value to false to signalize that some results were 
returned and there is no need for creating additional record
 
 Review comment:
   Thanks, updated the comment and added more details.
   
   Regarding the concept of the additional record, I will try to explain how 
Metastore collects the data in general cases, it may help to understand the 
reason for such a decision.
   
   Drill Metastore may collect metadata for every file or row group, so 
aggregation calls for every column with grouping by `fqn`, `rgi`, `dirX`... 
columns were added.
   This approach works fine for the case of non-empty files and row groups, but 
when an empty file is present, no data is passed to the aggregation from the 
Scan, so Metastore was ignoring such files.
   To solve this problem, I have added this logic to return a single record for 
the case when no data was read with the correct values of implicit columns, and 
this additional implicit column helps to distinguish such records and collect 
info about rows count, schema, etc.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE 
TABLE ... REFRESH METADATA does not work for empty Parquet files
URL: https://github.com/apache/drill/pull/1985#discussion_r380077493
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java
 ##
 @@ -511,6 +511,11 @@ private ExecConstants() {
   new OptionDescription("Available as of Drill 1.17. Sets the implicit 
column name for the lastModifiedTime column. " +
   "For internal usage when producing Metastore analyze."));
 
+  public static final String IMPLICIT_PROJECT_METADATA_COLUMN_LABEL = 
"drill.exec.storage.implicit.project_metadata.column.label";
+  public static final OptionValidator 
IMPLICIT_PROJECT_METADATA_COLUMN_LABEL_VALIDATOR = new 
StringValidator(IMPLICIT_PROJECT_METADATA_COLUMN_LABEL,
+  new OptionDescription("Available as of Drill 1.18. Sets the implicit 
column name for the $project_metadata$ column. " +
 
 Review comment:
   Good point about that. I specified version here and in other places to be 
consistent with other options descriptions.
   I think adding version in options descriptions was done to simplify updating 
docs for Drill Web site - there is no need to look up for the commit date and 
version of Drill, where it was added, just copy and paste it from Drill Web-UI, 
or from this class.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE 
TABLE ... REFRESH METADATA does not work for empty Parquet files
URL: https://github.com/apache/drill/pull/1985#discussion_r380128453
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/metadata/MetadataAggregateHelper.java
 ##
 @@ -313,8 +317,19 @@ private void addColumnAggregateCalls(FieldReference 
fieldRef, String fieldName)
   if (interestingColumns == null || interestingColumns.contains(fieldRef)) 
{
 // collect statistics for all or only interesting columns if they are 
specified
 
AnalyzeColumnUtils.COLUMN_STATISTICS_FUNCTIONS.forEach((statisticsKind, 
sqlKind) -> {
+  // constructs "case when is not null projectMetadataColumn then 
column1 else null end" call
+  // to avoid using default values for required columns when data for 
empty result is obtained
 
 Review comment:
   Thanks for pointing this. Unfortunately, we can't use a plain SQL approach 
to collect metadata, since we do not have information about the schema, so we 
create aggregate calls dynamically. But Drill uses inbuilt aggregate functions 
for collecting summary statistics (`MIN`, `MAX`, ...).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files

2020-02-17 Thread GitBox

vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE 
TABLE ... REFRESH METADATA does not work for empty Parquet files
URL: https://github.com/apache/drill/pull/1985#discussion_r380081293
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/metastore/analyze/MetadataAggregateContext.java
 ##
 @@ -63,8 +67,8 @@ public boolean createNewAggregations() {
   }
 
   @JsonProperty
-  public List excludedColumns() {
-return excludedColumns;
+  public List nonSchemaColumns() {
 
 Review comment:
   Thanks, `metadataColumns` name looks better, renamed this field.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

73 matches

Mail list logo