Re: [DISCUSS] Schema queries - solutions?
Hi Igor, Thanks! I should have remembered that bit of SQL. Yes, if we can generalize `DESCRIBE`, we could create another path of some kind through the plugins that say, "return schema, not data." Then, for the HDF5 use case we could have: DESCRIBE TABLE `dfs`.`myFile.hdf5` -- returns schema And SELECT * FROM `dfs`.`myFile.hdf5` -- returns data Nice solution! I'll file a feature request. The next interesting bit about HDF5 is that it is a file system, it contains multiple data sets. Would be great to be able to express that in the FROM clause: SELECT * FROM `dfs`.`myFile.hdf5`.`dataSet1` >From my random walks though Calcite, it appears that we can have any level of >schema/table path. True? We'd need some way to resolve a name part to a file, >then ask the format plugin for that file if it supports additional parts. This >seems pretty obscure. Have we done anything like that before? Maybe in storage >(rather than format) plugin? Thanks, - Paul On Monday, February 17, 2020, 11:34:48 PM PST, Igor Guzenko wrote: Hello Paul, Seems like we simply need to improve our DESCRIBE [1] table functionality. [1] https://drill.apache.org/docs/describe/ Thanks, Igor On Tue, Feb 18, 2020 at 9:23 AM Paul Rogers wrote: > Hi All, > > Charles has a little PR, #1978, that adds a convenient feature to his > HDF5 format reader: the ability to query the schema of the file. (It seems > that HDF5 is a bit like a zip file: it contains a set of files. Unlike zip, > each file is a data set with a schema.) Charles added a clever way to tell > the reader that the user wants a schema rather than data. > > If we think a bit, we realize that a schema query would be handy for any > data source. Maybe I want to know the fields in a JSON or Parquet file > without getting the data for those fields (and, for example, inferring type > and nullability from data.) > > In a relational DB, we'd get the schema by querying system tables. We'd do > the same thing in Hive because Hive requires an up-front schema. But, Drill > is unique in that it can infer schema at run time; no previous schema > required. So, we have no system tables to answer schema questions. Instead, > we want to get the schema directly from the data source itself by doing a > query. > > (This feature would be in addition to the case when the Metastore does > hold a schema.) > > > How might we accomplish the same result? Can we create some kind of > "virtual" system table that tells us to rewrite the query to get schema? > Something like: > > SELECT * FROM sys.columns WHERE tableName = `dfs`.`my/path/someFile.json` > > Or, maybe some implied columns in the table schema? > > > SELECT schema.* FROM `dfs`.`my/path/someFile.json` > > > Or, maybe some special schema name space? > > SELECT schema.* FROM schema.`dfs`.`my/path/someFile.json` > > > Anyone know of any system that has an elegant solution we could mimic? > Other suggestions? > > > Thanks, > - Paul > >
[GitHub] [drill] paul-rogers commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail with Large Files
paul-rogers commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail with Large Files URL: https://github.com/apache/drill/pull/1978#issuecomment-587322871 @cgivre, one more design-level comment about this particular file format. You've mentioned several times that HDF5 is "a file system within a file." It finally clicked: we need need to treat this file as a directory, not a file. This means adding a layer of schema in Calcite planning: ``` SELECT * FROM `dfs`.`some/path/myFile.hdf5`.`dataSet1` ``` This would let the reader load only data from `dataSet1`, using only the schema from that data set. (Can't use slashes; that is a notation for the Hadoop file system.) Fortunately, Calcite seems to allow any number of schema levels. It is why we can have plugins, workspaces, etc. The challenge is to provide some way for a format plugin to influence the planner and say, "hey, if you do a query against me, ask me to resolve all path elements below my file name." Again, not something for this PR. But, it is something we can think about as we try to improve our storage plugin API. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
Re: [DISCUSS] Schema queries - solutions?
Hello Paul, Seems like we simply need to improve our DESCRIBE [1] table functionality. [1] https://drill.apache.org/docs/describe/ Thanks, Igor On Tue, Feb 18, 2020 at 9:23 AM Paul Rogers wrote: > Hi All, > > Charles has a little PR, #1978, that adds a convenient feature to his > HDF5 format reader: the ability to query the schema of the file. (It seems > that HDF5 is a bit like a zip file: it contains a set of files. Unlike zip, > each file is a data set with a schema.) Charles added a clever way to tell > the reader that the user wants a schema rather than data. > > If we think a bit, we realize that a schema query would be handy for any > data source. Maybe I want to know the fields in a JSON or Parquet file > without getting the data for those fields (and, for example, inferring type > and nullability from data.) > > In a relational DB, we'd get the schema by querying system tables. We'd do > the same thing in Hive because Hive requires an up-front schema. But, Drill > is unique in that it can infer schema at run time; no previous schema > required. So, we have no system tables to answer schema questions. Instead, > we want to get the schema directly from the data source itself by doing a > query. > > (This feature would be in addition to the case when the Metastore does > hold a schema.) > > > How might we accomplish the same result? Can we create some kind of > "virtual" system table that tells us to rewrite the query to get schema? > Something like: > > SELECT * FROM sys.columns WHERE tableName = `dfs`.`my/path/someFile.json` > > Or, maybe some implied columns in the table schema? > > > SELECT schema.* FROM `dfs`.`my/path/someFile.json` > > > Or, maybe some special schema name space? > > SELECT schema.* FROM schema.`dfs`.`my/path/someFile.json` > > > Anyone know of any system that has an elegant solution we could mimic? > Other suggestions? > > > Thanks, > - Paul > >
[DISCUSS] Schema queries - solutions?
Hi All, Charles has a little PR, #1978, that adds a convenient feature to his HDF5 format reader: the ability to query the schema of the file. (It seems that HDF5 is a bit like a zip file: it contains a set of files. Unlike zip, each file is a data set with a schema.) Charles added a clever way to tell the reader that the user wants a schema rather than data. If we think a bit, we realize that a schema query would be handy for any data source. Maybe I want to know the fields in a JSON or Parquet file without getting the data for those fields (and, for example, inferring type and nullability from data.) In a relational DB, we'd get the schema by querying system tables. We'd do the same thing in Hive because Hive requires an up-front schema. But, Drill is unique in that it can infer schema at run time; no previous schema required. So, we have no system tables to answer schema questions. Instead, we want to get the schema directly from the data source itself by doing a query. (This feature would be in addition to the case when the Metastore does hold a schema.) How might we accomplish the same result? Can we create some kind of "virtual" system table that tells us to rewrite the query to get schema? Something like: SELECT * FROM sys.columns WHERE tableName = `dfs`.`my/path/someFile.json` Or, maybe some implied columns in the table schema? SELECT schema.* FROM `dfs`.`my/path/someFile.json` Or, maybe some special schema name space? SELECT schema.* FROM schema.`dfs`.`my/path/someFile.json` Anyone know of any system that has an elegant solution we could mimic? Other suggestions? Thanks, - Paul
[GitHub] [drill] paul-rogers commented on a change in pull request #1978: DRILL-7578: HDF5 Metadata Queries Fail with Large Files
paul-rogers commented on a change in pull request #1978: DRILL-7578: HDF5 Metadata Queries Fail with Large Files URL: https://github.com/apache/drill/pull/1978#discussion_r380488669 ## File path: contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java ## @@ -1069,26 +1125,30 @@ private void getAndMapCompoundData(String path, List fieldNames, IHDF5Re for (int col = 0; col < values[row].length; col++) { assert fieldNames != null; currentFieldName = fieldNames.get(col); -ArrayWriter innerWriter = listWriter.array(currentFieldName); -if (values[row][col] instanceof Integer) { - innerWriter.scalar().setInt((Integer) values[row][col]); -} else if (values[row][col] instanceof Short) { - innerWriter.scalar().setInt((Short) values[row][col]); -} else if (values[row][col] instanceof Byte) { - innerWriter.scalar().setInt((Byte) values[row][col]); -} else if (values[row][col] instanceof Long) { - innerWriter.scalar().setLong((Long) values[row][col]); -} else if (values[row][col] instanceof Float) { - innerWriter.scalar().setDouble((Float) values[row][col]); -} else if (values[row][col] instanceof Double) { - innerWriter.scalar().setDouble((Double) values[row][col]); -} else if (values[row][col] instanceof BitSet || values[row][col] instanceof Boolean) { - innerWriter.scalar().setBoolean((Boolean) values[row][col]); -} else if (values[row][col] instanceof String) { - innerWriter.scalar().setString((String) values[row][col]); -} -if (col == values[row].length) { - innerWriter.save(); +try { + ArrayWriter innerWriter = listWriter.array(currentFieldName); + if (values[row][col] instanceof Integer) { Review comment: I realize that this is existing code, but boxing and comparing each value will be slow and will thrash the heap. Far better if we can use "shims" that can read the data as the Java primitive type and write it directly to the corresponding `set()` method without boxing. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] paul-rogers commented on a change in pull request #1978: DRILL-7578: HDF5 Metadata Queries Fail with Large Files
paul-rogers commented on a change in pull request #1978: DRILL-7578: HDF5 Metadata Queries Fail with Large Files URL: https://github.com/apache/drill/pull/1978#discussion_r380490693 ## File path: contrib/format-hdf5/src/test/java/org/apache/drill/exec/store/hdf5/TestHDF5Format.java ## @@ -98,9 +98,9 @@ public void testStarQuery() throws Exception { testBuilder() .sqlQuery("SELECT * FROM dfs.`hdf5/dset.h5`") - .unOrdered() - .baselineColumns("path", "data_type", "file_name", "int_data") - .baselineValues("/dset", "DATASET", "dset.h5", finalList) + .ordered() Review comment: This might be the place to ask the question about schema. We have two distinct views of a data set. The general rule of the wildcard (`*`) is to return all available columns. Here, we special-case wildcard to mean "return metadata." This is, unfortunately, very non standard. We need some way to express two views of the file. The same problem occurs for any database. We could even use if for JSON, CSV and other file formats. The challenge is, how do we tell the query we want metadata and not data? In a normal DB, we query system tables. Perhaps we could jimmy up something in Drill: ``` SELECT * FROM sys.schema.dfs.`hdf5/dset.h5` ``` Or, maybe think of the table as a namespace, and have an optional `.schema` tail: ``` SELECT * FROM dfs.`hdf5/dset.h5`.schema ``` The point is not for you to implement this, or even to design the solution. Rather, the point is that the current solution is a hack, and that we need a better solution. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] paul-rogers commented on a change in pull request #1978: DRILL-7578: HDF5 Metadata Queries Fail with Large Files
paul-rogers commented on a change in pull request #1978: DRILL-7578: HDF5 Metadata Queries Fail with Large Files URL: https://github.com/apache/drill/pull/1978#discussion_r380487624 ## File path: contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java ## @@ -92,6 +93,20 @@ private static final String LONG_COLUMN_NAME = "long_data"; + private static final String DATA_SIZE_COLUMN_NAME = "data_size"; + + private static final String ELEMENT_COUNT_NAME = "element_count"; + + private static final String IS_TIMESTAMP_NAME = "is_timestamp"; Review comment: The two `is` columns appear mutually exclusive. I wonder, does it make sense to define an `extended_type` column if `data_type` is the Drill type? That is, for most columns, `extended_type` would be null. For these two it would be, say `TIMESTAMP` or `TIME_DURATION`. Though, truth be told, Drill has `TIMESTAMP` and `INTERVAL` columns, so if we mapped the HDF5 type to these Drill types, we would not need the extended type (or these two Boolean columns). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] paul-rogers commented on a change in pull request #1987: DRILL-7589: Set temporary tests folder for UDF_DIRECTORY_LOCAL, fix allocators closing in BloomFilterTest, fix permissions iss
paul-rogers commented on a change in pull request #1987: DRILL-7589: Set temporary tests folder for UDF_DIRECTORY_LOCAL, fix allocators closing in BloomFilterTest, fix permissions issue for TestGracefulShutdown tests URL: https://github.com/apache/drill/pull/1987#discussion_r380485944 ## File path: exec/java-exec/src/test/java/org/apache/drill/test/TestGracefulShutdown.java ## @@ -262,17 +265,15 @@ private boolean waitAndAssertDrillbitCount(ClusterFixture cluster, int zkRefresh } private static void setupFile(int file_num) throws Exception { -final String file = "employee"+file_num+".json"; -final Path path = dirTestWatcher.getRootDir().toPath().resolve(file); -try(PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter(path.toFile(), true { +String file = "employee" + file_num + ".json"; +Path path = dirTestWatcher.getRootDir().toPath().resolve(file); +try (PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter(path.toFile(), true { Review comment: I realize the code here is original; but it might be a bit cleaner to put the data in a resource file than in Java. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] paul-rogers commented on a change in pull request #1987: DRILL-7589: Set temporary tests folder for UDF_DIRECTORY_LOCAL, fix allocators closing in BloomFilterTest, fix permissions iss
paul-rogers commented on a change in pull request #1987: DRILL-7589: Set temporary tests folder for UDF_DIRECTORY_LOCAL, fix allocators closing in BloomFilterTest, fix permissions issue for TestGracefulShutdown tests URL: https://github.com/apache/drill/pull/1987#discussion_r380485298 ## File path: exec/java-exec/src/test/java/org/apache/drill/exec/work/filter/BloomFilterTest.java ## @@ -133,214 +135,227 @@ public boolean hasFailed() { } } - @Test public void testNotExist() throws Exception { -Drillbit bit = new Drillbit(c, RemoteServiceSet.getLocalServiceSet(), ClassPathScanner.fromPrescan(c)); -bit.run(); -DrillbitContext bitContext = bit.getContext(); -FunctionImplementationRegistry registry = bitContext.getFunctionImplementationRegistry(); -FragmentContextImpl context = new FragmentContextImpl(bitContext, BitControl.PlanFragment.getDefaultInstance(), null, registry); -BufferAllocator bufferAllocator = bitContext.getAllocator(); -//create RecordBatch -VarCharVector vector = new VarCharVector(SchemaBuilder.columnSchema("a", TypeProtos.MinorType.VARCHAR, TypeProtos.DataMode.REQUIRED), bufferAllocator); -vector.allocateNew(); -int valueCount = 3; -VarCharVector.Mutator mutator = vector.getMutator(); -mutator.setSafe(0, "a".getBytes()); -mutator.setSafe(1, "b".getBytes()); -mutator.setSafe(2, "c".getBytes()); -mutator.setValueCount(valueCount); -VectorContainer vectorContainer = new VectorContainer(); -TypedFieldId fieldId = vectorContainer.add(vector); -RecordBatch recordBatch = new TestRecordBatch(vectorContainer); -//construct hash64 -ValueVectorReadExpression exp = new ValueVectorReadExpression(fieldId); -LogicalExpression[] expressions = new LogicalExpression[1]; -expressions[0] = exp; -TypedFieldId[] fieldIds = new TypedFieldId[1]; -fieldIds[0] = fieldId; -ValueVectorHashHelper valueVectorHashHelper = new ValueVectorHashHelper(recordBatch, context); -ValueVectorHashHelper.Hash64 hash64 = valueVectorHashHelper.getHash64(expressions, fieldIds); - -//construct BloomFilter -int numBytes = BloomFilter.optimalNumOfBytes(3, 0.03); - -BloomFilter bloomFilter = new BloomFilter(numBytes, bufferAllocator); -for (int i = 0; i < valueCount; i++) { - long hashCode = hash64.hash64Code(i, 0, 0); - bloomFilter.insert(hashCode); +int userPort = QueryTestUtil.getFreePortNumber(31170, 300); +int bitPort = QueryTestUtil.getFreePortNumber(31180, 300); +ClusterFixtureBuilder clusterFixtureBuilder = ClusterFixture.bareBuilder(dirTestWatcher) +.configProperty(ExecConstants.INITIAL_USER_PORT, userPort) +.configProperty(ExecConstants.INITIAL_BIT_PORT, bitPort) +.configProperty(ExecConstants.ALLOW_LOOPBACK_ADDRESS_BINDING, true); +try (ClusterFixture cluster = clusterFixtureBuilder.build()) { + Drillbit bit = cluster.drillbit(); + DrillbitContext bitContext = bit.getContext(); + FunctionImplementationRegistry registry = bitContext.getFunctionImplementationRegistry(); + FragmentContextImpl context = new FragmentContextImpl(bitContext, BitControl.PlanFragment.getDefaultInstance(), null, registry); + BufferAllocator bufferAllocator = bitContext.getAllocator(); + //create RecordBatch + VarCharVector vector = new VarCharVector(SchemaBuilder.columnSchema("a", TypeProtos.MinorType.VARCHAR, TypeProtos.DataMode.REQUIRED), bufferAllocator); + vector.allocateNew(); + int valueCount = 3; + VarCharVector.Mutator mutator = vector.getMutator(); + mutator.setSafe(0, "a".getBytes()); + mutator.setSafe(1, "b".getBytes()); + mutator.setSafe(2, "c".getBytes()); + mutator.setValueCount(valueCount); + VectorContainer vectorContainer = new VectorContainer(); + TypedFieldId fieldId = vectorContainer.add(vector); + RecordBatch recordBatch = new TestRecordBatch(vectorContainer); + //construct hash64 + ValueVectorReadExpression exp = new ValueVectorReadExpression(fieldId); + LogicalExpression[] expressions = new LogicalExpression[1]; + expressions[0] = exp; + TypedFieldId[] fieldIds = new TypedFieldId[1]; + fieldIds[0] = fieldId; + ValueVectorHashHelper valueVectorHashHelper = new ValueVectorHashHelper(recordBatch, context); + ValueVectorHashHelper.Hash64 hash64 = valueVectorHashHelper.getHash64(expressions, fieldIds); + + //construct BloomFilter + int numBytes = BloomFilter.optimalNumOfBytes(3, 0.03); + + BloomFilter bloomFilter = new BloomFilter(numBytes, bufferAllocator); + for (int i = 0; i < valueCount; i++) { +long hashCode = hash64.hash64Code(i, 0, 0); +bloomFilter.insert(hashCode); + } + + //-create probe side RecordBatch- + VarCharVector probeVector = new VarCharVector(SchemaBuilder.columnSchema("a",
[GitHub] [drill] paul-rogers commented on a change in pull request #1987: DRILL-7589: Set temporary tests folder for UDF_DIRECTORY_LOCAL, fix allocators closing in BloomFilterTest, fix permissions iss
paul-rogers commented on a change in pull request #1987: DRILL-7589: Set temporary tests folder for UDF_DIRECTORY_LOCAL, fix allocators closing in BloomFilterTest, fix permissions issue for TestGracefulShutdown tests URL: https://github.com/apache/drill/pull/1987#discussion_r380481139 ## File path: exec/java-exec/src/test/java/org/apache/drill/exec/work/filter/BloomFilterTest.java ## @@ -133,214 +135,227 @@ public boolean hasFailed() { } } - @Test public void testNotExist() throws Exception { -Drillbit bit = new Drillbit(c, RemoteServiceSet.getLocalServiceSet(), ClassPathScanner.fromPrescan(c)); -bit.run(); -DrillbitContext bitContext = bit.getContext(); -FunctionImplementationRegistry registry = bitContext.getFunctionImplementationRegistry(); -FragmentContextImpl context = new FragmentContextImpl(bitContext, BitControl.PlanFragment.getDefaultInstance(), null, registry); -BufferAllocator bufferAllocator = bitContext.getAllocator(); -//create RecordBatch -VarCharVector vector = new VarCharVector(SchemaBuilder.columnSchema("a", TypeProtos.MinorType.VARCHAR, TypeProtos.DataMode.REQUIRED), bufferAllocator); -vector.allocateNew(); -int valueCount = 3; -VarCharVector.Mutator mutator = vector.getMutator(); -mutator.setSafe(0, "a".getBytes()); -mutator.setSafe(1, "b".getBytes()); -mutator.setSafe(2, "c".getBytes()); -mutator.setValueCount(valueCount); -VectorContainer vectorContainer = new VectorContainer(); -TypedFieldId fieldId = vectorContainer.add(vector); -RecordBatch recordBatch = new TestRecordBatch(vectorContainer); -//construct hash64 -ValueVectorReadExpression exp = new ValueVectorReadExpression(fieldId); -LogicalExpression[] expressions = new LogicalExpression[1]; -expressions[0] = exp; -TypedFieldId[] fieldIds = new TypedFieldId[1]; -fieldIds[0] = fieldId; -ValueVectorHashHelper valueVectorHashHelper = new ValueVectorHashHelper(recordBatch, context); -ValueVectorHashHelper.Hash64 hash64 = valueVectorHashHelper.getHash64(expressions, fieldIds); - -//construct BloomFilter -int numBytes = BloomFilter.optimalNumOfBytes(3, 0.03); - -BloomFilter bloomFilter = new BloomFilter(numBytes, bufferAllocator); -for (int i = 0; i < valueCount; i++) { - long hashCode = hash64.hash64Code(i, 0, 0); - bloomFilter.insert(hashCode); +int userPort = QueryTestUtil.getFreePortNumber(31170, 300); +int bitPort = QueryTestUtil.getFreePortNumber(31180, 300); +ClusterFixtureBuilder clusterFixtureBuilder = ClusterFixture.bareBuilder(dirTestWatcher) Review comment: Do you need a full cluster for this? There is a `SubOperatorTest` that will give you a fragment context and allocator so you can create vectors and invoke "sub-operator" functionality such as the BloomFilter stuff. If any of the code under tests needs the `DrillbitContext`, perhaps look at modifying so that it doesn't. There is nothing a Bloom filter should need. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] paul-rogers commented on a change in pull request #1987: DRILL-7589: Set temporary tests folder for UDF_DIRECTORY_LOCAL, fix allocators closing in BloomFilterTest, fix permissions iss
paul-rogers commented on a change in pull request #1987: DRILL-7589: Set temporary tests folder for UDF_DIRECTORY_LOCAL, fix allocators closing in BloomFilterTest, fix permissions issue for TestGracefulShutdown tests URL: https://github.com/apache/drill/pull/1987#discussion_r380479013 ## File path: exec/java-exec/src/test/java/org/apache/drill/exec/udf/dynamic/TestDynamicUDFSupport.java ## @@ -104,8 +104,10 @@ public static void buildAndStoreDefaultJars() throws IOException { @Before public void setupNewDrillbit() throws Exception { udfDir = dirTestWatcher.makeSubDir(Paths.get("udf")); +File udfLocalDir = dirTestWatcher.makeSubDir(Paths.get("udf", "local")); Properties overrideProps = new Properties(); overrideProps.setProperty(ExecConstants.UDF_DIRECTORY_ROOT, udfDir.getAbsolutePath()); +overrideProps.setProperty(ExecConstants.UDF_DIRECTORY_LOCAL, udfLocalDir.getAbsolutePath()); Review comment: We've got lots of local directory properties. Hard to keep them all in sync. I wonder if we can use a feature of HOCON to default them to a known structure: ``` exec: { ... local: { baseDir: "/tmp/drill", udfDir: "${drill.exec.local.baseDir}/udf", pluginDir: "${drill.exec.local.baseDir}/plugins", ... }, ``` Probably some setup to do in the `ClusterFixture` and `DirTestWatcher` to get everything set up. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] paul-rogers commented on a change in pull request #1987: DRILL-7589: Set temporary tests folder for UDF_DIRECTORY_LOCAL, fix allocators closing in BloomFilterTest, fix permissions iss
paul-rogers commented on a change in pull request #1987: DRILL-7589: Set temporary tests folder for UDF_DIRECTORY_LOCAL, fix allocators closing in BloomFilterTest, fix permissions issue for TestGracefulShutdown tests URL: https://github.com/apache/drill/pull/1987#discussion_r380485678 ## File path: exec/java-exec/src/test/java/org/apache/drill/test/TestGracefulShutdown.java ## @@ -73,31 +73,39 @@ private static void enableDrillPortHunting(ClusterFixtureBuilder builder) { builder.configBuilder.put(ExecConstants.DRILL_PORT_HUNT, true); builder.configBuilder.put(ExecConstants.GRACE_PERIOD, 500); builder.configBuilder.put(ExecConstants.ALLOW_LOOPBACK_ADDRESS_BINDING, true); + +setTestDirectories(builder); + } + + private static void setTestDirectories(ClusterFixtureBuilder builder) { +builder.configBuilder.put(ExecConstants.DRILL_TMP_DIR, dirTestWatcher.getTmpDir().getAbsolutePath()); +builder.configBuilder.put(ExecConstants.SYS_STORE_PROVIDER_LOCAL_PATH, dirTestWatcher.getStoreDir().getAbsolutePath()); Review comment: Can this be done in `ClusterFixture` or its builder so we use a consistent set of directories everywhere? I've been burned by these being a bit ill-defined. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] paul-rogers commented on a change in pull request #1987: DRILL-7589: Set temporary tests folder for UDF_DIRECTORY_LOCAL, fix allocators closing in BloomFilterTest, fix permissions iss
paul-rogers commented on a change in pull request #1987: DRILL-7589: Set temporary tests folder for UDF_DIRECTORY_LOCAL, fix allocators closing in BloomFilterTest, fix permissions issue for TestGracefulShutdown tests URL: https://github.com/apache/drill/pull/1987#discussion_r380476183 ## File path: exec/java-exec/src/test/java/org/apache/drill/exec/udf/dynamic/TestDynamicUDFSupport.java ## @@ -104,8 +104,10 @@ public static void buildAndStoreDefaultJars() throws IOException { @Before public void setupNewDrillbit() throws Exception { udfDir = dirTestWatcher.makeSubDir(Paths.get("udf")); +File udfLocalDir = dirTestWatcher.makeSubDir(Paths.get("udf", "local")); Review comment: The `DirTestWatcher` has internal support for each of Drill's working directories. Might we want to add another directory for UDF files? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] paul-rogers commented on issue #1971: DRILL-7572: JSON structure parser
paul-rogers commented on issue #1971: DRILL-7572: JSON structure parser URL: https://github.com/apache/drill/pull/1971#issuecomment-587300916 @vvysotskyi, thanks for pointing out the question; I missed it when reading the code comments. Looked at `CountingJsonReader`. Looks like creates a series or rows, one per input row, with just a bit field set to 1. This reader could do exactly the same by projecting none of the columns and instead writing that bit = 1 value for the start of each top-level object. The non-projected columns will "free-wheel" over the incoming JSON. A better solution is to actually return the count. Maybe we need another option on the format plugin, `supportsCountPushDown()` so that we return the per-file row count rather grind through the effort of making trivial rows. EVF has support for this idea with its notion of "project none" which occurs when the scan asks for now rows as in a `COUNT(*)`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] paul-rogers opened a new pull request #1988: DRILL-7590: Refactor plugin registry
paul-rogers opened a new pull request #1988: DRILL-7590: Refactor plugin registry URL: https://github.com/apache/drill/pull/1988 Performs a thorough "spring cleaning" of the storage plugin registry to prepare it to add a proper plugin API. This is a complex PR with lots going on. The plugin registry connects configurations, stored in ZK, with implementations, which are Java classes. The registry handles a large number of tasks including: * Populating "bootstrap" plugin configurations and handling upgrades. * Reading from, and writing to, the persistent store in ZK. * Handling "normal" (configured) plugins and special system plugins (which have no configuration.) * Handle format plugins which are always associated with the DFS storage plugin. * Handle "ephemeral" plugins which correspond to configs not stored in the registry. * And so on. The code has grown overly complex. As we look to add a new, cleaner plugin mechanism, we will start by cleaning up what we have to allow the new mechanism to be one of many. ## Terminology There is no more confusing term in Drill than "plugin." That single term can mean: * The stored JSON definition for a plugin config. (What we see in the web console.) * The config object which holds the configuation. * The storage plugin instance with the config attached. This is the functional aspect of a plugin. * The storage plugin class itself. To make the following discussion clearer, we redefine terms as: * *Connector*: the storage plugin class (which needs a config to be useful) * *Plugin*: the configuration of a plugin in any of its three forms: JSON, config-only or as part of a connector + config pair. ## Standard and System Connectors The registry class handled many tasks itself, making the code hard to follow. The first task is to split apart responsibilities into separate classses. The registry handles two kinds of plugins at present: * "Classic" plugins are those defined by a `StoragePluginConfig` subclass and a `StoragePlugin` subclass with a specific constructor. Their configs are persistently stored in ZK. That is, the storage plugins most of us think about. * System plugins are a special case: they are always defined by default, and have no (or, actually, an implicit) config. Examples: `sys` and `information_schema` System plugins have the `` annotation, are created at boot time, and do not reside in the ZK store. The first step is to split out these two kinds of plugins into separate "provider" classes, along with a common interface. A new `ConnectorProvider` interface has two implementations: one for "classic" plugins another for system plugins. Then, when we add the new mechanism, it becomes a third plugin provider. ## Bootstrap and Upgrade The registry also handles the process of initializing a newly installed Drill, or upgrading an existing one. The code for this is pulled out into a separate class. Moved the names of the bootstrap plugins and plugins upgrade files into the config system to allow easier testing with test-specific files. Added complete unit tests. ## Plugin Lifecycle Plugins have a surprisingly robust lifecycle. Revised the code to better model the nuances of the lifecycle (and fix a number of subtle bugs). Plugin instances must be created, but only for standard plugins (not system plugins). Added a `ConnectorHandle` so we can track the source of each connector so that the locator can create connector instances (for standard plugins) or not (for system plugins.) Plugins are defined by persistent storage as a (name, config) pair. There is no reason to create a connector instance just to load plugins from storage. So, added a `PluginHandle` class to hold onto the (name, config, `ConnectorHandle`) triple. This handle then allows us to do lazy instantiation of the connector class. Rather than creating it on load, we wait until some code actually needs the plugin. (Some code still demands that we load all plugins; this can be fixed in a later PR.) The registry API was changed to make this clear. `createOrUpdate()` is renamed to `put` and no longer returns the plugin instance (which, it turned out, was never used.) Now, we don't create the connector instance until `getPlugin()` is called. Added a new `getConfig()` method for the many times we only want the config and don't actually need the instance. Drill is a concurrent, distributed system. Plugin (configurations) can change at any time. We might change `dfs` while queries run. The registry supports "ephemeral" plugins, those that occur in a query execution plan, but do not match a name in persistent storage. Previously, ephemeral plugins were not connected to normal named plugins. Revised this so that
[GitHub] [drill] cgivre commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail with Large Files
cgivre commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail with Large Files URL: https://github.com/apache/drill/pull/1978#issuecomment-587259759 @paul-rogers @vvysotskyi See above comment. I removed the config option and added logger warnings if the data is truncated. Again, this is just for "preview" mode so real data queries are not affected. In doing this PR, I discovered that the HDF5 format allows for arrays within compound fields. This functionality is not supported by Drill so I added a warning for that. In the future, or if anyone asks for it, I may add it but for now, I'm leaving that alone. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi opened a new pull request #1987: DRILL-7589: Set temporary tests folder for UDF_DIRECTORY_LOCAL, fix allocators closing in BloomFilterTest, fix permissions issue for TestGr
vvysotskyi opened a new pull request #1987: DRILL-7589: Set temporary tests folder for UDF_DIRECTORY_LOCAL, fix allocators closing in BloomFilterTest, fix permissions issue for TestGracefulShutdown tests URL: https://github.com/apache/drill/pull/1987 # [DRILL-7589](https://issues.apache.org/jira/browse/DRILL-7589): Set temporary tests folder for UDF_DIRECTORY_LOCAL, fix allocators closing in BloomFilterTest, fix permissions issue for TestGracefulShutdown tests ## Description Initially, `UDF_DIRECTORY_LOCAL` had default value for tests and was set to `/tmp/drill/udf/udf/local`. Changed its value to refer to the test directory. Hope it will help to fix CI failures. Fixed the following errors for `TestGracefulShutdown` tests (it was only logged, but tests pass. ``` Unable to store data for the path [file:/var/log/drill/profiles/21b7ceae-680b-91ab-3cd2-24f6d5d53a7d.sys.drill]: Mkdirs failed to create file:/var/log/drill/profiles (exists=false, cwd=file:/home/runner/work/drill/drill/exec/java-exec) ``` Fixed closing allocators for `BloomFilterTest` tests, the following error was logged, after tests from this class are finished: ``` java.lang.IllegalStateException: Allocator[ROOT] closed with outstanding buffers allocated (1). ``` ## Documentation NA ## Testing Checked several times on GitHub Actions Jobs on the forked repo. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Created] (DRILL-7590) Refactor plugin registry
Paul Rogers created DRILL-7590: -- Summary: Refactor plugin registry Key: DRILL-7590 URL: https://issues.apache.org/jira/browse/DRILL-7590 Project: Apache Drill Issue Type: Improvement Affects Versions: 1.17.0 Reporter: Paul Rogers Assignee: Paul Rogers Fix For: 1.18.0 The plugin registry connects configurations, stored in ZK, with implementations, which are Java classes. The registry handles a large number of tasks including: * Populating "bootstrap" plugin configurations and handling upgrades. * Reading from, and writing to, the persistent store in ZK. * Handling "normal" (configured) plugins and special system plugins (which have no configuration.) * Handle format plugins which are always associated with the DFS storage plugin. * And so on. The code has grown overly complex. As we look to add a new, cleaner plugin mechanism, we will start by cleaning up what we have to allow the new mechanism to be one of many. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7589) TestDynamicUDFSupport fails on GitHub Actions
Vova Vysotskyi created DRILL-7589: - Summary: TestDynamicUDFSupport fails on GitHub Actions Key: DRILL-7589 URL: https://issues.apache.org/jira/browse/DRILL-7589 Project: Apache Drill Issue Type: Bug Affects Versions: 1.18.0 Reporter: Vova Vysotskyi Assignee: Vova Vysotskyi Fix For: 1.18.0 {{TestDynamicUDFSupport}} tests fail when running in GitHub Actions job for occasional JDK version: sometimes passes for specific JDK, but sometimes fails for it. Also, different tests from the same test class may fail. When enabling logs for tests, the following stack traces are logged: {noformat} 2020-02-15T10:56:33.8624913Z 10:56:33.855 [21b8319e-7e24-a9b9-34b7-74e1d27f64e8:foreman] ERROR o.a.d.e.e.f.FunctionImplementationRegistry - Problem during remote functions load from drill-custom-abs.jar 2020-02-15T10:56:33.8626171Z java.io.IOException: Error during jar [drill-custom-abs-sources.jar] coping from [/home/runner/work/drill/drill/exec/java-exec/target/org.apache.drill.exec.udf.dynamic.TestDynamicUDFSupport/udf/drill/udf/registry] to [/tmp/drill/udf/udf/local/] 2020-02-15T10:56:33.8626499Zat org.apache.drill.exec.expr.fn.FunctionImplementationRegistry.copyJarToLocal(FunctionImplementationRegistry.java:573) 2020-02-15T10:56:33.8626758Zat org.apache.drill.exec.expr.fn.FunctionImplementationRegistry.syncWithRemoteRegistry(FunctionImplementationRegistry.java:369) 2020-02-15T10:56:33.8627312Zat org.apache.drill.exec.planner.sql.DrillSqlWorker.convertPlan(DrillSqlWorker.java:135) 2020-02-15T10:56:33.8627544Zat org.apache.drill.exec.planner.sql.DrillSqlWorker.getPlan(DrillSqlWorker.java:93) 2020-02-15T10:56:33.8628086Zat org.apache.drill.exec.work.foreman.Foreman.runSQL(Foreman.java:590) 2020-02-15T10:56:33.8628315Zat org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java:275) 2020-02-15T10:56:33.8628522Zat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 2020-02-15T10:56:33.8628749Zat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 2020-02-15T10:56:33.8628961Zat java.base/java.lang.Thread.run(Thread.java:834) 2020-02-15T10:56:33.8629569Z Caused by: org.apache.hadoop.util.Shell$ExitCodeException: chmod: cannot access '/tmp/drill/udf/udf/local/.drill-custom-abs-sources.jar.crc': No such file or directory 2020-02-15T10:56:33.8629777Z 2020-02-15T10:56:33.8629975Zat org.apache.hadoop.util.Shell.runCommand(Shell.java:1008) 2020-02-15T10:56:33.8630183Zat org.apache.hadoop.util.Shell.run(Shell.java:901) 2020-02-15T10:56:33.8630396Zat org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) 2020-02-15T10:56:33.8630618Zat org.apache.hadoop.util.Shell.execCommand(Shell.java:1307) 2020-02-15T10:56:33.8630813Zat org.apache.hadoop.util.Shell.execCommand(Shell.java:1289) 2020-02-15T10:56:33.8631031Zat org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:865) 2020-02-15T10:56:33.8631283Zat org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.(RawLocalFileSystem.java:252) 2020-02-15T10:56:33.8631519Zat org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.(RawLocalFileSystem.java:232) 2020-02-15T10:56:33.8631876Zat org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:331) 2020-02-15T10:56:33.8632094Zat org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:320) 2020-02-15T10:56:33.8632306Zat org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:351) 2020-02-15T10:56:33.8632528Zat org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.(ChecksumFileSystem.java:405) 2020-02-15T10:56:33.8632748Zat org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:464) 2020-02-15T10:56:33.8632961Zat org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:443) 2020-02-15T10:56:33.8633171Zat org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1118) 2020-02-15T10:56:33.8633380Zat org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1098) 2020-02-15T10:56:33.8633580Zat org.apache.hadoop.fs.FileSystem.create(FileSystem.java:987) 2020-02-15T10:56:33.8633780Zat org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:414) 2020-02-15T10:56:33.8633986Zat org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:387) 2020-02-15T10:56:33.8634187Zat org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337) 2020-02-15T10:56:33.8634398Zat org.apache.hadoop.fs.LocalFileSystem.copyToLocalFile(LocalFileSystem.java:88) 2020-02-15T10:56:33.8634613Zat org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2379) 2020-02-15T10:56:33.8634845Zat
[GitHub] [drill] vvysotskyi commented on issue #1984: DRILL-7586: drill-hive-exec-shaded contains commons-lang3 version 3.1
vvysotskyi commented on issue #1984: DRILL-7586: drill-hive-exec-shaded contains commons-lang3 version 3.1 URL: https://github.com/apache/drill/pull/1984#issuecomment-587094713 @oleg-zinovev, thanks for the PR and making changes. Could you please also update your commit message to reflect its changes? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r380192497 ## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ## @@ -0,0 +1,408 @@ +--- +title: "Using Drill Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill 1.17 introduces the Drill Metastore which stores the table schema and table statistics. Statistics allow Drill to better create optimal query plans. + +The Metastore is a Beta feature; it is subject to change. We encourage you to try it and provide feedback. +Because the Metastore is in Beta, the SQL commands and Metastore formats may change in the next release. +{% include startnote.html %}In Drill 1.17, this feature is supported for Parquet tables only and is disabled by default.{% include endnote.html %} + +## Enabling Drill Metastore + +To use the Drill Metastore, you must enable it at the session or system level with one of the following commands: + + SET `metastore.enabled` = true; + ALTER SYSTEM SET `metastore.enabled` = true; + +Alternatively, you can enable the option in the Drill Web UI at `http://:8047/options`. + +## Computing and storing table metadata to Drill Metastore + +Once you enable the Metastore, the next step is to populate it with data. Drill can query a table whether that table + has a Metastore entry or not. (If you are familiar with Hive, then you know that Hive requires that all tables have + Hive Metastore entries before you can query them.) In Drill, only add data to the Metastore when doing so improves + query performance. In general, large tables benefit from statistics more than small tables do. + +Unlike Hive, Drill does not require you to declare a schema. Instead, Drill infers the schema by scanning your table + in the same way as it is done during regular select and computes some metadata like `MIN` / `MAX` column values and + `NULLS_COUNT` designated as "metadata" to be able to produce more optimizations like filter push-down, etc. If + `planner.statistics.use` option is enabled, this command will also calculate and store table statistics into Drill + Metastore. + +## Configuration + +Default Metastore configuration is defined in `drill-metastore-default.conf` file. +It can be overridden in `drill-metastore-override.conf`. Distribution configuration can be +indicated in `drill-metastore-distrib.conf`. + +All configuration properties should reside in `drill.metastore` namespace. +Metastore implementation based on class implementation config property `drill.metastore.implementation.class`. +The default value is the following: + +``` +drill.metastore: { + implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore" +} +``` + +Note, that currently out of box Iceberg Metastore is available and is the default one. Though any custom + implementation can be added by placing the JAR into classpath which has the implementation of + `org.apache.drill.metastore.Metastore` interface and indicating custom class in the `drill.metastore.implementation.class`. + +### Metastore Components + +Metastore can store metadata for various components: tables, views, etc. +Current implementation provides fully functioning support for tables component. +Views component support is not implemented but contains stub methods to show +how new Metastore components like UDFs, storage plugins, etc. can be added in the future. + +### Metastore Tables + +Metastore Tables component contains metadata about Drill tables, including general information, as well as +information about table segments, files, row groups, partitions. + +Full table metadata consists of two major concepts: general information and top-level segments metadata. +Table general information contains basic table information and corresponds to the `BaseTableMetadata` class. + +A table can be non-partitioned and partitioned. Non-partitioned tables have only one top-level segment +which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned tables may have several top-level segments. +Each top-level segment can include metadata about inner segments, files, row groups, and partitions. + +A unique table identifier in Metastore Tables is a combination of storage plugin, workspace, and table name. +Table metadata inside is grouped by top-level segments, unique identifier of the top-level segment and its metadata +is storage plugin, workspace, table name, and metadata key. + +### Related Session/System Options + +The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` or via the Drill Web console. + +- **metastore.enabled** +Enables Drill Metastore usage to be able to store table metadata during ANALYZE TABLE commands execution and to be able + to read table metadata during regular queries execution or when querying some INFORMATION_SCHEMA tables. Default is `false`. +-
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r380218259 ## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ## @@ -0,0 +1,408 @@ +--- +title: "Using Drill Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill 1.17 introduces the Drill Metastore which stores the table schema and table statistics. Statistics allow Drill to better create optimal query plans. + +The Metastore is a Beta feature; it is subject to change. We encourage you to try it and provide feedback. +Because the Metastore is in Beta, the SQL commands and Metastore formats may change in the next release. +{% include startnote.html %}In Drill 1.17, this feature is supported for Parquet tables only and is disabled by default.{% include endnote.html %} + +## Enabling Drill Metastore + +To use the Drill Metastore, you must enable it at the session or system level with one of the following commands: + + SET `metastore.enabled` = true; + ALTER SYSTEM SET `metastore.enabled` = true; + +Alternatively, you can enable the option in the Drill Web UI at `http://:8047/options`. + +## Computing and storing table metadata to Drill Metastore + +Once you enable the Metastore, the next step is to populate it with data. Drill can query a table whether that table + has a Metastore entry or not. (If you are familiar with Hive, then you know that Hive requires that all tables have + Hive Metastore entries before you can query them.) In Drill, only add data to the Metastore when doing so improves + query performance. In general, large tables benefit from statistics more than small tables do. + +Unlike Hive, Drill does not require you to declare a schema. Instead, Drill infers the schema by scanning your table + in the same way as it is done during regular select and computes some metadata like `MIN` / `MAX` column values and + `NULLS_COUNT` designated as "metadata" to be able to produce more optimizations like filter push-down, etc. If + `planner.statistics.use` option is enabled, this command will also calculate and store table statistics into Drill + Metastore. + +## Configuration + +Default Metastore configuration is defined in `drill-metastore-default.conf` file. +It can be overridden in `drill-metastore-override.conf`. Distribution configuration can be +indicated in `drill-metastore-distrib.conf`. + +All configuration properties should reside in `drill.metastore` namespace. +Metastore implementation based on class implementation config property `drill.metastore.implementation.class`. +The default value is the following: + +``` +drill.metastore: { + implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore" +} +``` + +Note, that currently out of box Iceberg Metastore is available and is the default one. Though any custom + implementation can be added by placing the JAR into classpath which has the implementation of + `org.apache.drill.metastore.Metastore` interface and indicating custom class in the `drill.metastore.implementation.class`. + +### Metastore Components + +Metastore can store metadata for various components: tables, views, etc. +Current implementation provides fully functioning support for tables component. +Views component support is not implemented but contains stub methods to show +how new Metastore components like UDFs, storage plugins, etc. can be added in the future. + +### Metastore Tables + +Metastore Tables component contains metadata about Drill tables, including general information, as well as +information about table segments, files, row groups, partitions. + +Full table metadata consists of two major concepts: general information and top-level segments metadata. +Table general information contains basic table information and corresponds to the `BaseTableMetadata` class. + +A table can be non-partitioned and partitioned. Non-partitioned tables have only one top-level segment +which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned tables may have several top-level segments. +Each top-level segment can include metadata about inner segments, files, row groups, and partitions. + +A unique table identifier in Metastore Tables is a combination of storage plugin, workspace, and table name. +Table metadata inside is grouped by top-level segments, unique identifier of the top-level segment and its metadata +is storage plugin, workspace, table name, and metadata key. + +### Related Session/System Options + +The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` or via the Drill Web console. + +- **metastore.enabled** +Enables Drill Metastore usage to be able to store table metadata during ANALYZE TABLE commands execution and to be able + to read table metadata during regular queries execution or when querying some INFORMATION_SCHEMA tables. Default is `false`. +-
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r380209257 ## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ## @@ -0,0 +1,408 @@ +--- +title: "Using Drill Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill 1.17 introduces the Drill Metastore which stores the table schema and table statistics. Statistics allow Drill to better create optimal query plans. + +The Metastore is a Beta feature; it is subject to change. We encourage you to try it and provide feedback. +Because the Metastore is in Beta, the SQL commands and Metastore formats may change in the next release. +{% include startnote.html %}In Drill 1.17, this feature is supported for Parquet tables only and is disabled by default.{% include endnote.html %} + +## Enabling Drill Metastore + +To use the Drill Metastore, you must enable it at the session or system level with one of the following commands: + + SET `metastore.enabled` = true; + ALTER SYSTEM SET `metastore.enabled` = true; + +Alternatively, you can enable the option in the Drill Web UI at `http://:8047/options`. + +## Computing and storing table metadata to Drill Metastore + +Once you enable the Metastore, the next step is to populate it with data. Drill can query a table whether that table + has a Metastore entry or not. (If you are familiar with Hive, then you know that Hive requires that all tables have + Hive Metastore entries before you can query them.) In Drill, only add data to the Metastore when doing so improves + query performance. In general, large tables benefit from statistics more than small tables do. + +Unlike Hive, Drill does not require you to declare a schema. Instead, Drill infers the schema by scanning your table + in the same way as it is done during regular select and computes some metadata like `MIN` / `MAX` column values and + `NULLS_COUNT` designated as "metadata" to be able to produce more optimizations like filter push-down, etc. If + `planner.statistics.use` option is enabled, this command will also calculate and store table statistics into Drill + Metastore. + +## Configuration + +Default Metastore configuration is defined in `drill-metastore-default.conf` file. +It can be overridden in `drill-metastore-override.conf`. Distribution configuration can be +indicated in `drill-metastore-distrib.conf`. + +All configuration properties should reside in `drill.metastore` namespace. +Metastore implementation based on class implementation config property `drill.metastore.implementation.class`. +The default value is the following: + +``` +drill.metastore: { + implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore" +} +``` + +Note, that currently out of box Iceberg Metastore is available and is the default one. Though any custom + implementation can be added by placing the JAR into classpath which has the implementation of + `org.apache.drill.metastore.Metastore` interface and indicating custom class in the `drill.metastore.implementation.class`. + +### Metastore Components + +Metastore can store metadata for various components: tables, views, etc. +Current implementation provides fully functioning support for tables component. +Views component support is not implemented but contains stub methods to show +how new Metastore components like UDFs, storage plugins, etc. can be added in the future. + +### Metastore Tables + +Metastore Tables component contains metadata about Drill tables, including general information, as well as +information about table segments, files, row groups, partitions. + +Full table metadata consists of two major concepts: general information and top-level segments metadata. +Table general information contains basic table information and corresponds to the `BaseTableMetadata` class. + +A table can be non-partitioned and partitioned. Non-partitioned tables have only one top-level segment +which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned tables may have several top-level segments. +Each top-level segment can include metadata about inner segments, files, row groups, and partitions. + +A unique table identifier in Metastore Tables is a combination of storage plugin, workspace, and table name. +Table metadata inside is grouped by top-level segments, unique identifier of the top-level segment and its metadata +is storage plugin, workspace, table name, and metadata key. + +### Related Session/System Options + +The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` or via the Drill Web console. + +- **metastore.enabled** +Enables Drill Metastore usage to be able to store table metadata during ANALYZE TABLE commands execution and to be able + to read table metadata during regular queries execution or when querying some INFORMATION_SCHEMA tables. Default is `false`. +-
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r380229001 ## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ## @@ -0,0 +1,408 @@ +--- +title: "Using Drill Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill 1.17 introduces the Drill Metastore which stores the table schema and table statistics. Statistics allow Drill to better create optimal query plans. + +The Metastore is a Beta feature; it is subject to change. We encourage you to try it and provide feedback. +Because the Metastore is in Beta, the SQL commands and Metastore formats may change in the next release. +{% include startnote.html %}In Drill 1.17, this feature is supported for Parquet tables only and is disabled by default.{% include endnote.html %} + +## Enabling Drill Metastore + +To use the Drill Metastore, you must enable it at the session or system level with one of the following commands: + + SET `metastore.enabled` = true; + ALTER SYSTEM SET `metastore.enabled` = true; + +Alternatively, you can enable the option in the Drill Web UI at `http://:8047/options`. + +## Computing and storing table metadata to Drill Metastore + +Once you enable the Metastore, the next step is to populate it with data. Drill can query a table whether that table + has a Metastore entry or not. (If you are familiar with Hive, then you know that Hive requires that all tables have + Hive Metastore entries before you can query them.) In Drill, only add data to the Metastore when doing so improves + query performance. In general, large tables benefit from statistics more than small tables do. + +Unlike Hive, Drill does not require you to declare a schema. Instead, Drill infers the schema by scanning your table + in the same way as it is done during regular select and computes some metadata like `MIN` / `MAX` column values and + `NULLS_COUNT` designated as "metadata" to be able to produce more optimizations like filter push-down, etc. If + `planner.statistics.use` option is enabled, this command will also calculate and store table statistics into Drill + Metastore. + +## Configuration + +Default Metastore configuration is defined in `drill-metastore-default.conf` file. +It can be overridden in `drill-metastore-override.conf`. Distribution configuration can be +indicated in `drill-metastore-distrib.conf`. + +All configuration properties should reside in `drill.metastore` namespace. +Metastore implementation based on class implementation config property `drill.metastore.implementation.class`. +The default value is the following: + +``` +drill.metastore: { + implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore" +} +``` + +Note, that currently out of box Iceberg Metastore is available and is the default one. Though any custom + implementation can be added by placing the JAR into classpath which has the implementation of + `org.apache.drill.metastore.Metastore` interface and indicating custom class in the `drill.metastore.implementation.class`. + +### Metastore Components + +Metastore can store metadata for various components: tables, views, etc. +Current implementation provides fully functioning support for tables component. +Views component support is not implemented but contains stub methods to show +how new Metastore components like UDFs, storage plugins, etc. can be added in the future. + +### Metastore Tables + +Metastore Tables component contains metadata about Drill tables, including general information, as well as +information about table segments, files, row groups, partitions. + +Full table metadata consists of two major concepts: general information and top-level segments metadata. +Table general information contains basic table information and corresponds to the `BaseTableMetadata` class. + +A table can be non-partitioned and partitioned. Non-partitioned tables have only one top-level segment +which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned tables may have several top-level segments. +Each top-level segment can include metadata about inner segments, files, row groups, and partitions. + +A unique table identifier in Metastore Tables is a combination of storage plugin, workspace, and table name. +Table metadata inside is grouped by top-level segments, unique identifier of the top-level segment and its metadata +is storage plugin, workspace, table name, and metadata key. + +### Related Session/System Options + +The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` or via the Drill Web console. + +- **metastore.enabled** +Enables Drill Metastore usage to be able to store table metadata during ANALYZE TABLE commands execution and to be able + to read table metadata during regular queries execution or when querying some INFORMATION_SCHEMA tables. Default is `false`. +-
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r380275860 ## File path: _docs/performance-tuning/drill-metastore/030-drill-iceberg-metastore.md ## @@ -0,0 +1,69 @@ +--- +title: "Drill Iceberg Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill uses Iceberg Metastore implementation based on [Iceberg tables](http://iceberg.incubator.apache.org). For Drill 1.17, + this is default Drill Metastore implementation. For details on how to configure Iceberg Metastore implementation and + its option descriptions, please refer to [Iceberg Metastore docs](https://github.com/apache/drill/blob/master/metastore/iceberg-metastore/README.md). + +{% include startnote.html %} +Iceberg table supports concurrent writes and transactions but they are only effective on file systems that support + atomic rename. +If the file system does not support atomic rename, it could lead to inconsistencies during concurrent writes. +{% include endnote.html %} + +### Iceberg Tables Location + +Iceberg tables will reside on the file system in the location based on +Iceberg Metastore base location `drill.metastore.iceberg.location.base_path` and component specific location. +If Iceberg Metastore base location is `/drill/metastore/iceberg` +and tables component location is `tables`. Iceberg table for tables component +will be located in `/drill/metastore/iceberg/tables` folder. + +Metastore metadata will be stored inside Iceberg table location provided +in the configuration file. Drill table metadata location will be constructed +based on specific component storage keys. For example, for `tables` component, +storage keys are storage plugin, workspace and table name: unique table identifier in Drill. Review comment: Thanks, replaced. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r379568993 ## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ## @@ -0,0 +1,408 @@ +--- +title: "Using Drill Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill 1.17 introduces the Drill Metastore which stores the table schema and table statistics. Statistics allow Drill to better create optimal query plans. + +The Metastore is a Beta feature; it is subject to change. We encourage you to try it and provide feedback. +Because the Metastore is in Beta, the SQL commands and Metastore formats may change in the next release. +{% include startnote.html %}In Drill 1.17, this feature is supported for Parquet tables only and is disabled by default.{% include endnote.html %} + +## Enabling Drill Metastore + +To use the Drill Metastore, you must enable it at the session or system level with one of the following commands: + + SET `metastore.enabled` = true; + ALTER SYSTEM SET `metastore.enabled` = true; + +Alternatively, you can enable the option in the Drill Web UI at `http://:8047/options`. + +## Computing and storing table metadata to Drill Metastore + +Once you enable the Metastore, the next step is to populate it with data. Drill can query a table whether that table + has a Metastore entry or not. (If you are familiar with Hive, then you know that Hive requires that all tables have + Hive Metastore entries before you can query them.) In Drill, only add data to the Metastore when doing so improves + query performance. In general, large tables benefit from statistics more than small tables do. + +Unlike Hive, Drill does not require you to declare a schema. Instead, Drill infers the schema by scanning your table + in the same way as it is done during regular select and computes some metadata like `MIN` / `MAX` column values and + `NULLS_COUNT` designated as "metadata" to be able to produce more optimizations like filter push-down, etc. If + `planner.statistics.use` option is enabled, this command will also calculate and store table statistics into Drill + Metastore. + +## Configuration + +Default Metastore configuration is defined in `drill-metastore-default.conf` file. +It can be overridden in `drill-metastore-override.conf`. Distribution configuration can be +indicated in `drill-metastore-distrib.conf`. + +All configuration properties should reside in `drill.metastore` namespace. +Metastore implementation based on class implementation config property `drill.metastore.implementation.class`. +The default value is the following: + +``` +drill.metastore: { + implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore" +} +``` + +Note, that currently out of box Iceberg Metastore is available and is the default one. Though any custom + implementation can be added by placing the JAR into classpath which has the implementation of + `org.apache.drill.metastore.Metastore` interface and indicating custom class in the `drill.metastore.implementation.class`. + +### Metastore Components + +Metastore can store metadata for various components: tables, views, etc. +Current implementation provides fully functioning support for tables component. +Views component support is not implemented but contains stub methods to show +how new Metastore components like UDFs, storage plugins, etc. can be added in the future. + +### Metastore Tables + +Metastore Tables component contains metadata about Drill tables, including general information, as well as +information about table segments, files, row groups, partitions. + +Full table metadata consists of two major concepts: general information and top-level segments metadata. +Table general information contains basic table information and corresponds to the `BaseTableMetadata` class. + +A table can be non-partitioned and partitioned. Non-partitioned tables have only one top-level segment +which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned tables may have several top-level segments. +Each top-level segment can include metadata about inner segments, files, row groups, and partitions. + +A unique table identifier in Metastore Tables is a combination of storage plugin, workspace, and table name. Review comment: Thanks, replaced. Currently, user can delete only metadata for an existing table. Added this info also. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r380209127 ## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ## @@ -0,0 +1,408 @@ +--- +title: "Using Drill Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill 1.17 introduces the Drill Metastore which stores the table schema and table statistics. Statistics allow Drill to better create optimal query plans. + +The Metastore is a Beta feature; it is subject to change. We encourage you to try it and provide feedback. +Because the Metastore is in Beta, the SQL commands and Metastore formats may change in the next release. +{% include startnote.html %}In Drill 1.17, this feature is supported for Parquet tables only and is disabled by default.{% include endnote.html %} + +## Enabling Drill Metastore + +To use the Drill Metastore, you must enable it at the session or system level with one of the following commands: + + SET `metastore.enabled` = true; + ALTER SYSTEM SET `metastore.enabled` = true; + +Alternatively, you can enable the option in the Drill Web UI at `http://:8047/options`. + +## Computing and storing table metadata to Drill Metastore + +Once you enable the Metastore, the next step is to populate it with data. Drill can query a table whether that table + has a Metastore entry or not. (If you are familiar with Hive, then you know that Hive requires that all tables have + Hive Metastore entries before you can query them.) In Drill, only add data to the Metastore when doing so improves + query performance. In general, large tables benefit from statistics more than small tables do. + +Unlike Hive, Drill does not require you to declare a schema. Instead, Drill infers the schema by scanning your table + in the same way as it is done during regular select and computes some metadata like `MIN` / `MAX` column values and + `NULLS_COUNT` designated as "metadata" to be able to produce more optimizations like filter push-down, etc. If + `planner.statistics.use` option is enabled, this command will also calculate and store table statistics into Drill + Metastore. + +## Configuration + +Default Metastore configuration is defined in `drill-metastore-default.conf` file. +It can be overridden in `drill-metastore-override.conf`. Distribution configuration can be +indicated in `drill-metastore-distrib.conf`. + +All configuration properties should reside in `drill.metastore` namespace. +Metastore implementation based on class implementation config property `drill.metastore.implementation.class`. +The default value is the following: + +``` +drill.metastore: { + implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore" +} +``` + +Note, that currently out of box Iceberg Metastore is available and is the default one. Though any custom + implementation can be added by placing the JAR into classpath which has the implementation of + `org.apache.drill.metastore.Metastore` interface and indicating custom class in the `drill.metastore.implementation.class`. + +### Metastore Components + +Metastore can store metadata for various components: tables, views, etc. +Current implementation provides fully functioning support for tables component. +Views component support is not implemented but contains stub methods to show +how new Metastore components like UDFs, storage plugins, etc. can be added in the future. + +### Metastore Tables + +Metastore Tables component contains metadata about Drill tables, including general information, as well as +information about table segments, files, row groups, partitions. + +Full table metadata consists of two major concepts: general information and top-level segments metadata. +Table general information contains basic table information and corresponds to the `BaseTableMetadata` class. + +A table can be non-partitioned and partitioned. Non-partitioned tables have only one top-level segment +which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned tables may have several top-level segments. +Each top-level segment can include metadata about inner segments, files, row groups, and partitions. + +A unique table identifier in Metastore Tables is a combination of storage plugin, workspace, and table name. +Table metadata inside is grouped by top-level segments, unique identifier of the top-level segment and its metadata +is storage plugin, workspace, table name, and metadata key. + +### Related Session/System Options + +The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` or via the Drill Web console. + +- **metastore.enabled** +Enables Drill Metastore usage to be able to store table metadata during ANALYZE TABLE commands execution and to be able + to read table metadata during regular queries execution or when querying some INFORMATION_SCHEMA tables. Default is `false`. +-
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r380230120 ## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ## @@ -0,0 +1,408 @@ +--- +title: "Using Drill Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill 1.17 introduces the Drill Metastore which stores the table schema and table statistics. Statistics allow Drill to better create optimal query plans. + +The Metastore is a Beta feature; it is subject to change. We encourage you to try it and provide feedback. +Because the Metastore is in Beta, the SQL commands and Metastore formats may change in the next release. +{% include startnote.html %}In Drill 1.17, this feature is supported for Parquet tables only and is disabled by default.{% include endnote.html %} + +## Enabling Drill Metastore + +To use the Drill Metastore, you must enable it at the session or system level with one of the following commands: + + SET `metastore.enabled` = true; + ALTER SYSTEM SET `metastore.enabled` = true; + +Alternatively, you can enable the option in the Drill Web UI at `http://:8047/options`. + +## Computing and storing table metadata to Drill Metastore + +Once you enable the Metastore, the next step is to populate it with data. Drill can query a table whether that table + has a Metastore entry or not. (If you are familiar with Hive, then you know that Hive requires that all tables have + Hive Metastore entries before you can query them.) In Drill, only add data to the Metastore when doing so improves + query performance. In general, large tables benefit from statistics more than small tables do. + +Unlike Hive, Drill does not require you to declare a schema. Instead, Drill infers the schema by scanning your table + in the same way as it is done during regular select and computes some metadata like `MIN` / `MAX` column values and + `NULLS_COUNT` designated as "metadata" to be able to produce more optimizations like filter push-down, etc. If + `planner.statistics.use` option is enabled, this command will also calculate and store table statistics into Drill + Metastore. + +## Configuration + +Default Metastore configuration is defined in `drill-metastore-default.conf` file. +It can be overridden in `drill-metastore-override.conf`. Distribution configuration can be +indicated in `drill-metastore-distrib.conf`. + +All configuration properties should reside in `drill.metastore` namespace. +Metastore implementation based on class implementation config property `drill.metastore.implementation.class`. +The default value is the following: + +``` +drill.metastore: { + implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore" +} +``` + +Note, that currently out of box Iceberg Metastore is available and is the default one. Though any custom + implementation can be added by placing the JAR into classpath which has the implementation of + `org.apache.drill.metastore.Metastore` interface and indicating custom class in the `drill.metastore.implementation.class`. + +### Metastore Components + +Metastore can store metadata for various components: tables, views, etc. +Current implementation provides fully functioning support for tables component. +Views component support is not implemented but contains stub methods to show +how new Metastore components like UDFs, storage plugins, etc. can be added in the future. + +### Metastore Tables + +Metastore Tables component contains metadata about Drill tables, including general information, as well as +information about table segments, files, row groups, partitions. + +Full table metadata consists of two major concepts: general information and top-level segments metadata. +Table general information contains basic table information and corresponds to the `BaseTableMetadata` class. + +A table can be non-partitioned and partitioned. Non-partitioned tables have only one top-level segment +which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned tables may have several top-level segments. +Each top-level segment can include metadata about inner segments, files, row groups, and partitions. + +A unique table identifier in Metastore Tables is a combination of storage plugin, workspace, and table name. +Table metadata inside is grouped by top-level segments, unique identifier of the top-level segment and its metadata +is storage plugin, workspace, table name, and metadata key. + +### Related Session/System Options + +The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` or via the Drill Web console. + +- **metastore.enabled** +Enables Drill Metastore usage to be able to store table metadata during ANALYZE TABLE commands execution and to be able + to read table metadata during regular queries execution or when querying some INFORMATION_SCHEMA tables. Default is `false`. +-
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r380276160 ## File path: _docs/performance-tuning/drill-metastore/030-drill-iceberg-metastore.md ## @@ -0,0 +1,69 @@ +--- +title: "Drill Iceberg Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill uses Iceberg Metastore implementation based on [Iceberg tables](http://iceberg.incubator.apache.org). For Drill 1.17, + this is default Drill Metastore implementation. For details on how to configure Iceberg Metastore implementation and + its option descriptions, please refer to [Iceberg Metastore docs](https://github.com/apache/drill/blob/master/metastore/iceberg-metastore/README.md). + +{% include startnote.html %} +Iceberg table supports concurrent writes and transactions but they are only effective on file systems that support + atomic rename. +If the file system does not support atomic rename, it could lead to inconsistencies during concurrent writes. +{% include endnote.html %} + +### Iceberg Tables Location + +Iceberg tables will reside on the file system in the location based on +Iceberg Metastore base location `drill.metastore.iceberg.location.base_path` and component specific location. +If Iceberg Metastore base location is `/drill/metastore/iceberg` +and tables component location is `tables`. Iceberg table for tables component +will be located in `/drill/metastore/iceberg/tables` folder. + +Metastore metadata will be stored inside Iceberg table location provided +in the configuration file. Drill table metadata location will be constructed +based on specific component storage keys. For example, for `tables` component, +storage keys are storage plugin, workspace and table name: unique table identifier in Drill. + +Assume Iceberg table location is `/drill/metastore/iceberg/tables`, metadata for the table +`dfs.tmp.nation` will be stored in the `/drill/metastore/iceberg/tables/dfs/tmp/nation` folder. Review comment: Thanks, updated the docs as proposed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r380146051 ## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ## @@ -0,0 +1,408 @@ +--- +title: "Using Drill Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill 1.17 introduces the Drill Metastore which stores the table schema and table statistics. Statistics allow Drill to better create optimal query plans. + +The Metastore is a Beta feature; it is subject to change. We encourage you to try it and provide feedback. +Because the Metastore is in Beta, the SQL commands and Metastore formats may change in the next release. +{% include startnote.html %}In Drill 1.17, this feature is supported for Parquet tables only and is disabled by default.{% include endnote.html %} + +## Enabling Drill Metastore + +To use the Drill Metastore, you must enable it at the session or system level with one of the following commands: + + SET `metastore.enabled` = true; + ALTER SYSTEM SET `metastore.enabled` = true; + +Alternatively, you can enable the option in the Drill Web UI at `http://:8047/options`. + +## Computing and storing table metadata to Drill Metastore + +Once you enable the Metastore, the next step is to populate it with data. Drill can query a table whether that table + has a Metastore entry or not. (If you are familiar with Hive, then you know that Hive requires that all tables have + Hive Metastore entries before you can query them.) In Drill, only add data to the Metastore when doing so improves + query performance. In general, large tables benefit from statistics more than small tables do. + +Unlike Hive, Drill does not require you to declare a schema. Instead, Drill infers the schema by scanning your table + in the same way as it is done during regular select and computes some metadata like `MIN` / `MAX` column values and + `NULLS_COUNT` designated as "metadata" to be able to produce more optimizations like filter push-down, etc. If + `planner.statistics.use` option is enabled, this command will also calculate and store table statistics into Drill + Metastore. + +## Configuration + +Default Metastore configuration is defined in `drill-metastore-default.conf` file. +It can be overridden in `drill-metastore-override.conf`. Distribution configuration can be +indicated in `drill-metastore-distrib.conf`. + +All configuration properties should reside in `drill.metastore` namespace. +Metastore implementation based on class implementation config property `drill.metastore.implementation.class`. +The default value is the following: + +``` +drill.metastore: { + implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore" +} +``` + +Note, that currently out of box Iceberg Metastore is available and is the default one. Though any custom + implementation can be added by placing the JAR into classpath which has the implementation of + `org.apache.drill.metastore.Metastore` interface and indicating custom class in the `drill.metastore.implementation.class`. + +### Metastore Components + +Metastore can store metadata for various components: tables, views, etc. +Current implementation provides fully functioning support for tables component. +Views component support is not implemented but contains stub methods to show +how new Metastore components like UDFs, storage plugins, etc. can be added in the future. + +### Metastore Tables + +Metastore Tables component contains metadata about Drill tables, including general information, as well as +information about table segments, files, row groups, partitions. + +Full table metadata consists of two major concepts: general information and top-level segments metadata. +Table general information contains basic table information and corresponds to the `BaseTableMetadata` class. + +A table can be non-partitioned and partitioned. Non-partitioned tables have only one top-level segment +which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned tables may have several top-level segments. +Each top-level segment can include metadata about inner segments, files, row groups, and partitions. + +A unique table identifier in Metastore Tables is a combination of storage plugin, workspace, and table name. +Table metadata inside is grouped by top-level segments, unique identifier of the top-level segment and its metadata +is storage plugin, workspace, table name, and metadata key. + +### Related Session/System Options + +The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` or via the Drill Web console. + +- **metastore.enabled** +Enables Drill Metastore usage to be able to store table metadata during ANALYZE TABLE commands execution and to be able + to read table metadata during regular queries execution or when querying some INFORMATION_SCHEMA tables. Default is `false`. +-
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r380210657 ## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ## @@ -0,0 +1,408 @@ +--- +title: "Using Drill Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill 1.17 introduces the Drill Metastore which stores the table schema and table statistics. Statistics allow Drill to better create optimal query plans. + +The Metastore is a Beta feature; it is subject to change. We encourage you to try it and provide feedback. +Because the Metastore is in Beta, the SQL commands and Metastore formats may change in the next release. +{% include startnote.html %}In Drill 1.17, this feature is supported for Parquet tables only and is disabled by default.{% include endnote.html %} + +## Enabling Drill Metastore + +To use the Drill Metastore, you must enable it at the session or system level with one of the following commands: + + SET `metastore.enabled` = true; + ALTER SYSTEM SET `metastore.enabled` = true; + +Alternatively, you can enable the option in the Drill Web UI at `http://:8047/options`. + +## Computing and storing table metadata to Drill Metastore + +Once you enable the Metastore, the next step is to populate it with data. Drill can query a table whether that table + has a Metastore entry or not. (If you are familiar with Hive, then you know that Hive requires that all tables have + Hive Metastore entries before you can query them.) In Drill, only add data to the Metastore when doing so improves + query performance. In general, large tables benefit from statistics more than small tables do. + +Unlike Hive, Drill does not require you to declare a schema. Instead, Drill infers the schema by scanning your table + in the same way as it is done during regular select and computes some metadata like `MIN` / `MAX` column values and + `NULLS_COUNT` designated as "metadata" to be able to produce more optimizations like filter push-down, etc. If + `planner.statistics.use` option is enabled, this command will also calculate and store table statistics into Drill + Metastore. + +## Configuration + +Default Metastore configuration is defined in `drill-metastore-default.conf` file. +It can be overridden in `drill-metastore-override.conf`. Distribution configuration can be +indicated in `drill-metastore-distrib.conf`. + +All configuration properties should reside in `drill.metastore` namespace. +Metastore implementation based on class implementation config property `drill.metastore.implementation.class`. +The default value is the following: + +``` +drill.metastore: { + implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore" +} +``` + +Note, that currently out of box Iceberg Metastore is available and is the default one. Though any custom + implementation can be added by placing the JAR into classpath which has the implementation of + `org.apache.drill.metastore.Metastore` interface and indicating custom class in the `drill.metastore.implementation.class`. + +### Metastore Components + +Metastore can store metadata for various components: tables, views, etc. +Current implementation provides fully functioning support for tables component. +Views component support is not implemented but contains stub methods to show +how new Metastore components like UDFs, storage plugins, etc. can be added in the future. + +### Metastore Tables + +Metastore Tables component contains metadata about Drill tables, including general information, as well as +information about table segments, files, row groups, partitions. + +Full table metadata consists of two major concepts: general information and top-level segments metadata. +Table general information contains basic table information and corresponds to the `BaseTableMetadata` class. + +A table can be non-partitioned and partitioned. Non-partitioned tables have only one top-level segment +which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned tables may have several top-level segments. +Each top-level segment can include metadata about inner segments, files, row groups, and partitions. + +A unique table identifier in Metastore Tables is a combination of storage plugin, workspace, and table name. +Table metadata inside is grouped by top-level segments, unique identifier of the top-level segment and its metadata +is storage plugin, workspace, table name, and metadata key. + +### Related Session/System Options + +The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` or via the Drill Web console. + +- **metastore.enabled** +Enables Drill Metastore usage to be able to store table metadata during ANALYZE TABLE commands execution and to be able + to read table metadata during regular queries execution or when querying some INFORMATION_SCHEMA tables. Default is `false`. +-
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r379576801 ## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ## @@ -0,0 +1,408 @@ +--- +title: "Using Drill Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill 1.17 introduces the Drill Metastore which stores the table schema and table statistics. Statistics allow Drill to better create optimal query plans. + +The Metastore is a Beta feature; it is subject to change. We encourage you to try it and provide feedback. +Because the Metastore is in Beta, the SQL commands and Metastore formats may change in the next release. +{% include startnote.html %}In Drill 1.17, this feature is supported for Parquet tables only and is disabled by default.{% include endnote.html %} + +## Enabling Drill Metastore + +To use the Drill Metastore, you must enable it at the session or system level with one of the following commands: + + SET `metastore.enabled` = true; + ALTER SYSTEM SET `metastore.enabled` = true; + +Alternatively, you can enable the option in the Drill Web UI at `http://:8047/options`. + +## Computing and storing table metadata to Drill Metastore + +Once you enable the Metastore, the next step is to populate it with data. Drill can query a table whether that table + has a Metastore entry or not. (If you are familiar with Hive, then you know that Hive requires that all tables have + Hive Metastore entries before you can query them.) In Drill, only add data to the Metastore when doing so improves + query performance. In general, large tables benefit from statistics more than small tables do. + +Unlike Hive, Drill does not require you to declare a schema. Instead, Drill infers the schema by scanning your table + in the same way as it is done during regular select and computes some metadata like `MIN` / `MAX` column values and + `NULLS_COUNT` designated as "metadata" to be able to produce more optimizations like filter push-down, etc. If + `planner.statistics.use` option is enabled, this command will also calculate and store table statistics into Drill + Metastore. + +## Configuration + +Default Metastore configuration is defined in `drill-metastore-default.conf` file. +It can be overridden in `drill-metastore-override.conf`. Distribution configuration can be +indicated in `drill-metastore-distrib.conf`. + +All configuration properties should reside in `drill.metastore` namespace. +Metastore implementation based on class implementation config property `drill.metastore.implementation.class`. +The default value is the following: + +``` +drill.metastore: { + implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore" +} +``` + +Note, that currently out of box Iceberg Metastore is available and is the default one. Though any custom + implementation can be added by placing the JAR into classpath which has the implementation of + `org.apache.drill.metastore.Metastore` interface and indicating custom class in the `drill.metastore.implementation.class`. + +### Metastore Components + +Metastore can store metadata for various components: tables, views, etc. +Current implementation provides fully functioning support for tables component. +Views component support is not implemented but contains stub methods to show +how new Metastore components like UDFs, storage plugins, etc. can be added in the future. + +### Metastore Tables + +Metastore Tables component contains metadata about Drill tables, including general information, as well as +information about table segments, files, row groups, partitions. + +Full table metadata consists of two major concepts: general information and top-level segments metadata. +Table general information contains basic table information and corresponds to the `BaseTableMetadata` class. + +A table can be non-partitioned and partitioned. Non-partitioned tables have only one top-level segment +which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned tables may have several top-level segments. +Each top-level segment can include metadata about inner segments, files, row groups, and partitions. + +A unique table identifier in Metastore Tables is a combination of storage plugin, workspace, and table name. +Table metadata inside is grouped by top-level segments, unique identifier of the top-level segment and its metadata +is storage plugin, workspace, table name, and metadata key. + +### Related Session/System Options + +The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` or via the Drill Web console. + +- **metastore.enabled** +Enables Drill Metastore usage to be able to store table metadata during ANALYZE TABLE commands execution and to be able + to read table metadata during regular queries execution or when querying some INFORMATION_SCHEMA tables. Default is `false`. +-
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r380164420 ## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ## @@ -0,0 +1,408 @@ +--- +title: "Using Drill Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill 1.17 introduces the Drill Metastore which stores the table schema and table statistics. Statistics allow Drill to better create optimal query plans. + +The Metastore is a Beta feature; it is subject to change. We encourage you to try it and provide feedback. +Because the Metastore is in Beta, the SQL commands and Metastore formats may change in the next release. +{% include startnote.html %}In Drill 1.17, this feature is supported for Parquet tables only and is disabled by default.{% include endnote.html %} + +## Enabling Drill Metastore + +To use the Drill Metastore, you must enable it at the session or system level with one of the following commands: + + SET `metastore.enabled` = true; + ALTER SYSTEM SET `metastore.enabled` = true; + +Alternatively, you can enable the option in the Drill Web UI at `http://:8047/options`. + +## Computing and storing table metadata to Drill Metastore + +Once you enable the Metastore, the next step is to populate it with data. Drill can query a table whether that table + has a Metastore entry or not. (If you are familiar with Hive, then you know that Hive requires that all tables have + Hive Metastore entries before you can query them.) In Drill, only add data to the Metastore when doing so improves + query performance. In general, large tables benefit from statistics more than small tables do. + +Unlike Hive, Drill does not require you to declare a schema. Instead, Drill infers the schema by scanning your table + in the same way as it is done during regular select and computes some metadata like `MIN` / `MAX` column values and + `NULLS_COUNT` designated as "metadata" to be able to produce more optimizations like filter push-down, etc. If + `planner.statistics.use` option is enabled, this command will also calculate and store table statistics into Drill + Metastore. + +## Configuration + +Default Metastore configuration is defined in `drill-metastore-default.conf` file. +It can be overridden in `drill-metastore-override.conf`. Distribution configuration can be +indicated in `drill-metastore-distrib.conf`. + +All configuration properties should reside in `drill.metastore` namespace. +Metastore implementation based on class implementation config property `drill.metastore.implementation.class`. +The default value is the following: + +``` +drill.metastore: { + implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore" +} +``` + +Note, that currently out of box Iceberg Metastore is available and is the default one. Though any custom + implementation can be added by placing the JAR into classpath which has the implementation of + `org.apache.drill.metastore.Metastore` interface and indicating custom class in the `drill.metastore.implementation.class`. + +### Metastore Components + +Metastore can store metadata for various components: tables, views, etc. +Current implementation provides fully functioning support for tables component. +Views component support is not implemented but contains stub methods to show +how new Metastore components like UDFs, storage plugins, etc. can be added in the future. + +### Metastore Tables + +Metastore Tables component contains metadata about Drill tables, including general information, as well as +information about table segments, files, row groups, partitions. + +Full table metadata consists of two major concepts: general information and top-level segments metadata. +Table general information contains basic table information and corresponds to the `BaseTableMetadata` class. + +A table can be non-partitioned and partitioned. Non-partitioned tables have only one top-level segment +which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned tables may have several top-level segments. +Each top-level segment can include metadata about inner segments, files, row groups, and partitions. + +A unique table identifier in Metastore Tables is a combination of storage plugin, workspace, and table name. +Table metadata inside is grouped by top-level segments, unique identifier of the top-level segment and its metadata +is storage plugin, workspace, table name, and metadata key. + +### Related Session/System Options + +The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` or via the Drill Web console. + +- **metastore.enabled** +Enables Drill Metastore usage to be able to store table metadata during ANALYZE TABLE commands execution and to be able + to read table metadata during regular queries execution or when querying some INFORMATION_SCHEMA tables. Default is `false`. +-
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r380209529 ## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ## @@ -0,0 +1,408 @@ +--- +title: "Using Drill Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill 1.17 introduces the Drill Metastore which stores the table schema and table statistics. Statistics allow Drill to better create optimal query plans. + +The Metastore is a Beta feature; it is subject to change. We encourage you to try it and provide feedback. +Because the Metastore is in Beta, the SQL commands and Metastore formats may change in the next release. +{% include startnote.html %}In Drill 1.17, this feature is supported for Parquet tables only and is disabled by default.{% include endnote.html %} + +## Enabling Drill Metastore + +To use the Drill Metastore, you must enable it at the session or system level with one of the following commands: + + SET `metastore.enabled` = true; + ALTER SYSTEM SET `metastore.enabled` = true; + +Alternatively, you can enable the option in the Drill Web UI at `http://:8047/options`. + +## Computing and storing table metadata to Drill Metastore + +Once you enable the Metastore, the next step is to populate it with data. Drill can query a table whether that table + has a Metastore entry or not. (If you are familiar with Hive, then you know that Hive requires that all tables have + Hive Metastore entries before you can query them.) In Drill, only add data to the Metastore when doing so improves + query performance. In general, large tables benefit from statistics more than small tables do. + +Unlike Hive, Drill does not require you to declare a schema. Instead, Drill infers the schema by scanning your table + in the same way as it is done during regular select and computes some metadata like `MIN` / `MAX` column values and + `NULLS_COUNT` designated as "metadata" to be able to produce more optimizations like filter push-down, etc. If + `planner.statistics.use` option is enabled, this command will also calculate and store table statistics into Drill + Metastore. + +## Configuration + +Default Metastore configuration is defined in `drill-metastore-default.conf` file. +It can be overridden in `drill-metastore-override.conf`. Distribution configuration can be +indicated in `drill-metastore-distrib.conf`. + +All configuration properties should reside in `drill.metastore` namespace. +Metastore implementation based on class implementation config property `drill.metastore.implementation.class`. +The default value is the following: + +``` +drill.metastore: { + implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore" +} +``` + +Note, that currently out of box Iceberg Metastore is available and is the default one. Though any custom + implementation can be added by placing the JAR into classpath which has the implementation of + `org.apache.drill.metastore.Metastore` interface and indicating custom class in the `drill.metastore.implementation.class`. + +### Metastore Components + +Metastore can store metadata for various components: tables, views, etc. +Current implementation provides fully functioning support for tables component. +Views component support is not implemented but contains stub methods to show +how new Metastore components like UDFs, storage plugins, etc. can be added in the future. + +### Metastore Tables + +Metastore Tables component contains metadata about Drill tables, including general information, as well as +information about table segments, files, row groups, partitions. + +Full table metadata consists of two major concepts: general information and top-level segments metadata. +Table general information contains basic table information and corresponds to the `BaseTableMetadata` class. + +A table can be non-partitioned and partitioned. Non-partitioned tables have only one top-level segment +which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned tables may have several top-level segments. +Each top-level segment can include metadata about inner segments, files, row groups, and partitions. + +A unique table identifier in Metastore Tables is a combination of storage plugin, workspace, and table name. +Table metadata inside is grouped by top-level segments, unique identifier of the top-level segment and its metadata +is storage plugin, workspace, table name, and metadata key. + +### Related Session/System Options + +The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` or via the Drill Web console. + +- **metastore.enabled** +Enables Drill Metastore usage to be able to store table metadata during ANALYZE TABLE commands execution and to be able + to read table metadata during regular queries execution or when querying some INFORMATION_SCHEMA tables. Default is `false`. +-
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r380270959 ## File path: _docs/performance-tuning/drill-metastore/030-drill-iceberg-metastore.md ## @@ -0,0 +1,69 @@ +--- +title: "Drill Iceberg Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill uses Iceberg Metastore implementation based on [Iceberg tables](http://iceberg.incubator.apache.org). For Drill 1.17, + this is default Drill Metastore implementation. For details on how to configure Iceberg Metastore implementation and + its option descriptions, please refer to [Iceberg Metastore docs](https://github.com/apache/drill/blob/master/metastore/iceberg-metastore/README.md). + +{% include startnote.html %} +Iceberg table supports concurrent writes and transactions but they are only effective on file systems that support + atomic rename. +If the file system does not support atomic rename, it could lead to inconsistencies during concurrent writes. +{% include endnote.html %} + +### Iceberg Tables Location + +Iceberg tables will reside on the file system in the location based on +Iceberg Metastore base location `drill.metastore.iceberg.location.base_path` and component specific location. +If Iceberg Metastore base location is `/drill/metastore/iceberg` +and tables component location is `tables`. Iceberg table for tables component Review comment: Thanks, updated. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r380199229 ## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ## @@ -0,0 +1,408 @@ +--- +title: "Using Drill Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill 1.17 introduces the Drill Metastore which stores the table schema and table statistics. Statistics allow Drill to better create optimal query plans. + +The Metastore is a Beta feature; it is subject to change. We encourage you to try it and provide feedback. +Because the Metastore is in Beta, the SQL commands and Metastore formats may change in the next release. +{% include startnote.html %}In Drill 1.17, this feature is supported for Parquet tables only and is disabled by default.{% include endnote.html %} + +## Enabling Drill Metastore + +To use the Drill Metastore, you must enable it at the session or system level with one of the following commands: + + SET `metastore.enabled` = true; + ALTER SYSTEM SET `metastore.enabled` = true; + +Alternatively, you can enable the option in the Drill Web UI at `http://:8047/options`. + +## Computing and storing table metadata to Drill Metastore + +Once you enable the Metastore, the next step is to populate it with data. Drill can query a table whether that table + has a Metastore entry or not. (If you are familiar with Hive, then you know that Hive requires that all tables have + Hive Metastore entries before you can query them.) In Drill, only add data to the Metastore when doing so improves + query performance. In general, large tables benefit from statistics more than small tables do. + +Unlike Hive, Drill does not require you to declare a schema. Instead, Drill infers the schema by scanning your table + in the same way as it is done during regular select and computes some metadata like `MIN` / `MAX` column values and + `NULLS_COUNT` designated as "metadata" to be able to produce more optimizations like filter push-down, etc. If + `planner.statistics.use` option is enabled, this command will also calculate and store table statistics into Drill + Metastore. + +## Configuration + +Default Metastore configuration is defined in `drill-metastore-default.conf` file. +It can be overridden in `drill-metastore-override.conf`. Distribution configuration can be +indicated in `drill-metastore-distrib.conf`. + +All configuration properties should reside in `drill.metastore` namespace. +Metastore implementation based on class implementation config property `drill.metastore.implementation.class`. +The default value is the following: + +``` +drill.metastore: { + implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore" +} +``` + +Note, that currently out of box Iceberg Metastore is available and is the default one. Though any custom + implementation can be added by placing the JAR into classpath which has the implementation of + `org.apache.drill.metastore.Metastore` interface and indicating custom class in the `drill.metastore.implementation.class`. + +### Metastore Components + +Metastore can store metadata for various components: tables, views, etc. +Current implementation provides fully functioning support for tables component. +Views component support is not implemented but contains stub methods to show +how new Metastore components like UDFs, storage plugins, etc. can be added in the future. + +### Metastore Tables + +Metastore Tables component contains metadata about Drill tables, including general information, as well as +information about table segments, files, row groups, partitions. + +Full table metadata consists of two major concepts: general information and top-level segments metadata. +Table general information contains basic table information and corresponds to the `BaseTableMetadata` class. + +A table can be non-partitioned and partitioned. Non-partitioned tables have only one top-level segment +which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned tables may have several top-level segments. +Each top-level segment can include metadata about inner segments, files, row groups, and partitions. + +A unique table identifier in Metastore Tables is a combination of storage plugin, workspace, and table name. +Table metadata inside is grouped by top-level segments, unique identifier of the top-level segment and its metadata +is storage plugin, workspace, table name, and metadata key. + +### Related Session/System Options + +The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` or via the Drill Web console. + +- **metastore.enabled** +Enables Drill Metastore usage to be able to store table metadata during ANALYZE TABLE commands execution and to be able + to read table metadata during regular queries execution or when querying some INFORMATION_SCHEMA tables. Default is `false`. +-
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r380219490 ## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ## @@ -0,0 +1,408 @@ +--- +title: "Using Drill Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill 1.17 introduces the Drill Metastore which stores the table schema and table statistics. Statistics allow Drill to better create optimal query plans. + +The Metastore is a Beta feature; it is subject to change. We encourage you to try it and provide feedback. +Because the Metastore is in Beta, the SQL commands and Metastore formats may change in the next release. +{% include startnote.html %}In Drill 1.17, this feature is supported for Parquet tables only and is disabled by default.{% include endnote.html %} + +## Enabling Drill Metastore + +To use the Drill Metastore, you must enable it at the session or system level with one of the following commands: + + SET `metastore.enabled` = true; + ALTER SYSTEM SET `metastore.enabled` = true; + +Alternatively, you can enable the option in the Drill Web UI at `http://:8047/options`. + +## Computing and storing table metadata to Drill Metastore + +Once you enable the Metastore, the next step is to populate it with data. Drill can query a table whether that table + has a Metastore entry or not. (If you are familiar with Hive, then you know that Hive requires that all tables have + Hive Metastore entries before you can query them.) In Drill, only add data to the Metastore when doing so improves + query performance. In general, large tables benefit from statistics more than small tables do. + +Unlike Hive, Drill does not require you to declare a schema. Instead, Drill infers the schema by scanning your table + in the same way as it is done during regular select and computes some metadata like `MIN` / `MAX` column values and + `NULLS_COUNT` designated as "metadata" to be able to produce more optimizations like filter push-down, etc. If + `planner.statistics.use` option is enabled, this command will also calculate and store table statistics into Drill + Metastore. + +## Configuration + +Default Metastore configuration is defined in `drill-metastore-default.conf` file. +It can be overridden in `drill-metastore-override.conf`. Distribution configuration can be +indicated in `drill-metastore-distrib.conf`. + +All configuration properties should reside in `drill.metastore` namespace. +Metastore implementation based on class implementation config property `drill.metastore.implementation.class`. +The default value is the following: + +``` +drill.metastore: { + implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore" +} +``` + +Note, that currently out of box Iceberg Metastore is available and is the default one. Though any custom + implementation can be added by placing the JAR into classpath which has the implementation of + `org.apache.drill.metastore.Metastore` interface and indicating custom class in the `drill.metastore.implementation.class`. + +### Metastore Components + +Metastore can store metadata for various components: tables, views, etc. +Current implementation provides fully functioning support for tables component. +Views component support is not implemented but contains stub methods to show +how new Metastore components like UDFs, storage plugins, etc. can be added in the future. + +### Metastore Tables + +Metastore Tables component contains metadata about Drill tables, including general information, as well as +information about table segments, files, row groups, partitions. + +Full table metadata consists of two major concepts: general information and top-level segments metadata. +Table general information contains basic table information and corresponds to the `BaseTableMetadata` class. + +A table can be non-partitioned and partitioned. Non-partitioned tables have only one top-level segment +which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned tables may have several top-level segments. +Each top-level segment can include metadata about inner segments, files, row groups, and partitions. + +A unique table identifier in Metastore Tables is a combination of storage plugin, workspace, and table name. +Table metadata inside is grouped by top-level segments, unique identifier of the top-level segment and its metadata +is storage plugin, workspace, table name, and metadata key. + +### Related Session/System Options + +The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` or via the Drill Web console. + +- **metastore.enabled** +Enables Drill Metastore usage to be able to store table metadata during ANALYZE TABLE commands execution and to be able + to read table metadata during regular queries execution or when querying some INFORMATION_SCHEMA tables. Default is `false`. +-
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r380250105 ## File path: _docs/performance-tuning/drill-metastore/030-drill-iceberg-metastore.md ## @@ -0,0 +1,69 @@ +--- +title: "Drill Iceberg Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill uses Iceberg Metastore implementation based on [Iceberg tables](http://iceberg.incubator.apache.org). For Drill 1.17, + this is default Drill Metastore implementation. For details on how to configure Iceberg Metastore implementation and + its option descriptions, please refer to [Iceberg Metastore docs](https://github.com/apache/drill/blob/master/metastore/iceberg-metastore/README.md). + +{% include startnote.html %} +Iceberg table supports concurrent writes and transactions but they are only effective on file systems that support + atomic rename. +If the file system does not support atomic rename, it could lead to inconsistencies during concurrent writes. +{% include endnote.html %} + +### Iceberg Tables Location + +Iceberg tables will reside on the file system in the location based on +Iceberg Metastore base location `drill.metastore.iceberg.location.base_path` and component specific location. Review comment: Good point! Added sentence before this one about configuration files and added specified that the above is the configuration property. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r380159080 ## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ## @@ -0,0 +1,408 @@ +--- +title: "Using Drill Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill 1.17 introduces the Drill Metastore which stores the table schema and table statistics. Statistics allow Drill to better create optimal query plans. + +The Metastore is a Beta feature; it is subject to change. We encourage you to try it and provide feedback. +Because the Metastore is in Beta, the SQL commands and Metastore formats may change in the next release. +{% include startnote.html %}In Drill 1.17, this feature is supported for Parquet tables only and is disabled by default.{% include endnote.html %} + +## Enabling Drill Metastore + +To use the Drill Metastore, you must enable it at the session or system level with one of the following commands: + + SET `metastore.enabled` = true; + ALTER SYSTEM SET `metastore.enabled` = true; + +Alternatively, you can enable the option in the Drill Web UI at `http://:8047/options`. + +## Computing and storing table metadata to Drill Metastore + +Once you enable the Metastore, the next step is to populate it with data. Drill can query a table whether that table + has a Metastore entry or not. (If you are familiar with Hive, then you know that Hive requires that all tables have + Hive Metastore entries before you can query them.) In Drill, only add data to the Metastore when doing so improves + query performance. In general, large tables benefit from statistics more than small tables do. + +Unlike Hive, Drill does not require you to declare a schema. Instead, Drill infers the schema by scanning your table + in the same way as it is done during regular select and computes some metadata like `MIN` / `MAX` column values and + `NULLS_COUNT` designated as "metadata" to be able to produce more optimizations like filter push-down, etc. If + `planner.statistics.use` option is enabled, this command will also calculate and store table statistics into Drill + Metastore. + +## Configuration + +Default Metastore configuration is defined in `drill-metastore-default.conf` file. +It can be overridden in `drill-metastore-override.conf`. Distribution configuration can be +indicated in `drill-metastore-distrib.conf`. + +All configuration properties should reside in `drill.metastore` namespace. +Metastore implementation based on class implementation config property `drill.metastore.implementation.class`. +The default value is the following: + +``` +drill.metastore: { + implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore" +} +``` + +Note, that currently out of box Iceberg Metastore is available and is the default one. Though any custom + implementation can be added by placing the JAR into classpath which has the implementation of + `org.apache.drill.metastore.Metastore` interface and indicating custom class in the `drill.metastore.implementation.class`. + +### Metastore Components + +Metastore can store metadata for various components: tables, views, etc. +Current implementation provides fully functioning support for tables component. +Views component support is not implemented but contains stub methods to show +how new Metastore components like UDFs, storage plugins, etc. can be added in the future. + +### Metastore Tables + +Metastore Tables component contains metadata about Drill tables, including general information, as well as +information about table segments, files, row groups, partitions. + +Full table metadata consists of two major concepts: general information and top-level segments metadata. +Table general information contains basic table information and corresponds to the `BaseTableMetadata` class. + +A table can be non-partitioned and partitioned. Non-partitioned tables have only one top-level segment +which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned tables may have several top-level segments. +Each top-level segment can include metadata about inner segments, files, row groups, and partitions. + +A unique table identifier in Metastore Tables is a combination of storage plugin, workspace, and table name. +Table metadata inside is grouped by top-level segments, unique identifier of the top-level segment and its metadata +is storage plugin, workspace, table name, and metadata key. + +### Related Session/System Options + +The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` or via the Drill Web console. + +- **metastore.enabled** +Enables Drill Metastore usage to be able to store table metadata during ANALYZE TABLE commands execution and to be able + to read table metadata during regular queries execution or when querying some INFORMATION_SCHEMA tables. Default is `false`. +-
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r379573331 ## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ## @@ -0,0 +1,408 @@ +--- +title: "Using Drill Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill 1.17 introduces the Drill Metastore which stores the table schema and table statistics. Statistics allow Drill to better create optimal query plans. + +The Metastore is a Beta feature; it is subject to change. We encourage you to try it and provide feedback. +Because the Metastore is in Beta, the SQL commands and Metastore formats may change in the next release. +{% include startnote.html %}In Drill 1.17, this feature is supported for Parquet tables only and is disabled by default.{% include endnote.html %} + +## Enabling Drill Metastore + +To use the Drill Metastore, you must enable it at the session or system level with one of the following commands: + + SET `metastore.enabled` = true; + ALTER SYSTEM SET `metastore.enabled` = true; + +Alternatively, you can enable the option in the Drill Web UI at `http://:8047/options`. + +## Computing and storing table metadata to Drill Metastore + +Once you enable the Metastore, the next step is to populate it with data. Drill can query a table whether that table + has a Metastore entry or not. (If you are familiar with Hive, then you know that Hive requires that all tables have + Hive Metastore entries before you can query them.) In Drill, only add data to the Metastore when doing so improves + query performance. In general, large tables benefit from statistics more than small tables do. + +Unlike Hive, Drill does not require you to declare a schema. Instead, Drill infers the schema by scanning your table + in the same way as it is done during regular select and computes some metadata like `MIN` / `MAX` column values and + `NULLS_COUNT` designated as "metadata" to be able to produce more optimizations like filter push-down, etc. If + `planner.statistics.use` option is enabled, this command will also calculate and store table statistics into Drill + Metastore. + +## Configuration + +Default Metastore configuration is defined in `drill-metastore-default.conf` file. +It can be overridden in `drill-metastore-override.conf`. Distribution configuration can be +indicated in `drill-metastore-distrib.conf`. + +All configuration properties should reside in `drill.metastore` namespace. +Metastore implementation based on class implementation config property `drill.metastore.implementation.class`. +The default value is the following: + +``` +drill.metastore: { + implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore" +} +``` + +Note, that currently out of box Iceberg Metastore is available and is the default one. Though any custom + implementation can be added by placing the JAR into classpath which has the implementation of + `org.apache.drill.metastore.Metastore` interface and indicating custom class in the `drill.metastore.implementation.class`. + +### Metastore Components + +Metastore can store metadata for various components: tables, views, etc. +Current implementation provides fully functioning support for tables component. +Views component support is not implemented but contains stub methods to show +how new Metastore components like UDFs, storage plugins, etc. can be added in the future. + +### Metastore Tables + +Metastore Tables component contains metadata about Drill tables, including general information, as well as +information about table segments, files, row groups, partitions. + +Full table metadata consists of two major concepts: general information and top-level segments metadata. +Table general information contains basic table information and corresponds to the `BaseTableMetadata` class. + +A table can be non-partitioned and partitioned. Non-partitioned tables have only one top-level segment +which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned tables may have several top-level segments. +Each top-level segment can include metadata about inner segments, files, row groups, and partitions. + +A unique table identifier in Metastore Tables is a combination of storage plugin, workspace, and table name. +Table metadata inside is grouped by top-level segments, unique identifier of the top-level segment and its metadata +is storage plugin, workspace, table name, and metadata key. + +### Related Session/System Options + +The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` or via the Drill Web console. + +- **metastore.enabled** +Enables Drill Metastore usage to be able to store table metadata during ANALYZE TABLE commands execution and to be able + to read table metadata during regular queries execution or when querying some INFORMATION_SCHEMA tables. Default is `false`. +-
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r380278702 ## File path: _docs/performance-tuning/drill-metastore/030-drill-iceberg-metastore.md ## @@ -0,0 +1,69 @@ +--- +title: "Drill Iceberg Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill uses Iceberg Metastore implementation based on [Iceberg tables](http://iceberg.incubator.apache.org). For Drill 1.17, + this is default Drill Metastore implementation. For details on how to configure Iceberg Metastore implementation and + its option descriptions, please refer to [Iceberg Metastore docs](https://github.com/apache/drill/blob/master/metastore/iceberg-metastore/README.md). + +{% include startnote.html %} +Iceberg table supports concurrent writes and transactions but they are only effective on file systems that support + atomic rename. +If the file system does not support atomic rename, it could lead to inconsistencies during concurrent writes. +{% include endnote.html %} + +### Iceberg Tables Location + +Iceberg tables will reside on the file system in the location based on +Iceberg Metastore base location `drill.metastore.iceberg.location.base_path` and component specific location. +If Iceberg Metastore base location is `/drill/metastore/iceberg` +and tables component location is `tables`. Iceberg table for tables component +will be located in `/drill/metastore/iceberg/tables` folder. + +Metastore metadata will be stored inside Iceberg table location provided +in the configuration file. Drill table metadata location will be constructed +based on specific component storage keys. For example, for `tables` component, +storage keys are storage plugin, workspace and table name: unique table identifier in Drill. + +Assume Iceberg table location is `/drill/metastore/iceberg/tables`, metadata for the table +`dfs.tmp.nation` will be stored in the `/drill/metastore/iceberg/tables/dfs/tmp/nation` folder. + +Example of base Metastore configuration file `drill-metastore-override.conf`, where Iceberg tables will be stored in + hdfs: + +``` +drill.metastore.iceberg: { + config.properties: { +fs.defaultFS: "hdfs:///" + } + + location: { +base_path: "/drill/metastore", +relative_path: "iceberg" + } +} +``` + +### Metadata Storage Format + +Iceberg tables support data storage in three formats: Parquet, Avro, ORC. Drill metadata will be stored in Parquet files. +This format was chosen over others since it is column oriented and efficient in terms of disk I/O when specific +columns need to be queried. + +Each Parquet file will hold information for one partition. Partition keys will depend on Metastore +component characteristics. For example, for tables component, partitions keys are storage plugin, workspace, +table name and metadata key. + +Parquet files name will be based on UUID to ensure uniqueness. If somehow collision occurs, modify operation +in Metastore will fail. Review comment: Thanks, removed this section. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r380199004 ## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ## @@ -0,0 +1,408 @@ +--- +title: "Using Drill Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill 1.17 introduces the Drill Metastore which stores the table schema and table statistics. Statistics allow Drill to better create optimal query plans. + +The Metastore is a Beta feature; it is subject to change. We encourage you to try it and provide feedback. +Because the Metastore is in Beta, the SQL commands and Metastore formats may change in the next release. +{% include startnote.html %}In Drill 1.17, this feature is supported for Parquet tables only and is disabled by default.{% include endnote.html %} + +## Enabling Drill Metastore + +To use the Drill Metastore, you must enable it at the session or system level with one of the following commands: + + SET `metastore.enabled` = true; + ALTER SYSTEM SET `metastore.enabled` = true; + +Alternatively, you can enable the option in the Drill Web UI at `http://:8047/options`. + +## Computing and storing table metadata to Drill Metastore + +Once you enable the Metastore, the next step is to populate it with data. Drill can query a table whether that table + has a Metastore entry or not. (If you are familiar with Hive, then you know that Hive requires that all tables have + Hive Metastore entries before you can query them.) In Drill, only add data to the Metastore when doing so improves + query performance. In general, large tables benefit from statistics more than small tables do. + +Unlike Hive, Drill does not require you to declare a schema. Instead, Drill infers the schema by scanning your table + in the same way as it is done during regular select and computes some metadata like `MIN` / `MAX` column values and + `NULLS_COUNT` designated as "metadata" to be able to produce more optimizations like filter push-down, etc. If + `planner.statistics.use` option is enabled, this command will also calculate and store table statistics into Drill + Metastore. + +## Configuration + +Default Metastore configuration is defined in `drill-metastore-default.conf` file. +It can be overridden in `drill-metastore-override.conf`. Distribution configuration can be +indicated in `drill-metastore-distrib.conf`. + +All configuration properties should reside in `drill.metastore` namespace. +Metastore implementation based on class implementation config property `drill.metastore.implementation.class`. +The default value is the following: + +``` +drill.metastore: { + implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore" +} +``` + +Note, that currently out of box Iceberg Metastore is available and is the default one. Though any custom + implementation can be added by placing the JAR into classpath which has the implementation of + `org.apache.drill.metastore.Metastore` interface and indicating custom class in the `drill.metastore.implementation.class`. + +### Metastore Components + +Metastore can store metadata for various components: tables, views, etc. +Current implementation provides fully functioning support for tables component. +Views component support is not implemented but contains stub methods to show +how new Metastore components like UDFs, storage plugins, etc. can be added in the future. + +### Metastore Tables + +Metastore Tables component contains metadata about Drill tables, including general information, as well as +information about table segments, files, row groups, partitions. + +Full table metadata consists of two major concepts: general information and top-level segments metadata. +Table general information contains basic table information and corresponds to the `BaseTableMetadata` class. + +A table can be non-partitioned and partitioned. Non-partitioned tables have only one top-level segment +which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned tables may have several top-level segments. +Each top-level segment can include metadata about inner segments, files, row groups, and partitions. + +A unique table identifier in Metastore Tables is a combination of storage plugin, workspace, and table name. +Table metadata inside is grouped by top-level segments, unique identifier of the top-level segment and its metadata +is storage plugin, workspace, table name, and metadata key. + +### Related Session/System Options + +The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` or via the Drill Web console. + +- **metastore.enabled** +Enables Drill Metastore usage to be able to store table metadata during ANALYZE TABLE commands execution and to be able + to read table metadata during regular queries execution or when querying some INFORMATION_SCHEMA tables. Default is `false`. +-
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r380241423 ## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ## @@ -0,0 +1,408 @@ +--- +title: "Using Drill Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill 1.17 introduces the Drill Metastore which stores the table schema and table statistics. Statistics allow Drill to better create optimal query plans. + +The Metastore is a Beta feature; it is subject to change. We encourage you to try it and provide feedback. +Because the Metastore is in Beta, the SQL commands and Metastore formats may change in the next release. +{% include startnote.html %}In Drill 1.17, this feature is supported for Parquet tables only and is disabled by default.{% include endnote.html %} + +## Enabling Drill Metastore + +To use the Drill Metastore, you must enable it at the session or system level with one of the following commands: + + SET `metastore.enabled` = true; + ALTER SYSTEM SET `metastore.enabled` = true; + +Alternatively, you can enable the option in the Drill Web UI at `http://:8047/options`. + +## Computing and storing table metadata to Drill Metastore + +Once you enable the Metastore, the next step is to populate it with data. Drill can query a table whether that table + has a Metastore entry or not. (If you are familiar with Hive, then you know that Hive requires that all tables have + Hive Metastore entries before you can query them.) In Drill, only add data to the Metastore when doing so improves + query performance. In general, large tables benefit from statistics more than small tables do. + +Unlike Hive, Drill does not require you to declare a schema. Instead, Drill infers the schema by scanning your table + in the same way as it is done during regular select and computes some metadata like `MIN` / `MAX` column values and + `NULLS_COUNT` designated as "metadata" to be able to produce more optimizations like filter push-down, etc. If + `planner.statistics.use` option is enabled, this command will also calculate and store table statistics into Drill + Metastore. + +## Configuration + +Default Metastore configuration is defined in `drill-metastore-default.conf` file. +It can be overridden in `drill-metastore-override.conf`. Distribution configuration can be +indicated in `drill-metastore-distrib.conf`. + +All configuration properties should reside in `drill.metastore` namespace. +Metastore implementation based on class implementation config property `drill.metastore.implementation.class`. +The default value is the following: + +``` +drill.metastore: { + implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore" +} +``` + +Note, that currently out of box Iceberg Metastore is available and is the default one. Though any custom + implementation can be added by placing the JAR into classpath which has the implementation of + `org.apache.drill.metastore.Metastore` interface and indicating custom class in the `drill.metastore.implementation.class`. + +### Metastore Components + +Metastore can store metadata for various components: tables, views, etc. +Current implementation provides fully functioning support for tables component. +Views component support is not implemented but contains stub methods to show +how new Metastore components like UDFs, storage plugins, etc. can be added in the future. + +### Metastore Tables + +Metastore Tables component contains metadata about Drill tables, including general information, as well as +information about table segments, files, row groups, partitions. + +Full table metadata consists of two major concepts: general information and top-level segments metadata. +Table general information contains basic table information and corresponds to the `BaseTableMetadata` class. + +A table can be non-partitioned and partitioned. Non-partitioned tables have only one top-level segment +which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned tables may have several top-level segments. +Each top-level segment can include metadata about inner segments, files, row groups, and partitions. + +A unique table identifier in Metastore Tables is a combination of storage plugin, workspace, and table name. +Table metadata inside is grouped by top-level segments, unique identifier of the top-level segment and its metadata +is storage plugin, workspace, table name, and metadata key. + +### Related Session/System Options + +The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` or via the Drill Web console. + +- **metastore.enabled** +Enables Drill Metastore usage to be able to store table metadata during ANALYZE TABLE commands execution and to be able + to read table metadata during regular queries execution or when querying some INFORMATION_SCHEMA tables. Default is `false`. +-
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r380164542 ## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ## @@ -0,0 +1,408 @@ +--- +title: "Using Drill Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill 1.17 introduces the Drill Metastore which stores the table schema and table statistics. Statistics allow Drill to better create optimal query plans. + +The Metastore is a Beta feature; it is subject to change. We encourage you to try it and provide feedback. +Because the Metastore is in Beta, the SQL commands and Metastore formats may change in the next release. +{% include startnote.html %}In Drill 1.17, this feature is supported for Parquet tables only and is disabled by default.{% include endnote.html %} + +## Enabling Drill Metastore + +To use the Drill Metastore, you must enable it at the session or system level with one of the following commands: + + SET `metastore.enabled` = true; + ALTER SYSTEM SET `metastore.enabled` = true; + +Alternatively, you can enable the option in the Drill Web UI at `http://:8047/options`. + +## Computing and storing table metadata to Drill Metastore + +Once you enable the Metastore, the next step is to populate it with data. Drill can query a table whether that table + has a Metastore entry or not. (If you are familiar with Hive, then you know that Hive requires that all tables have + Hive Metastore entries before you can query them.) In Drill, only add data to the Metastore when doing so improves + query performance. In general, large tables benefit from statistics more than small tables do. + +Unlike Hive, Drill does not require you to declare a schema. Instead, Drill infers the schema by scanning your table + in the same way as it is done during regular select and computes some metadata like `MIN` / `MAX` column values and + `NULLS_COUNT` designated as "metadata" to be able to produce more optimizations like filter push-down, etc. If + `planner.statistics.use` option is enabled, this command will also calculate and store table statistics into Drill + Metastore. + +## Configuration + +Default Metastore configuration is defined in `drill-metastore-default.conf` file. +It can be overridden in `drill-metastore-override.conf`. Distribution configuration can be +indicated in `drill-metastore-distrib.conf`. + +All configuration properties should reside in `drill.metastore` namespace. +Metastore implementation based on class implementation config property `drill.metastore.implementation.class`. +The default value is the following: + +``` +drill.metastore: { + implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore" +} +``` + +Note, that currently out of box Iceberg Metastore is available and is the default one. Though any custom + implementation can be added by placing the JAR into classpath which has the implementation of + `org.apache.drill.metastore.Metastore` interface and indicating custom class in the `drill.metastore.implementation.class`. + +### Metastore Components + +Metastore can store metadata for various components: tables, views, etc. +Current implementation provides fully functioning support for tables component. +Views component support is not implemented but contains stub methods to show +how new Metastore components like UDFs, storage plugins, etc. can be added in the future. + +### Metastore Tables + +Metastore Tables component contains metadata about Drill tables, including general information, as well as +information about table segments, files, row groups, partitions. + +Full table metadata consists of two major concepts: general information and top-level segments metadata. +Table general information contains basic table information and corresponds to the `BaseTableMetadata` class. + +A table can be non-partitioned and partitioned. Non-partitioned tables have only one top-level segment +which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned tables may have several top-level segments. +Each top-level segment can include metadata about inner segments, files, row groups, and partitions. + +A unique table identifier in Metastore Tables is a combination of storage plugin, workspace, and table name. +Table metadata inside is grouped by top-level segments, unique identifier of the top-level segment and its metadata +is storage plugin, workspace, table name, and metadata key. + +### Related Session/System Options + +The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` or via the Drill Web console. + +- **metastore.enabled** +Enables Drill Metastore usage to be able to store table metadata during ANALYZE TABLE commands execution and to be able + to read table metadata during regular queries execution or when querying some INFORMATION_SCHEMA tables. Default is `false`. +-
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r379567240 ## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ## @@ -0,0 +1,408 @@ +--- +title: "Using Drill Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill 1.17 introduces the Drill Metastore which stores the table schema and table statistics. Statistics allow Drill to better create optimal query plans. + +The Metastore is a Beta feature; it is subject to change. We encourage you to try it and provide feedback. +Because the Metastore is in Beta, the SQL commands and Metastore formats may change in the next release. +{% include startnote.html %}In Drill 1.17, this feature is supported for Parquet tables only and is disabled by default.{% include endnote.html %} + +## Enabling Drill Metastore + +To use the Drill Metastore, you must enable it at the session or system level with one of the following commands: + + SET `metastore.enabled` = true; + ALTER SYSTEM SET `metastore.enabled` = true; + +Alternatively, you can enable the option in the Drill Web UI at `http://:8047/options`. + +## Computing and storing table metadata to Drill Metastore + +Once you enable the Metastore, the next step is to populate it with data. Drill can query a table whether that table + has a Metastore entry or not. (If you are familiar with Hive, then you know that Hive requires that all tables have + Hive Metastore entries before you can query them.) In Drill, only add data to the Metastore when doing so improves + query performance. In general, large tables benefit from statistics more than small tables do. + +Unlike Hive, Drill does not require you to declare a schema. Instead, Drill infers the schema by scanning your table + in the same way as it is done during regular select and computes some metadata like `MIN` / `MAX` column values and + `NULLS_COUNT` designated as "metadata" to be able to produce more optimizations like filter push-down, etc. If + `planner.statistics.use` option is enabled, this command will also calculate and store table statistics into Drill + Metastore. + +## Configuration + +Default Metastore configuration is defined in `drill-metastore-default.conf` file. +It can be overridden in `drill-metastore-override.conf`. Distribution configuration can be +indicated in `drill-metastore-distrib.conf`. + +All configuration properties should reside in `drill.metastore` namespace. +Metastore implementation based on class implementation config property `drill.metastore.implementation.class`. +The default value is the following: + +``` +drill.metastore: { + implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore" +} +``` + +Note, that currently out of box Iceberg Metastore is available and is the default one. Though any custom + implementation can be added by placing the JAR into classpath which has the implementation of + `org.apache.drill.metastore.Metastore` interface and indicating custom class in the `drill.metastore.implementation.class`. + +### Metastore Components + +Metastore can store metadata for various components: tables, views, etc. +Current implementation provides fully functioning support for tables component. +Views component support is not implemented but contains stub methods to show +how new Metastore components like UDFs, storage plugins, etc. can be added in the future. + +### Metastore Tables + +Metastore Tables component contains metadata about Drill tables, including general information, as well as +information about table segments, files, row groups, partitions. + +Full table metadata consists of two major concepts: general information and top-level segments metadata. +Table general information contains basic table information and corresponds to the `BaseTableMetadata` class. + +A table can be non-partitioned and partitioned. Non-partitioned tables have only one top-level segment +which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned tables may have several top-level segments. +Each top-level segment can include metadata about inner segments, files, row groups, and partitions. Review comment: Metastore supports single files also. Added part of the info you have proposed and added references to the examples, where was described how to query partitions and segments metadata. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r379576091 ## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ## @@ -0,0 +1,408 @@ +--- +title: "Using Drill Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill 1.17 introduces the Drill Metastore which stores the table schema and table statistics. Statistics allow Drill to better create optimal query plans. + +The Metastore is a Beta feature; it is subject to change. We encourage you to try it and provide feedback. +Because the Metastore is in Beta, the SQL commands and Metastore formats may change in the next release. +{% include startnote.html %}In Drill 1.17, this feature is supported for Parquet tables only and is disabled by default.{% include endnote.html %} + +## Enabling Drill Metastore + +To use the Drill Metastore, you must enable it at the session or system level with one of the following commands: + + SET `metastore.enabled` = true; + ALTER SYSTEM SET `metastore.enabled` = true; + +Alternatively, you can enable the option in the Drill Web UI at `http://:8047/options`. + +## Computing and storing table metadata to Drill Metastore + +Once you enable the Metastore, the next step is to populate it with data. Drill can query a table whether that table + has a Metastore entry or not. (If you are familiar with Hive, then you know that Hive requires that all tables have + Hive Metastore entries before you can query them.) In Drill, only add data to the Metastore when doing so improves + query performance. In general, large tables benefit from statistics more than small tables do. + +Unlike Hive, Drill does not require you to declare a schema. Instead, Drill infers the schema by scanning your table + in the same way as it is done during regular select and computes some metadata like `MIN` / `MAX` column values and + `NULLS_COUNT` designated as "metadata" to be able to produce more optimizations like filter push-down, etc. If + `planner.statistics.use` option is enabled, this command will also calculate and store table statistics into Drill + Metastore. + +## Configuration + +Default Metastore configuration is defined in `drill-metastore-default.conf` file. +It can be overridden in `drill-metastore-override.conf`. Distribution configuration can be +indicated in `drill-metastore-distrib.conf`. + +All configuration properties should reside in `drill.metastore` namespace. +Metastore implementation based on class implementation config property `drill.metastore.implementation.class`. +The default value is the following: + +``` +drill.metastore: { + implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore" +} +``` + +Note, that currently out of box Iceberg Metastore is available and is the default one. Though any custom + implementation can be added by placing the JAR into classpath which has the implementation of + `org.apache.drill.metastore.Metastore` interface and indicating custom class in the `drill.metastore.implementation.class`. + +### Metastore Components + +Metastore can store metadata for various components: tables, views, etc. +Current implementation provides fully functioning support for tables component. +Views component support is not implemented but contains stub methods to show +how new Metastore components like UDFs, storage plugins, etc. can be added in the future. + +### Metastore Tables + +Metastore Tables component contains metadata about Drill tables, including general information, as well as +information about table segments, files, row groups, partitions. + +Full table metadata consists of two major concepts: general information and top-level segments metadata. +Table general information contains basic table information and corresponds to the `BaseTableMetadata` class. + +A table can be non-partitioned and partitioned. Non-partitioned tables have only one top-level segment +which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned tables may have several top-level segments. +Each top-level segment can include metadata about inner segments, files, row groups, and partitions. + +A unique table identifier in Metastore Tables is a combination of storage plugin, workspace, and table name. +Table metadata inside is grouped by top-level segments, unique identifier of the top-level segment and its metadata +is storage plugin, workspace, table name, and metadata key. + +### Related Session/System Options + +The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` or via the Drill Web console. + +- **metastore.enabled** +Enables Drill Metastore usage to be able to store table metadata during ANALYZE TABLE commands execution and to be able + to read table metadata during regular queries execution or when querying some INFORMATION_SCHEMA tables. Default is `false`. +-
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r380244476 ## File path: _docs/performance-tuning/drill-metastore/030-drill-iceberg-metastore.md ## @@ -0,0 +1,69 @@ +--- +title: "Drill Iceberg Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill uses Iceberg Metastore implementation based on [Iceberg tables](http://iceberg.incubator.apache.org). For Drill 1.17, + this is default Drill Metastore implementation. For details on how to configure Iceberg Metastore implementation and + its option descriptions, please refer to [Iceberg Metastore docs](https://github.com/apache/drill/blob/master/metastore/iceberg-metastore/README.md). + +{% include startnote.html %} +Iceberg table supports concurrent writes and transactions but they are only effective on file systems that support + atomic rename. +If the file system does not support atomic rename, it could lead to inconsistencies during concurrent writes. +{% include endnote.html %} + +### Iceberg Tables Location + Review comment: Thanks, added. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r379556928 ## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ## @@ -0,0 +1,408 @@ +--- +title: "Using Drill Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill 1.17 introduces the Drill Metastore which stores the table schema and table statistics. Statistics allow Drill to better create optimal query plans. + +The Metastore is a Beta feature; it is subject to change. We encourage you to try it and provide feedback. +Because the Metastore is in Beta, the SQL commands and Metastore formats may change in the next release. +{% include startnote.html %}In Drill 1.17, this feature is supported for Parquet tables only and is disabled by default.{% include endnote.html %} + +## Enabling Drill Metastore + +To use the Drill Metastore, you must enable it at the session or system level with one of the following commands: + + SET `metastore.enabled` = true; + ALTER SYSTEM SET `metastore.enabled` = true; + +Alternatively, you can enable the option in the Drill Web UI at `http://:8047/options`. + +## Computing and storing table metadata to Drill Metastore + +Once you enable the Metastore, the next step is to populate it with data. Drill can query a table whether that table + has a Metastore entry or not. (If you are familiar with Hive, then you know that Hive requires that all tables have + Hive Metastore entries before you can query them.) In Drill, only add data to the Metastore when doing so improves + query performance. In general, large tables benefit from statistics more than small tables do. + +Unlike Hive, Drill does not require you to declare a schema. Instead, Drill infers the schema by scanning your table + in the same way as it is done during regular select and computes some metadata like `MIN` / `MAX` column values and + `NULLS_COUNT` designated as "metadata" to be able to produce more optimizations like filter push-down, etc. If + `planner.statistics.use` option is enabled, this command will also calculate and store table statistics into Drill + Metastore. + +## Configuration + +Default Metastore configuration is defined in `drill-metastore-default.conf` file. +It can be overridden in `drill-metastore-override.conf`. Distribution configuration can be +indicated in `drill-metastore-distrib.conf`. + +All configuration properties should reside in `drill.metastore` namespace. +Metastore implementation based on class implementation config property `drill.metastore.implementation.class`. +The default value is the following: + +``` +drill.metastore: { + implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore" +} +``` + +Note, that currently out of box Iceberg Metastore is available and is the default one. Though any custom + implementation can be added by placing the JAR into classpath which has the implementation of + `org.apache.drill.metastore.Metastore` interface and indicating custom class in the `drill.metastore.implementation.class`. + +### Metastore Components + +Metastore can store metadata for various components: tables, views, etc. +Current implementation provides fully functioning support for tables component. +Views component support is not implemented but contains stub methods to show +how new Metastore components like UDFs, storage plugins, etc. can be added in the future. + +### Metastore Tables + +Metastore Tables component contains metadata about Drill tables, including general information, as well as +information about table segments, files, row groups, partitions. + +Full table metadata consists of two major concepts: general information and top-level segments metadata. Review comment: Yes, we have a section below with the real tables and examples of how to discover metastore metadata. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r374778009 ## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ## @@ -0,0 +1,408 @@ +--- +title: "Using Drill Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill 1.17 introduces the Drill Metastore which stores the table schema and table statistics. Statistics allow Drill to better create optimal query plans. + +The Metastore is a Beta feature; it is subject to change. We encourage you to try it and provide feedback. +Because the Metastore is in Beta, the SQL commands and Metastore formats may change in the next release. +{% include startnote.html %}In Drill 1.17, this feature is supported for Parquet tables only and is disabled by default.{% include endnote.html %} + +## Enabling Drill Metastore + +To use the Drill Metastore, you must enable it at the session or system level with one of the following commands: + + SET `metastore.enabled` = true; + ALTER SYSTEM SET `metastore.enabled` = true; + +Alternatively, you can enable the option in the Drill Web UI at `http://:8047/options`. + +## Computing and storing table metadata to Drill Metastore + +Once you enable the Metastore, the next step is to populate it with data. Drill can query a table whether that table + has a Metastore entry or not. (If you are familiar with Hive, then you know that Hive requires that all tables have + Hive Metastore entries before you can query them.) In Drill, only add data to the Metastore when doing so improves + query performance. In general, large tables benefit from statistics more than small tables do. + +Unlike Hive, Drill does not require you to declare a schema. Instead, Drill infers the schema by scanning your table + in the same way as it is done during regular select and computes some metadata like `MIN` / `MAX` column values and + `NULLS_COUNT` designated as "metadata" to be able to produce more optimizations like filter push-down, etc. If + `planner.statistics.use` option is enabled, this command will also calculate and store table statistics into Drill + Metastore. + +## Configuration + +Default Metastore configuration is defined in `drill-metastore-default.conf` file. Review comment: Thanks, done. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r380159969 ## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ## @@ -0,0 +1,408 @@ +--- +title: "Using Drill Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill 1.17 introduces the Drill Metastore which stores the table schema and table statistics. Statistics allow Drill to better create optimal query plans. + +The Metastore is a Beta feature; it is subject to change. We encourage you to try it and provide feedback. +Because the Metastore is in Beta, the SQL commands and Metastore formats may change in the next release. +{% include startnote.html %}In Drill 1.17, this feature is supported for Parquet tables only and is disabled by default.{% include endnote.html %} + +## Enabling Drill Metastore + +To use the Drill Metastore, you must enable it at the session or system level with one of the following commands: + + SET `metastore.enabled` = true; + ALTER SYSTEM SET `metastore.enabled` = true; + +Alternatively, you can enable the option in the Drill Web UI at `http://:8047/options`. + +## Computing and storing table metadata to Drill Metastore + +Once you enable the Metastore, the next step is to populate it with data. Drill can query a table whether that table + has a Metastore entry or not. (If you are familiar with Hive, then you know that Hive requires that all tables have + Hive Metastore entries before you can query them.) In Drill, only add data to the Metastore when doing so improves + query performance. In general, large tables benefit from statistics more than small tables do. + +Unlike Hive, Drill does not require you to declare a schema. Instead, Drill infers the schema by scanning your table + in the same way as it is done during regular select and computes some metadata like `MIN` / `MAX` column values and + `NULLS_COUNT` designated as "metadata" to be able to produce more optimizations like filter push-down, etc. If + `planner.statistics.use` option is enabled, this command will also calculate and store table statistics into Drill + Metastore. + +## Configuration + +Default Metastore configuration is defined in `drill-metastore-default.conf` file. +It can be overridden in `drill-metastore-override.conf`. Distribution configuration can be +indicated in `drill-metastore-distrib.conf`. + +All configuration properties should reside in `drill.metastore` namespace. +Metastore implementation based on class implementation config property `drill.metastore.implementation.class`. +The default value is the following: + +``` +drill.metastore: { + implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore" +} +``` + +Note, that currently out of box Iceberg Metastore is available and is the default one. Though any custom + implementation can be added by placing the JAR into classpath which has the implementation of + `org.apache.drill.metastore.Metastore` interface and indicating custom class in the `drill.metastore.implementation.class`. + +### Metastore Components + +Metastore can store metadata for various components: tables, views, etc. +Current implementation provides fully functioning support for tables component. +Views component support is not implemented but contains stub methods to show +how new Metastore components like UDFs, storage plugins, etc. can be added in the future. + +### Metastore Tables + +Metastore Tables component contains metadata about Drill tables, including general information, as well as +information about table segments, files, row groups, partitions. + +Full table metadata consists of two major concepts: general information and top-level segments metadata. +Table general information contains basic table information and corresponds to the `BaseTableMetadata` class. + +A table can be non-partitioned and partitioned. Non-partitioned tables have only one top-level segment +which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned tables may have several top-level segments. +Each top-level segment can include metadata about inner segments, files, row groups, and partitions. + +A unique table identifier in Metastore Tables is a combination of storage plugin, workspace, and table name. +Table metadata inside is grouped by top-level segments, unique identifier of the top-level segment and its metadata +is storage plugin, workspace, table name, and metadata key. + +### Related Session/System Options + +The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` or via the Drill Web console. + +- **metastore.enabled** +Enables Drill Metastore usage to be able to store table metadata during ANALYZE TABLE commands execution and to be able + to read table metadata during regular queries execution or when querying some INFORMATION_SCHEMA tables. Default is `false`. +-
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r379541595 ## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ## @@ -0,0 +1,408 @@ +--- +title: "Using Drill Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill 1.17 introduces the Drill Metastore which stores the table schema and table statistics. Statistics allow Drill to better create optimal query plans. + +The Metastore is a Beta feature; it is subject to change. We encourage you to try it and provide feedback. +Because the Metastore is in Beta, the SQL commands and Metastore formats may change in the next release. +{% include startnote.html %}In Drill 1.17, this feature is supported for Parquet tables only and is disabled by default.{% include endnote.html %} + +## Enabling Drill Metastore + +To use the Drill Metastore, you must enable it at the session or system level with one of the following commands: + + SET `metastore.enabled` = true; + ALTER SYSTEM SET `metastore.enabled` = true; + +Alternatively, you can enable the option in the Drill Web UI at `http://:8047/options`. + +## Computing and storing table metadata to Drill Metastore + +Once you enable the Metastore, the next step is to populate it with data. Drill can query a table whether that table + has a Metastore entry or not. (If you are familiar with Hive, then you know that Hive requires that all tables have + Hive Metastore entries before you can query them.) In Drill, only add data to the Metastore when doing so improves + query performance. In general, large tables benefit from statistics more than small tables do. + +Unlike Hive, Drill does not require you to declare a schema. Instead, Drill infers the schema by scanning your table + in the same way as it is done during regular select and computes some metadata like `MIN` / `MAX` column values and + `NULLS_COUNT` designated as "metadata" to be able to produce more optimizations like filter push-down, etc. If + `planner.statistics.use` option is enabled, this command will also calculate and store table statistics into Drill + Metastore. + +## Configuration + +Default Metastore configuration is defined in `drill-metastore-default.conf` file. +It can be overridden in `drill-metastore-override.conf`. Distribution configuration can be +indicated in `drill-metastore-distrib.conf`. + +All configuration properties should reside in `drill.metastore` namespace. +Metastore implementation based on class implementation config property `drill.metastore.implementation.class`. +The default value is the following: + +``` +drill.metastore: { + implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore" +} +``` + +Note, that currently out of box Iceberg Metastore is available and is the default one. Though any custom Review comment: Thanks, separated these two concepts and added links to iceberg documentation. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r37466 ## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ## @@ -0,0 +1,408 @@ +--- +title: "Using Drill Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill 1.17 introduces the Drill Metastore which stores the table schema and table statistics. Statistics allow Drill to better create optimal query plans. + +The Metastore is a Beta feature; it is subject to change. We encourage you to try it and provide feedback. +Because the Metastore is in Beta, the SQL commands and Metastore formats may change in the next release. +{% include startnote.html %}In Drill 1.17, this feature is supported for Parquet tables only and is disabled by default.{% include endnote.html %} + +## Enabling Drill Metastore + +To use the Drill Metastore, you must enable it at the session or system level with one of the following commands: + + SET `metastore.enabled` = true; + ALTER SYSTEM SET `metastore.enabled` = true; + +Alternatively, you can enable the option in the Drill Web UI at `http://:8047/options`. + +## Computing and storing table metadata to Drill Metastore + +Once you enable the Metastore, the next step is to populate it with data. Drill can query a table whether that table + has a Metastore entry or not. (If you are familiar with Hive, then you know that Hive requires that all tables have Review comment: Thanks, reworded. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r379551973 ## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ## @@ -0,0 +1,408 @@ +--- +title: "Using Drill Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill 1.17 introduces the Drill Metastore which stores the table schema and table statistics. Statistics allow Drill to better create optimal query plans. + +The Metastore is a Beta feature; it is subject to change. We encourage you to try it and provide feedback. +Because the Metastore is in Beta, the SQL commands and Metastore formats may change in the next release. +{% include startnote.html %}In Drill 1.17, this feature is supported for Parquet tables only and is disabled by default.{% include endnote.html %} + +## Enabling Drill Metastore + +To use the Drill Metastore, you must enable it at the session or system level with one of the following commands: + + SET `metastore.enabled` = true; + ALTER SYSTEM SET `metastore.enabled` = true; + +Alternatively, you can enable the option in the Drill Web UI at `http://:8047/options`. + +## Computing and storing table metadata to Drill Metastore + +Once you enable the Metastore, the next step is to populate it with data. Drill can query a table whether that table + has a Metastore entry or not. (If you are familiar with Hive, then you know that Hive requires that all tables have + Hive Metastore entries before you can query them.) In Drill, only add data to the Metastore when doing so improves + query performance. In general, large tables benefit from statistics more than small tables do. + +Unlike Hive, Drill does not require you to declare a schema. Instead, Drill infers the schema by scanning your table + in the same way as it is done during regular select and computes some metadata like `MIN` / `MAX` column values and + `NULLS_COUNT` designated as "metadata" to be able to produce more optimizations like filter push-down, etc. If + `planner.statistics.use` option is enabled, this command will also calculate and store table statistics into Drill + Metastore. + +## Configuration + +Default Metastore configuration is defined in `drill-metastore-default.conf` file. +It can be overridden in `drill-metastore-override.conf`. Distribution configuration can be +indicated in `drill-metastore-distrib.conf`. + +All configuration properties should reside in `drill.metastore` namespace. +Metastore implementation based on class implementation config property `drill.metastore.implementation.class`. +The default value is the following: + +``` +drill.metastore: { + implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore" +} +``` + +Note, that currently out of box Iceberg Metastore is available and is the default one. Though any custom + implementation can be added by placing the JAR into classpath which has the implementation of + `org.apache.drill.metastore.Metastore` interface and indicating custom class in the `drill.metastore.implementation.class`. + +### Metastore Components + +Metastore can store metadata for various components: tables, views, etc. +Current implementation provides fully functioning support for tables component. +Views component support is not implemented but contains stub methods to show +how new Metastore components like UDFs, storage plugins, etc. can be added in the future. + +### Metastore Tables + +Metastore Tables component contains metadata about Drill tables, including general information, as well as Review comment: Thanks, replaced as you proposed, but also left mentioning that we have metadata about segments, files, row groups, partitions since it wasn't described in this doc yet. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r379573563 ## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ## @@ -0,0 +1,408 @@ +--- +title: "Using Drill Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill 1.17 introduces the Drill Metastore which stores the table schema and table statistics. Statistics allow Drill to better create optimal query plans. + +The Metastore is a Beta feature; it is subject to change. We encourage you to try it and provide feedback. +Because the Metastore is in Beta, the SQL commands and Metastore formats may change in the next release. +{% include startnote.html %}In Drill 1.17, this feature is supported for Parquet tables only and is disabled by default.{% include endnote.html %} + +## Enabling Drill Metastore + +To use the Drill Metastore, you must enable it at the session or system level with one of the following commands: + + SET `metastore.enabled` = true; + ALTER SYSTEM SET `metastore.enabled` = true; + +Alternatively, you can enable the option in the Drill Web UI at `http://:8047/options`. + +## Computing and storing table metadata to Drill Metastore + +Once you enable the Metastore, the next step is to populate it with data. Drill can query a table whether that table + has a Metastore entry or not. (If you are familiar with Hive, then you know that Hive requires that all tables have + Hive Metastore entries before you can query them.) In Drill, only add data to the Metastore when doing so improves + query performance. In general, large tables benefit from statistics more than small tables do. + +Unlike Hive, Drill does not require you to declare a schema. Instead, Drill infers the schema by scanning your table + in the same way as it is done during regular select and computes some metadata like `MIN` / `MAX` column values and + `NULLS_COUNT` designated as "metadata" to be able to produce more optimizations like filter push-down, etc. If + `planner.statistics.use` option is enabled, this command will also calculate and store table statistics into Drill + Metastore. + +## Configuration + +Default Metastore configuration is defined in `drill-metastore-default.conf` file. +It can be overridden in `drill-metastore-override.conf`. Distribution configuration can be +indicated in `drill-metastore-distrib.conf`. + +All configuration properties should reside in `drill.metastore` namespace. +Metastore implementation based on class implementation config property `drill.metastore.implementation.class`. +The default value is the following: + +``` +drill.metastore: { + implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore" +} +``` + +Note, that currently out of box Iceberg Metastore is available and is the default one. Though any custom + implementation can be added by placing the JAR into classpath which has the implementation of + `org.apache.drill.metastore.Metastore` interface and indicating custom class in the `drill.metastore.implementation.class`. + +### Metastore Components + +Metastore can store metadata for various components: tables, views, etc. +Current implementation provides fully functioning support for tables component. +Views component support is not implemented but contains stub methods to show +how new Metastore components like UDFs, storage plugins, etc. can be added in the future. + +### Metastore Tables + +Metastore Tables component contains metadata about Drill tables, including general information, as well as +information about table segments, files, row groups, partitions. + +Full table metadata consists of two major concepts: general information and top-level segments metadata. +Table general information contains basic table information and corresponds to the `BaseTableMetadata` class. + +A table can be non-partitioned and partitioned. Non-partitioned tables have only one top-level segment +which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned tables may have several top-level segments. +Each top-level segment can include metadata about inner segments, files, row groups, and partitions. + +A unique table identifier in Metastore Tables is a combination of storage plugin, workspace, and table name. +Table metadata inside is grouped by top-level segments, unique identifier of the top-level segment and its metadata +is storage plugin, workspace, table name, and metadata key. + +### Related Session/System Options + +The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` or via the Drill Web console. + +- **metastore.enabled** +Enables Drill Metastore usage to be able to store table metadata during ANALYZE TABLE commands execution and to be able + to read table metadata during regular queries execution or when querying some INFORMATION_SCHEMA tables. Default is `false`. +-
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r379569561 ## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ## @@ -0,0 +1,408 @@ +--- +title: "Using Drill Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill 1.17 introduces the Drill Metastore which stores the table schema and table statistics. Statistics allow Drill to better create optimal query plans. + +The Metastore is a Beta feature; it is subject to change. We encourage you to try it and provide feedback. +Because the Metastore is in Beta, the SQL commands and Metastore formats may change in the next release. +{% include startnote.html %}In Drill 1.17, this feature is supported for Parquet tables only and is disabled by default.{% include endnote.html %} + +## Enabling Drill Metastore + +To use the Drill Metastore, you must enable it at the session or system level with one of the following commands: + + SET `metastore.enabled` = true; + ALTER SYSTEM SET `metastore.enabled` = true; + +Alternatively, you can enable the option in the Drill Web UI at `http://:8047/options`. + +## Computing and storing table metadata to Drill Metastore + +Once you enable the Metastore, the next step is to populate it with data. Drill can query a table whether that table + has a Metastore entry or not. (If you are familiar with Hive, then you know that Hive requires that all tables have + Hive Metastore entries before you can query them.) In Drill, only add data to the Metastore when doing so improves + query performance. In general, large tables benefit from statistics more than small tables do. + +Unlike Hive, Drill does not require you to declare a schema. Instead, Drill infers the schema by scanning your table + in the same way as it is done during regular select and computes some metadata like `MIN` / `MAX` column values and + `NULLS_COUNT` designated as "metadata" to be able to produce more optimizations like filter push-down, etc. If + `planner.statistics.use` option is enabled, this command will also calculate and store table statistics into Drill + Metastore. + +## Configuration + +Default Metastore configuration is defined in `drill-metastore-default.conf` file. +It can be overridden in `drill-metastore-override.conf`. Distribution configuration can be +indicated in `drill-metastore-distrib.conf`. + +All configuration properties should reside in `drill.metastore` namespace. +Metastore implementation based on class implementation config property `drill.metastore.implementation.class`. +The default value is the following: + +``` +drill.metastore: { + implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore" +} +``` + +Note, that currently out of box Iceberg Metastore is available and is the default one. Though any custom + implementation can be added by placing the JAR into classpath which has the implementation of + `org.apache.drill.metastore.Metastore` interface and indicating custom class in the `drill.metastore.implementation.class`. + +### Metastore Components + +Metastore can store metadata for various components: tables, views, etc. +Current implementation provides fully functioning support for tables component. +Views component support is not implemented but contains stub methods to show +how new Metastore components like UDFs, storage plugins, etc. can be added in the future. + +### Metastore Tables + +Metastore Tables component contains metadata about Drill tables, including general information, as well as +information about table segments, files, row groups, partitions. + +Full table metadata consists of two major concepts: general information and top-level segments metadata. +Table general information contains basic table information and corresponds to the `BaseTableMetadata` class. + +A table can be non-partitioned and partitioned. Non-partitioned tables have only one top-level segment +which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned tables may have several top-level segments. +Each top-level segment can include metadata about inner segments, files, row groups, and partitions. + +A unique table identifier in Metastore Tables is a combination of storage plugin, workspace, and table name. +Table metadata inside is grouped by top-level segments, unique identifier of the top-level segment and its metadata +is storage plugin, workspace, table name, and metadata key. + +### Related Session/System Options + Review comment: Thanks, replaced. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r379534144 ## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ## @@ -0,0 +1,408 @@ +--- +title: "Using Drill Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill 1.17 introduces the Drill Metastore which stores the table schema and table statistics. Statistics allow Drill to better create optimal query plans. + +The Metastore is a Beta feature; it is subject to change. We encourage you to try it and provide feedback. +Because the Metastore is in Beta, the SQL commands and Metastore formats may change in the next release. +{% include startnote.html %}In Drill 1.17, this feature is supported for Parquet tables only and is disabled by default.{% include endnote.html %} + +## Enabling Drill Metastore + +To use the Drill Metastore, you must enable it at the session or system level with one of the following commands: + + SET `metastore.enabled` = true; + ALTER SYSTEM SET `metastore.enabled` = true; + +Alternatively, you can enable the option in the Drill Web UI at `http://:8047/options`. + +## Computing and storing table metadata to Drill Metastore + +Once you enable the Metastore, the next step is to populate it with data. Drill can query a table whether that table + has a Metastore entry or not. (If you are familiar with Hive, then you know that Hive requires that all tables have + Hive Metastore entries before you can query them.) In Drill, only add data to the Metastore when doing so improves + query performance. In general, large tables benefit from statistics more than small tables do. + +Unlike Hive, Drill does not require you to declare a schema. Instead, Drill infers the schema by scanning your table + in the same way as it is done during regular select and computes some metadata like `MIN` / `MAX` column values and + `NULLS_COUNT` designated as "metadata" to be able to produce more optimizations like filter push-down, etc. If + `planner.statistics.use` option is enabled, this command will also calculate and store table statistics into Drill + Metastore. + +## Configuration + +Default Metastore configuration is defined in `drill-metastore-default.conf` file. +It can be overridden in `drill-metastore-override.conf`. Distribution configuration can be +indicated in `drill-metastore-distrib.conf`. Review comment: Thanks, reworded. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r379543295 ## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ## @@ -0,0 +1,408 @@ +--- +title: "Using Drill Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill 1.17 introduces the Drill Metastore which stores the table schema and table statistics. Statistics allow Drill to better create optimal query plans. + +The Metastore is a Beta feature; it is subject to change. We encourage you to try it and provide feedback. +Because the Metastore is in Beta, the SQL commands and Metastore formats may change in the next release. +{% include startnote.html %}In Drill 1.17, this feature is supported for Parquet tables only and is disabled by default.{% include endnote.html %} + +## Enabling Drill Metastore + +To use the Drill Metastore, you must enable it at the session or system level with one of the following commands: + + SET `metastore.enabled` = true; + ALTER SYSTEM SET `metastore.enabled` = true; + +Alternatively, you can enable the option in the Drill Web UI at `http://:8047/options`. + +## Computing and storing table metadata to Drill Metastore + +Once you enable the Metastore, the next step is to populate it with data. Drill can query a table whether that table + has a Metastore entry or not. (If you are familiar with Hive, then you know that Hive requires that all tables have + Hive Metastore entries before you can query them.) In Drill, only add data to the Metastore when doing so improves + query performance. In general, large tables benefit from statistics more than small tables do. + +Unlike Hive, Drill does not require you to declare a schema. Instead, Drill infers the schema by scanning your table + in the same way as it is done during regular select and computes some metadata like `MIN` / `MAX` column values and + `NULLS_COUNT` designated as "metadata" to be able to produce more optimizations like filter push-down, etc. If + `planner.statistics.use` option is enabled, this command will also calculate and store table statistics into Drill + Metastore. + +## Configuration + +Default Metastore configuration is defined in `drill-metastore-default.conf` file. +It can be overridden in `drill-metastore-override.conf`. Distribution configuration can be +indicated in `drill-metastore-distrib.conf`. + +All configuration properties should reside in `drill.metastore` namespace. +Metastore implementation based on class implementation config property `drill.metastore.implementation.class`. +The default value is the following: + +``` +drill.metastore: { + implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore" +} +``` + +Note, that currently out of box Iceberg Metastore is available and is the default one. Though any custom + implementation can be added by placing the JAR into classpath which has the implementation of + `org.apache.drill.metastore.Metastore` interface and indicating custom class in the `drill.metastore.implementation.class`. + +### Metastore Components + +Metastore can store metadata for various components: tables, views, etc. Review comment: Thanks, updated section with the info you have proposed and added a link to main Jira. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r379521993 ## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ## @@ -0,0 +1,408 @@ +--- +title: "Using Drill Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill 1.17 introduces the Drill Metastore which stores the table schema and table statistics. Statistics allow Drill to better create optimal query plans. + +The Metastore is a Beta feature; it is subject to change. We encourage you to try it and provide feedback. +Because the Metastore is in Beta, the SQL commands and Metastore formats may change in the next release. +{% include startnote.html %}In Drill 1.17, this feature is supported for Parquet tables only and is disabled by default.{% include endnote.html %} + +## Enabling Drill Metastore Review comment: Thanks, good idea. I have added a section where enumerated problems that Metastore may help to solve. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore
vvysotskyi commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r379543754 ## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ## @@ -0,0 +1,408 @@ +--- +title: "Using Drill Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill 1.17 introduces the Drill Metastore which stores the table schema and table statistics. Statistics allow Drill to better create optimal query plans. + +The Metastore is a Beta feature; it is subject to change. We encourage you to try it and provide feedback. +Because the Metastore is in Beta, the SQL commands and Metastore formats may change in the next release. +{% include startnote.html %}In Drill 1.17, this feature is supported for Parquet tables only and is disabled by default.{% include endnote.html %} + +## Enabling Drill Metastore + +To use the Drill Metastore, you must enable it at the session or system level with one of the following commands: + + SET `metastore.enabled` = true; + ALTER SYSTEM SET `metastore.enabled` = true; + +Alternatively, you can enable the option in the Drill Web UI at `http://:8047/options`. + +## Computing and storing table metadata to Drill Metastore + +Once you enable the Metastore, the next step is to populate it with data. Drill can query a table whether that table + has a Metastore entry or not. (If you are familiar with Hive, then you know that Hive requires that all tables have + Hive Metastore entries before you can query them.) In Drill, only add data to the Metastore when doing so improves + query performance. In general, large tables benefit from statistics more than small tables do. + +Unlike Hive, Drill does not require you to declare a schema. Instead, Drill infers the schema by scanning your table + in the same way as it is done during regular select and computes some metadata like `MIN` / `MAX` column values and + `NULLS_COUNT` designated as "metadata" to be able to produce more optimizations like filter push-down, etc. If + `planner.statistics.use` option is enabled, this command will also calculate and store table statistics into Drill + Metastore. + +## Configuration + +Default Metastore configuration is defined in `drill-metastore-default.conf` file. +It can be overridden in `drill-metastore-override.conf`. Distribution configuration can be +indicated in `drill-metastore-distrib.conf`. + +All configuration properties should reside in `drill.metastore` namespace. +Metastore implementation based on class implementation config property `drill.metastore.implementation.class`. +The default value is the following: + +``` +drill.metastore: { + implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore" +} +``` + +Note, that currently out of box Iceberg Metastore is available and is the default one. Though any custom + implementation can be added by placing the JAR into classpath which has the implementation of + `org.apache.drill.metastore.Metastore` interface and indicating custom class in the `drill.metastore.implementation.class`. + +### Metastore Components + +Metastore can store metadata for various components: tables, views, etc. +Current implementation provides fully functioning support for tables component. +Views component support is not implemented but contains stub methods to show +how new Metastore components like UDFs, storage plugins, etc. can be added in the future. + +### Metastore Tables Review comment: Thanks, agree that it may seem a little bit confusing, so changed as you have proposed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi opened a new pull request #1986: Additional changes for Drill Metastore docs
vvysotskyi opened a new pull request #1986: Additional changes for Drill Metastore docs URL: https://github.com/apache/drill/pull/1986 Changes after code review for https://github.com/apache/drill/pull/1953 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi commented on issue #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files
vvysotskyi commented on issue #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files URL: https://github.com/apache/drill/pull/1985#issuecomment-587039702 @KazydubB, thanks for the review, I have made requested changes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files
vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files URL: https://github.com/apache/drill/pull/1985#discussion_r380237113 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/metadata/MetadataAggregateHelper.java ## @@ -117,16 +120,16 @@ private void createAggregatorInternal() { } } -for (SchemaPath excludedColumn : excludedColumns) { - if (excludedColumn.equals(SchemaPath.getSimplePath(columnNamesOptions.rowGroupStart())) - || excludedColumn.equals(SchemaPath.getSimplePath(columnNamesOptions.rowGroupLength( { -LogicalExpression lastModifiedTime = new FunctionCall("any_value", +for (SchemaPath nonSchemaColumn : context.metadataColumns()) { Review comment: Sorry, missed it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files
vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files URL: https://github.com/apache/drill/pull/1985#discussion_r380236720 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/metastore/ColumnNamesOptions.java ## @@ -40,6 +41,7 @@ public ColumnNamesOptions(OptionManager optionManager) { this.rowGroupStart = optionManager.getOption(ExecConstants.IMPLICIT_ROW_GROUP_START_COLUMN_LABEL).string_val; this.rowGroupLength = optionManager.getOption(ExecConstants.IMPLICIT_ROW_GROUP_LENGTH_COLUMN_LABEL).string_val; this.lastModifiedTime = optionManager.getOption(ExecConstants.IMPLICIT_LAST_MODIFIED_TIME_COLUMN_LABEL).string_val; +this.projectMetadataColumn = optionManager.getOption(ExecConstants.IMPLICIT_PROJECT_METADATA_COLUMN_LABEL).string_val; Review comment: Good idea, thanks, done. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] KazydubB commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files
KazydubB commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files URL: https://github.com/apache/drill/pull/1985#discussion_r380227482 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/metastore/ColumnNamesOptions.java ## @@ -40,6 +41,7 @@ public ColumnNamesOptions(OptionManager optionManager) { this.rowGroupStart = optionManager.getOption(ExecConstants.IMPLICIT_ROW_GROUP_START_COLUMN_LABEL).string_val; this.rowGroupLength = optionManager.getOption(ExecConstants.IMPLICIT_ROW_GROUP_LENGTH_COLUMN_LABEL).string_val; this.lastModifiedTime = optionManager.getOption(ExecConstants.IMPLICIT_LAST_MODIFIED_TIME_COLUMN_LABEL).string_val; +this.projectMetadataColumn = optionManager.getOption(ExecConstants.IMPLICIT_PROJECT_METADATA_COLUMN_LABEL).string_val; Review comment: I think, it is better to declare `ExecConstants.IMPLICIT_PROJECT_METADATA_COLUMN_LABEL` (and `ExecConstants.IMPLICIT_LAST_MODIFIED_TIME_COLUMN_LABEL`) as `StringValidator` and use it as `this.projectMetadataColumn = optionManager.getOption(ExecConstants.IMPLICIT_PROJECT_METADATA_COLUMN_LABEL);`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] KazydubB commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files
KazydubB commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files URL: https://github.com/apache/drill/pull/1985#discussion_r380231898 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/metadata/MetadataAggregateHelper.java ## @@ -117,16 +120,16 @@ private void createAggregatorInternal() { } } -for (SchemaPath excludedColumn : excludedColumns) { - if (excludedColumn.equals(SchemaPath.getSimplePath(columnNamesOptions.rowGroupStart())) - || excludedColumn.equals(SchemaPath.getSimplePath(columnNamesOptions.rowGroupLength( { -LogicalExpression lastModifiedTime = new FunctionCall("any_value", +for (SchemaPath nonSchemaColumn : context.metadataColumns()) { Review comment: Rename to `metadataColumn` or `implicitColumn`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi commented on a change in pull request #1984: DRILL-7586: drill-hive-exec-shaded contains commons-lang3 version 3.1
vvysotskyi commented on a change in pull request #1984: DRILL-7586: drill-hive-exec-shaded contains commons-lang3 version 3.1 URL: https://github.com/apache/drill/pull/1984#discussion_r380155881 ## File path: contrib/storage-hive/hive-exec-shade/pom.xml ## @@ -158,6 +158,8 @@ you can use TestHiveStorage.readFromAlteredPartitionedTableWithEmptyGroupType() test case. --> org/apache/parquet/** shaded/parquet/org/** +org/apache/commons/lang/** Review comment: I'm afraid it can break something since hive explicitly includes these libraries into `hive-exec` jar: https://github.com/apache/hive/blob/master/ql/pom.xml#L958. As an alternative solution, I would recommend relocating them (as it is done above for other libraries). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] KazydubB commented on a change in pull request #1974: DRILL-7574: Generalize the projection parser
KazydubB commented on a change in pull request #1974: DRILL-7574: Generalize the projection parser URL: https://github.com/apache/drill/pull/1974#discussion_r380154237 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/resultSet/project/RequestedColumn.java ## @@ -0,0 +1,137 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.resultSet.project; + +/** + * Plan-time properties of a requested column.Represents + * a consolidated view of the set of references to a column. + * For example, the project list might contain: + * SELECT columns[4], columns[8] + * SELECT a.b, a.c + * SELECT columns, columns[1] + * SELECT a, a.b + * In each case, the same column is referenced in different + * forms which are consolidated into this abstraction. + * + * The resulting information is a "pattern": a form of reference + * which which a concrete type can be compatible or not. The project + * list does not contain sufficient information to definitively pick + * a type; it only excludes certain types. + * + * Depending on the syntax, we can infer if a column must + * be an array or map. This is definitive: though we know that + * columns of the form above must be an array or a map, + * we cannot know if a simple column reference might refer + * to an array or map. + * + * Compatibility Rules + * + * The pattern given by projection is consistent with certain concrete types + * as follows. + means any number of additional qualifiers. + * + * + * TypeConsistent with + * Non-repeated MAP + * {@code a+} {@code a.b+} + * Repeated MAP + * {@code a+} {@code a.b+} {@code a[n].b+}> + * Non-repeated Scalar + * {@code a} + * Repeated Scalar + * {@code a} {@code a[n]} + * Non-repeated DICT + * {@code a} {@code a['key']} + * Repeated DICT + * {@code a} {@code a[n]} {@code a['key']} {@code a[n]['key']} Review comment: Checked whether `m.a` is supported for `MAP` arrays: it looks like this is not supported in Drill. For a json file `file.json` ``` {"sa": [{"a": 1}, {"a": 2}, {"a": 3}]} {"sa": [{"a": 1}]} ``` following query ``select t.sa.a kv from dfs.`file.json` t`` produces two rows with `null` value each. (Should have returned an error instead?) In case when types are known during planning, e.g. in case when querying Hive table, there is following validation: `VALIDATION ERROR: From line 1, column 27 to line 1, column 28: Cannot apply 'ITEM' to arguments of type 'ITEM(, )'. Supported form(s): [] []` (used following test in `TestHiveStructs.java`: ``` @Test public void strWithArr2ByIdxP0111() throws Exception { HiveTestUtilities.assertNativeScanUsed(queryBuilder(), "struct_tbl_p"); testBuilder() .sqlQuery("SELECT rid, t.str_wa_2.fa.sn p0 FROM hive.struct_tbl_p t") .unOrdered() .baselineColumns("rid", "p0") .expectsEmptyResultSet() .go(); } ``` ) However, such behavior is present in Hive, but for repeated (Drill's) `MAP` only (but not for repeated `DICT`), IIRC. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] oleg-zinovev commented on issue #1984: DRILL-7586: drill-hive-exec-shaded contains commons-lang3 version 3.1
oleg-zinovev commented on issue #1984: DRILL-7586: drill-hive-exec-shaded contains commons-lang3 version 3.1 URL: https://github.com/apache/drill/pull/1984#issuecomment-586964736 I can not reproduce the error on any version of JDK This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files
vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files URL: https://github.com/apache/drill/pull/1985#discussion_r380128666 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/handlers/MetastoreAnalyzeTableHandler.java ## @@ -406,13 +406,13 @@ private DrillRel getTableAggRelNode(DrillRel convertedRelNode, boolean createNew SchemaPath lastModifiedTimeField = SchemaPath.getSimplePath(config.getContext().getOptions().getString(ExecConstants.IMPLICIT_LAST_MODIFIED_TIME_COLUMN_LABEL)); -List excludedColumns = Arrays.asList(locationField, lastModifiedTimeField); +List nonSchemaColumns = Arrays.asList(locationField, lastModifiedTimeField); Review comment: Thanks, renamed here and in other places. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files
vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files URL: https://github.com/apache/drill/pull/1985#discussion_r380124653 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java ## @@ -237,6 +238,18 @@ private IterOutcome internalNext() { logger.trace("currentReader.next return recordCount={}", recordCount); Preconditions.checkArgument(recordCount >= 0, "recordCount from RecordReader.next() should not be negative"); boolean isNewSchema = mutator.isNewSchema(); + // adds additional record for the case of making scan for obtaining metadata if required + if (implicitValues != null) { +String projectMetadataColumn = context.getOptions().getOption(ExecConstants.IMPLICIT_PROJECT_METADATA_COLUMN_LABEL).string_val; +if (recordCount > 0) { + // sets implicit value to false to signalize that some results were returned and there is no need for creating additional record Review comment: Thanks, updated the comment and added more details. Regarding the concept of the additional record, I will try to explain how Metastore collects the data in general cases, it may help to understand the reason for such a decision. Drill Metastore may collect metadata for every file or row group, so aggregation calls for every column with grouping by `fqn`, `rgi`, `dirX`... columns were added. This approach works fine for the case of non-empty files and row groups, but when an empty file is present, no data is passed to the aggregation from the Scan, so Metastore was ignoring such files. To solve this problem, I have added this logic to return a single record for the case when no data was read with the correct values of implicit columns, and this additional implicit column helps to distinguish such records and collect info about rows count, schema, etc. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files
vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files URL: https://github.com/apache/drill/pull/1985#discussion_r380077493 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java ## @@ -511,6 +511,11 @@ private ExecConstants() { new OptionDescription("Available as of Drill 1.17. Sets the implicit column name for the lastModifiedTime column. " + "For internal usage when producing Metastore analyze.")); + public static final String IMPLICIT_PROJECT_METADATA_COLUMN_LABEL = "drill.exec.storage.implicit.project_metadata.column.label"; + public static final OptionValidator IMPLICIT_PROJECT_METADATA_COLUMN_LABEL_VALIDATOR = new StringValidator(IMPLICIT_PROJECT_METADATA_COLUMN_LABEL, + new OptionDescription("Available as of Drill 1.18. Sets the implicit column name for the $project_metadata$ column. " + Review comment: Good point about that. I specified version here and in other places to be consistent with other options descriptions. I think adding version in options descriptions was done to simplify updating docs for Drill Web site - there is no need to look up for the commit date and version of Drill, where it was added, just copy and paste it from Drill Web-UI, or from this class. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files
vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files URL: https://github.com/apache/drill/pull/1985#discussion_r380128453 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/metadata/MetadataAggregateHelper.java ## @@ -313,8 +317,19 @@ private void addColumnAggregateCalls(FieldReference fieldRef, String fieldName) if (interestingColumns == null || interestingColumns.contains(fieldRef)) { // collect statistics for all or only interesting columns if they are specified AnalyzeColumnUtils.COLUMN_STATISTICS_FUNCTIONS.forEach((statisticsKind, sqlKind) -> { + // constructs "case when is not null projectMetadataColumn then column1 else null end" call + // to avoid using default values for required columns when data for empty result is obtained Review comment: Thanks for pointing this. Unfortunately, we can't use a plain SQL approach to collect metadata, since we do not have information about the schema, so we create aggregate calls dynamically. But Drill uses inbuilt aggregate functions for collecting summary statistics (`MIN`, `MAX`, ...). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files
vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files URL: https://github.com/apache/drill/pull/1985#discussion_r380081293 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/metastore/analyze/MetadataAggregateContext.java ## @@ -63,8 +67,8 @@ public boolean createNewAggregations() { } @JsonProperty - public List excludedColumns() { -return excludedColumns; + public List nonSchemaColumns() { Review comment: Thanks, `metadataColumns` name looks better, renamed this field. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services