[jira] [Commented] (DRILL-4826) Query against INFORMATION_SCHEMA.TABLES degrades as the number of views increases
[ https://issues.apache.org/jira/browse/DRILL-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15613960#comment-15613960 ] ASF GitHub Bot commented on DRILL-4826: --- Github user asfgit closed the pull request at: https://github.com/apache/drill/pull/592 > Query against INFORMATION_SCHEMA.TABLES degrades as the number of views > increases > - > > Key: DRILL-4826 > URL: https://issues.apache.org/jira/browse/DRILL-4826 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra >Assignee: Padma Penumarthy > > Queries against INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.VIEWS slow > down as the number of views increases. > BI tools like Tableau issue a query like the following at connection time: > {code} > select TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE from > INFORMATION_SCHEMA.`TABLES` WHERE TABLE_CATALOG LIKE 'DRILL' ESCAPE '\' AND > TABLE_SCHEMA <> 'sys' AND TABLE_SCHEMA <> 'INFORMATION_SCHEMA'ORDER BY > TABLE_TYPE, TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME > {code} > The time to query the information schema tables degrades as the number of > views increases. On a test system: > || Views || Time(secs) || > |500 | 6 | > |1000 | 19 | > |1500 | 33 | > This can result in a single connection taking more than a minute to establish. > The problem occurs because we read the view file for every view and this > appears to take most of the time. > Querying information_schema.tables does not, in fact, need to open the view > file at all, it merely needs to get a listing of the view files. Eliminating > the view file read will speed up the query tremendously. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4826) Query against INFORMATION_SCHEMA.TABLES degrades as the number of views increases
[ https://issues.apache.org/jira/browse/DRILL-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606077#comment-15606077 ] ASF GitHub Bot commented on DRILL-4826: --- Github user sudheeshkatkam commented on a diff in the pull request: https://github.com/apache/drill/pull/592#discussion_r84967719 --- Diff: contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/DrillHiveMetaStoreClient.java --- @@ -233,6 +236,30 @@ private DrillHiveMetaStoreClient(final HiveConf hiveConf) throws MetaException { } } + public static List getTablesByNamesByBulkLoadHelper( + final HiveMetaStoreClient mClient, final List tableNames, final String schemaName, + final int bulkSize) { +final int totalTables = tableNames.size(); +final List tables = Lists.newArrayList(); + +// In each round, Drill asks for a sub-list of all the requested tables +for (int fromIndex = 0; fromIndex < totalTables; fromIndex += bulkSize) { + final int toIndex = Math.min(fromIndex + bulkSize, totalTables); + final List eachBulkofTableNames = tableNames.subList(fromIndex, toIndex); + List eachBulkofTables; + // Retries once if the first call to fetch the metadata fails + try { +eachBulkofTables = --- End diff -- `eachBulkofTables = getTableObjectsByNameHelper(mClient, schemaName, eachBulkofTableNames);` > Query against INFORMATION_SCHEMA.TABLES degrades as the number of views > increases > - > > Key: DRILL-4826 > URL: https://issues.apache.org/jira/browse/DRILL-4826 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra >Assignee: Padma Penumarthy > > Queries against INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.VIEWS slow > down as the number of views increases. > BI tools like Tableau issue a query like the following at connection time: > {code} > select TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE from > INFORMATION_SCHEMA.`TABLES` WHERE TABLE_CATALOG LIKE 'DRILL' ESCAPE '\' AND > TABLE_SCHEMA <> 'sys' AND TABLE_SCHEMA <> 'INFORMATION_SCHEMA'ORDER BY > TABLE_TYPE, TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME > {code} > The time to query the information schema tables degrades as the number of > views increases. On a test system: > || Views || Time(secs) || > |500 | 6 | > |1000 | 19 | > |1500 | 33 | > This can result in a single connection taking more than a minute to establish. > The problem occurs because we read the view file for every view and this > appears to take most of the time. > Querying information_schema.tables does not, in fact, need to open the view > file at all, it merely needs to get a listing of the view files. Eliminating > the view file read will speed up the query tremendously. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4826) Query against INFORMATION_SCHEMA.TABLES degrades as the number of views increases
[ https://issues.apache.org/jira/browse/DRILL-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573421#comment-15573421 ] ASF GitHub Bot commented on DRILL-4826: --- Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/592#discussion_r83324591 --- Diff: contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/schema/HiveDatabaseSchema.java --- @@ -78,32 +79,49 @@ public String getTypeName() { } @Override - public List> getTablesByNamesByBulkLoad(final List tableNames) { + public List> getTablesByNamesByBulkLoad(final List tableNames, + final int bulkSize) { +final int totalTables = tableNames.size(); final String schemaName = getName(); -final List> tableNameToTable = Lists.newArrayList(); -List tables; -try { - tables = DrillHiveMetaStoreClient.getTableObjectsByNameHelper(mClient, schemaName, tableNames); -} catch (TException e) { - logger.warn("Exception occurred while trying to list tables by names from {}: {}", schemaName, e.getCause()); - return tableNameToTable; +final List tables = Lists.newArrayList(); + +// In each round, Drill asks for a sub-list of all the requested tables +for (int fromIndex = 0; fromIndex < totalTables; fromIndex += bulkSize) { + final int toIndex = Math.min(fromIndex + bulkSize, totalTables); + final List eachBulkofTableNames = tableNames.subList(fromIndex, toIndex); + List eachBulkofTables; + // Retries once if the first call to fetch the metadata fails + synchronized (mClient) { +try { + eachBulkofTables = mClient.getTableObjectsByName(schemaName, eachBulkofTableNames); +} catch (TException tException) { + try { +mClient.reconnect(); +eachBulkofTables = mClient.getTableObjectsByName(schemaName, eachBulkofTableNames); + } catch (Exception e) { +logger.warn("Exception occurred while trying to read tables from {}: {}", schemaName, +e.getCause()); +return ImmutableList.of(); + } +} +tables.addAll(eachBulkofTables); + } } -for(final org.apache.hadoop.hive.metastore.api.Table table : tables) { - if(table == null) { +final List> tableNameToTable = Lists.newArrayList(); +for (final org.apache.hadoop.hive.metastore.api.Table table : tables) { + if (table == null) { --- End diff -- can this table be null ? > Query against INFORMATION_SCHEMA.TABLES degrades as the number of views > increases > - > > Key: DRILL-4826 > URL: https://issues.apache.org/jira/browse/DRILL-4826 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra >Assignee: Padma Penumarthy > > Queries against INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.VIEWS slow > down as the number of views increases. > BI tools like Tableau issue a query like the following at connection time: > {code} > select TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE from > INFORMATION_SCHEMA.`TABLES` WHERE TABLE_CATALOG LIKE 'DRILL' ESCAPE '\' AND > TABLE_SCHEMA <> 'sys' AND TABLE_SCHEMA <> 'INFORMATION_SCHEMA'ORDER BY > TABLE_TYPE, TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME > {code} > The time to query the information schema tables degrades as the number of > views increases. On a test system: > || Views || Time(secs) || > |500 | 6 | > |1000 | 19 | > |1500 | 33 | > This can result in a single connection taking more than a minute to establish. > The problem occurs because we read the view file for every view and this > appears to take most of the time. > Querying information_schema.tables does not, in fact, need to open the view > file at all, it merely needs to get a listing of the view files. Eliminating > the view file read will speed up the query tremendously. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4826) Query against INFORMATION_SCHEMA.TABLES degrades as the number of views increases
[ https://issues.apache.org/jira/browse/DRILL-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573420#comment-15573420 ] ASF GitHub Bot commented on DRILL-4826: --- Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/592#discussion_r83324461 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/ischema/InfoSchemaRecordGenerator.java --- @@ -290,28 +290,30 @@ public Tables(OptionManager optionManager) { return new PojoRecordReader<>(Records.Table.class, records.iterator()); } -@Override -public void visitTables(String schemaPath, SchemaPlus schema) { +@Override public void visitTables(String schemaPath, SchemaPlus schema) { --- End diff -- why this change ? > Query against INFORMATION_SCHEMA.TABLES degrades as the number of views > increases > - > > Key: DRILL-4826 > URL: https://issues.apache.org/jira/browse/DRILL-4826 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra >Assignee: Padma Penumarthy > > Queries against INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.VIEWS slow > down as the number of views increases. > BI tools like Tableau issue a query like the following at connection time: > {code} > select TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE from > INFORMATION_SCHEMA.`TABLES` WHERE TABLE_CATALOG LIKE 'DRILL' ESCAPE '\' AND > TABLE_SCHEMA <> 'sys' AND TABLE_SCHEMA <> 'INFORMATION_SCHEMA'ORDER BY > TABLE_TYPE, TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME > {code} > The time to query the information schema tables degrades as the number of > views increases. On a test system: > || Views || Time(secs) || > |500 | 6 | > |1000 | 19 | > |1500 | 33 | > This can result in a single connection taking more than a minute to establish. > The problem occurs because we read the view file for every view and this > appears to take most of the time. > Querying information_schema.tables does not, in fact, need to open the view > file at all, it merely needs to get a listing of the view files. Eliminating > the view file read will speed up the query tremendously. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4826) Query against INFORMATION_SCHEMA.TABLES degrades as the number of views increases
[ https://issues.apache.org/jira/browse/DRILL-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573419#comment-15573419 ] ASF GitHub Bot commented on DRILL-4826: --- Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/592#discussion_r83323951 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/AbstractSchema.java --- @@ -231,4 +231,21 @@ public void dropTable(String tableName) { } return tables; } -} \ No newline at end of file + + public List> getTableNamesAndTypes(boolean bulkLoad, int bulkSize) { +final List tableNames = Lists.newArrayList(getTableNames()); +final List> tableNamesAndTypes = Lists.newArrayList(); +final List> tables; +if (bulkLoad) { + tables = getTablesByNamesByBulkLoad(tableNames, bulkSize); --- End diff -- why do we even have this option to do bulkLoad or not ? why not just do bulkLoad always ? > Query against INFORMATION_SCHEMA.TABLES degrades as the number of views > increases > - > > Key: DRILL-4826 > URL: https://issues.apache.org/jira/browse/DRILL-4826 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra >Assignee: Padma Penumarthy > > Queries against INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.VIEWS slow > down as the number of views increases. > BI tools like Tableau issue a query like the following at connection time: > {code} > select TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE from > INFORMATION_SCHEMA.`TABLES` WHERE TABLE_CATALOG LIKE 'DRILL' ESCAPE '\' AND > TABLE_SCHEMA <> 'sys' AND TABLE_SCHEMA <> 'INFORMATION_SCHEMA'ORDER BY > TABLE_TYPE, TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME > {code} > The time to query the information schema tables degrades as the number of > views increases. On a test system: > || Views || Time(secs) || > |500 | 6 | > |1000 | 19 | > |1500 | 33 | > This can result in a single connection taking more than a minute to establish. > The problem occurs because we read the view file for every view and this > appears to take most of the time. > Querying information_schema.tables does not, in fact, need to open the view > file at all, it merely needs to get a listing of the view files. Eliminating > the view file read will speed up the query tremendously. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4826) Query against INFORMATION_SCHEMA.TABLES degrades as the number of views increases
[ https://issues.apache.org/jira/browse/DRILL-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15508098#comment-15508098 ] ASF GitHub Bot commented on DRILL-4826: --- Github user parthchandra commented on the issue: https://github.com/apache/drill/pull/592 Updated to address review comments > Query against INFORMATION_SCHEMA.TABLES degrades as the number of views > increases > - > > Key: DRILL-4826 > URL: https://issues.apache.org/jira/browse/DRILL-4826 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra >Assignee: Parth Chandra > > Queries against INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.VIEWS slow > down as the number of views increases. > BI tools like Tableau issue a query like the following at connection time: > {code} > select TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE from > INFORMATION_SCHEMA.`TABLES` WHERE TABLE_CATALOG LIKE 'DRILL' ESCAPE '\' AND > TABLE_SCHEMA <> 'sys' AND TABLE_SCHEMA <> 'INFORMATION_SCHEMA'ORDER BY > TABLE_TYPE, TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME > {code} > The time to query the information schema tables degrades as the number of > views increases. On a test system: > || Views || Time(secs) || > |500 | 6 | > |1000 | 19 | > |1500 | 33 | > This can result in a single connection taking more than a minute to establish. > The problem occurs because we read the view file for every view and this > appears to take most of the time. > Querying information_schema.tables does not, in fact, need to open the view > file at all, it merely needs to get a listing of the view files. Eliminating > the view file read will speed up the query tremendously. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4826) Query against INFORMATION_SCHEMA.TABLES degrades as the number of views increases
[ https://issues.apache.org/jira/browse/DRILL-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15507922#comment-15507922 ] ASF GitHub Bot commented on DRILL-4826: --- Github user parthchandra commented on a diff in the pull request: https://github.com/apache/drill/pull/592#discussion_r79721752 --- Diff: contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/schema/HiveDatabaseSchema.java --- @@ -78,17 +79,34 @@ public String getTypeName() { } @Override - public List> getTablesByNamesByBulkLoad(final List tableNames) { + public List> getTablesByNamesByBulkLoad(final List tableNames, final int bulkSize) { +final int totalTables = tableNames.size(); final String schemaName = getName(); -final List> tableNameToTable = Lists.newArrayList(); -List tables; -try { - tables = DrillHiveMetaStoreClient.getTableObjectsByNameHelper(mClient, schemaName, tableNames); -} catch (TException e) { - logger.warn("Exception occurred while trying to list tables by names from {}: {}", schemaName, e.getCause()); - return tableNameToTable; +final List tables = Lists.newArrayList(); + +// In each round, Drill asks for a sub-list of all the requested tables +for(int fromIndex = 0; fromIndex < totalTables; fromIndex += bulkSize) { --- End diff -- Where? > Query against INFORMATION_SCHEMA.TABLES degrades as the number of views > increases > - > > Key: DRILL-4826 > URL: https://issues.apache.org/jira/browse/DRILL-4826 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra >Assignee: Parth Chandra > > Queries against INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.VIEWS slow > down as the number of views increases. > BI tools like Tableau issue a query like the following at connection time: > {code} > select TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE from > INFORMATION_SCHEMA.`TABLES` WHERE TABLE_CATALOG LIKE 'DRILL' ESCAPE '\' AND > TABLE_SCHEMA <> 'sys' AND TABLE_SCHEMA <> 'INFORMATION_SCHEMA'ORDER BY > TABLE_TYPE, TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME > {code} > The time to query the information schema tables degrades as the number of > views increases. On a test system: > || Views || Time(secs) || > |500 | 6 | > |1000 | 19 | > |1500 | 33 | > This can result in a single connection taking more than a minute to establish. > The problem occurs because we read the view file for every view and this > appears to take most of the time. > Querying information_schema.tables does not, in fact, need to open the view > file at all, it merely needs to get a listing of the view files. Eliminating > the view file read will speed up the query tremendously. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4826) Query against INFORMATION_SCHEMA.TABLES degrades as the number of views increases
[ https://issues.apache.org/jira/browse/DRILL-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15507918#comment-15507918 ] ASF GitHub Bot commented on DRILL-4826: --- Github user parthchandra commented on a diff in the pull request: https://github.com/apache/drill/pull/592#discussion_r79721387 --- Diff: exec/jdbc/src/test/java/org/apache/drill/jdbc/test/TestJdbcQuery.java --- @@ -122,6 +122,7 @@ public void testLikeNotLike() throws Exception{ ); } + @Ignore("Returns results in different order depeding on forkCount") --- End diff -- Or maybe I should just order the results? > Query against INFORMATION_SCHEMA.TABLES degrades as the number of views > increases > - > > Key: DRILL-4826 > URL: https://issues.apache.org/jira/browse/DRILL-4826 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra >Assignee: Parth Chandra > > Queries against INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.VIEWS slow > down as the number of views increases. > BI tools like Tableau issue a query like the following at connection time: > {code} > select TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE from > INFORMATION_SCHEMA.`TABLES` WHERE TABLE_CATALOG LIKE 'DRILL' ESCAPE '\' AND > TABLE_SCHEMA <> 'sys' AND TABLE_SCHEMA <> 'INFORMATION_SCHEMA'ORDER BY > TABLE_TYPE, TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME > {code} > The time to query the information schema tables degrades as the number of > views increases. On a test system: > || Views || Time(secs) || > |500 | 6 | > |1000 | 19 | > |1500 | 33 | > This can result in a single connection taking more than a minute to establish. > The problem occurs because we read the view file for every view and this > appears to take most of the time. > Querying information_schema.tables does not, in fact, need to open the view > file at all, it merely needs to get a listing of the view files. Eliminating > the view file read will speed up the query tremendously. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4826) Query against INFORMATION_SCHEMA.TABLES degrades as the number of views increases
[ https://issues.apache.org/jira/browse/DRILL-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15507919#comment-15507919 ] ASF GitHub Bot commented on DRILL-4826: --- Github user parthchandra commented on a diff in the pull request: https://github.com/apache/drill/pull/592#discussion_r79721423 --- Diff: contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/schema/HiveDatabaseSchema.java --- @@ -78,17 +79,34 @@ public String getTypeName() { } @Override - public List> getTablesByNamesByBulkLoad(final List tableNames) { + public List> getTablesByNamesByBulkLoad(final List tableNames, final int bulkSize) { +final int totalTables = tableNames.size(); final String schemaName = getName(); -final List> tableNameToTable = Lists.newArrayList(); -List tables; -try { - tables = DrillHiveMetaStoreClient.getTableObjectsByNameHelper(mClient, schemaName, tableNames); -} catch (TException e) { - logger.warn("Exception occurred while trying to list tables by names from {}: {}", schemaName, e.getCause()); - return tableNameToTable; +final List tables = Lists.newArrayList(); + +// In each round, Drill asks for a sub-list of all the requested tables +for(int fromIndex = 0; fromIndex < totalTables; fromIndex += bulkSize) { + final int toIndex = Math.min(fromIndex + bulkSize, totalTables); + final List eachBulkofTableNames = tableNames.subList(fromIndex, toIndex); + List eachBulkofTables; + // Retries once if the first call to fetch the metadata fails + synchronized(mClient) { --- End diff -- see previous comment > Query against INFORMATION_SCHEMA.TABLES degrades as the number of views > increases > - > > Key: DRILL-4826 > URL: https://issues.apache.org/jira/browse/DRILL-4826 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra >Assignee: Parth Chandra > > Queries against INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.VIEWS slow > down as the number of views increases. > BI tools like Tableau issue a query like the following at connection time: > {code} > select TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE from > INFORMATION_SCHEMA.`TABLES` WHERE TABLE_CATALOG LIKE 'DRILL' ESCAPE '\' AND > TABLE_SCHEMA <> 'sys' AND TABLE_SCHEMA <> 'INFORMATION_SCHEMA'ORDER BY > TABLE_TYPE, TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME > {code} > The time to query the information schema tables degrades as the number of > views increases. On a test system: > || Views || Time(secs) || > |500 | 6 | > |1000 | 19 | > |1500 | 33 | > This can result in a single connection taking more than a minute to establish. > The problem occurs because we read the view file for every view and this > appears to take most of the time. > Querying information_schema.tables does not, in fact, need to open the view > file at all, it merely needs to get a listing of the view files. Eliminating > the view file read will speed up the query tremendously. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4826) Query against INFORMATION_SCHEMA.TABLES degrades as the number of views increases
[ https://issues.apache.org/jira/browse/DRILL-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15507911#comment-15507911 ] ASF GitHub Bot commented on DRILL-4826: --- Github user parthchandra commented on a diff in the pull request: https://github.com/apache/drill/pull/592#discussion_r79721140 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/ischema/InfoSchemaRecordGenerator.java --- @@ -290,28 +290,30 @@ public Tables(OptionManager optionManager) { return new PojoRecordReader<>(Records.Table.class, records.iterator()); } -@Override -public void visitTables(String schemaPath, SchemaPlus schema) { +@Override public void visitTables(String schemaPath, SchemaPlus schema) { final AbstractSchema drillSchema = schema.unwrap(AbstractSchema.class); + final List> tableNamesAndTypes = drillSchema + .getTableNamesAndTypes(optionManager.getOption(ExecConstants.ENABLE_BULK_LOAD_TABLE_LIST), + (int)optionManager.getOption(ExecConstants.BULK_LOAD_TABLE_LIST_BULK_SIZE)); - final List tableNames = Lists.newArrayList(schema.getTableNames()); - final List> tableNameToTables; - if(optionManager.getOption(ExecConstants.ENABLE_BULK_LOAD_TABLE_LIST)) { -tableNameToTables = drillSchema.getTablesByNamesByBulkLoad(tableNames); - } else { -tableNameToTables = drillSchema.getTablesByNames(tableNames); - } - - for(Pair tableNameToTable : tableNameToTables) { -final String tableName = tableNameToTable.getKey(); -final Table table = tableNameToTable.getValue(); + for (Pair tableNameAndType : tableNamesAndTypes) { +final String tableName = tableNameAndType.getKey(); +final TableType tableType = tableNameAndType.getValue(); // Visit the table, and if requested ... -if(shouldVisitTable(schemaPath, tableName)) { - visitTable(schemaPath, tableName, table); +if (shouldVisitTable(schemaPath, tableName)) { + visitTableWithType(schemaPath, tableName, tableType); } } } +public boolean visitTableWithType(String schemaName, String tableName, TableType type) { + Preconditions + .checkNotNull(type, "Error. Type information for table %s.%s provided is null.", schemaName, + tableName); + records.add(new Records.Table(IS_CATALOG_NAME, schemaName, tableName, type.toString())); + return false; --- End diff -- to keep it similar to visitTable which does the same, unnecessarily. I suppose I could change it to return void. > Query against INFORMATION_SCHEMA.TABLES degrades as the number of views > increases > - > > Key: DRILL-4826 > URL: https://issues.apache.org/jira/browse/DRILL-4826 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra >Assignee: Parth Chandra > > Queries against INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.VIEWS slow > down as the number of views increases. > BI tools like Tableau issue a query like the following at connection time: > {code} > select TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE from > INFORMATION_SCHEMA.`TABLES` WHERE TABLE_CATALOG LIKE 'DRILL' ESCAPE '\' AND > TABLE_SCHEMA <> 'sys' AND TABLE_SCHEMA <> 'INFORMATION_SCHEMA'ORDER BY > TABLE_TYPE, TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME > {code} > The time to query the information schema tables degrades as the number of > views increases. On a test system: > || Views || Time(secs) || > |500 | 6 | > |1000 | 19 | > |1500 | 33 | > This can result in a single connection taking more than a minute to establish. > The problem occurs because we read the view file for every view and this > appears to take most of the time. > Querying information_schema.tables does not, in fact, need to open the view > file at all, it merely needs to get a listing of the view files. Eliminating > the view file read will speed up the query tremendously. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4826) Query against INFORMATION_SCHEMA.TABLES degrades as the number of views increases
[ https://issues.apache.org/jira/browse/DRILL-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15507882#comment-15507882 ] ASF GitHub Bot commented on DRILL-4826: --- Github user parthchandra commented on a diff in the pull request: https://github.com/apache/drill/pull/592#discussion_r79719241 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/WorkspaceSchemaFactory.java --- @@ -738,5 +740,49 @@ public void dropTable(String table) { .build(logger); } } + +@Override public List> getTableNamesAndTypes(boolean bulkLoad, int bulkSize) { --- End diff -- IntelliJ keeps reformatting this to be on the same line ! Will fix. > Query against INFORMATION_SCHEMA.TABLES degrades as the number of views > increases > - > > Key: DRILL-4826 > URL: https://issues.apache.org/jira/browse/DRILL-4826 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra >Assignee: Parth Chandra > > Queries against INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.VIEWS slow > down as the number of views increases. > BI tools like Tableau issue a query like the following at connection time: > {code} > select TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE from > INFORMATION_SCHEMA.`TABLES` WHERE TABLE_CATALOG LIKE 'DRILL' ESCAPE '\' AND > TABLE_SCHEMA <> 'sys' AND TABLE_SCHEMA <> 'INFORMATION_SCHEMA'ORDER BY > TABLE_TYPE, TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME > {code} > The time to query the information schema tables degrades as the number of > views increases. On a test system: > || Views || Time(secs) || > |500 | 6 | > |1000 | 19 | > |1500 | 33 | > This can result in a single connection taking more than a minute to establish. > The problem occurs because we read the view file for every view and this > appears to take most of the time. > Querying information_schema.tables does not, in fact, need to open the view > file at all, it merely needs to get a listing of the view files. Eliminating > the view file read will speed up the query tremendously. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4826) Query against INFORMATION_SCHEMA.TABLES degrades as the number of views increases
[ https://issues.apache.org/jira/browse/DRILL-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15507879#comment-15507879 ] ASF GitHub Bot commented on DRILL-4826: --- Github user parthchandra commented on a diff in the pull request: https://github.com/apache/drill/pull/592#discussion_r79719219 --- Diff: contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/schema/HiveDatabaseSchema.java --- @@ -78,17 +79,34 @@ public String getTypeName() { } @Override - public List> getTablesByNamesByBulkLoad(final List tableNames) { + public List> getTablesByNamesByBulkLoad(final List tableNames, final int bulkSize) { +final int totalTables = tableNames.size(); final String schemaName = getName(); -final List> tableNameToTable = Lists.newArrayList(); -List tables; -try { - tables = DrillHiveMetaStoreClient.getTableObjectsByNameHelper(mClient, schemaName, tableNames); -} catch (TException e) { - logger.warn("Exception occurred while trying to list tables by names from {}: {}", schemaName, e.getCause()); - return tableNameToTable; +final List tables = Lists.newArrayList(); + +// In each round, Drill asks for a sub-list of all the requested tables +for(int fromIndex = 0; fromIndex < totalTables; fromIndex += bulkSize) { + final int toIndex = Math.min(fromIndex + bulkSize, totalTables); + final List eachBulkofTableNames = tableNames.subList(fromIndex, toIndex); + List eachBulkofTables; + // Retries once if the first call to fetch the metadata fails + synchronized(mClient) { --- End diff -- This is refactored code from the fix for DRILL-4577. (https://github.com/apache/drill/pull/461) I didn't really change it. Going thru the code, it appears that m_client may be cached and reused and so probably should be synchronized. > Query against INFORMATION_SCHEMA.TABLES degrades as the number of views > increases > - > > Key: DRILL-4826 > URL: https://issues.apache.org/jira/browse/DRILL-4826 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra >Assignee: Parth Chandra > > Queries against INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.VIEWS slow > down as the number of views increases. > BI tools like Tableau issue a query like the following at connection time: > {code} > select TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE from > INFORMATION_SCHEMA.`TABLES` WHERE TABLE_CATALOG LIKE 'DRILL' ESCAPE '\' AND > TABLE_SCHEMA <> 'sys' AND TABLE_SCHEMA <> 'INFORMATION_SCHEMA'ORDER BY > TABLE_TYPE, TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME > {code} > The time to query the information schema tables degrades as the number of > views increases. On a test system: > || Views || Time(secs) || > |500 | 6 | > |1000 | 19 | > |1500 | 33 | > This can result in a single connection taking more than a minute to establish. > The problem occurs because we read the view file for every view and this > appears to take most of the time. > Querying information_schema.tables does not, in fact, need to open the view > file at all, it merely needs to get a listing of the view files. Eliminating > the view file read will speed up the query tremendously. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4826) Query against INFORMATION_SCHEMA.TABLES degrades as the number of views increases
[ https://issues.apache.org/jira/browse/DRILL-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15507515#comment-15507515 ] ASF GitHub Bot commented on DRILL-4826: --- Github user gparai commented on a diff in the pull request: https://github.com/apache/drill/pull/592#discussion_r79664175 --- Diff: exec/jdbc/src/test/java/org/apache/drill/jdbc/test/TestJdbcQuery.java --- @@ -122,6 +122,7 @@ public void testLikeNotLike() throws Exception{ ); } + @Ignore("Returns results in different order depeding on forkCount") --- End diff -- typo: depending > Query against INFORMATION_SCHEMA.TABLES degrades as the number of views > increases > - > > Key: DRILL-4826 > URL: https://issues.apache.org/jira/browse/DRILL-4826 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra >Assignee: Parth Chandra > > Queries against INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.VIEWS slow > down as the number of views increases. > BI tools like Tableau issue a query like the following at connection time: > {code} > select TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE from > INFORMATION_SCHEMA.`TABLES` WHERE TABLE_CATALOG LIKE 'DRILL' ESCAPE '\' AND > TABLE_SCHEMA <> 'sys' AND TABLE_SCHEMA <> 'INFORMATION_SCHEMA'ORDER BY > TABLE_TYPE, TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME > {code} > The time to query the information schema tables degrades as the number of > views increases. On a test system: > || Views || Time(secs) || > |500 | 6 | > |1000 | 19 | > |1500 | 33 | > This can result in a single connection taking more than a minute to establish. > The problem occurs because we read the view file for every view and this > appears to take most of the time. > Querying information_schema.tables does not, in fact, need to open the view > file at all, it merely needs to get a listing of the view files. Eliminating > the view file read will speed up the query tremendously. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4826) Query against INFORMATION_SCHEMA.TABLES degrades as the number of views increases
[ https://issues.apache.org/jira/browse/DRILL-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15507513#comment-15507513 ] ASF GitHub Bot commented on DRILL-4826: --- Github user gparai commented on a diff in the pull request: https://github.com/apache/drill/pull/592#discussion_r79688118 --- Diff: contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/schema/HiveDatabaseSchema.java --- @@ -78,17 +79,34 @@ public String getTypeName() { } @Override - public List> getTablesByNamesByBulkLoad(final List tableNames) { + public List> getTablesByNamesByBulkLoad(final List tableNames, final int bulkSize) { +final int totalTables = tableNames.size(); final String schemaName = getName(); -final List> tableNameToTable = Lists.newArrayList(); -List tables; -try { - tables = DrillHiveMetaStoreClient.getTableObjectsByNameHelper(mClient, schemaName, tableNames); -} catch (TException e) { - logger.warn("Exception occurred while trying to list tables by names from {}: {}", schemaName, e.getCause()); - return tableNameToTable; +final List tables = Lists.newArrayList(); + +// In each round, Drill asks for a sub-list of all the requested tables +for(int fromIndex = 0; fromIndex < totalTables; fromIndex += bulkSize) { --- End diff -- Space? > Query against INFORMATION_SCHEMA.TABLES degrades as the number of views > increases > - > > Key: DRILL-4826 > URL: https://issues.apache.org/jira/browse/DRILL-4826 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra >Assignee: Parth Chandra > > Queries against INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.VIEWS slow > down as the number of views increases. > BI tools like Tableau issue a query like the following at connection time: > {code} > select TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE from > INFORMATION_SCHEMA.`TABLES` WHERE TABLE_CATALOG LIKE 'DRILL' ESCAPE '\' AND > TABLE_SCHEMA <> 'sys' AND TABLE_SCHEMA <> 'INFORMATION_SCHEMA'ORDER BY > TABLE_TYPE, TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME > {code} > The time to query the information schema tables degrades as the number of > views increases. On a test system: > || Views || Time(secs) || > |500 | 6 | > |1000 | 19 | > |1500 | 33 | > This can result in a single connection taking more than a minute to establish. > The problem occurs because we read the view file for every view and this > appears to take most of the time. > Querying information_schema.tables does not, in fact, need to open the view > file at all, it merely needs to get a listing of the view files. Eliminating > the view file read will speed up the query tremendously. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4826) Query against INFORMATION_SCHEMA.TABLES degrades as the number of views increases
[ https://issues.apache.org/jira/browse/DRILL-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15507516#comment-15507516 ] ASF GitHub Bot commented on DRILL-4826: --- Github user gparai commented on a diff in the pull request: https://github.com/apache/drill/pull/592#discussion_r79663027 --- Diff: contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/schema/HiveDatabaseSchema.java --- @@ -108,7 +126,7 @@ public String getTypeName() { return tableNameToTable; } - private static class HiveTableWithoutStatisticAndRowType implements Table { + private static class HiveTableWithoutStatisticAndRowType implements Table { --- End diff -- Extra space > Query against INFORMATION_SCHEMA.TABLES degrades as the number of views > increases > - > > Key: DRILL-4826 > URL: https://issues.apache.org/jira/browse/DRILL-4826 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra >Assignee: Parth Chandra > > Queries against INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.VIEWS slow > down as the number of views increases. > BI tools like Tableau issue a query like the following at connection time: > {code} > select TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE from > INFORMATION_SCHEMA.`TABLES` WHERE TABLE_CATALOG LIKE 'DRILL' ESCAPE '\' AND > TABLE_SCHEMA <> 'sys' AND TABLE_SCHEMA <> 'INFORMATION_SCHEMA'ORDER BY > TABLE_TYPE, TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME > {code} > The time to query the information schema tables degrades as the number of > views increases. On a test system: > || Views || Time(secs) || > |500 | 6 | > |1000 | 19 | > |1500 | 33 | > This can result in a single connection taking more than a minute to establish. > The problem occurs because we read the view file for every view and this > appears to take most of the time. > Querying information_schema.tables does not, in fact, need to open the view > file at all, it merely needs to get a listing of the view files. Eliminating > the view file read will speed up the query tremendously. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4826) Query against INFORMATION_SCHEMA.TABLES degrades as the number of views increases
[ https://issues.apache.org/jira/browse/DRILL-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15507517#comment-15507517 ] ASF GitHub Bot commented on DRILL-4826: --- Github user gparai commented on a diff in the pull request: https://github.com/apache/drill/pull/592#discussion_r79687833 --- Diff: contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/schema/HiveDatabaseSchema.java --- @@ -78,17 +79,34 @@ public String getTypeName() { } @Override - public List> getTablesByNamesByBulkLoad(final List tableNames) { + public List> getTablesByNamesByBulkLoad(final List tableNames, final int bulkSize) { +final int totalTables = tableNames.size(); final String schemaName = getName(); -final List> tableNameToTable = Lists.newArrayList(); -List tables; -try { - tables = DrillHiveMetaStoreClient.getTableObjectsByNameHelper(mClient, schemaName, tableNames); -} catch (TException e) { - logger.warn("Exception occurred while trying to list tables by names from {}: {}", schemaName, e.getCause()); - return tableNameToTable; +final List tables = Lists.newArrayList(); + +// In each round, Drill asks for a sub-list of all the requested tables +for(int fromIndex = 0; fromIndex < totalTables; fromIndex += bulkSize) { + final int toIndex = Math.min(fromIndex + bulkSize, totalTables); + final List eachBulkofTableNames = tableNames.subList(fromIndex, toIndex); + List eachBulkofTables; + // Retries once if the first call to fetch the metadata fails + synchronized(mClient) { --- End diff -- why do we synchronize? > Query against INFORMATION_SCHEMA.TABLES degrades as the number of views > increases > - > > Key: DRILL-4826 > URL: https://issues.apache.org/jira/browse/DRILL-4826 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra >Assignee: Parth Chandra > > Queries against INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.VIEWS slow > down as the number of views increases. > BI tools like Tableau issue a query like the following at connection time: > {code} > select TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE from > INFORMATION_SCHEMA.`TABLES` WHERE TABLE_CATALOG LIKE 'DRILL' ESCAPE '\' AND > TABLE_SCHEMA <> 'sys' AND TABLE_SCHEMA <> 'INFORMATION_SCHEMA'ORDER BY > TABLE_TYPE, TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME > {code} > The time to query the information schema tables degrades as the number of > views increases. On a test system: > || Views || Time(secs) || > |500 | 6 | > |1000 | 19 | > |1500 | 33 | > This can result in a single connection taking more than a minute to establish. > The problem occurs because we read the view file for every view and this > appears to take most of the time. > Querying information_schema.tables does not, in fact, need to open the view > file at all, it merely needs to get a listing of the view files. Eliminating > the view file read will speed up the query tremendously. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4826) Query against INFORMATION_SCHEMA.TABLES degrades as the number of views increases
[ https://issues.apache.org/jira/browse/DRILL-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15507514#comment-15507514 ] ASF GitHub Bot commented on DRILL-4826: --- Github user gparai commented on a diff in the pull request: https://github.com/apache/drill/pull/592#discussion_r79689068 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java --- @@ -291,6 +291,9 @@ String ENABLE_BULK_LOAD_TABLE_LIST_KEY = "exec.enable_bulk_load_table_list"; BooleanValidator ENABLE_BULK_LOAD_TABLE_LIST = new BooleanValidator(ENABLE_BULK_LOAD_TABLE_LIST_KEY, false); + String BULK_LOAD_TABLE_LIST_BULK_SIZE_KEY = "exec.bulk_load_table_list.bulk_size"; --- End diff -- Maybe a comment to describe the option? > Query against INFORMATION_SCHEMA.TABLES degrades as the number of views > increases > - > > Key: DRILL-4826 > URL: https://issues.apache.org/jira/browse/DRILL-4826 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra >Assignee: Parth Chandra > > Queries against INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.VIEWS slow > down as the number of views increases. > BI tools like Tableau issue a query like the following at connection time: > {code} > select TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE from > INFORMATION_SCHEMA.`TABLES` WHERE TABLE_CATALOG LIKE 'DRILL' ESCAPE '\' AND > TABLE_SCHEMA <> 'sys' AND TABLE_SCHEMA <> 'INFORMATION_SCHEMA'ORDER BY > TABLE_TYPE, TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME > {code} > The time to query the information schema tables degrades as the number of > views increases. On a test system: > || Views || Time(secs) || > |500 | 6 | > |1000 | 19 | > |1500 | 33 | > This can result in a single connection taking more than a minute to establish. > The problem occurs because we read the view file for every view and this > appears to take most of the time. > Querying information_schema.tables does not, in fact, need to open the view > file at all, it merely needs to get a listing of the view files. Eliminating > the view file read will speed up the query tremendously. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4826) Query against INFORMATION_SCHEMA.TABLES degrades as the number of views increases
[ https://issues.apache.org/jira/browse/DRILL-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15507222#comment-15507222 ] ASF GitHub Bot commented on DRILL-4826: --- Github user sudheeshkatkam commented on a diff in the pull request: https://github.com/apache/drill/pull/592#discussion_r79667525 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/ischema/InfoSchemaRecordGenerator.java --- @@ -290,28 +290,30 @@ public Tables(OptionManager optionManager) { return new PojoRecordReader<>(Records.Table.class, records.iterator()); } -@Override -public void visitTables(String schemaPath, SchemaPlus schema) { +@Override public void visitTables(String schemaPath, SchemaPlus schema) { final AbstractSchema drillSchema = schema.unwrap(AbstractSchema.class); + final List> tableNamesAndTypes = drillSchema + .getTableNamesAndTypes(optionManager.getOption(ExecConstants.ENABLE_BULK_LOAD_TABLE_LIST), + (int)optionManager.getOption(ExecConstants.BULK_LOAD_TABLE_LIST_BULK_SIZE)); - final List tableNames = Lists.newArrayList(schema.getTableNames()); - final List> tableNameToTables; - if(optionManager.getOption(ExecConstants.ENABLE_BULK_LOAD_TABLE_LIST)) { -tableNameToTables = drillSchema.getTablesByNamesByBulkLoad(tableNames); - } else { -tableNameToTables = drillSchema.getTablesByNames(tableNames); - } - - for(Pair tableNameToTable : tableNameToTables) { -final String tableName = tableNameToTable.getKey(); -final Table table = tableNameToTable.getValue(); + for (Pair tableNameAndType : tableNamesAndTypes) { +final String tableName = tableNameAndType.getKey(); +final TableType tableType = tableNameAndType.getValue(); // Visit the table, and if requested ... -if(shouldVisitTable(schemaPath, tableName)) { - visitTable(schemaPath, tableName, table); +if (shouldVisitTable(schemaPath, tableName)) { + visitTableWithType(schemaPath, tableName, tableType); } } } +public boolean visitTableWithType(String schemaName, String tableName, TableType type) { + Preconditions + .checkNotNull(type, "Error. Type information for table %s.%s provided is null.", schemaName, + tableName); + records.add(new Records.Table(IS_CATALOG_NAME, schemaName, tableName, type.toString())); + return false; --- End diff -- why return `false`? > Query against INFORMATION_SCHEMA.TABLES degrades as the number of views > increases > - > > Key: DRILL-4826 > URL: https://issues.apache.org/jira/browse/DRILL-4826 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra >Assignee: Parth Chandra > > Queries against INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.VIEWS slow > down as the number of views increases. > BI tools like Tableau issue a query like the following at connection time: > {code} > select TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE from > INFORMATION_SCHEMA.`TABLES` WHERE TABLE_CATALOG LIKE 'DRILL' ESCAPE '\' AND > TABLE_SCHEMA <> 'sys' AND TABLE_SCHEMA <> 'INFORMATION_SCHEMA'ORDER BY > TABLE_TYPE, TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME > {code} > The time to query the information schema tables degrades as the number of > views increases. On a test system: > || Views || Time(secs) || > |500 | 6 | > |1000 | 19 | > |1500 | 33 | > This can result in a single connection taking more than a minute to establish. > The problem occurs because we read the view file for every view and this > appears to take most of the time. > Querying information_schema.tables does not, in fact, need to open the view > file at all, it merely needs to get a listing of the view files. Eliminating > the view file read will speed up the query tremendously. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4826) Query against INFORMATION_SCHEMA.TABLES degrades as the number of views increases
[ https://issues.apache.org/jira/browse/DRILL-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15507218#comment-15507218 ] ASF GitHub Bot commented on DRILL-4826: --- Github user sudheeshkatkam commented on a diff in the pull request: https://github.com/apache/drill/pull/592#discussion_r79664962 --- Diff: contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/schema/HiveDatabaseSchema.java --- @@ -78,17 +79,34 @@ public String getTypeName() { } @Override - public List> getTablesByNamesByBulkLoad(final List tableNames) { + public List> getTablesByNamesByBulkLoad(final List tableNames, final int bulkSize) { +final int totalTables = tableNames.size(); final String schemaName = getName(); -final List> tableNameToTable = Lists.newArrayList(); -List tables; -try { - tables = DrillHiveMetaStoreClient.getTableObjectsByNameHelper(mClient, schemaName, tableNames); -} catch (TException e) { - logger.warn("Exception occurred while trying to list tables by names from {}: {}", schemaName, e.getCause()); - return tableNameToTable; +final List tables = Lists.newArrayList(); + +// In each round, Drill asks for a sub-list of all the requested tables +for(int fromIndex = 0; fromIndex < totalTables; fromIndex += bulkSize) { + final int toIndex = Math.min(fromIndex + bulkSize, totalTables); + final List eachBulkofTableNames = tableNames.subList(fromIndex, toIndex); + List eachBulkofTables; + // Retries once if the first call to fetch the metadata fails + synchronized(mClient) { +try { + eachBulkofTables = mClient.getTableObjectsByName(schemaName, eachBulkofTableNames); --- End diff -- + Why not use the helper? Exception handling and reconnecting logic is different in the helper methods in [DrillHiveMetaStoreClient](https://github.com/apache/drill/blob/master/contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/DrillHiveMetaStoreClient.java#L222). + Move this logic to a method in that class? > Query against INFORMATION_SCHEMA.TABLES degrades as the number of views > increases > - > > Key: DRILL-4826 > URL: https://issues.apache.org/jira/browse/DRILL-4826 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra >Assignee: Parth Chandra > > Queries against INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.VIEWS slow > down as the number of views increases. > BI tools like Tableau issue a query like the following at connection time: > {code} > select TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE from > INFORMATION_SCHEMA.`TABLES` WHERE TABLE_CATALOG LIKE 'DRILL' ESCAPE '\' AND > TABLE_SCHEMA <> 'sys' AND TABLE_SCHEMA <> 'INFORMATION_SCHEMA'ORDER BY > TABLE_TYPE, TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME > {code} > The time to query the information schema tables degrades as the number of > views increases. On a test system: > || Views || Time(secs) || > |500 | 6 | > |1000 | 19 | > |1500 | 33 | > This can result in a single connection taking more than a minute to establish. > The problem occurs because we read the view file for every view and this > appears to take most of the time. > Querying information_schema.tables does not, in fact, need to open the view > file at all, it merely needs to get a listing of the view files. Eliminating > the view file read will speed up the query tremendously. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4826) Query against INFORMATION_SCHEMA.TABLES degrades as the number of views increases
[ https://issues.apache.org/jira/browse/DRILL-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15507220#comment-15507220 ] ASF GitHub Bot commented on DRILL-4826: --- Github user sudheeshkatkam commented on a diff in the pull request: https://github.com/apache/drill/pull/592#discussion_r79666399 --- Diff: exec/jdbc/src/test/java/org/apache/drill/jdbc/test/TestJdbcQuery.java --- @@ -122,6 +122,7 @@ public void testLikeNotLike() throws Exception{ ); } + @Ignore("Returns results in different order depeding on forkCount") --- End diff -- Is this a regression due to this patch? Other wise, open a ticket for this issue. > Query against INFORMATION_SCHEMA.TABLES degrades as the number of views > increases > - > > Key: DRILL-4826 > URL: https://issues.apache.org/jira/browse/DRILL-4826 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra >Assignee: Parth Chandra > > Queries against INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.VIEWS slow > down as the number of views increases. > BI tools like Tableau issue a query like the following at connection time: > {code} > select TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE from > INFORMATION_SCHEMA.`TABLES` WHERE TABLE_CATALOG LIKE 'DRILL' ESCAPE '\' AND > TABLE_SCHEMA <> 'sys' AND TABLE_SCHEMA <> 'INFORMATION_SCHEMA'ORDER BY > TABLE_TYPE, TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME > {code} > The time to query the information schema tables degrades as the number of > views increases. On a test system: > || Views || Time(secs) || > |500 | 6 | > |1000 | 19 | > |1500 | 33 | > This can result in a single connection taking more than a minute to establish. > The problem occurs because we read the view file for every view and this > appears to take most of the time. > Querying information_schema.tables does not, in fact, need to open the view > file at all, it merely needs to get a listing of the view files. Eliminating > the view file read will speed up the query tremendously. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4826) Query against INFORMATION_SCHEMA.TABLES degrades as the number of views increases
[ https://issues.apache.org/jira/browse/DRILL-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15507221#comment-15507221 ] ASF GitHub Bot commented on DRILL-4826: --- Github user sudheeshkatkam commented on a diff in the pull request: https://github.com/apache/drill/pull/592#discussion_r79664917 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/WorkspaceSchemaFactory.java --- @@ -738,5 +740,49 @@ public void dropTable(String table) { .build(logger); } } + +@Override public List> getTableNamesAndTypes(boolean bulkLoad, int bulkSize) { --- End diff -- Add annotation in a line above? There are other places too. > Query against INFORMATION_SCHEMA.TABLES degrades as the number of views > increases > - > > Key: DRILL-4826 > URL: https://issues.apache.org/jira/browse/DRILL-4826 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra >Assignee: Parth Chandra > > Queries against INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.VIEWS slow > down as the number of views increases. > BI tools like Tableau issue a query like the following at connection time: > {code} > select TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE from > INFORMATION_SCHEMA.`TABLES` WHERE TABLE_CATALOG LIKE 'DRILL' ESCAPE '\' AND > TABLE_SCHEMA <> 'sys' AND TABLE_SCHEMA <> 'INFORMATION_SCHEMA'ORDER BY > TABLE_TYPE, TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME > {code} > The time to query the information schema tables degrades as the number of > views increases. On a test system: > || Views || Time(secs) || > |500 | 6 | > |1000 | 19 | > |1500 | 33 | > This can result in a single connection taking more than a minute to establish. > The problem occurs because we read the view file for every view and this > appears to take most of the time. > Querying information_schema.tables does not, in fact, need to open the view > file at all, it merely needs to get a listing of the view files. Eliminating > the view file read will speed up the query tremendously. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4826) Query against INFORMATION_SCHEMA.TABLES degrades as the number of views increases
[ https://issues.apache.org/jira/browse/DRILL-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15507217#comment-15507217 ] ASF GitHub Bot commented on DRILL-4826: --- Github user sudheeshkatkam commented on a diff in the pull request: https://github.com/apache/drill/pull/592#discussion_r79665593 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/ischema/InfoSchemaRecordGenerator.java --- @@ -55,6 +54,7 @@ * schema, table or field. */ public abstract class InfoSchemaRecordGenerator { + static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(InfoSchemaRecordGenerator.class); --- End diff -- private > Query against INFORMATION_SCHEMA.TABLES degrades as the number of views > increases > - > > Key: DRILL-4826 > URL: https://issues.apache.org/jira/browse/DRILL-4826 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra >Assignee: Parth Chandra > > Queries against INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.VIEWS slow > down as the number of views increases. > BI tools like Tableau issue a query like the following at connection time: > {code} > select TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE from > INFORMATION_SCHEMA.`TABLES` WHERE TABLE_CATALOG LIKE 'DRILL' ESCAPE '\' AND > TABLE_SCHEMA <> 'sys' AND TABLE_SCHEMA <> 'INFORMATION_SCHEMA'ORDER BY > TABLE_TYPE, TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME > {code} > The time to query the information schema tables degrades as the number of > views increases. On a test system: > || Views || Time(secs) || > |500 | 6 | > |1000 | 19 | > |1500 | 33 | > This can result in a single connection taking more than a minute to establish. > The problem occurs because we read the view file for every view and this > appears to take most of the time. > Querying information_schema.tables does not, in fact, need to open the view > file at all, it merely needs to get a listing of the view files. Eliminating > the view file read will speed up the query tremendously. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4826) Query against INFORMATION_SCHEMA.TABLES degrades as the number of views increases
[ https://issues.apache.org/jira/browse/DRILL-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15507219#comment-15507219 ] ASF GitHub Bot commented on DRILL-4826: --- Github user sudheeshkatkam commented on a diff in the pull request: https://github.com/apache/drill/pull/592#discussion_r79663056 --- Diff: contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/schema/HiveDatabaseSchema.java --- @@ -78,17 +79,34 @@ public String getTypeName() { } @Override - public List> getTablesByNamesByBulkLoad(final List tableNames) { + public List> getTablesByNamesByBulkLoad(final List tableNames, final int bulkSize) { +final int totalTables = tableNames.size(); final String schemaName = getName(); -final List> tableNameToTable = Lists.newArrayList(); -List tables; -try { - tables = DrillHiveMetaStoreClient.getTableObjectsByNameHelper(mClient, schemaName, tableNames); -} catch (TException e) { - logger.warn("Exception occurred while trying to list tables by names from {}: {}", schemaName, e.getCause()); - return tableNameToTable; +final List tables = Lists.newArrayList(); + +// In each round, Drill asks for a sub-list of all the requested tables +for(int fromIndex = 0; fromIndex < totalTables; fromIndex += bulkSize) { + final int toIndex = Math.min(fromIndex + bulkSize, totalTables); + final List eachBulkofTableNames = tableNames.subList(fromIndex, toIndex); + List eachBulkofTables; + // Retries once if the first call to fetch the metadata fails + synchronized(mClient) { --- End diff -- why synchronized? > Query against INFORMATION_SCHEMA.TABLES degrades as the number of views > increases > - > > Key: DRILL-4826 > URL: https://issues.apache.org/jira/browse/DRILL-4826 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra >Assignee: Parth Chandra > > Queries against INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.VIEWS slow > down as the number of views increases. > BI tools like Tableau issue a query like the following at connection time: > {code} > select TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE from > INFORMATION_SCHEMA.`TABLES` WHERE TABLE_CATALOG LIKE 'DRILL' ESCAPE '\' AND > TABLE_SCHEMA <> 'sys' AND TABLE_SCHEMA <> 'INFORMATION_SCHEMA'ORDER BY > TABLE_TYPE, TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME > {code} > The time to query the information schema tables degrades as the number of > views increases. On a test system: > || Views || Time(secs) || > |500 | 6 | > |1000 | 19 | > |1500 | 33 | > This can result in a single connection taking more than a minute to establish. > The problem occurs because we read the view file for every view and this > appears to take most of the time. > Querying information_schema.tables does not, in fact, need to open the view > file at all, it merely needs to get a listing of the view files. Eliminating > the view file read will speed up the query tremendously. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4826) Query against INFORMATION_SCHEMA.TABLES degrades as the number of views increases
[ https://issues.apache.org/jira/browse/DRILL-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15507131#comment-15507131 ] ASF GitHub Bot commented on DRILL-4826: --- GitHub user parthchandra opened a pull request: https://github.com/apache/drill/pull/592 DRILL-4826: Query against INFORMATION_SCHEMA.TABLES degrades as the n… …umber of views increases Changed to get information for all views in a single call instead of of one by one You can merge this pull request into a Git repository by running: $ git pull https://github.com/parthchandra/drill DRILL-4826 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/592.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #592 commit 07fbb0ac224e53299217263cf2d0510482a4c9b3 Author: Parth Chandra Date: 2016-08-04T06:02:01Z DRILL-4826: Query against INFORMATION_SCHEMA.TABLES degrades as the number of views increases > Query against INFORMATION_SCHEMA.TABLES degrades as the number of views > increases > - > > Key: DRILL-4826 > URL: https://issues.apache.org/jira/browse/DRILL-4826 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra >Assignee: Parth Chandra > > Queries against INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.VIEWS slow > down as the number of views increases. > BI tools like Tableau issue a query like the following at connection time: > {code} > select TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE from > INFORMATION_SCHEMA.`TABLES` WHERE TABLE_CATALOG LIKE 'DRILL' ESCAPE '\' AND > TABLE_SCHEMA <> 'sys' AND TABLE_SCHEMA <> 'INFORMATION_SCHEMA'ORDER BY > TABLE_TYPE, TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME > {code} > The time to query the information schema tables degrades as the number of > views increases. On a test system: > || Views || Time(secs) || > |500 | 6 | > |1000 | 19 | > |1500 | 33 | > This can result in a single connection taking more than a minute to establish. > The problem occurs because we read the view file for every view and this > appears to take most of the time. > Querying information_schema.tables does not, in fact, need to open the view > file at all, it merely needs to get a listing of the view files. Eliminating > the view file read will speed up the query tremendously. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4826) Query against INFORMATION_SCHEMA.TABLES degrades as the number of views increases
[ https://issues.apache.org/jira/browse/DRILL-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491044#comment-15491044 ] Joel Bondurant commented on DRILL-4826: --- 0: jdbc:drill:> select TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE from INFORMATION_SCHEMA.`TABLES` WHERE TABLE_CATALOG LIKE 'DRILL' ESCAPE '\' AND TABLE_SCHEMA <> 'sys' AND TABLE_SCHEMA <> 'INFORMATION_SCHEMA'ORDER BY TABLE_TYPE, TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME; ++---+-+-+ | TABLE_CATALOG | TABLE_SCHEMA | TABLE_NAME | TABLE_TYPE | ++---+-+-+ ++---+-+-+ No rows selected (534.714 seconds) > Query against INFORMATION_SCHEMA.TABLES degrades as the number of views > increases > - > > Key: DRILL-4826 > URL: https://issues.apache.org/jira/browse/DRILL-4826 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra >Assignee: Parth Chandra > > Queries against INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.VIEWS slow > down as the number of views increases. > BI tools like Tableau issue a query like the following at connection time: > {code} > select TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE from > INFORMATION_SCHEMA.`TABLES` WHERE TABLE_CATALOG LIKE 'DRILL' ESCAPE '\' AND > TABLE_SCHEMA <> 'sys' AND TABLE_SCHEMA <> 'INFORMATION_SCHEMA'ORDER BY > TABLE_TYPE, TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME > {code} > The time to query the information schema tables degrades as the number of > views increases. On a test system: > || Views || Time(secs) || > |500 | 6 | > |1000 | 19 | > |1500 | 33 | > This can result in a single connection taking more than a minute to establish. > The problem occurs because we read the view file for every view and this > appears to take most of the time. > Querying information_schema.tables does not, in fact, need to open the view > file at all, it merely needs to get a listing of the view files. Eliminating > the view file read will speed up the query tremendously. -- This message was sent by Atlassian JIRA (v6.3.4#6332)