[GitHub] carbondata pull request #3023: [CARBONDATA-3197][BloomDataMap] Include bloom...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/3023#discussion_r246663278 --- Diff: integration/spark2/src/main/scala/org/apache/spark/sql/CarbonEnv.scala --- @@ -184,6 +184,9 @@ object CarbonEnv { .addListener(classOf[LoadTablePostExecutionEvent], new MergeIndexEventListener) .addListener(classOf[AlterTableCompactionPostEvent], new MergeIndexEventListener) .addListener(classOf[AlterTableMergeIndexEvent], new MergeIndexEventListener) + .addListener(classOf[LoadTablePreStatusUpdateEvent], new MergeBloomIndexEventListener) --- End diff -- @jackylk @ravipesala @KanakaKumar Please check and review ---
[GitHub] carbondata pull request #3059: [HOTFIX][DataLoad]fix task assignment issue u...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/3059#discussion_r246376652 --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonLoaderUtil.java --- @@ -609,6 +609,14 @@ public static Dictionary getDictionary(AbsoluteTableIdentifier absoluteTableIden blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_SIZE_FIRST; } else { blockAssignmentStrategy = BlockAssignmentStrategy.BLOCK_NUM_FIRST; + // fall back to BLOCK_NUM_FIRST strategy need to reset + // the average expected size for each node + if (blockInfos.size() > 0) { --- End diff -- could be set to some value if use NODE_MIN_SIZE_FIRST but fall back to BLOCK_NUM_FIRST ---
[GitHub] carbondata pull request #3054: [CARBONDATA-3232] Add example and doc for all...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/3054#discussion_r245902021 --- Diff: examples/spark2/src/main/scala/org/apache/carbondata/examples/util/ExampleUtils.scala --- @@ -30,13 +30,20 @@ object ExampleUtils { .getCanonicalPath val storeLocation: String = currentPath + "/target/store" - def createCarbonSession(appName: String, workThreadNum: Int = 1): SparkSession = { + def createCarbonSession (appName: String, workThreadNum: Int = 1, + storePath: String = null): SparkSession = { val rootPath = new File(this.getClass.getResource("/").getPath -+ "../../../..").getCanonicalPath -val storeLocation = s"$rootPath/examples/spark2/target/store" + + "../../../..").getCanonicalPath + val warehouse = s"$rootPath/examples/spark2/target/warehouse" val metaStoreDB = s"$rootPath/examples/spark2/target" +val storeLocation = if (null != storePath) { + storePath; --- End diff -- no need for `;` ---
[GitHub] carbondata pull request #3054: [CARBONDATA-3232] Optimize carbonData using a...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/3054#discussion_r245641130 --- Diff: examples/spark2/src/main/scala/org/apache/carbondata/examples/util/ExampleUtils.scala --- @@ -30,13 +30,17 @@ object ExampleUtils { .getCanonicalPath val storeLocation: String = currentPath + "/target/store" - def createCarbonSession(appName: String, workThreadNum: Int = 1): SparkSession = { + def createCarbonSession (appName: String, workThreadNum: Int = 1, + storePath: String = null): SparkSession = { val rootPath = new File(this.getClass.getResource("/").getPath -+ "../../../..").getCanonicalPath -val storeLocation = s"$rootPath/examples/spark2/target/store" + + "../../../..").getCanonicalPath +var storeLocation = s"$rootPath/examples/spark2/target/store" val warehouse = s"$rootPath/examples/spark2/target/warehouse" val metaStoreDB = s"$rootPath/examples/spark2/target" +if (storePath != null) { + storeLocation = storePath; +} --- End diff -- ```suggestion val storeLocation = if (null != storePath) { storePath } else { s"$rootPath/examples/spark2/target/store" } ``` ---
[GitHub] carbondata pull request #3054: [CARBONDATA-3232] Optimize carbonData using a...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/3054#discussion_r245639998 --- Diff: examples/spark2/src/main/scala/org/apache/carbondata/examples/AlluxioExample.scala --- @@ -26,48 +30,88 @@ import org.apache.carbondata.examples.util.ExampleUtils /** - * configure alluxio: - * 1.start alluxio - * 2.upload the jar :"/alluxio_path/core/client/target/ - * alluxio-core-client-YOUR-VERSION-jar-with-dependencies.jar" - * 3.Get more detail at:http://www.alluxio.org/docs/master/en/Running-Spark-on-Alluxio.html - */ + * configure alluxio: + * 1.start alluxio + * 2.upload the jar: "/alluxio_path/core/client/target/ + * alluxio-core-client-YOUR-VERSION-jar-with-dependencies.jar" + * 3.Get more detail at:http://www.alluxio.org/docs/master/en/Running-Spark-on-Alluxio.html + */ object AlluxioExample { - def main(args: Array[String]) { -val spark = ExampleUtils.createCarbonSession("AlluxioExample") -exampleBody(spark) -spark.close() + def main (args: Array[String]) { +val carbon = ExampleUtils.createCarbonSession("AlluxioExample", + storePath = "alluxio://localhost:19998/carbondata") +exampleBody(carbon) +carbon.close() } - def exampleBody(spark : SparkSession): Unit = { + def exampleBody (spark: SparkSession): Unit = { +val rootPath = new File(this.getClass.getResource("/").getPath + + "../../../..").getCanonicalPath spark.sparkContext.hadoopConfiguration.set("fs.alluxio.impl", "alluxio.hadoop.FileSystem") FileFactory.getConfiguration.set("fs.alluxio.impl", "alluxio.hadoop.FileSystem") // Specify date format based on raw data CarbonProperties.getInstance() .addProperty(CarbonCommonConstants.CARBON_DATE_FORMAT, "/MM/dd") -spark.sql("DROP TABLE IF EXISTS alluxio_table") +val mFsShell = new FileSystemShell() +val localFile = rootPath + "/hadoop/src/test/resources/data.csv" +val remotePath = "/carbon_alluxio.csv" +val remoteFile = "alluxio://localhost:19998/carbon_alluxio.csv" +mFsShell.run("rm", remotePath) --- End diff -- As an example, I think we should not do this operation ---
[GitHub] carbondata pull request #3054: [CARBONDATA-3232] Optimize carbonData using a...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/3054#discussion_r245639533 --- Diff: docs/Integration/alluxio-guide.md --- @@ -0,0 +1,44 @@ + + + +# Presto guide +This tutorial provides a quick introduction to using Alluxio. + +## How to use Alluxio for CarbonData? +### Install and start Alluxio +Please refer to [https://www.alluxio.org/docs/1.8/en/Getting-Started.html#starting-alluxio](https://www.alluxio.org/docs/1.8/en/Getting-Started.html#starting-alluxio) +Access the Alluxio web: [http://localhost:1/home](http://localhost:1/home) +By command, for example: +```$xslt +./bin/alluxio fs ls / +``` +Result: +``` +drwxr-xr-x xubo staff1 NOT_PERSISTED 01-07-2019 15:39:24:960 DIR /carbondata +-rw-r--r-- xubo staff50686 NOT_PERSISTED 01-07-2019 11:37:48:924 100% /data.csv +``` +### Upload Alluxio jar to CarbonData +Upload the jar "/alluxio_path/client/alluxio-YOUR-VERSION-client.jar" to CarbonData --- End diff -- "Upload to CarbonData" is confusing. What we need to do is to add the alluxio client jar to classpath, right? ---
[GitHub] carbondata pull request #3045: [CARBONDATA-3222]Fix dataload failure after c...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/3045#discussion_r245511058 --- Diff: integration/spark2/src/main/scala/org/apache/spark/sql/execution/command/preaaggregate/PreAggregateTableHelper.scala --- @@ -110,7 +110,29 @@ case class PreAggregateTableHelper( // Datamap table name and columns are automatically added prefix with parent table name // in carbon. For convenient, users can type column names same as the ones in select statement // when config dmproperties, and here we update column names with prefix. -val longStringColumn = tableProperties.get(CarbonCommonConstants.LONG_STRING_COLUMNS) +// If longStringColumn is not present in dm properties then we take long_string_columns from +// the parent table. +var longStringColumn = tableProperties.get(CarbonCommonConstants.LONG_STRING_COLUMNS) --- End diff -- The codes is not consistent with comments in line 113. Please fix it ---
[GitHub] carbondata pull request #3045: [CARBONDATA-3222]Fix dataload failure after c...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/3045#discussion_r245511008 --- Diff: integration/spark2/src/main/scala/org/apache/spark/sql/execution/command/preaaggregate/PreAggregateTableHelper.scala --- @@ -110,7 +110,29 @@ case class PreAggregateTableHelper( // Datamap table name and columns are automatically added prefix with parent table name // in carbon. For convenient, users can type column names same as the ones in select statement // when config dmproperties, and here we update column names with prefix. -val longStringColumn = tableProperties.get(CarbonCommonConstants.LONG_STRING_COLUMNS) +// If longStringColumn is not present in dm properties then we take long_string_columns from +// the parent table. +var longStringColumn = tableProperties.get(CarbonCommonConstants.LONG_STRING_COLUMNS) --- End diff -- So user do not have to config dmproperties for long string anymore and this line is no use after this PR? ---
[GitHub] carbondata pull request #3023: [CARBONDATA-3197][BloomDataMap] Include bloom...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/3023#discussion_r245480571 --- Diff: integration/spark2/src/main/scala/org/apache/spark/sql/events/MergeBloomIndexEventListener.scala --- @@ -24,59 +24,96 @@ import org.apache.spark.internal.Logging import org.apache.spark.sql.SparkSession import org.apache.carbondata.common.logging.LogServiceFactory -import org.apache.carbondata.core.datamap.DataMapStoreManager +import org.apache.carbondata.core.datamap.{DataMapStoreManager, TableDataMap} import org.apache.carbondata.core.metadata.schema.datamap.DataMapClassProvider import org.apache.carbondata.core.metadata.schema.table.CarbonTable import org.apache.carbondata.datamap.CarbonMergeBloomIndexFilesRDD import org.apache.carbondata.events._ +import org.apache.carbondata.processing.loading.events.LoadEvents.LoadTablePreStatusUpdateEvent +import org.apache.carbondata.processing.merger.CarbonDataMergerUtil class MergeBloomIndexEventListener extends OperationEventListener with Logging { val LOGGER = LogServiceFactory.getLogService(this.getClass.getCanonicalName) override def onEvent(event: Event, operationContext: OperationContext): Unit = { +val sparkSession = SparkSession.getActiveSession.get event match { + case loadPreStatusUpdateEvent: LoadTablePreStatusUpdateEvent => +LOGGER.info("LoadTablePreStatusUpdateEvent called for bloom index merging") +// For loading process, segment can not be accessed at this time +val loadModel = loadPreStatusUpdateEvent.getCarbonLoadModel +val carbonTable = loadModel.getCarbonDataLoadSchema.getCarbonTable +val segmentId = loadModel.getSegmentId + +// filter out bloom datamap, skip lazy datamap +val bloomDatamaps = DataMapStoreManager.getInstance().getAllDataMap(carbonTable).asScala + .filter(_.getDataMapSchema.getProviderName.equalsIgnoreCase( +DataMapClassProvider.BLOOMFILTER.getShortName)) + .filter(!_.getDataMapSchema.isLazy).toList + +mergeBloomIndex(sparkSession, carbonTable, bloomDatamaps, Seq(segmentId)) + + case compactPreStatusUpdateEvent: AlterTableCompactionPreStatusUpdateEvent => +LOGGER.info("AlterTableCompactionPreStatusUpdateEvent called for bloom index merging") +// For compact process, segment can not be accessed at this time +val carbonTable = compactPreStatusUpdateEvent.carbonTable --- End diff -- The codes are following similar steps. For L44~L46, it depends on what info we can get from the Event. For L49~L52, do you want to say we can extract a common method get datamap of a table with additional constrains? ---
[GitHub] carbondata pull request #3023: [CARBONDATA-3197][BloomDataMap] Merge bloom i...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/3023#discussion_r245480089 --- Diff: integration/spark2/src/main/scala/org/apache/spark/sql/events/MergeBloomIndexEventListener.scala --- @@ -24,59 +24,96 @@ import org.apache.spark.internal.Logging import org.apache.spark.sql.SparkSession import org.apache.carbondata.common.logging.LogServiceFactory -import org.apache.carbondata.core.datamap.DataMapStoreManager +import org.apache.carbondata.core.datamap.{DataMapStoreManager, TableDataMap} import org.apache.carbondata.core.metadata.schema.datamap.DataMapClassProvider import org.apache.carbondata.core.metadata.schema.table.CarbonTable import org.apache.carbondata.datamap.CarbonMergeBloomIndexFilesRDD import org.apache.carbondata.events._ +import org.apache.carbondata.processing.loading.events.LoadEvents.LoadTablePreStatusUpdateEvent +import org.apache.carbondata.processing.merger.CarbonDataMergerUtil class MergeBloomIndexEventListener extends OperationEventListener with Logging { val LOGGER = LogServiceFactory.getLogService(this.getClass.getCanonicalName) override def onEvent(event: Event, operationContext: OperationContext): Unit = { +val sparkSession = SparkSession.getActiveSession.get event match { + case loadPreStatusUpdateEvent: LoadTablePreStatusUpdateEvent => +LOGGER.info("LoadTablePreStatusUpdateEvent called for bloom index merging") +// For loading process, segment can not be accessed at this time +val loadModel = loadPreStatusUpdateEvent.getCarbonLoadModel +val carbonTable = loadModel.getCarbonDataLoadSchema.getCarbonTable +val segmentId = loadModel.getSegmentId + +// filter out bloom datamap, skip lazy datamap +val bloomDatamaps = DataMapStoreManager.getInstance().getAllDataMap(carbonTable).asScala + .filter(_.getDataMapSchema.getProviderName.equalsIgnoreCase( +DataMapClassProvider.BLOOMFILTER.getShortName)) + .filter(!_.getDataMapSchema.isLazy).toList + +mergeBloomIndex(sparkSession, carbonTable, bloomDatamaps, Seq(segmentId)) + + case compactPreStatusUpdateEvent: AlterTableCompactionPreStatusUpdateEvent => +LOGGER.info("AlterTableCompactionPreStatusUpdateEvent called for bloom index merging") +// For compact process, segment can not be accessed at this time +val carbonTable = compactPreStatusUpdateEvent.carbonTable +val mergedLoadName = compactPreStatusUpdateEvent.mergedLoadName +val segmentId = CarbonDataMergerUtil.getLoadNumberFromLoadName(mergedLoadName) + +// filter out bloom datamap, skip lazy datamap +val bloomDatamaps = DataMapStoreManager.getInstance().getAllDataMap(carbonTable).asScala + .filter(_.getDataMapSchema.getProviderName.equalsIgnoreCase( +DataMapClassProvider.BLOOMFILTER.getShortName)) + .filter(!_.getDataMapSchema.isLazy).toList + +mergeBloomIndex(sparkSession, carbonTable, bloomDatamaps, Seq(segmentId)) + case datamapPostEvent: BuildDataMapPostExecutionEvent => -LOGGER.info("Load post status event-listener called for merge bloom index") +LOGGER.info("BuildDataMapPostExecutionEvent called for bloom index merging") +// For rebuild datamap process, datamap is disabled when rebuilding +if (!datamapPostEvent.isFromRebuild || null == datamapPostEvent.dmName) { + // ignore datamapPostEvent from loading and compaction for bloom index merging + // they use LoadTablePreStatusUpdateEvent and AlterTableCompactionPreStatusUpdateEvent + LOGGER.info("Ignore BuildDataMapPostExecutionEvent from loading and compaction") + return +} + val carbonTableIdentifier = datamapPostEvent.identifier val carbonTable = DataMapStoreManager.getInstance().getCarbonTable(carbonTableIdentifier) -val tableDataMaps = DataMapStoreManager.getInstance().getAllDataMap(carbonTable) -val sparkSession = SparkSession.getActiveSession.get -// filter out bloom datamap -var bloomDatamaps = tableDataMaps.asScala.filter( - _.getDataMapSchema.getProviderName.equalsIgnoreCase( +// filter out current rebuilt bloom datamap +val bloomDatamaps = DataMapStoreManager.getInstance().getAllDataMap(carbonTable).asScala + .filter(_.getDataMapSchema.getProviderName.equalsIgnoreCase( DataMapClassProvider.BLOOMFILTER.getShortName)) - -if (datamapPostEvent.isFromRe
[GitHub] carbondata pull request #3023: [CARBONDATA-3197][BloomDataMap] Merge bloom i...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/3023#discussion_r245480061 --- Diff: integration/spark2/src/main/scala/org/apache/spark/sql/CarbonEnv.scala --- @@ -184,6 +184,9 @@ object CarbonEnv { .addListener(classOf[LoadTablePostExecutionEvent], new MergeIndexEventListener) .addListener(classOf[AlterTableCompactionPostEvent], new MergeIndexEventListener) .addListener(classOf[AlterTableMergeIndexEvent], new MergeIndexEventListener) + .addListener(classOf[LoadTablePreStatusUpdateEvent], new MergeBloomIndexEventListener) --- End diff -- Yes. This is what this PR wants to do. ---
[GitHub] carbondata issue #2973: [WIP][CARBONDATA-3144] CarbonData support spark-2.4....
Github user kevinjmh commented on the issue: https://github.com/apache/carbondata/pull/2973 I test this PR by running `CarbonSessionExample`. Please check following exception: ``` java.lang.ClassCastException: org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener cannot be cast to org.apache.spark.sql.hive.HiveExternalCatalog at org.apache.spark.sql.hive.CarbonSessionStateBuilder.org$apache$spark$sql$hive$CarbonSessionStateBuilder$$externalCatalog(CarbonSessionState.scala:232) at org.apache.spark.sql.hive.CarbonSessionStateBuilder.resourceLoader$lzycompute(CarbonSessionState.scala:238) at org.apache.spark.sql.hive.CarbonSessionStateBuilder.resourceLoader(CarbonSessionState.scala:237) at org.apache.spark.sql.hive.CarbonSessionStateBuilder.catalog$lzycompute(CarbonSessionState.scala:226) at org.apache.spark.sql.hive.CarbonSessionStateBuilder.catalog(CarbonSessionState.scala:217) at org.apache.spark.sql.hive.CarbonSessionStateBuilder.catalog(CarbonSessionState.scala:196) at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$1.apply(BaseSessionStateBuilder.scala:291) at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$1.apply(BaseSessionStateBuilder.scala:291) at org.apache.spark.sql.internal.SessionState.catalog$lzycompute(SessionState.scala:77) at org.apache.spark.sql.internal.SessionState.catalog(SessionState.scala:77) at org.apache.spark.sql.CarbonEnv$.getInstance(CarbonEnv.scala:134) at org.apache.spark.sql.CarbonSession$.updateSessionInfoToCurrentThread(CarbonSession.scala:325) at org.apache.spark.sql.parser.CarbonSparkSqlParser.parsePlan(CarbonSparkSqlParser.scala:48) at org.apache.spark.sql.CarbonSession.withProfiler(CarbonSession.scala:125) at org.apache.spark.sql.CarbonSession.sql(CarbonSession.scala:88) ``` ---
[GitHub] carbondata pull request #2996: [WIP] Fix Rename-Fail & Datamap-creation-Fail
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2996#discussion_r244906592 --- Diff: integration/spark2/src/main/scala/org/apache/spark/sql/execution/command/schema/CarbonAlterTableRenameCommand.scala --- @@ -165,15 +167,22 @@ private[sql] case class CarbonAlterTableRenameCommand( case e: ConcurrentOperationException => throw e case e: Exception => +if (hiveRenameSuccess) { + sparkSession.sessionState.catalog.asInstanceOf[CarbonSessionCatalog].alterTableRename( --- End diff -- This is a revert operation when L139-142(hive update) is success but L145-146(carbon meta store update) is failed ---
[GitHub] carbondata pull request #2996: [WIP] Fix Rename-Fail & Datamap-creation-Fail
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2996#discussion_r244906489 --- Diff: integration/spark2/src/main/scala/org/apache/spark/sql/execution/command/schema/CarbonAlterTableRenameCommand.scala --- @@ -165,15 +167,22 @@ private[sql] case class CarbonAlterTableRenameCommand( case e: ConcurrentOperationException => throw e case e: Exception => +if (hiveRenameSuccess) { + sparkSession.sessionState.catalog.asInstanceOf[CarbonSessionCatalog].alterTableRename( +TableIdentifier(newTableName, Some(oldDatabaseName)), +TableIdentifier(oldTableName, Some(oldDatabaseName)), --- End diff -- Can we reuse `oldTableIdentifier` and `newTableIdentifier` in `alterTableRenameModel` ? ---
[GitHub] carbondata pull request #3017: [HOTFIX] remove this useless assignment
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/3017#discussion_r244096878 --- Diff: core/src/main/java/org/apache/carbondata/core/locks/LocalFileLock.java --- @@ -112,7 +112,7 @@ public LocalFileLock(String lockFileLocation, String lockFile) { status = true; } } catch (IOException e) { - status = false; + // status = false; --- End diff -- Please review PR #2878 at your convenience. Changes follows rules in the table in description ---
[GitHub] carbondata pull request #3023: [CARBONDATA-3197][BloomDataMap] Merge bloom i...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/3023#discussion_r243928456 --- Diff: integration/spark2/src/main/scala/org/apache/spark/sql/events/MergeBloomIndexEventListener.scala --- @@ -24,59 +24,88 @@ import org.apache.spark.internal.Logging import org.apache.spark.sql.SparkSession import org.apache.carbondata.common.logging.LogServiceFactory -import org.apache.carbondata.core.datamap.DataMapStoreManager +import org.apache.carbondata.core.datamap.{DataMapStoreManager, TableDataMap} import org.apache.carbondata.core.metadata.schema.datamap.DataMapClassProvider import org.apache.carbondata.core.metadata.schema.table.CarbonTable import org.apache.carbondata.datamap.CarbonMergeBloomIndexFilesRDD import org.apache.carbondata.events._ +import org.apache.carbondata.processing.loading.events.LoadEvents.LoadTablePreStatusUpdateEvent +import org.apache.carbondata.processing.merger.CarbonDataMergerUtil class MergeBloomIndexEventListener extends OperationEventListener with Logging { val LOGGER = LogServiceFactory.getLogService(this.getClass.getCanonicalName) override def onEvent(event: Event, operationContext: OperationContext): Unit = { +val sparkSession = SparkSession.getActiveSession.get event match { + case loadPreStatusUpdateEvent: LoadTablePreStatusUpdateEvent => +// For loading process, segment can not be accessed at this time +val loadModel = loadPreStatusUpdateEvent.getCarbonLoadModel +val carbonTable = loadModel.getCarbonDataLoadSchema.getCarbonTable +val segmentId = loadModel.getSegmentId + +// filter out bloom datamap, skip lazy datamap +val bloomDatamaps = DataMapStoreManager.getInstance().getAllDataMap(carbonTable).asScala + .filter(_.getDataMapSchema.getProviderName.equalsIgnoreCase( +DataMapClassProvider.BLOOMFILTER.getShortName)) + .filter(!_.getDataMapSchema.isLazy).toList + +mergeBloomIndex(sparkSession, carbonTable, bloomDatamaps, Seq(segmentId)) + + case compactPreStatusUpdateEvent: AlterTableCompactionPreStatusUpdateEvent => +// For compact process, segment can not be accessed at this time +val carbonTable = compactPreStatusUpdateEvent.carbonTable +val mergedLoadName = compactPreStatusUpdateEvent.mergedLoadName +val segmentId = CarbonDataMergerUtil.getLoadNumberFromLoadName(mergedLoadName) + +// filter out bloom datamap, skip lazy datamap +val bloomDatamaps = DataMapStoreManager.getInstance().getAllDataMap(carbonTable).asScala + .filter(_.getDataMapSchema.getProviderName.equalsIgnoreCase( +DataMapClassProvider.BLOOMFILTER.getShortName)) + .filter(!_.getDataMapSchema.isLazy).toList + +mergeBloomIndex(sparkSession, carbonTable, bloomDatamaps, Seq(segmentId)) + case datamapPostEvent: BuildDataMapPostExecutionEvent => -LOGGER.info("Load post status event-listener called for merge bloom index") +// For rebuild datamap process, datamap is disabled when rebuilding +if (!datamapPostEvent.isFromRebuild || null == datamapPostEvent.dmName) { + // ignore datamapPostEvent from loading and compaction + return +} + val carbonTableIdentifier = datamapPostEvent.identifier val carbonTable = DataMapStoreManager.getInstance().getCarbonTable(carbonTableIdentifier) -val tableDataMaps = DataMapStoreManager.getInstance().getAllDataMap(carbonTable) -val sparkSession = SparkSession.getActiveSession.get -// filter out bloom datamap -var bloomDatamaps = tableDataMaps.asScala.filter( - _.getDataMapSchema.getProviderName.equalsIgnoreCase( +// filter out current rebuilt bloom datamap +val bloomDatamaps = DataMapStoreManager.getInstance().getAllDataMap(carbonTable).asScala + .filter(_.getDataMapSchema.getProviderName.equalsIgnoreCase( DataMapClassProvider.BLOOMFILTER.getShortName)) - -if (datamapPostEvent.isFromRebuild) { - if (null != datamapPostEvent.dmName) { -// for rebuild process -bloomDatamaps = bloomDatamaps.filter( - _.getDataMapSchema.getDataMapName.equalsIgnoreCase(datamapPostEvent.dmName)) - } -} else { - // for load process, skip lazy datamap - bloomDatamaps = bloomDatamaps.filter(!_.getDataMapSchema.isLazy) -} + .filter(_.getDataMapSchema.getDataMapName.equalsIgnoreCase(datamapPostEvent.dmName)) + .toList
[GitHub] carbondata pull request #3023: [CARBONDATA-3197][BloomDataMap] Merge bloom i...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/3023#discussion_r243928449 --- Diff: integration/spark2/src/main/scala/org/apache/spark/sql/events/MergeBloomIndexEventListener.scala --- @@ -24,59 +24,88 @@ import org.apache.spark.internal.Logging import org.apache.spark.sql.SparkSession import org.apache.carbondata.common.logging.LogServiceFactory -import org.apache.carbondata.core.datamap.DataMapStoreManager +import org.apache.carbondata.core.datamap.{DataMapStoreManager, TableDataMap} import org.apache.carbondata.core.metadata.schema.datamap.DataMapClassProvider import org.apache.carbondata.core.metadata.schema.table.CarbonTable import org.apache.carbondata.datamap.CarbonMergeBloomIndexFilesRDD import org.apache.carbondata.events._ +import org.apache.carbondata.processing.loading.events.LoadEvents.LoadTablePreStatusUpdateEvent +import org.apache.carbondata.processing.merger.CarbonDataMergerUtil class MergeBloomIndexEventListener extends OperationEventListener with Logging { val LOGGER = LogServiceFactory.getLogService(this.getClass.getCanonicalName) override def onEvent(event: Event, operationContext: OperationContext): Unit = { +val sparkSession = SparkSession.getActiveSession.get event match { + case loadPreStatusUpdateEvent: LoadTablePreStatusUpdateEvent => +// For loading process, segment can not be accessed at this time +val loadModel = loadPreStatusUpdateEvent.getCarbonLoadModel +val carbonTable = loadModel.getCarbonDataLoadSchema.getCarbonTable +val segmentId = loadModel.getSegmentId + +// filter out bloom datamap, skip lazy datamap +val bloomDatamaps = DataMapStoreManager.getInstance().getAllDataMap(carbonTable).asScala + .filter(_.getDataMapSchema.getProviderName.equalsIgnoreCase( +DataMapClassProvider.BLOOMFILTER.getShortName)) + .filter(!_.getDataMapSchema.isLazy).toList + +mergeBloomIndex(sparkSession, carbonTable, bloomDatamaps, Seq(segmentId)) + + case compactPreStatusUpdateEvent: AlterTableCompactionPreStatusUpdateEvent => +// For compact process, segment can not be accessed at this time +val carbonTable = compactPreStatusUpdateEvent.carbonTable +val mergedLoadName = compactPreStatusUpdateEvent.mergedLoadName +val segmentId = CarbonDataMergerUtil.getLoadNumberFromLoadName(mergedLoadName) + +// filter out bloom datamap, skip lazy datamap +val bloomDatamaps = DataMapStoreManager.getInstance().getAllDataMap(carbonTable).asScala + .filter(_.getDataMapSchema.getProviderName.equalsIgnoreCase( +DataMapClassProvider.BLOOMFILTER.getShortName)) + .filter(!_.getDataMapSchema.isLazy).toList + +mergeBloomIndex(sparkSession, carbonTable, bloomDatamaps, Seq(segmentId)) + case datamapPostEvent: BuildDataMapPostExecutionEvent => -LOGGER.info("Load post status event-listener called for merge bloom index") +// For rebuild datamap process, datamap is disabled when rebuilding +if (!datamapPostEvent.isFromRebuild || null == datamapPostEvent.dmName) { + // ignore datamapPostEvent from loading and compaction --- End diff -- Added ---
[GitHub] carbondata pull request #3023: [CARBONDATA-3197][BloomDataMap] Merge bloom i...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/3023#discussion_r243905862 --- Diff: integration/spark2/src/main/scala/org/apache/spark/sql/events/MergeBloomIndexEventListener.scala --- @@ -24,59 +24,88 @@ import org.apache.spark.internal.Logging import org.apache.spark.sql.SparkSession import org.apache.carbondata.common.logging.LogServiceFactory -import org.apache.carbondata.core.datamap.DataMapStoreManager +import org.apache.carbondata.core.datamap.{DataMapStoreManager, TableDataMap} import org.apache.carbondata.core.metadata.schema.datamap.DataMapClassProvider import org.apache.carbondata.core.metadata.schema.table.CarbonTable import org.apache.carbondata.datamap.CarbonMergeBloomIndexFilesRDD import org.apache.carbondata.events._ +import org.apache.carbondata.processing.loading.events.LoadEvents.LoadTablePreStatusUpdateEvent +import org.apache.carbondata.processing.merger.CarbonDataMergerUtil class MergeBloomIndexEventListener extends OperationEventListener with Logging { val LOGGER = LogServiceFactory.getLogService(this.getClass.getCanonicalName) override def onEvent(event: Event, operationContext: OperationContext): Unit = { +val sparkSession = SparkSession.getActiveSession.get event match { + case loadPreStatusUpdateEvent: LoadTablePreStatusUpdateEvent => +// For loading process, segment can not be accessed at this time +val loadModel = loadPreStatusUpdateEvent.getCarbonLoadModel +val carbonTable = loadModel.getCarbonDataLoadSchema.getCarbonTable +val segmentId = loadModel.getSegmentId + +// filter out bloom datamap, skip lazy datamap +val bloomDatamaps = DataMapStoreManager.getInstance().getAllDataMap(carbonTable).asScala + .filter(_.getDataMapSchema.getProviderName.equalsIgnoreCase( +DataMapClassProvider.BLOOMFILTER.getShortName)) + .filter(!_.getDataMapSchema.isLazy).toList + +mergeBloomIndex(sparkSession, carbonTable, bloomDatamaps, Seq(segmentId)) + + case compactPreStatusUpdateEvent: AlterTableCompactionPreStatusUpdateEvent => +// For compact process, segment can not be accessed at this time +val carbonTable = compactPreStatusUpdateEvent.carbonTable +val mergedLoadName = compactPreStatusUpdateEvent.mergedLoadName +val segmentId = CarbonDataMergerUtil.getLoadNumberFromLoadName(mergedLoadName) + +// filter out bloom datamap, skip lazy datamap +val bloomDatamaps = DataMapStoreManager.getInstance().getAllDataMap(carbonTable).asScala + .filter(_.getDataMapSchema.getProviderName.equalsIgnoreCase( +DataMapClassProvider.BLOOMFILTER.getShortName)) + .filter(!_.getDataMapSchema.isLazy).toList + +mergeBloomIndex(sparkSession, carbonTable, bloomDatamaps, Seq(segmentId)) + case datamapPostEvent: BuildDataMapPostExecutionEvent => -LOGGER.info("Load post status event-listener called for merge bloom index") +// For rebuild datamap process, datamap is disabled when rebuilding +if (!datamapPostEvent.isFromRebuild || null == datamapPostEvent.dmName) { + // ignore datamapPostEvent from loading and compaction --- End diff -- what information do you expect? ignore datamapPostEvent event from loading and compaction because the bloom index files is already merged by above event. ---
[GitHub] carbondata pull request #3023: [CARBONDATA-3197][BloomDataMap] Merge bloom i...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/3023#discussion_r243905231 --- Diff: integration/spark2/src/main/scala/org/apache/spark/sql/events/MergeBloomIndexEventListener.scala --- @@ -24,59 +24,88 @@ import org.apache.spark.internal.Logging import org.apache.spark.sql.SparkSession import org.apache.carbondata.common.logging.LogServiceFactory -import org.apache.carbondata.core.datamap.DataMapStoreManager +import org.apache.carbondata.core.datamap.{DataMapStoreManager, TableDataMap} import org.apache.carbondata.core.metadata.schema.datamap.DataMapClassProvider import org.apache.carbondata.core.metadata.schema.table.CarbonTable import org.apache.carbondata.datamap.CarbonMergeBloomIndexFilesRDD import org.apache.carbondata.events._ +import org.apache.carbondata.processing.loading.events.LoadEvents.LoadTablePreStatusUpdateEvent +import org.apache.carbondata.processing.merger.CarbonDataMergerUtil class MergeBloomIndexEventListener extends OperationEventListener with Logging { val LOGGER = LogServiceFactory.getLogService(this.getClass.getCanonicalName) override def onEvent(event: Event, operationContext: OperationContext): Unit = { +val sparkSession = SparkSession.getActiveSession.get event match { + case loadPreStatusUpdateEvent: LoadTablePreStatusUpdateEvent => +// For loading process, segment can not be accessed at this time +val loadModel = loadPreStatusUpdateEvent.getCarbonLoadModel +val carbonTable = loadModel.getCarbonDataLoadSchema.getCarbonTable +val segmentId = loadModel.getSegmentId + +// filter out bloom datamap, skip lazy datamap +val bloomDatamaps = DataMapStoreManager.getInstance().getAllDataMap(carbonTable).asScala + .filter(_.getDataMapSchema.getProviderName.equalsIgnoreCase( +DataMapClassProvider.BLOOMFILTER.getShortName)) + .filter(!_.getDataMapSchema.isLazy).toList + +mergeBloomIndex(sparkSession, carbonTable, bloomDatamaps, Seq(segmentId)) + + case compactPreStatusUpdateEvent: AlterTableCompactionPreStatusUpdateEvent => +// For compact process, segment can not be accessed at this time +val carbonTable = compactPreStatusUpdateEvent.carbonTable +val mergedLoadName = compactPreStatusUpdateEvent.mergedLoadName +val segmentId = CarbonDataMergerUtil.getLoadNumberFromLoadName(mergedLoadName) + +// filter out bloom datamap, skip lazy datamap +val bloomDatamaps = DataMapStoreManager.getInstance().getAllDataMap(carbonTable).asScala + .filter(_.getDataMapSchema.getProviderName.equalsIgnoreCase( +DataMapClassProvider.BLOOMFILTER.getShortName)) + .filter(!_.getDataMapSchema.isLazy).toList + +mergeBloomIndex(sparkSession, carbonTable, bloomDatamaps, Seq(segmentId)) + case datamapPostEvent: BuildDataMapPostExecutionEvent => -LOGGER.info("Load post status event-listener called for merge bloom index") +// For rebuild datamap process, datamap is disabled when rebuilding +if (!datamapPostEvent.isFromRebuild || null == datamapPostEvent.dmName) { + // ignore datamapPostEvent from loading and compaction + return +} + val carbonTableIdentifier = datamapPostEvent.identifier val carbonTable = DataMapStoreManager.getInstance().getCarbonTable(carbonTableIdentifier) -val tableDataMaps = DataMapStoreManager.getInstance().getAllDataMap(carbonTable) -val sparkSession = SparkSession.getActiveSession.get -// filter out bloom datamap -var bloomDatamaps = tableDataMaps.asScala.filter( - _.getDataMapSchema.getProviderName.equalsIgnoreCase( +// filter out current rebuilt bloom datamap +val bloomDatamaps = DataMapStoreManager.getInstance().getAllDataMap(carbonTable).asScala + .filter(_.getDataMapSchema.getProviderName.equalsIgnoreCase( DataMapClassProvider.BLOOMFILTER.getShortName)) - -if (datamapPostEvent.isFromRebuild) { - if (null != datamapPostEvent.dmName) { -// for rebuild process -bloomDatamaps = bloomDatamaps.filter( - _.getDataMapSchema.getDataMapName.equalsIgnoreCase(datamapPostEvent.dmName)) - } -} else { - // for load process, skip lazy datamap - bloomDatamaps = bloomDatamaps.filter(!_.getDataMapSchema.isLazy) --- End diff -- original implementation deal all three scenes(load/compact/rebuild), so use if-else to distinguish. Now, thre
[GitHub] carbondata pull request #3023: [CARBONDATA-3197][BloomDataMap] Merge bloom i...
GitHub user kevinjmh opened a pull request: https://github.com/apache/carbondata/pull/3023 [CARBONDATA-3197][BloomDataMap] Merge bloom index before accessible **Problem** Currently carbon allows to query when bloom index files are merging, but this will cause problems when the index files state change from multiple shards to merged shard. Timeline to explain problem: - load data for table with bloom datamap, data is loaded, bloom index files are generated along loading, bloom index file merging is under action - query fired - `BloomCoarseGrainDataMapFactory.getAllShardPaths` found multiple shards, and bloom index file merging in progress, so `BloomCoarseGrainDataMap` with detailed shard name created - bloom index file merging done, folders with detailed shard name are deleted - Exception will occur when `BloomCoarseGrainDataMap` wants to read bloom index file from folders with detailed shard name to prune **Analyse** Root cause is that we allow query on datamap which is not in stable state. one solution is to disable datamap when merging bloom index file, but this will affect all the segments many times. Another solution is to take the bloom index files merging as part of loading, such that query can not access unstable bloom index files until it is ready **Solution** Change the events to watch for `MergeBloomIndexEventListener`, do the merging staff before segment status is updated for access Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [ ] Any interfaces changed? - [ ] Any backward compatibility impacted? - [ ] Document update required? - [ ] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? - How it is tested? Please attach test report. - Is it a performance related change? Please attach the performance test report. - Any additional information to help reviewers in testing this change. - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kevinjmh/carbondata mergeBloomIndexEvent Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/3023.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3023 commit 881a510de38e30cfa4ae8a84be1f003c6254d9ab Author: Manhua Date: 2018-12-25T08:21:40Z merge bloom index before accessible ---
[GitHub] carbondata pull request #2161: [CARBONDATA-2218] AlluxioCarbonFile while try...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2161#discussion_r243819959 --- Diff: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java --- @@ -960,7 +960,7 @@ private CarbonCommonConstants() { * If set to GLOBAL_SORT, the sorting scope is bigger and one index tree per task will be * created, thus loading is slower but query is faster. */ - public static final String LOAD_SORT_SCOPE_DEFAULT = "LOCAL_SORT"; + public static final String LOAD_SORT_SCOPE_DEFAULT = "NO_SORT"; --- End diff -- OK ---
[GitHub] carbondata pull request #2161: [CARBONDATA-2218] AlluxioCarbonFile while try...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2161#discussion_r243732085 --- Diff: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java --- @@ -960,7 +960,7 @@ private CarbonCommonConstants() { * If set to GLOBAL_SORT, the sorting scope is bigger and one index tree per task will be * created, thus loading is slower but query is faster. */ - public static final String LOAD_SORT_SCOPE_DEFAULT = "LOCAL_SORT"; + public static final String LOAD_SORT_SCOPE_DEFAULT = "NO_SORT"; --- End diff -- why should change this default value? Some more advices: 1. blank changes in this file is unnecessary. 2. too many commits in this PR, better to merge them into if history does not matter. refer to command `git rebase` to merge the commits into one Hope this PR can be merged soon ---
[GitHub] carbondata issue #2161: [CARBONDATA-2218] AlluxioCarbonFile while trying to ...
Github user kevinjmh commented on the issue: https://github.com/apache/carbondata/pull/2161 > Since user normally uses Alluxio as a read cache, I think we can firstly verify carbon on alluxio for the query scenario. As I am still not very sure what is the correct way to implement rename for Alluxio, in the meantime, we can merge this PR first. So please rebase it. @chandrasaripaka @jackylk My first try wants to use alluxio for query only as you said. But hive stores table location and carbon blocks `alter location` command, I have to create table after carbon runs on alluxio. Please mind this problem. ---
[GitHub] carbondata pull request #2161: [CARBONDATA-2218] AlluxioCarbonFile while try...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2161#discussion_r243254949 --- Diff: core/src/main/java/org/apache/carbondata/core/locks/AlluxioFileLock.java --- @@ -0,0 +1,112 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.core.locks; + +import java.io.DataOutputStream; +import java.io.IOException; + +import org.apache.carbondata.common.logging.LogServiceFactory; +import org.apache.carbondata.core.datastore.filesystem.AlluxioCarbonFile; +import org.apache.carbondata.core.datastore.impl.FileFactory; +import org.apache.carbondata.core.metadata.AbsoluteTableIdentifier; +import org.apache.carbondata.core.util.path.CarbonTablePath; + +import org.apache.log4j.Logger; + +/** + * This class is used to handle the S3 File locking. + * This is acheived using the concept of acquiring the data out stream using Append option. + */ +public class AlluxioFileLock extends AbstractCarbonLock { + + /** + * LOGGER + */ + private static final Logger LOGGER = + LogServiceFactory.getLogService(AlluxioCarbonFile.class.getName()); + /** + * lockFilePath is the location of the lock file. + */ + private String lockFilePath; + + /** + * lockFileDir is the directory of the lock file. + */ + private String lockFileDir; + + private DataOutputStream dataOutputStream; + + /** + * @param tableIdentifier + * @param lockFile + */ + public AlluxioFileLock(AbsoluteTableIdentifier tableIdentifier, String lockFile) { +this(tableIdentifier.getTablePath(), lockFile); + } + + /** + * @param lockFileLocation + * @param lockFile + */ + public AlluxioFileLock(String lockFileLocation, String lockFile) { +this.lockFileDir = CarbonTablePath.getLockFilesDirPath(lockFileLocation); +this.lockFilePath = CarbonTablePath.getLockFilePath(lockFileLocation, lockFile); +LOGGER.info("Alluxio lock path:" + this.lockFilePath); +initRetry(); + } + + /* (non-Javadoc) + * @see org.apache.carbondata.core.locks.ICarbonLock#unlock() + */ + @Override + public boolean unlock() { +boolean status = false; +if (null != dataOutputStream) { + try { +dataOutputStream.close(); +status = true; + } catch (IOException e) { +status = false; + } +} +return status; + } + + /* (non-Javadoc) + * @see org.apache.carbondata.core.locks.ICarbonLock#lock() + */ + @Override + public boolean lock() { +try { + if (!FileFactory.isFileExist(lockFileDir)) { +FileFactory.mkdirs(lockFileDir, FileFactory.getFileType(lockFileDir)); + } + if (!FileFactory.isFileExist(lockFilePath)) { +FileFactory.createNewLockFile(lockFilePath, FileFactory.getFileType(lockFilePath)); + } + dataOutputStream = + FileFactory.getDataOutputStreamUsingAppend(lockFilePath, + FileFactory.getFileType(lockFilePath)); --- End diff -- @xubo245 "tablestatus.lock already exists" is caused by this ---
[GitHub] carbondata pull request #2161: [CARBONDATA-2218] AlluxioCarbonFile while try...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2161#discussion_r243168473 --- Diff: core/src/main/java/org/apache/carbondata/core/locks/AlluxioFileLock.java --- @@ -0,0 +1,112 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.core.locks; + +import java.io.DataOutputStream; +import java.io.IOException; + +import org.apache.carbondata.common.logging.LogServiceFactory; +import org.apache.carbondata.core.datastore.filesystem.AlluxioCarbonFile; +import org.apache.carbondata.core.datastore.impl.FileFactory; +import org.apache.carbondata.core.metadata.AbsoluteTableIdentifier; +import org.apache.carbondata.core.util.path.CarbonTablePath; + +import org.apache.log4j.Logger; + +/** + * This class is used to handle the S3 File locking. + * This is acheived using the concept of acquiring the data out stream using Append option. + */ +public class AlluxioFileLock extends AbstractCarbonLock { + + /** + * LOGGER + */ + private static final Logger LOGGER = + LogServiceFactory.getLogService(AlluxioCarbonFile.class.getName()); + /** + * lockFilePath is the location of the lock file. + */ + private String lockFilePath; + + /** + * lockFileDir is the directory of the lock file. + */ + private String lockFileDir; + + private DataOutputStream dataOutputStream; + + /** + * @param tableIdentifier + * @param lockFile + */ + public AlluxioFileLock(AbsoluteTableIdentifier tableIdentifier, String lockFile) { +this(tableIdentifier.getTablePath(), lockFile); + } + + /** + * @param lockFileLocation + * @param lockFile + */ + public AlluxioFileLock(String lockFileLocation, String lockFile) { +this.lockFileDir = CarbonTablePath.getLockFilesDirPath(lockFileLocation); +this.lockFilePath = CarbonTablePath.getLockFilePath(lockFileLocation, lockFile); +LOGGER.info("Alluxio lock path:" + this.lockFilePath); +initRetry(); + } + + /* (non-Javadoc) + * @see org.apache.carbondata.core.locks.ICarbonLock#unlock() + */ + @Override + public boolean unlock() { +boolean status = false; +if (null != dataOutputStream) { + try { +dataOutputStream.close(); +status = true; + } catch (IOException e) { +status = false; + } +} +return status; + } + + /* (non-Javadoc) + * @see org.apache.carbondata.core.locks.ICarbonLock#lock() + */ + @Override + public boolean lock() { +try { + if (!FileFactory.isFileExist(lockFileDir)) { +FileFactory.mkdirs(lockFileDir, FileFactory.getFileType(lockFileDir)); + } + if (!FileFactory.isFileExist(lockFilePath)) { +FileFactory.createNewLockFile(lockFilePath, FileFactory.getFileType(lockFilePath)); + } + dataOutputStream = + FileFactory.getDataOutputStreamUsingAppend(lockFilePath, + FileFactory.getFileType(lockFilePath)); --- End diff -- Can you insert data into table? I found that the implementation of `append` function in Alluxio required to create a new file. It will fail if file already exists. For line 100, carbon makes sure the file exists, and finally we always fail when locking. The code I mention in alluxio is this: https://github.com/Alluxio/alluxio/blob/21a3a87747010f087d1937f48bba3caa883885f8/core/client/hdfs/src/main/java/alluxio/hadoop/AbstractFileSystem.java#L127-L138 ---
[GitHub] carbondata pull request #2161: [CARBONDATA-2218] AlluxioCarbonFile while try...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2161#discussion_r243167033 --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/filesystem/AlluxioCarbonFile.java --- @@ -94,14 +93,9 @@ public CarbonFile getParentFile() { public boolean renameForce(String changeToName) { FileSystem fs; try { - fs = fileStatus.getPath().getFileSystem(FileFactory.getConfiguration()); - if (fs instanceof DistributedFileSystem) { -((DistributedFileSystem) fs).rename(fileStatus.getPath(), new Path(changeToName), -org.apache.hadoop.fs.Options.Rename.OVERWRITE); -return true; - } else { -return false; - } + fs = fileStatus.getPath().getFileSystem(hadoopConf); + fs.delete(new Path(changeToName), true); + return fs.rename(fileStatus.getPath(), new Path(changeToName)); --- End diff -- This problem is also for `S3CarbonFile` , `ViewFSCarbonFile`, `LocalCarbonFile` ---
[GitHub] carbondata pull request #3005: [CARBONDATA-3185] Fix alluxio file rename
Github user kevinjmh closed the pull request at: https://github.com/apache/carbondata/pull/3005 ---
[GitHub] carbondata issue #3005: [CARBONDATA-3185] Fix alluxio file rename
Github user kevinjmh commented on the issue: https://github.com/apache/carbondata/pull/3005 > There is a PR to fix the same issue: #2161 I see. It fixed LockFile problem I faced now too. Hope it got fix soon. And I am closing this PR now ---
[GitHub] carbondata pull request #3005: [CARBONDATA-3185] Fix alluxio file rename
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/3005#discussion_r243150578 --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/filesystem/AlluxioCarbonFile.java --- @@ -95,15 +94,13 @@ public boolean renameForce(String changeToName) { FileSystem fs; try { fs = fileStatus.getPath().getFileSystem(FileFactory.getConfiguration()); - if (fs instanceof DistributedFileSystem) { -((DistributedFileSystem) fs).rename(fileStatus.getPath(), new Path(changeToName), -org.apache.hadoop.fs.Options.Rename.OVERWRITE); -return true; - } else { -return false; + Path targetPath = new Path(changeToName); + if (fs.exists(targetPath)) { +fs.delete(targetPath, true); --- End diff -- This should be managed by developer. If developer wants a simple rename, he should call `renameTo` instead of `renameForce`. Codes in this PR implements the OVERWRITE(delete before rename), as `S3CarbonFile` , `ViewFSCarbonFile`, `LocalCarbonFile` do ---
[GitHub] carbondata pull request #3005: [CARBONDATA-3185] Fix alluxio file rename
GitHub user kevinjmh opened a pull request: https://github.com/apache/carbondata/pull/3005 [CARBONDATA-3185] Fix alluxio file rename **Problem** Exception thrown when create table on alluxio because rename schema file failed **Analyse** Re-run the command after adding some logs, I found that the file system object it uses is `alluxio.hadoop.FileSystem`, and it is not an instanceof `DistributedFileSystem`, such that the renameForce method return false. **Solution** By checking hierarchy of class [`alluxio.hadoop.FileSystem`](https://github.com/Alluxio/alluxio/blob/branch-1.8/core/client/hdfs/src/main/java/alluxio/hadoop/FileSystem.java), it is ok to use method of hadoop FileSystem directly. We can fix it by remove instant type check and rewrite the overwrite file code by hand. ``` FileSystem (org.apache.hadoop.fs) |__ AbstractFileSystem (alluxio.hadoop) |__FileSystem (alluxio.hadoop) ``` *Log* ``` 18/12/20 09:31:50 ERROR thriftserver.SparkExecuteStatementOperation: Error executing query, currentState RUNNING, org.apache.carbondata.spark.exception.ProcessMetaDataException: operation failed for default.alluxio: Create table'alluxio' in database 'default' failed, temporary file renaming failed, src=alluxio://localhost:19998/user/hive/warehouse/carbon.store/default/alluxio/Metadata/schema.write, dest=alluxio://localhost:19998/user/hive/warehouse/carbon.store/default/alluxio/Metadata/schema at org.apache.spark.sql.execution.command.MetadataProcessOpeation$class.throwMetadataException(package.scala:55) at org.apache.spark.sql.execution.command.MetadataCommand.throwMetadataException(package.scala:120) at org.apache.spark.sql.execution.command.table.CarbonCreateTableCommand.processMetadata(CarbonCreateTableCommand.scala:179) at org.apache.spark.sql.execution.command.MetadataCommand$$anonfun$run$1.apply(package.scala:122) at org.apache.spark.sql.execution.command.MetadataCommand$$anonfun$run$1.apply(package.scala:122) at org.apache.spark.sql.execution.command.Auditable$class.runWithAudit(package.scala:104) at org.apache.spark.sql.execution.command.MetadataCommand.runWithAudit(package.scala:120) at org.apache.spark.sql.execution.command.MetadataCommand.run(package.scala:122) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) at org.apache.spark.sql.Dataset.(Dataset.scala:183) at org.apache.spark.sql.CarbonSession$$anonfun$sql$1.apply(CarbonSession.scala:91) at org.apache.spark.sql.CarbonSession$$anonfun$sql$1.apply(CarbonSession.scala:90) at org.apache.spark.sql.CarbonSession.withProfiler(CarbonSession.scala:136) at org.apache.spark.sql.CarbonSession.sql(CarbonSession.scala:88) ``` Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [ ] Any interfaces changed? - [ ] Any backward compatibility impacted? - [ ] Document update required? - [ ] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? - How it is tested? Please attach test report. - Is it a performance related change? Please attach the performance test report. - Any additional information to help reviewers in testing this change. - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kevinjmh/carbondata alluxio_fs Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/3005.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3005 commit 78f5d72a5b169b38d7d6bd0f7fbd9ebf8bfa6d55 Author: Manhua Date: 2018-12-20T02:34:48Z fix alluxio rename fail ---
[GitHub] carbondata pull request #3000: [CARBONDATA-3181][BloomDataMap] Fix access fi...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/3000#discussion_r243130074 --- Diff: datamap/bloom/src/main/java/org/apache/hadoop/util/bloom/CarbonBloomFilter.java --- @@ -49,27 +49,23 @@ public CarbonBloomFilter(int vectorSize, int nbHash, int hashType, boolean compr @Override public boolean membershipTest(Key key) { -if (key == null) { - throw new NullPointerException("key cannot be null"); -} - -int[] h = hash.hash(key); -hash.clear(); if (compress) { // If it is compressed check in roaring bitmap + if (key == null) { --- End diff -- The reason I code it like this is for consistent procedure. And I don't think it can reduce call for GENERAL cases. Null is only one special case, and all the others are not-null cases. If change it as you advised, all non-null cases will have to do double null check when bloom filter is not compressed. ---
[GitHub] carbondata pull request #3000: [CARBONDATA-3181][BloomDataMap] Fix access fi...
GitHub user kevinjmh opened a pull request: https://github.com/apache/carbondata/pull/3000 [CARBONDATA-3181][BloomDataMap] Fix access field error for BitSet in bloom filter **Problem** java.lang.IllegalAccessError is thrown when query on bloom filter without compress on CarbonThriftServer. Detailed log ss pasted at bottom. **Analyse** similar problem was occur when get/set BitSet in CarbonBloomFilter, it uses reflection to solve. We can do it like this. Since we have set the BitSet already, another easier way is to call super class method to avoid accessing it from CarbonBloomFilter **Solution** if bloom filter is not compressed, call super method to test membership --- *Detail log* ``` 18/12/19 11:16:07 ERROR thriftserver.SparkExecuteStatementOperation: Error executing query, currentState RUNNING, java.lang.IllegalAccessError: tried to access field org.apache.hadoop.util.bloom.BloomFilter.bits from class org.apache.hadoop.util.bloom.CarbonBloomFilter at org.apache.hadoop.util.bloom.CarbonBloomFilter.membershipTest(CarbonBloomFilter.java:70) at org.apache.carbondata.datamap.bloom.BloomCoarseGrainDataMap.prune(BloomCoarseGrainDataMap.java:202) at org.apache.carbondata.core.datamap.TableDataMap.pruneWithFilter(TableDataMap.java:185) at org.apache.carbondata.core.datamap.TableDataMap.prune(TableDataMap.java:160) at org.apache.carbondata.core.datamap.dev.expr.DataMapExprWrapperImpl.prune(DataMapExprWrapperImpl.java:53) at org.apache.carbondata.hadoop.api.CarbonInputFormat.getPrunedBlocklets(CarbonInputFormat.java:517) at org.apache.carbondata.hadoop.api.CarbonInputFormat.getDataBlocksOfSegment(CarbonInputFormat.java:412) at org.apache.carbondata.hadoop.api.CarbonTableInputFormat.getSplits(CarbonTableInputFormat.java:529) at org.apache.carbondata.hadoop.api.CarbonTableInputFormat.getSplits(CarbonTableInputFormat.java:220) at org.apache.carbondata.spark.rdd.CarbonScanRDD.internalGetPartitions(CarbonScanRDD.scala:127) at org.apache.carbondata.spark.rdd.CarbonRDD.getPartitions(CarbonRDD.scala:66) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) ``` Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [ ] Any interfaces changed? - [ ] Any backward compatibility impacted? - [ ] Document update required? - [ ] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? - How it is tested? Please attach test report. - Is it a performance related change? Please attach the performance test report. - Any additional information to help reviewers in testing this change. - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kevinjmh/carbondata bloom_bits_access Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/3000.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3000 commit b7df6abe73a633c0df79c5b426df9416d7a4c1bc Author: Manhua Date: 2018-12-19T03:30:46Z fix access error for bits in bloom filter ---
[GitHub] carbondata pull request #2713: [WIP][CARBONDATA-2931][BloomDataMap] Optimize...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2713#discussion_r242507682 --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMapFactory.java --- @@ -218,56 +218,46 @@ public DataMapBuilder createBuilder(Segment segment, String shardName, this.bloomFilterSize, this.bloomFilterFpp, bloomCompress); } - /** - * returns all shard directories of bloom index files for query - * if bloom index files are merged we should get only one shard path - */ - private Set getAllShardPaths(String tablePath, String segmentId) { -String dataMapStorePath = CarbonTablePath.getDataMapStorePath( -tablePath, segmentId, dataMapName); -CarbonFile[] carbonFiles = FileFactory.getCarbonFile(dataMapStorePath).listFiles(); -Set shardPaths = new HashSet<>(); + + private boolean isAllShardsMerged(String dmSegmentPath) { +boolean mergeShardExist = false; boolean mergeShardInprogress = false; -CarbonFile mergeShardFile = null; +CarbonFile[] carbonFiles = FileFactory.getCarbonFile(dmSegmentPath).listFiles(); for (CarbonFile carbonFile : carbonFiles) { - if (carbonFile.getName().equals(BloomIndexFileStore.MERGE_BLOOM_INDEX_SHARD_NAME)) { -mergeShardFile = carbonFile; - } else if (carbonFile.getName().equals(BloomIndexFileStore.MERGE_INPROGRESS_FILE)) { + String fileName = carbonFile.getName(); + if (fileName.equals(BloomIndexFileStore.MERGE_BLOOM_INDEX_SHARD_NAME)) { +mergeShardExist = true; + } else if (fileName.equals(BloomIndexFileStore.MERGE_INPROGRESS_FILE)) { mergeShardInprogress = true; --- End diff -- One more idea is that we can delay the deletion of original shards in query, referring to segment management. That is when mergeShard exists and no merge inprogress file in a query, we can assure to delete original shards safely. ---
[GitHub] carbondata pull request #2917: [WIP]Show load/insert/update/delete row numbe...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2917#discussion_r242500505 --- Diff: integration/spark-common-test/src/test/scala/org/apache/carbondata/spark/testsuite/iud/TestShowIUDRowCount.scala --- @@ -0,0 +1,60 @@ +package org.apache.carbondata.spark.testsuite.iud + +import org.apache.spark.sql.Row +import org.apache.spark.sql.test.util.QueryTest +import org.scalatest.{BeforeAndAfterAll, BeforeAndAfterEach} + +class TestShowIUDRowCount extends QueryTest with BeforeAndAfterEach with BeforeAndAfterAll { + + override protected def beforeAll(): Unit = { +dropTable("iud_rows") + } + + override protected def beforeEach(): Unit = { +dropTable("iud_rows") + } + + override protected def afterEach(): Unit = { +dropTable("iud_rows") + } + + test("Test show load row count") { +sql("""create table iud_rows (c1 string,c2 int,c3 string,c5 string) +|STORED BY 'org.apache.carbondata.format'""".stripMargin) +checkAnswer( + sql(s"""LOAD DATA LOCAL INPATH '$resourcesPath/IUD/dest.csv' INTO table iud_rows"""), --- End diff -- I found that this test will pass if I add OPTIONS(), without/with detail config, in load command. And the logical plan will change from `LoadDataCommand`( class in spark) to `CarbonLoadDataCommand`(class in carbon). If we want to show proceeded row count, we need to change the value `output` in these command/query plan, but we cannot do that for `LoadDataCommand` in spark. ---
[GitHub] carbondata pull request #2713: [WIP][CARBONDATA-2931][BloomDataMap] Optimize...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2713#discussion_r242454233 --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMapFactory.java --- @@ -218,56 +218,46 @@ public DataMapBuilder createBuilder(Segment segment, String shardName, this.bloomFilterSize, this.bloomFilterFpp, bloomCompress); } - /** - * returns all shard directories of bloom index files for query - * if bloom index files are merged we should get only one shard path - */ - private Set getAllShardPaths(String tablePath, String segmentId) { -String dataMapStorePath = CarbonTablePath.getDataMapStorePath( -tablePath, segmentId, dataMapName); -CarbonFile[] carbonFiles = FileFactory.getCarbonFile(dataMapStorePath).listFiles(); -Set shardPaths = new HashSet<>(); + + private boolean isAllShardsMerged(String dmSegmentPath) { +boolean mergeShardExist = false; boolean mergeShardInprogress = false; -CarbonFile mergeShardFile = null; +CarbonFile[] carbonFiles = FileFactory.getCarbonFile(dmSegmentPath).listFiles(); for (CarbonFile carbonFile : carbonFiles) { - if (carbonFile.getName().equals(BloomIndexFileStore.MERGE_BLOOM_INDEX_SHARD_NAME)) { -mergeShardFile = carbonFile; - } else if (carbonFile.getName().equals(BloomIndexFileStore.MERGE_INPROGRESS_FILE)) { + String fileName = carbonFile.getName(); + if (fileName.equals(BloomIndexFileStore.MERGE_BLOOM_INDEX_SHARD_NAME)) { +mergeShardExist = true; + } else if (fileName.equals(BloomIndexFileStore.MERGE_INPROGRESS_FILE)) { mergeShardInprogress = true; --- End diff -- Yes, you are right. We need to fix this. If we allow to use bloom filter when the index files are merging, maybe any IO Exception will occur in following steps when the merging is done. Some simple ideas for this: 1. datamap do not choose bloom when merging is under action 2. change the pruning logic to segment independent, any datamap excepts default datamap can reject or fail the segment pruning ( by return null or ?), and no more result blocklet intersection for this datamap, such that this does not affect final result ---
[GitHub] carbondata pull request #2713: [WIP][CARBONDATA-2931][BloomDataMap] Optimize...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2713#discussion_r242394373 --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMap.java --- @@ -178,15 +178,9 @@ private String getAncestorTablePath(CarbonTable currentTable) { for (BloomQueryModel bloomQueryModel : bloomQueryModels) { Set tempHitBlockletsResult = new HashSet<>(); LOGGER.debug("prune blocklet for query: " + bloomQueryModel); - BloomCacheKeyValue.CacheKey cacheKey = new BloomCacheKeyValue.CacheKey( - this.indexPath.toString(), bloomQueryModel.columnName); - BloomCacheKeyValue.CacheValue cacheValue = cache.get(cacheKey); - List bloomIndexList = cacheValue.getBloomFilters(); - for (CarbonBloomFilter bloomFilter : bloomIndexList) { -if (needShardPrune && !filteredShard.contains(bloomFilter.getShardName())) { --- End diff -- usage of this info is moved to `getBloomFilters` ---
[GitHub] carbondata issue #2878: [CARBONDATA-3107] Optimize error/exception coding fo...
Github user kevinjmh commented on the issue: https://github.com/apache/carbondata/pull/2878 conflicts fixed. ---
[GitHub] carbondata pull request #2917: [WIP]Show load/insert/update/delete row numbe...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2917#discussion_r235654816 --- Diff: integration/spark-common-test/src/test/scala/org/apache/carbondata/spark/testsuite/iud/TestShowIUDRowCount.scala --- @@ -0,0 +1,60 @@ +package org.apache.carbondata.spark.testsuite.iud + +import org.apache.spark.sql.Row +import org.apache.spark.sql.test.util.QueryTest +import org.scalatest.{BeforeAndAfterAll, BeforeAndAfterEach} + +class TestShowIUDRowCount extends QueryTest with BeforeAndAfterEach with BeforeAndAfterAll { + + override protected def beforeAll(): Unit = { +dropTable("iud_rows") + } + + override protected def beforeEach(): Unit = { +dropTable("iud_rows") + } + + override protected def afterEach(): Unit = { +dropTable("iud_rows") + } + + test("Test show load row count") { +sql("""create table iud_rows (c1 string,c2 int,c3 string,c5 string) +|STORED BY 'org.apache.carbondata.format'""".stripMargin) +checkAnswer( + sql(s"""LOAD DATA LOCAL INPATH '$resourcesPath/IUD/dest.csv' INTO table iud_rows"""), --- End diff -- why `sql()` function in QueryTest get different plan to self made spark context or beeline ---
[GitHub] carbondata pull request #2917: [WIP]Show load/insert/update/delete row numbe...
GitHub user kevinjmh opened a pull request: https://github.com/apache/carbondata/pull/2917 [WIP]Show load/insert/update/delete row number Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [ ] Any interfaces changed? - [ ] Any backward compatibility impacted? - [ ] Document update required? - [ ] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? - How it is tested? Please attach test report. - Is it a performance related change? Please attach the performance test report. - Any additional information to help reviewers in testing this change. - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kevinjmh/carbondata ProceededRowCount Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2917.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2917 commit f7d199e7d6d28c72124a2ad45635949976816d04 Author: Manhua Date: 2018-11-13T09:27:10Z show load/insert/update/delete row number ---
[GitHub] carbondata pull request #2900: [CARBONDATA-3078] Disable explain collector f...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2900#discussion_r231014871 --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java --- @@ -575,6 +576,8 @@ private BitSet setMatchedPartitions(String partitionIds, Expression filter, */ public BlockMappingVO getBlockRowCount(Job job, CarbonTable table, List partitions) throws IOException { +// no useful information for count star query without filter, so disable explain collector +ExplainCollector.remove(); --- End diff -- OK ---
[GitHub] carbondata pull request #2900: [CARBONDATA-3078] Disable explain collector f...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2900#discussion_r231010531 --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java --- @@ -575,6 +576,8 @@ private BitSet setMatchedPartitions(String partitionIds, Expression filter, */ public BlockMappingVO getBlockRowCount(Job job, CarbonTable table, List partitions) throws IOException { +// no useful information for count star query without filter, so disable explain collector +ExplainCollector.remove(); --- End diff -- No remove. Its implementation is disable. ---
[GitHub] carbondata pull request #2900: [CARBONDATA-3078] Disable explain collector f...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2900#discussion_r230972236 --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java --- @@ -575,6 +576,8 @@ private BitSet setMatchedPartitions(String partitionIds, Expression filter, */ public BlockMappingVO getBlockRowCount(Job job, CarbonTable table, List partitions) throws IOException { +// no useful information for count star query without filter, so disable explain collector +ExplainCollector.remove(); --- End diff -- You are right. Normal query flow goes to `CarbonInputFormat#getPrunedBlocklets` and initialize the pruning info for table we queried. But count star query without filter use a different query plan, it does not go into that method, so no pruning info does not init. When it calls default data map to prune(with a null filter), exception will occur during settingg pruning info. One solution is to init the pruning info for this type of query in mrthod `getBlockRowCount`. But considering no useful information about block/blocklet pruning for such query(actually no pruning), I choose to disable the expalin collector instead. ---
[GitHub] carbondata issue #2900: [CARBONDATA-3078] Disable explain collector for coun...
Github user kevinjmh commented on the issue: https://github.com/apache/carbondata/pull/2900 retest this please ---
[GitHub] carbondata pull request #2900: [CARBONDATA-3078] Disable explain collector f...
GitHub user kevinjmh opened a pull request: https://github.com/apache/carbondata/pull/2900 [CARBONDATA-3078] Disable explain collector for count star query without filter An issue is found about count star query without filter in explain command. It is a special case. It uses different plan. Considering no useful information about block/blocklet pruning for count star query without filter, so disable explain collector and avoid the exception in https://issues.apache.org/jira/browse/CARBONDATA-3078 Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [ ] Any interfaces changed? - [ ] Any backward compatibility impacted? - [ ] Document update required? - [ ] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? - How it is tested? Please attach test report. - Is it a performance related change? Please attach the performance test report. - Any additional information to help reviewers in testing this change. - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kevinjmh/carbondata explain_countstar Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2900.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2900 commit 46f07a25924e98d43dd9dcfad3a1c7cb0cd4d895 Author: Manhua Date: 2018-11-05T12:17:59Z disable explain collector for count star query without filter ---
[GitHub] carbondata issue #2894: [CARBONDATA-3074] Change default sort temp compresso...
Github user kevinjmh commented on the issue: https://github.com/apache/carbondata/pull/2894 retest this please ---
[GitHub] carbondata pull request #2894: [CARBONDATA-3074] Change default sort temp co...
GitHub user kevinjmh opened a pull request: https://github.com/apache/carbondata/pull/2894 [CARBONDATA-3074] Change default sort temp compressor to SNAPPY Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [ ] Any interfaces changed? - [ ] Any backward compatibility impacted? - [ ] Document update required? - [ ] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? - How it is tested? Please attach test report. - Is it a performance related change? Please attach the performance test report. - Any additional information to help reviewers in testing this change. - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kevinjmh/carbondata sorttempDefault Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2894.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2894 commit 7877c797230b58fd4c51ffa21a026bc13493ac1f Author: Manhua Date: 2018-11-05T09:28:07Z change default sort temp compressor from empty to snappy ---
[GitHub] carbondata issue #2886: [CARBONDATA-3065]make inverted index false by defaut
Github user kevinjmh commented on the issue: https://github.com/apache/carbondata/pull/2886 @akashrn5 thanks for reply. 1. Let's take a detail case. you can check whether it is right. In `DictDimensionIndexCodec#createEncoder`, as the setting I said above `isSort`=false `isDoInvertedIndex` = true `isInvertedIndex`=`isSort`&&`isDoInvertedIndex` = false so, it will go to `indexStorage = new BlockIndexerStorageForNoInvertedIndexForShort(data, false);`. In the construction method, we can see that it only assigns the dataPage value. No RLE. ``` public BlockIndexerStorageForNoInvertedIndexForShort(byte[][] dataPage, boolean applyRLE) { this.dataPage = dataPage; if (applyRLE) { List actualDataList = new ArrayList<>(); for (int i = 0; i < dataPage.length; i++) { actualDataList.add(dataPage[i]); } rleEncodeOnData(actualDataList); } } ``` 2. If isInvertedIndex is TRUE, then the isSort check must be TRUE ``` isInvertedIndex= isSort&&isDoInvertedIndex; ^ ^ ^ | | | internalUsed SORT_COLUMNS INVERTED_INDEX ``` ---
[GitHub] carbondata issue #2886: [CARBONDATA-3065]make inverted index false by defaut
Github user kevinjmh commented on the issue: https://github.com/apache/carbondata/pull/2886 The InvertedIndex/NoInvertedIndex setting is confusing. 1. the value `isInvertedIndex` assigned to different IndexCodec in `createEncoderForDimensionLegacy` requires us to set the column both SortColumns and use InvertedIndex. What if I set it in INVERTED_INDEX but not in SORT_COLUMNS? 2. what the boolean value `isInvertedIndex` in IndexCodec do is to control whether to do RLE on datapage? These make the setting not a direct switch to control how the data proceed ---
[GitHub] carbondata pull request #2886: [CARBONDATA-3065]make inverted index false by...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2886#discussion_r230260481 --- Diff: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java --- @@ -359,8 +359,13 @@ private CarbonCommonConstants() { public static final String TABLE_BLOCKSIZE = "table_blocksize"; // table blocklet size in MB public static final String TABLE_BLOCKLET_SIZE = "table_blocklet_size"; - // set in column level to disable inverted index + /** + * set in column level to disable inverted index + * @Deprecated :This property is deprecated, it is kep just for compatibility --- End diff -- spelling: kept ---
[GitHub] carbondata issue #2862: [HOTFIX] Enable Local dictionary by default
Github user kevinjmh commented on the issue: https://github.com/apache/carbondata/pull/2862 please remember to update the doc too. ---
[GitHub] carbondata pull request #2879: [CARBONDATA-3058] Fix some exception coding i...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2879#discussion_r229287888 --- Diff: processing/src/main/java/org/apache/carbondata/processing/loading/steps/CarbonRowDataWriterProcessorStepImpl.java --- @@ -308,7 +312,7 @@ private void processBatch(CarbonRowBatch batch, CarbonFactHandler dataHandler, i } writeCounter[iteratorIndex] += batch.getSize(); } catch (Exception e) { - throw new CarbonDataLoadingException("unable to generate the mdkey", e); + throw new CarbonDataLoadingException(e); --- End diff -- The KeyGenException extend Exception, it needs CarbonDataLoadingException(RuntimeException) to wrap and throw. ---
[GitHub] carbondata pull request #2879: [CARBONDATA-3058] Fix some exception coding i...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2879#discussion_r229285339 --- Diff: processing/src/main/java/org/apache/carbondata/processing/loading/steps/CarbonRowDataWriterProcessorStepImpl.java --- @@ -212,7 +212,11 @@ private void finish(CarbonFactHandler dataHandler, int iteratorIndex) { try { processingComplete(dataHandler); } catch (CarbonDataLoadingException e) { - exception = new CarbonDataWriterException(e.getMessage(), e); + // only assign when exception is null + // else it will erase original root cause + if (null == exception) { --- End diff -- not for the statistics. better to read the whole method. It has two stages: finish the handler and close the handler. the exception could be assigned in either stage. ---
[GitHub] carbondata issue #2879: [CARBONDATA-3058] Fix some exception coding in data ...
Github user kevinjmh commented on the issue: https://github.com/apache/carbondata/pull/2879 retest this please ---
[GitHub] carbondata pull request #2879: [CARBONDATA-3058] Fix some exception coding i...
GitHub user kevinjmh opened a pull request: https://github.com/apache/carbondata/pull/2879 [CARBONDATA-3058] Fix some exception coding in data loading 1. when exception occur in `dataHandler.finish();`, carbon does not proceed it immediately. Carbon keeps the exception and calls method to close the datahandler. But the exception would be overwrite if another exception occur when closing the dataHandler. This makes us lost the root cause. Refer to `AbstractFactDataWriter.closeExecutorService()` and `CarbonFactDataWriterImplV3.closeWriter()`, we add null check before the second time assignment in `CarbonRowDataWriterProcessorStepImpl.finish()` and `DataWriterBatchProcessorStepImpl` to avoid exception overwrite. 2. remove irrelevant exception message "unable to generate the mdkey", use the exception itself directly. Message in the exception will be retrieved automatically when logging. Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [ ] Any interfaces changed? - [ ] Any backward compatibility impacted? - [ ] Document update required? - [ ] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? - How it is tested? Please attach test report. - Is it a performance related change? Please attach the performance test report. - Any additional information to help reviewers in testing this change. - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kevinjmh/carbondata exceptionFix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2879.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2879 commit a36f9c98338e1ae38a9c4d9afb4bcd5ab1eb9b23 Author: Manhua Date: 2018-10-29T11:05:03Z fix exception ---
[GitHub] carbondata pull request #2878: [WIP] Modification of error/exception for bet...
GitHub user kevinjmh opened a pull request: https://github.com/apache/carbondata/pull/2878 [WIP] Modification of error/exception for better debugging Changes in this PR follows these rules: | Code Case| Problem | Suggest Modification | | | -: | :: | | `LOGGER.error(e);` | no stack trace | `LOGGER.error(e.getMessage(), e);` | | `LOGGER.error(e.getMessage());` | no stack trace | `LOGGER.error(e.getMessage(), e);` | | `catch ... throw new Exception("Error occur")`|useless message, no stack trace| `throw new Exception(e)` | | `catch ... throw new Exception(e.getMessage())`|no stack trace| `throw new Exception(e)` | | `catch ... throw new Exception(e.getMessage(), e)`| no need to call `getMessage()` | `throw new Exception(e)` | | `catch ... throw new Exception("Error occur: " + e.getMessage(), e)` |useless message| `throw new Exception(e)` | | `catch ... throw new Exception("DataLoad fail: " + e.getMessage())` |no stack trace| `throw new Exception("DataLoad fail: " + e.getMessage(), e)` | Some exceptions, such as MalformedCarbonCommandException, InterruptException, NumberFormatException, InvalidLoadOptionException and NoRetryException, do not have Constructor using `Throwable`, so we do not change it Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [ ] Any interfaces changed? - [ ] Any backward compatibility impacted? - [ ] Document update required? - [ ] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? - How it is tested? Please attach test report. - Is it a performance related change? Please attach the performance test report. - Any additional information to help reviewers in testing this change. - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kevinjmh/carbondata error_exception Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2878.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2878 commit 78792ba19a0b09ca02581890e708ca0ce4cc Author: Manhua Date: 2018-10-30T01:44:59Z LOGGER.error(e); ==> LOGGER.error(e.getMessage(), e); commit 4a7c58cab1f30178003e55b47978daf8d474fb31 Author: Manhua Date: 2018-10-30T01:48:33Z LOGGER.error(e.getMessage()); ==> LOGGER.error(e.getMessage(), e); commit dd42cba52500725265fbb0577702c79c2d138f6f Author: Manhua Date: 2018-10-30T01:50:24Z LOG.error(e); ==> LOG.error(e.getMessage(), e); commit b8716e8d33c7535d1698409bd91802f27d72890d Author: Manhua Date: 2018-10-30T02:01:29Z Exception(e.getMessage(), e); ==> Exception(e); commit f137aa7da667dcbfe867b5c83ebd6735b8c5f0f1 Author: Manhua Date: 2018-10-30T02:09:31Z Exception(e.getMessage()); ==> Exception(e); commit da0df82a9a43f54f7cf7072a4c3497a731ca261a Author: Manhua Date: 2018-10-30T02:18:37Z + e.getMessage()); ==> + e.getMessage(), e); commit 4c1ecaf636391327a57621721e2a5318fb8cc11f Author: Manhua Date: 2018-10-30T02:36:17Z fix style commit 26fa06c96aeaa7418424df6f5089315d95cfadbb Author: Manhua Date: 2018-10-30T02:53:22Z some more fix for .getMessage() ---
[GitHub] carbondata issue #2732: [CARBONDATA-3020] support lz4 as column compressor
Github user kevinjmh commented on the issue: https://github.com/apache/carbondata/pull/2732 retest this please ---
[GitHub] carbondata pull request #2866: [CARBONDATA-3050][Doc] Remove unused paramete...
GitHub user kevinjmh opened a pull request: https://github.com/apache/carbondata/pull/2866 [CARBONDATA-3050][Doc] Remove unused parameter doc Remove documentation of parameter 'carbon.use.multiple.temp.dir' *Related PR:* > #2824 - removed the parameter 'carbon.use.multiple.temp.dir' Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [ ] Any interfaces changed? - [ ] Any backward compatibility impacted? - [ ] Document update required? - [ ] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? - How it is tested? Please attach test report. - Is it a performance related change? Please attach the performance test report. - Any additional information to help reviewers in testing this change. - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kevinjmh/carbondata doc_fix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2866.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2866 commit d9f9a3bf694837f43817689a5122bfcc2ba8ae65 Author: Manhua Date: 2018-10-29T02:08:15Z remove unused parameter doc ---
[GitHub] carbondata pull request #2732: [CARBONDATA-3020] support lz4 as column compr...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2732#discussion_r228778578 --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/compression/Lz4Compressor.java --- @@ -0,0 +1,199 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.core.datastore.compression; + +import java.io.IOException; +import java.nio.ByteBuffer; +import java.nio.DoubleBuffer; +import java.nio.FloatBuffer; +import java.nio.IntBuffer; +import java.nio.LongBuffer; +import java.nio.ShortBuffer; +import java.util.Arrays; + +import org.apache.carbondata.core.util.ByteUtil; + +import net.jpountz.lz4.LZ4Compressor; +import net.jpountz.lz4.LZ4Factory; +import net.jpountz.lz4.LZ4FastDecompressor; + + +public class Lz4Compressor implements Compressor { --- End diff -- added ---
[GitHub] carbondata issue #2851: [CARBONDATA-3040][BloomDataMap] Add checking before ...
Github user kevinjmh commented on the issue: https://github.com/apache/carbondata/pull/2851 description updated ---
[GitHub] carbondata pull request #2851: [CARBONDATA-3040][BloomDataMap] Add checking ...
GitHub user kevinjmh opened a pull request: https://github.com/apache/carbondata/pull/2851 [CARBONDATA-3040][BloomDataMap] Add checking before merging bloom index *Scene* There is a bug which causes query failure when we create two bloom datamaps on same table with data. *Analyse* Since we already have data, each create datamap will trigger rebuild datamap task and then trigger bloom index file merging. By debuging, we found the first datamap's bloom index files would be merged two times and the second time made bloom index file empty. *Solution* Send the datamap name in rebuild event for filter. And add file check when merging. Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [ ] Any interfaces changed? - [ ] Any backward compatibility impacted? - [ ] Document update required? - [ ] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? - How it is tested? Please attach test report. - Is it a performance related change? Please attach the performance test report. - Any additional information to help reviewers in testing this change. - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kevinjmh/carbondata fix_multi_bloom Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2851.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2851 commit bcab5ac630e39a7dadee09d5b9157642d061b5e1 Author: Manhua Date: 2018-10-24T08:20:13Z only rebuild target datamap and add file check ---
[GitHub] carbondata pull request #2665: [CARBONDATA-2897][DataMap] Assign to datamap ...
Github user kevinjmh closed the pull request at: https://github.com/apache/carbondata/pull/2665 ---
[GitHub] carbondata issue #2781: [CARBONDATA-2983][BloomDataMap] Change bloom query m...
Github user kevinjmh commented on the issue: https://github.com/apache/carbondata/pull/2781 retest this please ---
[GitHub] carbondata pull request #2781: [CARBONDATA-2983][BloomDataMap] Change bloom ...
GitHub user kevinjmh opened a pull request: https://github.com/apache/carbondata/pull/2781 [CARBONDATA-2983][BloomDataMap] Change bloom query model to proceed multiple filter values currently, bloom generates multiple query model for each value of InExpression. This PR changes the query model to proceed multiple filter values. Also, intersect result between query model/expression to support AndExpression optimize Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [ ] Any interfaces changed? - [ ] Any backward compatibility impacted? - [ ] Document update required? - [ ] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? - How it is tested? Please attach test report. - Is it a performance related change? Please attach the performance test report. - Any additional information to help reviewers in testing this change. - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kevinjmh/carbondata query_model Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2781.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2781 commit 25ab0270021ec1c58219690c7389003c553768bf Author: Manhua Date: 2018-09-28T03:39:39Z build 1 query model for InExp; intersect result between query model/expression ---
[GitHub] carbondata pull request #2767: [CARBONDATA-2974] Fixed multiple expressions ...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2767#discussion_r221123795 --- Diff: core/src/main/java/org/apache/carbondata/core/datamap/DataMapChooser.java --- @@ -282,7 +280,9 @@ private void extractColumnExpression(Expression expression, List children = expression.getChildren(); if (children != null && children.size() > 0) { for (Expression exp : children) { - extractColumnExpression(exp, columnExpressions); + if (exp != null && exp.getFilterExpressionType() != ExpressionType.UNKNOWN) { --- End diff -- I'am not sure whether any **other** expression exists like SparkUnknownExpression( which is under EqualToExpression/InExpression and has a child of ColumnExpression ) ---
[GitHub] carbondata pull request #2767: [CARBONDATA-2974] Fixed multiple expressions ...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2767#discussion_r221117943 --- Diff: core/src/main/java/org/apache/carbondata/core/datamap/DataMapChooser.java --- @@ -177,34 +180,35 @@ private ExpressionTuple selectDataMap(Expression expression, List // If both left and right has datamap then we can either merge both datamaps to single // datamap if possible. Otherwise apply AND expression. if (left.dataMapExprWrapper != null && right.dataMapExprWrapper != null) { -filterExpressionTypes.add( - left.dataMapExprWrapper.getFilterResolverIntf().getFilterExpression() -.getFilterExpressionType()); -filterExpressionTypes.add( - right.dataMapExprWrapper.getFilterResolverIntf().getFilterExpression() -.getFilterExpressionType()); +filterExpressionTypes.addAll(left.filterExpressionTypes); +filterExpressionTypes.addAll(right.filterExpressionTypes); List columnExpressions = new ArrayList<>(); columnExpressions.addAll(left.columnExpressions); columnExpressions.addAll(right.columnExpressions); // Check if we can merge them to single datamap. TableDataMap dataMap = chooseDataMap(allDataMap, columnExpressions, filterExpressionTypes); +TrueConditionalResolverImpl resolver = new TrueConditionalResolverImpl( +new AndExpression(left.expression, right.expression), false, +true); --- End diff -- so this resolver is used for re-organizing the required expressions? ---
[GitHub] carbondata issue #2767: [CARBONDATA-2974] Fixed multiple expressions issue o...
Github user kevinjmh commented on the issue: https://github.com/apache/carbondata/pull/2767 Hi @ravipesala , I have some question about this PR 1. It is a little confusing to only deal with AND case and ignoring OR case if we want the framework do expression merging in general. 2. This PR still failed when querying on STRING column with bloom filter like "column1 = 123". Please refer to case 1 in description of PR #2665 for details 3. One more fail case found if apply this PR: 3.1 query with filter alike `(Col1='a' or Col2 ='b') and (Col3='c' or Col4='d' ) ` 3.2 ensure test result without bloom should not be empty 3.3 create bloom filter on multi columns in same datamap 3.4 test result with bloom would be empty. Because OrExpression is not proceeded when creating bloom query model now. Please check it. One more thing I think we could do to improve datamap pruning efficiency is to reuse pruning result between filter conditions. Such as we query with two filter conditions on segment with 5 blocklets, we get 2 hit blocklets after applying first filter condition, and then the second filter condition can only apply on those 2 blocklets instead of 5. ---
[GitHub] carbondata issue #2765: [CARBONDATA-2971] Add shard info of blocklet for deb...
Github user kevinjmh commented on the issue: https://github.com/apache/carbondata/pull/2765 retest this please ---
[GitHub] carbondata pull request #2765: [CARBONDATA-2971] Add shard info of blocklet ...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2765#discussion_r220490591 --- Diff: core/src/main/java/org/apache/carbondata/core/indexstore/Blocklet.java --- @@ -92,7 +95,13 @@ public String getFilePath() { blocklet.blockletId == null; } - @Override public int hashCode() { + @Override + public String toString() { +return String.format("[shard:%s, blockletId:%s]", filePath, blockletId); --- End diff -- updated ---
[GitHub] carbondata pull request #2765: [CARBONDATA-2971] Add shard info of blocklet ...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2765#discussion_r220455196 --- Diff: core/src/main/java/org/apache/carbondata/core/indexstore/Blocklet.java --- @@ -92,7 +95,13 @@ public String getFilePath() { blocklet.blockletId == null; } - @Override public int hashCode() { + @Override + public String toString() { +return String.format("[shard:%s, blockletId:%s]", filePath, blockletId); --- End diff -- The shard name in `toString()` is exactly the one in `Blocklet` itself ---
[GitHub] carbondata pull request #2765: [CARBONDATA-2971] Add shard info of blocklet ...
GitHub user kevinjmh opened a pull request: https://github.com/apache/carbondata/pull/2765 [CARBONDATA-2971] Add shard info of blocklet for debugging add `toString` method to print both shard name and blocklet id for debugging. Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [ ] Any interfaces changed? - [ ] Any backward compatibility impacted? - [ ] Document update required? - [ ] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? - How it is tested? Please attach test report. - Is it a performance related change? Please attach the performance test report. - Any additional information to help reviewers in testing this change. - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kevinjmh/carbondata addDebugInfo Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2765.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2765 commit 225ea72f16fafe5d16192763d01614822f31e2dd Author: Manhua Date: 2018-09-26T02:34:54Z add toString ---
[GitHub] carbondata pull request #2665: [CARBONDATA-2897][DataMap] Optimize datamap c...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2665#discussion_r219378413 --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMapFactory.java --- @@ -123,6 +129,42 @@ public BloomCoarseGrainDataMapFactory(CarbonTable carbonTable, DataMapSchema dat } } + @Override + public boolean isSupport(Expression expression) { +ColumnExpression filterColumn = null; +// First check type of child nodes, and get filter column name +switch (expression.getFilterExpressionType()) { + case IN: +InExpression inExpr = (InExpression) expression; +if (inExpr.getLeft() instanceof ColumnExpression && +inExpr.getRight() instanceof ListExpression) { + filterColumn = (ColumnExpression) inExpr.getLeft(); +} else if (inExpr.getRight() instanceof ColumnExpression && +inExpr.getLeft() instanceof ListExpression) { + filterColumn = (ColumnExpression) inExpr.getRight(); +} +break; + case EQUALS: +EqualToExpression equalToExpr = (EqualToExpression) expression; +if (equalToExpr.getLeft() instanceof ColumnExpression && +equalToExpr.getRight() instanceof LiteralExpression) { + filterColumn = (ColumnExpression) equalToExpr.getLeft(); +} else if (equalToExpr.getRight() instanceof ColumnExpression && +equalToExpr.getLeft() instanceof LiteralExpression) { + filterColumn = (ColumnExpression) equalToExpr.getRight(); +} +break; + default: +break; --- End diff -- ok ---
[GitHub] carbondata pull request #2665: [CARBONDATA-2897][DataMap] Optimize datamap c...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2665#discussion_r219378217 --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMapFactory.java --- @@ -123,6 +129,42 @@ public BloomCoarseGrainDataMapFactory(CarbonTable carbonTable, DataMapSchema dat } } + @Override + public boolean isSupport(Expression expression) { +ColumnExpression filterColumn = null; +// First check type of child nodes, and get filter column name +switch (expression.getFilterExpressionType()) { + case IN: +InExpression inExpr = (InExpression) expression; +if (inExpr.getLeft() instanceof ColumnExpression && +inExpr.getRight() instanceof ListExpression) { + filterColumn = (ColumnExpression) inExpr.getLeft(); +} else if (inExpr.getRight() instanceof ColumnExpression && +inExpr.getLeft() instanceof ListExpression) { + filterColumn = (ColumnExpression) inExpr.getRight(); +} +break; + case EQUALS: +EqualToExpression equalToExpr = (EqualToExpression) expression; +if (equalToExpr.getLeft() instanceof ColumnExpression && +equalToExpr.getRight() instanceof LiteralExpression) { + filterColumn = (ColumnExpression) equalToExpr.getLeft(); +} else if (equalToExpr.getRight() instanceof ColumnExpression && +equalToExpr.getLeft() instanceof LiteralExpression) { + filterColumn = (ColumnExpression) equalToExpr.getRight(); +} +break; + default: +break; +} +// Then check if filter column is in index columns +if (null != filterColumn && dataMapMeta.getIndexedColumnNames().contains( --- End diff -- for safety, convert to lowercase in both side to check ---
[GitHub] carbondata pull request #2665: [CARBONDATA-2897][DataMap] Optimize datamap c...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2665#discussion_r219376540 --- Diff: core/src/main/java/org/apache/carbondata/core/indexstore/blockletindex/BlockletDataMapFactory.java --- @@ -364,6 +365,13 @@ public DataMapMeta getMeta() { return null; } + @Override + public boolean isSupport(Expression expression) { +// this method is used for choosing cg/fg datamap +// for default datamap always return false +return false; --- End diff -- I get your point. I will change it to return true for better code reading. Also will add checking meta info( meta of default datamap is NULLnow) to avoid null point exception ---
[GitHub] carbondata issue #2734: [CARBONDATA-2946] Bloom filter backward compatibilit...
Github user kevinjmh commented on the issue: https://github.com/apache/carbondata/pull/2734 Test case which found this bug pass. Please refer last PR #2654 in description for tracking ---
[GitHub] carbondata issue #2732: [WIP] lz4 as column compressor in final store
Github user kevinjmh commented on the issue: https://github.com/apache/carbondata/pull/2732 retest this please ---
[GitHub] carbondata issue #2665: [CARBONDATA-2897][DataMap] Optimize datamap chooser
Github user kevinjmh commented on the issue: https://github.com/apache/carbondata/pull/2665 retest this please ---
[GitHub] carbondata issue #2654: [CARBONDATA-2896] Adaptive Encoding for Primitive da...
Github user kevinjmh commented on the issue: https://github.com/apache/carbondata/pull/2654 OK ---
[GitHub] carbondata pull request #2665: [CARBONDATA-2897][DataMap] Optimize datamap c...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2665#discussion_r218679293 --- Diff: core/src/main/java/org/apache/carbondata/core/datamap/DataMapChooser.java --- @@ -268,23 +238,38 @@ private ExpressionTuple selectDataMap(Expression expression, List private void extractColumnExpression(Expression expression, List columnExpressions) { -if (expression instanceof ColumnExpression) { - columnExpressions.add((ColumnExpression) expression); -} else if (expression instanceof MatchExpression) { - // this is a special case for lucene - // build a fake ColumnExpression to filter datamaps which contain target column - // a Lucene query string is alike "column:query term" - String[] queryItems = expression.getString().split(":", 2); - if (queryItems.length == 2) { -columnExpressions.add(new ColumnExpression(queryItems[0], null)); - } -} else if (expression != null) { - List children = expression.getChildren(); - if (children != null && children.size() > 0) { -for (Expression exp : children) { - extractColumnExpression(exp, columnExpressions); +switch (expression.getFilterExpressionType()) { --- End diff -- Change to check by method `isSupport` in datamap factory ---
[GitHub] carbondata issue #2654: [CARBONDATA-2896] Adaptive Encoding for Primitive da...
Github user kevinjmh commented on the issue: https://github.com/apache/carbondata/pull/2654 I ran a test on table with bloom datamap created before applying this PR, and query it after this PR merged, but the answer is not correct. Can you check it? Procedure to reproduce: - switch master code before this PR merged - create table with no-dict measure column (set the measure column as sort column) - create bloom datamap on the measure column - load some data into table - query on the measure column, get a result - switch to code after this PR merged - do the same query and compare the result ---
[GitHub] carbondata pull request #2732: [WIP] lz4 as column compressor in final store
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2732#discussion_r218645316 --- Diff: core/src/main/java/net/jpountz/lz4/LZ4DecompressorWithLength.java --- @@ -0,0 +1,191 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +// code ported from https://github.com/lz4/lz4-java/issues/119 +// remove this class when new version > 1.4.1 released +// this is only for test + +/* + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package net.jpountz.lz4; + +import java.nio.ByteBuffer; + +// code ported from https://github.com/lz4/lz4-java/issues/119 +// remove this when new version > 1.4.1 released + +/** + * Convenience class to decompress data compressed by {@link LZ4CompressorWithLength}. + * This decompressor is NOT compatible with any other compressors in lz4-java + * or any other lz4 tools. + * The user does not need to specify the length of the compressed data or + * original data because the length of the original decompressed data is + * included in the compressed data. + */ + +public class LZ4DecompressorWithLength { + + private final LZ4FastDecompressor decompressor; + + /** + * Returns the decompressed length of compressed data in src. + * + * @param src the compressed data + * @return the decompressed length + */ + public static int getDecompressedLength(byte[] src) { +return getDecompressedLength(src, 0); + } + + /** + * Returns the decompressed length of compressed data in src[srcOff:]. + * + * @param src the compressed data + * @param srcOff the start offset in src + * @return the decompressed length + */ + public static int getDecompressedLength(byte[] src, int srcOff) { +return (src[srcOff] & 0xFF) | (src[srcOff + 1] & 0xFF) << 8 | +(src[srcOff + 2] & 0xFF) << 16 | src[srcOff + 3] << 24; + } + + /** + * Returns the decompressed length of compressed data in src. + * + * @param src the compressed data + * @return the decompressed length + */ + public static int getDecompressedLength(ByteBuffer src) { +return getDecompressedLength(src, src.position()); + } + + /** + * Returns the decompressed length of compressed data in src[srcOff:]. + * + * @param src the compressed data + * @param srcOff the start offset in src + * @return the decompressed length + */ + public static int getDecompressedLength(ByteBuffer src, int srcOff) { +return (src.get(srcOff) & 0xFF) | (src.get(srcOff + 1) & 0xFF) << 8 | +(src.get(srcOff + 2) & 0xFF) << 16 | src.get(srcOff + 3) << 24; + } + + /** + * Creates a new decompressor to decompress data compressed by {@link LZ4CompressorWithLength}. + * + * @param decompressor decompressor to use + */ + public LZ4DecompressorWithLength(LZ4FastDecompressor decompressor) { +this.decompressor = decompressor; + } + + /** + * Convenience method, equivalent to calling + * {@link #decompress(byte[], int, byte[], int) decompress(src, 0, dest, 0)}. + * + * @param src the compressed data + * @param dest the destination buffer to store the decompressed data + * @return the number o
[GitHub] carbondata pull request #2732: [WIP] lz4 as column compressor in final store
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2732#discussion_r218635891 --- Diff: core/src/main/java/net/jpountz/lz4/LZ4CompressorWithLength.java --- @@ -0,0 +1,225 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +// code ported from https://github.com/lz4/lz4-java/issues/119 +// remove this class when new version > 1.4.1 released +// this is only for test + +/* + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package net.jpountz.lz4; + +import java.nio.ByteBuffer; +import java.util.Arrays; + +/** + * Covenience class to include the length of the original decompressed data + * in the output compressed data, so that the user does not need to save + * the length at anywhere else. The compressed data must be decompressed by + * {@link LZ4DecompressorWithLength} and is NOT compatible with any other + * decompressors in lz4-java or any other lz4 tools. This class deliberately + * does not extend {@link LZ4Compressor} because they are not interchangable. + */ + +public class LZ4CompressorWithLength { --- End diff -- yes. These codes didn't not packed in a released jar. Here we cope it only for test. See comment in L18-20 ---
[GitHub] carbondata pull request #2732: [WIP] lz4 as column compressor in final store
GitHub user kevinjmh opened a pull request: https://github.com/apache/carbondata/pull/2732 [WIP] lz4 as column compressor in final store Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [ ] Any interfaces changed? - [ ] Any backward compatibility impacted? - [ ] Document update required? - [ ] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? - How it is tested? Please attach test report. - Is it a performance related change? Please attach the performance test report. - Any additional information to help reviewers in testing this change. - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kevinjmh/carbondata lz4 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2732.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2732 commit b5a4353c9f7536973f8aa1900757e2266cde31ee Author: Manhua Date: 2018-09-18T11:41:51Z lz4 test ---
[GitHub] carbondata pull request #2706: [CARBONDATA-2927] multiple issue fixes for va...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2706#discussion_r218403711 --- Diff: processing/src/main/java/org/apache/carbondata/processing/store/CarbonFactDataHandlerColumnar.java --- @@ -239,6 +239,7 @@ public void addDataToStore(CarbonRow row) throws CarbonDataWriterException { * @return false if any varchar column page cannot add one more value(2MB) */ private boolean isVarcharColumnFull(CarbonRow row) { +//TODO: test and remove this as now UnsafeSortDataRows can exceed 2MB --- End diff -- Original implementation use `2MB` to ensure next varchar column value can be filled safely, because size of value of single column won't exceed size of a row. If UnsafeSortDataRows can exceed 2MB(growing dynamically), then we cannot check whether we have enough space for next value because we are not sure how many space next value will take. So the column page size check should be run before adding row to `dataRows` ---
[GitHub] carbondata pull request #2723: [CARBONDATA-2938][DataMap] Update comment of ...
GitHub user kevinjmh opened a pull request: https://github.com/apache/carbondata/pull/2723 [CARBONDATA-2938][DataMap] Update comment of blockletId in IndexDataMapRebuildRDD **Background**: #2539 Tried to make use of blocklet id information from query when rebuilding datamap. #2565 Revert PR2539 since it would cause wrong answer when testing on table with large mount of data **What's in this PR**: Update comment to explain why blocklet id information from query does not work Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [ ] Any interfaces changed? - [ ] Any backward compatibility impacted? - [ ] Document update required? - [ ] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? - How it is tested? Please attach test report. - Is it a performance related change? Please attach the performance test report. - Any additional information to help reviewers in testing this change. - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kevinjmh/carbondata rebuild_blockletid Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2723.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2723 commit 786f743703d3dfb824b12b9a29f9c16d3ef3 Author: Manhua Date: 2018-09-15T02:30:00Z update comment ---
[GitHub] carbondata issue #2713: [CARBONDATA-2931][BloomDataMap] Optimize bloom datam...
Github user kevinjmh commented on the issue: https://github.com/apache/carbondata/pull/2713 retest this please ---
[GitHub] carbondata pull request #2713: [CARBONDATA-2931][BloomDataMap] Optimize bloo...
GitHub user kevinjmh opened a pull request: https://github.com/apache/carbondata/pull/2713 [CARBONDATA-2931][BloomDataMap] Optimize bloom datamap pruning 1. re-use shard pruning info from default datamap 2. create one BloomCoarseGrainDataMap object per segment instead of per shard. (This is also preparation for parallel segment pruning) Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [ ] Any interfaces changed? - [ ] Any backward compatibility impacted? - [ ] Document update required? - [ ] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? - How it is tested? Please attach test report. - Is it a performance related change? Please attach the performance test report. - Any additional information to help reviewers in testing this change. - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kevinjmh/carbondata bloom_shard_op Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2713.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2713 commit 8544ed699424ff6089d28b8e07f582a2f9e25b78 Author: Manhua Date: 2018-08-30T01:43:10Z all shard of one segment use one datamap ---
[GitHub] carbondata pull request #2696: [CARBONDATA-2902][DataMap] Fix showing negati...
Github user kevinjmh closed the pull request at: https://github.com/apache/carbondata/pull/2696 ---
[GitHub] carbondata pull request #2711: [CARBONDATA-2929][DataMap] Add block skipped ...
GitHub user kevinjmh opened a pull request: https://github.com/apache/carbondata/pull/2711 [CARBONDATA-2929][DataMap] Add block skipped info for explain command This pr will add block skipped info by counting distinct file path from hit blocklet. It shows like below: ``` |== CarbonData Profiler == Table Scan on test - total: 125 blocks, 250 blocklets - filter: (l_partkey <> null and l_partkey = 1006) - pruned by Main DataMap - skipped: 119 blocks, 238 blocklets - pruned by CG DataMap - name: dm - provider: bloomfilter - skipped: 6 blocks, 12 blocklets ``` ``` |== CarbonData Profiler == Table Scan on test - total: 125 blocks, 250 blocklets - filter: TEXT_MATCH('l_shipmode:AIR') - pruned by Main DataMap - skipped: 0 blocks, 0 blocklets - pruned by FG DataMap - name: dm - provider: lucene - skipped: 12 blocks, 80 blocklets ``` Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [ ] Any interfaces changed? - [ ] Any backward compatibility impacted? - [ ] Document update required? - [ ] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? - How it is tested? Please attach test report. - Is it a performance related change? Please attach the performance test report. - Any additional information to help reviewers in testing this change. - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kevinjmh/carbondata explain_block_skip Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2711.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2711 commit 0828b4d3f366b02a3f9db89e862fc9bc0b89 Author: Manhua Date: 2018-09-12T03:29:46Z add block skip info ---
[GitHub] carbondata issue #2696: [CARBONDATA-2902][DataMap] Fix showing negative prun...
Github user kevinjmh commented on the issue: https://github.com/apache/carbondata/pull/2696 I found some material which may be helpful. > A PR can be closed without committing it like so: https://mahout.apache.org/developers/github.html#closing-a-pr-without-committing-for-committers > Here's an example that closed https://github.com/apache/incubator-airflow/pull/2440 : > git commit --allow-empty -m "Closes apache/incubator-airflow#2440 *Already Merged*" Ref: the first comment by Chris Riccomini in https://cwiki.apache.org/confluence/display/AIRFLOW/Committers%27+Guide ---
[GitHub] carbondata issue #2696: [CARBONDATA-2902][DataMap] Fix showing negative prun...
Github user kevinjmh commented on the issue: https://github.com/apache/carbondata/pull/2696 This PR is already merged in this [commit](https://github.com/apache/carbondata/commit/f04850f39d8c42b96ee419140c9506f7df988075), but asfgit did not close this. Is it OK if I close this PR directly? @jackylk ---
[GitHub] carbondata pull request #2654: [CARBONDATA-2896] Adaptive Encoding for Primi...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2654#discussion_r215829032 --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMap.java --- @@ -331,8 +332,18 @@ private BloomQueryModel buildQueryModelInternal(CarbonColumn carbonColumn, // for dictionary/date columns, convert the surrogate key to bytes internalFilterValue = CarbonUtil.getValueAsBytes(DataTypes.INT, convertedValue); } else { - // for non dictionary dimensions, is already bytes, - internalFilterValue = (byte[]) convertedValue; + // for non dictionary dimensions, numeric columns will be of original data, + // so convert the data to bytes + if (DataTypeUtil.isPrimitiveColumn(carbonColumn.getDataType())) { +if (convertedValue == null) { + convertedValue = DataConvertUtil.getNullValueForMeasure(carbonColumn.getDataType(), + carbonColumn.getColumnSchema().getScale()); +} +internalFilterValue = +CarbonUtil.getValueAsBytes(carbonColumn.getDataType(), convertedValue); --- End diff -- The above problem is similar to your second commit for minmax ---
[GitHub] carbondata pull request #2654: [CARBONDATA-2896] Adaptive Encoding for Primi...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2654#discussion_r215827023 --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMap.java --- @@ -331,8 +332,18 @@ private BloomQueryModel buildQueryModelInternal(CarbonColumn carbonColumn, // for dictionary/date columns, convert the surrogate key to bytes internalFilterValue = CarbonUtil.getValueAsBytes(DataTypes.INT, convertedValue); } else { - // for non dictionary dimensions, is already bytes, - internalFilterValue = (byte[]) convertedValue; + // for non dictionary dimensions, numeric columns will be of original data, + // so convert the data to bytes + if (DataTypeUtil.isPrimitiveColumn(carbonColumn.getDataType())) { +if (convertedValue == null) { + convertedValue = DataConvertUtil.getNullValueForMeasure(carbonColumn.getDataType(), + carbonColumn.getColumnSchema().getScale()); +} +internalFilterValue = +CarbonUtil.getValueAsBytes(carbonColumn.getDataType(), convertedValue); --- End diff -- Result of `getValueAsBytes` conflicts with existing bloom index data which will affect query result. For measure in 'NoDict', original implementation used `NonDictionaryFieldConverterImpl` to convert value, but now it uses `MeasureFieldConverterImpl` to convert value ---
[GitHub] carbondata pull request #2696: [CARBONDATA-2902][DataMap] Fix showing negati...
GitHub user kevinjmh opened a pull request: https://github.com/apache/carbondata/pull/2696 [CARBONDATA-2902][DataMap] Fix showing negative pruning result for explain command #2676 used method `ByteBuffer.getShort(int index)` to get number of blocklets in block, but it used wrong parameter. The `index` is index of byte instead of index of short. So it needs to multiply bytes of short type Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [ ] Any interfaces changed? - [ ] Any backward compatibility impacted? - [ ] Document update required? - [ ] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? - How it is tested? Please attach test report. - Is it a performance related change? Please attach the performance test report. - Any additional information to help reviewers in testing this change. - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kevinjmh/carbondata explain_fix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2696.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2696 commit 1cdfc5c8e592f53170cf942bfbb45cc5e5d67719 Author: Manhua Date: 2018-09-06T09:09:21Z fix negative blocklet skipped when cache level is block ---
[GitHub] carbondata pull request #2654: [CARBONDATA-2896] Adaptive Encoding for Primi...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2654#discussion_r215153728 --- Diff: integration/spark2/src/main/scala/org/apache/carbondata/datamap/IndexDataMapRebuildRDD.scala --- @@ -264,8 +264,17 @@ class RawBytesReadSupport(segmentProperties: SegmentProperties, indexColumns: Ar rtn(i) = if (indexCol2IdxInDictArray.contains(col.getColName)) { surrogatKeys(indexCol2IdxInDictArray(col.getColName)).toInt.asInstanceOf[Integer] } else if (indexCol2IdxInNoDictArray.contains(col.getColName)) { -data(0).asInstanceOf[ByteArrayWrapper].getNoDictionaryKeyByIndex( +val bytes = data(0).asInstanceOf[ByteArrayWrapper].getNoDictionaryKeyByIndex( indexCol2IdxInNoDictArray(col.getColName)) +// no dictionary primitive columns are expected to be in original data while loading, +// so convert it to original data +if (DataTypeUtil.isPrimitiveColumn(col.getDataType)) { + val dataFromBytes = DataTypeUtil +.getDataBasedOnDataTypeForNoDictionaryColumn(bytes, col.getDataType) + dataFromBytes --- End diff -- if isPrimitiveColumn, need null check and get null value for measure ---
[GitHub] carbondata pull request #2685: [CARBONDATA-2910] Support backward compatabil...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2685#discussion_r214886972 --- Diff: core/src/main/java/org/apache/carbondata/core/indexstore/blockletindex/BlockDataMap.java --- @@ -630,6 +638,24 @@ private boolean useMinMaxForExecutorPruning(FilterResolverIntf filterResolverInt return useMinMaxForPruning; } + @Override + public List prune(Expression expression, SegmentProperties segmentProperties, + List partitions, AbsoluteTableIdentifier identifier) throws IOException { +FilterResolverIntf filterResolverIntf = null; +if (expression != null) { + SegmentProperties properties = getSegmentProperties(); + QueryModel.FilterProcessVO processVO = + new QueryModel.FilterProcessVO(properties.getDimensions(), properties.getMeasures(), + new ArrayList()); + QueryModel.processFilterExpression(processVO, expression, null, null); + // Optimize Filter Expression and fit RANGE filters is conditions apply. + FilterOptimizer rangeFilterOptimizer = new RangeFilterOptmizer(expression); + rangeFilterOptimizer.optimizeFilter(); + filterResolverIntf = CarbonTable.resolveFilter(expression, identifier); --- End diff -- can we pull up the transformation from expression to filterResolverIntf so that we can reuse most code instead of adding another `prune` method everywhere ---
[GitHub] carbondata pull request #2688: [CARBONDATA-2911] Remove unused BTree related...
GitHub user kevinjmh opened a pull request: https://github.com/apache/carbondata/pull/2688 [CARBONDATA-2911] Remove unused BTree related code 1. BTree related code is only used by a test class called`BTreeBlockFinderTest`. 2. BTreeDataRefNodeFinder in AbstractDetailQueryResultIterator never run. All dataRefNode are actually instanceof BlockletDataRefNode Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [ ] Any interfaces changed? - [ ] Any backward compatibility impacted? - [ ] Document update required? - [ ] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? - How it is tested? Please attach test report. - Is it a performance related change? Please attach the performance test report. - Any additional information to help reviewers in testing this change. - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kevinjmh/carbondata remove_btree_related Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2688.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2688 commit 47f18af7277acaec8de4e166d6a268b2fa8b2e7e Author: Manhua Date: 2018-09-04T02:20:07Z remove btree related code ---
[GitHub] carbondata pull request #2676: [CARBONDATA-2902][DataMap] Fix showing negati...
GitHub user kevinjmh opened a pull request: https://github.com/apache/carbondata/pull/2676 [CARBONDATA-2902][DataMap] Fix showing negative pruning result for explain command Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [ ] Any interfaces changed? - [ ] Any backward compatibility impacted? - [ ] Document update required? - [ ] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? - How it is tested? Please attach test report. - Is it a performance related change? Please attach the performance test report. - Any additional information to help reviewers in testing this change. - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kevinjmh/carbondata explain_block_level Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2676.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2676 commit 50a15da1e1ab8165a1674e5ed6778ef18316628b Author: Manhua Date: 2018-08-31T03:44:10Z show different explain result for block/blocklet pruning ---
[GitHub] carbondata issue #2598: [CARBONDATA-2811][BloomDataMap] Add query test case ...
Github user kevinjmh commented on the issue: https://github.com/apache/carbondata/pull/2598 retest this please ---
[GitHub] carbondata issue #2665: [CARBONDATA-2897][DataMap] Optimize datamap chooser
Github user kevinjmh commented on the issue: https://github.com/apache/carbondata/pull/2665 retest this please ---
[GitHub] carbondata pull request #2665: [CARBONDATA-2897][DataMap] Optimize datamap c...
GitHub user kevinjmh opened a pull request: https://github.com/apache/carbondata/pull/2665 [CARBONDATA-2897][DataMap] Optimize datamap chooser In this PR, 1. Remove code for merging into one datamap when some datamap hits both child nodes of And/Or expression in DataMapChooser. This aims to make datamap focus on pruning single index column without any logic process. Leave logic stuff to be done by AndDataMapExprWrapper and OrDataMapExprWrapper 2. Only extract ColumnExpression of Expression which our datamap can handle in DataMapChooser. 3. Add short circuit to return pruned result when result of left node is empty in AndDataMapExprWrapper. Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [ ] Any interfaces changed? - [ ] Any backward compatibility impacted? - [ ] Document update required? - [ ] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? - How it is tested? Please attach test report. - Is it a performance related change? Please attach the performance test report. - Any additional information to help reviewers in testing this change. - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kevinjmh/carbondata dmChooser Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2665.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2665 commit 2c03e75f7d77df7609f2b054dcf93bdd7256bfa8 Author: Manhua Date: 2018-08-28T11:51:52Z datamap chooser ---