Hi David,
It seems that there is dirty data (null value for example) in your hive data source. Best regards, Ni Chunen / George On 06/18/2019 11:33,[email protected]<[email protected]> wrote: Hi, I am using Kylin 2.6.2 with Hadoop 2.7 (hive-2.1, hbase 1.1.8), and encountered the following problem in the third phase (Extract Distinct Columns): "2019-06-17 17:22:30,021 INFO [main] org.apache.kylin.engine.mr.steps.FactDistinctColumnsMapper: Sample output: TEST.RECORD_AGGREG.TS '1558681200' => reducer 0 2019-06-17 17:22:30,025 ERROR [Thread-8] org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Thread-8,5,main] threw an Exception. java.lang.NullPointerException at org.apache.kylin.engine.mr.steps.FactDistinctColumnsMapper$CuboidStatCalculator.putRowKeyToHLLNew(FactDistinctColumnsMapper.java:385) " I have a large input (521 836 260 rows) and I want to create one cube with 2 metrics + 1 dimension. At first, I thought that it might fail because of a null value for a dimension, but after checking the code it seems that scenario is handled: String colValue = row[rowkeyColIndex[i]]; if (colValue == null) colValue = "0"; byte[] bytes = hc.putString(colValue).hash().asBytes(); Could you please help me to find the root cause why this step is failing? Below, you can find the logs for one container and the config: " 2019-06-17 17:22:28,323 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 2019-06-17 17:22:28,373 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s). 2019-06-17 17:22:28,373 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system started 2019-06-17 17:22:28,414 INFO [main] org.apache.hadoop.mapred.YarnChild: Executing with tokens: 2019-06-17 17:22:28,414 INFO [main] org.apache.hadoop.mapred.YarnChild: Kind: mapreduce.job, Service: job_1560186768967_16121, Ident: (org.apache.hadoop.mapreduce.security.token.JobTokenIdentifier@4e096385) 2019-06-17 17:22:28,488 INFO [main] org.apache.hadoop.mapred.YarnChild: Sleeping for 0ms before retrying again. Got null now. 2019-06-17 17:22:28,670 INFO [main] org.apache.hadoop.mapred.YarnChild: mapreduce.cluster.local.dir for child: /local/yarn/hadoop-mapr/nm-local-dir/usercache/root/appcache/application_1560186768967_16121 2019-06-17 17:22:28,854 INFO [main] org.apache.hadoop.mapred.Task: mapOutputFile class: org.apache.hadoop.mapred.MapRFsOutputFile 2019-06-17 17:22:28,854 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id 2019-06-17 17:22:28,865 INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: File Output Committer Algorithm version is 1 2019-06-17 17:22:28,865 INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false 2019-06-17 17:22:28,874 INFO [main] org.apache.hadoop.mapred.Task: Using ResourceCalculatorProcessTree : [ ] 2019-06-17 17:22:28,949 INFO [main] org.apache.hadoop.mapred.MapTask: Processing split: org.apache.hive.hcatalog.mapreduce.HCatSplit@1b8a29df 2019-06-17 17:22:29,154 INFO [main] org.apache.hadoop.mapred.MapRFsOutputBuffer: mapreduce.task.io.sort.mb: 480 2019-06-17 17:22:29,154 INFO [main] org.apache.hadoop.mapred.MapRFsOutputBuffer: soft limit at 413575168 2019-06-17 17:22:29,155 INFO [main] org.apache.hadoop.mapred.MapRFsOutputBuffer: bufstart = 0; bufvoid = 417752688 2019-06-17 17:22:29,155 INFO [main] org.apache.hadoop.mapred.MapRFsOutputBuffer: kvstart = 0; length = 26109543 2019-06-17 17:22:29,163 INFO [main] org.apache.hadoop.mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapRFsOutputBuffer 2019-06-17 17:22:29,168 INFO [main] org.apache.kylin.engine.mr.common.AbstractHadoopJob: The absolute path for meta dir is /local/yarn/hadoop-mapr/nm-local-dir/usercache/root/appcache/application_1560186768967_16121/container_e02_1560186768967_16121_01_000016/meta 2019-06-17 17:22:29,181 INFO [main] org.apache.kylin.common.KylinConfig: Loading kylin-defaults.properties from file:/local/yarn/hadoop-mapr/nm-local-dir/usercache/root/appcache/application_1560186768967_16121/filecache/10/job.jar/job.jar!/kylin-defaults.properties 2019-06-17 17:22:29,185 INFO [main] org.apache.kylin.common.KylinConfig: Use KYLIN_CONF=/local/yarn/hadoop-mapr/nm-local-dir/usercache/root/appcache/application_1560186768967_16121/container_e02_1560186768967_16121_01_000016/meta 2019-06-17 17:22:29,187 INFO [main] org.apache.kylin.common.KylinConfig: Initialized a new KylinConfig from getInstanceFromEnv : 1097619701 2019-06-17 17:22:29,203 INFO [main] org.apache.kylin.common.KylinConfigBase: Kylin Config was updated with kylin.metadata.url : kylin_metadata@ifile,path=/local/yarn/hadoop-mapr/nm-local-dir/usercache/root/appcache/application_1560186768967_16121/container_e02_1560186768967_16121_01_000016/meta 2019-06-17 17:22:29,342 INFO [main] org.apache.kylin.common.KylinConfig: Creating new manager instance of class org.apache.kylin.cube.CubeManager 2019-06-17 17:22:29,363 INFO [main] org.apache.kylin.cube.CubeManager: Initializing CubeManager with config kylin_metadata@ifile,path=/local/yarn/hadoop-mapr/nm-local-dir/usercache/root/appcache/application_1560186768967_16121/container_e02_1560186768967_16121_01_000016/meta 2019-06-17 17:22:29,364 INFO [main] org.apache.kylin.common.persistence.ResourceStore: Using metadata url kylin_metadata@ifile,path=/local/yarn/hadoop-mapr/nm-local-dir/usercache/root/appcache/application_1560186768967_16121/container_e02_1560186768967_16121_01_000016/meta for resource store 2019-06-17 17:22:29,673 INFO [main] org.apache.kylin.common.KylinConfig: Creating new manager instance of class org.apache.kylin.cube.CubeDescManager 2019-06-17 17:22:29,674 INFO [main] org.apache.kylin.cube.CubeDescManager: Initializing CubeDescManager with config kylin_metadata@ifile,path=/local/yarn/hadoop-mapr/nm-local-dir/usercache/root/appcache/application_1560186768967_16121/container_e02_1560186768967_16121_01_000016/meta 2019-06-17 17:22:29,715 INFO [main] org.apache.kylin.common.KylinConfig: Creating new manager instance of class org.apache.kylin.metadata.project.ProjectManager 2019-06-17 17:22:29,716 INFO [main] org.apache.kylin.metadata.project.ProjectManager: Initializing ProjectManager with metadata url kylin_metadata@ifile,path=/local/yarn/hadoop-mapr/nm-local-dir/usercache/root/appcache/application_1560186768967_16121/container_e02_1560186768967_16121_01_000016/meta 2019-06-17 17:22:29,726 INFO [main] org.apache.kylin.common.KylinConfig: Creating new manager instance of class org.apache.kylin.metadata.cachesync.Broadcaster 2019-06-17 17:22:29,733 INFO [main] org.apache.kylin.common.KylinConfig: Creating new manager instance of class org.apache.kylin.metadata.model.DataModelManager 2019-06-17 17:22:29,738 INFO [main] org.apache.kylin.common.KylinConfig: Creating new manager instance of class org.apache.kylin.metadata.TableMetadataManager 2019-06-17 17:22:29,754 INFO [main] org.apache.kylin.measure.MeasureTypeFactory: Checking custom measure types from kylin config 2019-06-17 17:22:29,755 INFO [main] org.apache.kylin.measure.MeasureTypeFactory: registering COUNT_DISTINCT(hllc), class org.apache.kylin.measure.hllc.HLLCMeasureType$Factory 2019-06-17 17:22:29,760 INFO [main] org.apache.kylin.measure.MeasureTypeFactory: registering COUNT_DISTINCT(bitmap), class org.apache.kylin.measure.bitmap.BitmapMeasureType$Factory 2019-06-17 17:22:29,767 INFO [main] org.apache.kylin.measure.MeasureTypeFactory: registering TOP_N(topn), class org.apache.kylin.measure.topn.TopNMeasureType$Factory 2019-06-17 17:22:29,769 INFO [main] org.apache.kylin.measure.MeasureTypeFactory: registering RAW(raw), class org.apache.kylin.measure.raw.RawMeasureType$Factory 2019-06-17 17:22:29,771 INFO [main] org.apache.kylin.measure.MeasureTypeFactory: registering EXTENDED_COLUMN(extendedcolumn), class org.apache.kylin.measure.extendedcolumn.ExtendedColumnMeasureType$Factory 2019-06-17 17:22:29,772 INFO [main] org.apache.kylin.measure.MeasureTypeFactory: registering PERCENTILE_APPROX(percentile), class org.apache.kylin.measure.percentile.PercentileMeasureType$Factory 2019-06-17 17:22:29,774 INFO [main] org.apache.kylin.measure.MeasureTypeFactory: registering COUNT_DISTINCT(dim_dc), class org.apache.kylin.measure.dim.DimCountDistinctMeasureType$Factory 2019-06-17 17:22:29,789 INFO [main] org.apache.kylin.metadata.model.DataModelManager: Model flat_single is missing or unloaded yet 2019-06-17 17:22:29,789 INFO [main] org.apache.kylin.metadata.model.DataModelManager: Model record_aggr is missing or unloaded yet 2019-06-17 17:22:29,789 INFO [main] org.apache.kylin.metadata.model.DataModelManager: Model tester is missing or unloaded yet 2019-06-17 17:22:29,836 INFO [main] org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 2019-06-17 17:22:29,837 INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor [.deflate] 2019-06-17 17:22:29,849 INFO [main] org.apache.hive.hcatalog.mapreduce.InternalUtil: Initializing org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe with properties {name=default.kylin_intermediate_record_aggr_cube_9c444e0a_98c7_2646_52cc_3d74b6058d18, numFiles=70, columns.types=bigint,bigint,bigint, auto.purge=true, serialization.format=1, columns=record_aggreg_ts,record_aggreg_page_visit_sum,record_aggreg_image_load_sum, rawDataSize=10549866066, columns.comments=nullnullnull, last_modified_time=1560812812, numRows=521836260, serialization.lib=org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, EXTERNAL=TRUE, COLUMN_STATS_ACCURATE={"BASIC_STATS":"true"}, totalSize=21571825655, last_modified_by=root, serialization.null.format=\N, transient_lastDdlTime=1560817330} 2019-06-17 17:22:29,972 INFO [main] org.apache.kylin.engine.mr.KylinMapper: Do setup, available memory: 5712m 2019-06-17 17:22:29,972 INFO [main] org.apache.kylin.engine.mr.KylinMapper: The conf for current mapper will be 2047526627 2019-06-17 17:22:29,981 INFO [main] org.apache.kylin.common.KylinConfig: Creating new manager instance of class org.apache.kylin.source.SourceManager 2019-06-17 17:22:29,993 INFO [main] org.apache.kylin.common.KylinConfig: Creating new manager instance of class org.apache.kylin.cube.cuboid.CuboidManager 2019-06-17 17:22:30,012 INFO [main] org.apache.kylin.engine.mr.steps.FactDistinctColumnsMapper: Found KylinVersion : 2.6.2.0. Use new algorithm for cuboid sampling. About the details of the new algorithm, please refer to KYLIN-2518 2019-06-17 17:22:30,013 INFO [main] org.apache.kylin.engine.mr.steps.FactDistinctColumnsMapper: cuboid stats calculator:0 started, handle cuboids number:187 2019-06-17 17:22:30,017 INFO [main] org.apache.kylin.engine.mr.KylinMapper: Accepting Mapper Key with ordinal: 1 2019-06-17 17:22:30,017 INFO [main] org.apache.kylin.engine.mr.KylinMapper: Do map, available memory: 5701m 2019-06-17 17:22:30,021 INFO [main] org.apache.kylin.engine.mr.steps.FactDistinctColumnsMapper: Sample output: TEST.RECORD_AGGREG.TS '1558681200' => reducer 0 2019-06-17 17:22:30,025 ERROR [Thread-8] org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Thread-8,5,main] threw an Exception. java.lang.NullPointerException at org.apache.kylin.engine.mr.steps.FactDistinctColumnsMapper$CuboidStatCalculator.putRowKeyToHLLNew(FactDistinctColumnsMapper.java:385) at org.apache.kylin.engine.mr.steps.FactDistinctColumnsMapper$CuboidStatCalculator.run(FactDistinctColumnsMapper.java:411) at java.lang.Thread.run(Thread.java:748)" Cube Config: { "uuid": "c532d208-cd50-4aaf-06a6-6023f61a3050", "last_modified": 1560810508988, "version": "2.6.2.0", "name": "record_aggr_cube", "is_draft": false, "model_name": "record_aggr", "description": "", "null_string": null, "dimensions": [ { "name": "TS", "table": "record_AGGREG", "column": "TS", "derived": null } ], "measures": [ { "name": "_COUNT_", "function": { "expression": "COUNT", "parameter": { "type": "constant", "value": "1" }, "returntype": "bigint" } }, { "name": "SUM_PAGE_VISIT", "function": { "expression": "SUM", "parameter": { "type": "column", "value": "record_AGGREG.PAGE_VISIT_SUM" }, "returntype": "bigint" } }, { "name": "SUM_IMAGE_LOAD", "function": { "expression": "SUM", "parameter": { "type": "column", "value": "record_AGGREG.IMAGE_LOAD_SUM" }, "returntype": "bigint" } } ], "dictionaries": [], "rowkey": { "rowkey_columns": [ { "column": "record_AGGREG.TS", "encoding": "dict", "encoding_version": 1, "isShardBy": false } ] }, "hbase_mapping": { "column_family": [ { "name": "F1", "columns": [ { "qualifier": "M", "measure_refs": [ "_COUNT_", "SUM_PAGE_VISIT", "SUM_IMAGE_LOAD" ] } ] } ] }, "aggregation_groups": [ { "includes": [ "record_AGGREG.TS" ], "select_rule": { "hierarchy_dims": [], "mandatory_dims": [], "joint_dims": [] } } ], "signature": "+8tNqJZYWGtkbx7AAZhJCg==", "notify_list": [], "status_need_notify": [ "ERROR", "DISCARDED", "SUCCEED" ], "partition_date_start": 0, "partition_date_end": 3153600000000, "auto_merge_time_ranges": [ 604800000, 2419200000 ], "volatile_range": 0, "retention_range": 0, "engine_type": 4, "storage_type": 2, "override_kylin_properties": { "kylin.engine.mr.config-override.mapreduce.map.memory.mb": "20480", "kylin.engine.mr.config-override.mapreduce.reduce.memory.mb": "20480", "kylin.engine.mr.config-override.mapreduce.map.cpu.vcores": "4", "kylin.engine.mr.config-override.mapreduce.map.reduce.vcores": "4", "kylin.source.hive.config-override.mapreduce.reduce.memory.mb": "20480", "kylin.engine.mr.config-override.mapreduce.reduce.cpu.vcores": "4", "kylin.engine.mr.config-override.mapreduce.map.java.opts": "-Xmx7g", "kylin.engine.mr.config-override.mapreduce.reduce.java.opts": "-Xmx7g", "kylin.source.hive.config-override.mapreduce.reduce.cpu.vcores": "2", "kylin.source.hive.config-override.mapreduce.map.cpu.vcores": "2", "kylin.source.hive.config-override.mapreduce.map.memory.mb": "20480" }, "cuboid_black_list": [], "parent_forward": 3, "mandatory_dimension_set_list": [], "snapshot_table_desc_list": [] } Thanks, David
