Hi Alexander, That makes sense. Using S3 for Cube build and storage is required for a cloud hadoop environment.
I tried to reproduce this problem. I created a EMR with S3 as HBase storage, in kylin.properties, I set "kylin.env.hdfs-working-dir" and "kylin.storage.hbase.cluster-fs" to the S3 bucket. But in the "Convert Cuboid Data to HFile" step, Kylin still writes to local HDFS; Did you modify the core-site.xml to make S3 as the default FS? 2017-08-10 22:53 GMT+08:00 Alexander Sterligov <[email protected]>: > Yes, I workarounded this problem in such way and it works. > > One problem of such solution is that I have to use pretty large hdfs and > it'expensive. And also I have to manually garbage collect it, because it is > not moved to s3, but copied. Kylin cleanup job doesn't work for it, because > main metadata folder is at s3. So it would be really nice to put everything > to s3. > > Another problem is that I had to rise hbase rpc timeout, because bulk > loading from hdfs takes long. That was not trivial. 3 minutes work good, > but with drawback of queries or metadata writes handing for 3 minutes if > something bad happen. But that's rare event. > > 10 авг. 2017 г. 17:42 пользователь "ShaoFeng Shi" <[email protected]> > написал: > > How about leaving empty for "kylin.hbase.cluster.fs"? This property is >> for two-cluster deployment (one Hadoop for cube build, the other for >> query); >> >> When be empty, the HFile will be written to default fs (HDFS in EMR), and >> then load to HBase. I'm not sure whether EMR HBase (using S3 as storage) >> can bulk load files from HDFS or not. If it can, that would be great as the >> write performance of HDFS would be better than S3. >> >> 2017-08-10 22:29 GMT+08:00 Alexander Sterligov <[email protected]>: >> >>> I also thought about it, but no, it's not consistency. >>> >>> Consistency view is enabled. I use same s3 for my own map-reduce jobs >>> and it's ok. >>> >>> I also checked if it lost consistency (emrfs diff). No problems. >>> >>> In case of inconsistency of s3 files disappear right after they were >>> written and appear some time after. Hfiles didn't appear after a day, but >>> _template is there. >>> >>> It's 100% reproducable, I think I'll investigate this problem by running >>> conversion job manually. >>> >>> 10 авг. 2017 г. 17:18 пользователь "ShaoFeng Shi" < >>> [email protected]> написал: >>> >>> Did you enable the Consistent View? This article explains the challenge >>>> when using S3 directly for ETL process: >>>> https://aws.amazon.com/cn/blogs/big-data/ensuring-consistenc >>>> y-when-using-amazon-s3-and-amazon-elastic-mapreduce-for-etl-workflows/ >>>> >>>> >>>> 2017-08-09 18:19 GMT+08:00 Alexander Sterligov <[email protected]>: >>>> >>>>> Yes, it's empty. Also I see this message in the log: >>>>> >>>>> 2017-08-09 09:02:35,947 WARN [Job >>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608] >>>>> mapreduce.LoadIncrementalHFiles:234 : Skipping non-directory >>>>> s3://joom.emr.fs/home/production/bi/kylin/kylin_metadata/kyl >>>>> in-1e436685-7102-4621-a4cb-6472b866126d >>>>> /main_event_1_main/hfile/_SUCCESS >>>>> 2017-08-09 09:02:36,009 WARN [Job >>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608] >>>>> mapreduce.LoadIncrementalHFiles:252 : Skipping non-file >>>>> FileStatusExt{path=s3://joom.emr.fs/home/production/bi/kylin >>>>> /kylin_metadata/kylin-1e436685-7102-4621-a4cb-6472b866126d/m >>>>> ain_event_1_main/hfile/_temporary/1; isDirectory=true; >>>>> modification_time=0; access_time=0; owner=; group=; permission=rwxrwxrwx; >>>>> isSymlink=false} >>>>> 2017-08-09 09:02:36,014 WARN [Job >>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608] >>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation did not >>>>> find any files to load in directory s3://joom.emr.fs/home/producti >>>>> on/bi/kylin/kylin_metadata/kylin-1e436685-7102-4621-a4cb-647 >>>>> 2b866126d/main_event_1_main/hfile. Does it contain files in >>>>> subdirectories that correspond to column family names? >>>>> >>>>> On Wed, Aug 9, 2017 at 1:15 PM, ShaoFeng Shi <[email protected]> >>>>> wrote: >>>>> >>>>>> The HFile will be moved to HBase data folder when bulk load finished; >>>>>> Did you check whether the HTable has data? >>>>>> >>>>>> 2017-08-09 17:54 GMT+08:00 Alexander Sterligov <[email protected]>: >>>>>> >>>>>>> Hi! >>>>>>> >>>>>>> I set kylin.hbase.cluster.fs to s3 bucket where hbase lives. >>>>>>> >>>>>>> Step "Convert Cuboid Data to HFile" finished without errors. >>>>>>> Statistics at the end of the job said that it has written lot's of data >>>>>>> to >>>>>>> s3. >>>>>>> >>>>>>> But there is no hfiles in kylin_metadata folder (kylin_metadata >>>>>>> /kylin-1e436685-7102-4621-a4cb-6472b866126d/<table name>/hfile), >>>>>>> but only _temporary folder and _SUCCESS file. >>>>>>> >>>>>>> _temporary contains hfiles inside attempt folders. it looks like >>>>>>> there were not copied from _temporary to result dir. But there is no >>>>>>> errors >>>>>>> neither in kylin log, nor in reducers' logs. >>>>>>> >>>>>>> Then loading empty hfiles produces empty segments. >>>>>>> >>>>>>> Is that a bug or I'm doing something wrong? >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Best regards, >>>>>> >>>>>> Shaofeng Shi 史少锋 >>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> Best regards, >>>> >>>> Shaofeng Shi 史少锋 >>>> >>>> >> >> >> -- >> Best regards, >> >> Shaofeng Shi 史少锋 >> >> -- Best regards, Shaofeng Shi 史少锋
