On Wed, May 18, 2016 at 3:42 PM, Abhi Basu <9000r...@gmail.com> wrote:
> Todd: > > Thanks for the update. So Kudu is not designed to be a common storage > system for long-term and streaming data/random access? Just curious. > I'd say it is, but right now we are focusing on more common use cases that one might have in a relational columnar database. Having 1000 ~30 byte columns is a relatively rare type of table in my experience, so we haven't focused our testing and tuning for that use case. -Todd > > On Wed, May 18, 2016 at 3:38 PM, Todd Lipcon <t...@cloudera.com> wrote: > >> Hm, so each of the strings is about 27 bytes, so each row is 27KB. So, a >> batch size of 500 is still >13MB. I'd start with something very low like >> 10, and work your way up. That said, this is definitely not in the >> "standard" use cases for which Kudu has been designed. >> >> I'd also recommend using compression and/or dictionary coding for a table >> if you have many repeat values. Unfortunately, it's not currently do this >> when creating a table using Impala. >> >> -Todd >> >> On Wed, May 18, 2016 at 10:51 AM, Abhi Basu <9000r...@gmail.com> wrote: >> >>> Query: describe kudu_db.chr22_kudu >>> +-------------+--------+---------+ >>> | name | type | comment | >>> +-------------+--------+---------+ >>> | pos | int | | >>> | id | string | | >>> | chrom | string | | >>> | ref | string | | >>> | alt | string | | >>> | qual | string | | >>> | filter | string | | >>> | info | string | | >>> | format_type | string | | >>> | hg00096 | string | | >>> | hg00097 | string | | >>> | hg00099 | string | | >>> | hg00100 | string | | >>> | hg00101 | string | | >>> | hg00102 | string | | >>> | hg00103 | string | | >>> | hg00104 | string | | >>> >>> .......... >>> >>> all the way to column na20828 string. >>> >>> Each hg and na columns have values like: >>> | hg00096 | >>> +----------------------------+ >>> | 0|0:0.000:0.00,-5.00,-5.00 | >>> | 0|0:0.000:0.00,-5.00,-5.00 | >>> | 0|0:0.000:0.00,-5.00,-5.00 | >>> | 0|0:0.000:0.00,-5.00,-5.00 | >>> | 0|0:0.000:0.00,-5.00,-5.00 | >>> | 0|0:0.000:0.00,-5.00,-5.00 | >>> | 0|0:0.000:0.00,-5.00,-5.00 | >>> | 0|0:0.000:0.00,-5.00,-5.00 | >>> | 0|0:0.000:0.00,-5.00,-5.00 | >>> | 0|0:0.000:0.00,-5.00,-5.00 | >>> >>> >>> >>> On Wed, May 18, 2016 at 10:47 AM, Todd Lipcon <t...@cloudera.com> wrote: >>> >>>> What are the types of your 1000 columns? Maybe an even smaller batch >>>> size is necessary. >>>> >>>> -Todd >>>> >>>> On Wed, May 18, 2016 at 10:41 AM, Abhi Basu <9000r...@gmail.com> wrote: >>>> >>>>> I have tried with batch_size=500 and still get same error. For your >>>>> reference are attached info that may help diagnose. >>>>> >>>>> Error: Error while applying Kudu session.: Incomplete: not enough >>>>> space remaining in buffer for op (required 46.7K, 7.00M already used >>>>> >>>>> >>>>> Config settings: >>>>> >>>>> Kudu Tablet Server Block Cache Capacity 1 GB >>>>> Kudu Tablet Server Hard Memory Limit 16 GB >>>>> >>>>> >>>>> On Wed, May 18, 2016 at 8:26 AM, William Berkeley < >>>>> wdberke...@cloudera.com> wrote: >>>>> >>>>>> Both options are more or less the same idea- the point is you need >>>>>> less rows going in per batch so you don't go over the batch size limit. >>>>>> Follow what Todd said as he explained it more clearly and suggested a >>>>>> better way. >>>>>> >>>>>> -Will >>>>>> >>>>>> On Wed, May 18, 2016 at 10:45 AM, Abhi Basu <9000r...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Thanks for the updates. I will give both options a try and report >>>>>>> back. >>>>>>> >>>>>>> If you are interested in testing with such datasets, I can help. >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Abhi >>>>>>> >>>>>>> On Wed, May 18, 2016 at 6:25 AM, Todd Lipcon <t...@cloudera.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Abhi, >>>>>>>> >>>>>>>> Will is right that the error is client-side, and probably happening >>>>>>>> because your rows are so wide.Impala typically will batch 1000 rows at >>>>>>>> a >>>>>>>> time when inserting into Kudu, so if each of your rows is 7-8KB, that >>>>>>>> will >>>>>>>> overflow the max buffer size that Will mentioned. This seems quite >>>>>>>> probable >>>>>>>> if your data is 1000 columns of doubles or int64s (which are 8 bytes >>>>>>>> each). >>>>>>>> >>>>>>>> I don't think his suggested workaround will help, but you can try >>>>>>>> running 'set batch_size=500' before running the create table or insert >>>>>>>> query. >>>>>>>> >>>>>>>> In terms of max supported columns, most of the workloads we are >>>>>>>> focusing on are more like typical data-warehouse tables, on the order >>>>>>>> of a >>>>>>>> couple hundred columns. Crossing into the 1000+ range enters "uncharted >>>>>>>> territory" where it's much more likely you'll hit problems like this >>>>>>>> and >>>>>>>> quite possibly others as well. Will be interested to hear your >>>>>>>> experiences, >>>>>>>> though you should probably be prepared for some rough edges. >>>>>>>> >>>>>>>> -Todd >>>>>>>> >>>>>>>> On Tue, May 17, 2016 at 8:32 PM, William Berkeley < >>>>>>>> wdberke...@cloudera.com> wrote: >>>>>>>> >>>>>>>>> Hi Abhi. >>>>>>>>> >>>>>>>>> I believe that error is actually coming from the client, not the >>>>>>>>> server. See e,g, >>>>>>>>> https://github.com/apache/incubator-kudu/blob/master/src/kudu/client/batcher.cc#L787 >>>>>>>>> (NB >>>>>>>>> that link is to master branch not the exact release you are using). >>>>>>>>> >>>>>>>>> If you look around there, you'll see that the max is set by >>>>>>>>> something called max_buffer_size_, which appears to be hardcoded to 7 >>>>>>>>> * >>>>>>>>> 1024 * 1024 bytes = 7MiB (and this is consistent with 6.96 + 0.0467 > >>>>>>>>> 7). >>>>>>>>> >>>>>>>>> I think the simple workaround would be to do the CTAS as a CTAS + >>>>>>>>> insert as select. Pick a condition that bipartitions the table, so you >>>>>>>>> don't get errors trying to double insert rows. >>>>>>>>> >>>>>>>>> -Will >>>>>>>>> >>>>>>>>> On Tue, May 17, 2016 at 4:45 PM, Abhi Basu <9000r...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> What is the limit of columns in Kudu? >>>>>>>>>> >>>>>>>>>> I am using 1000 gen dataset, specifically the chr22 table which >>>>>>>>>> has 500,000 rows x 1101 columns. This table has been built In >>>>>>>>>> Impala/HDFS. >>>>>>>>>> I am trying to create a new Kudu table as select from that table. I >>>>>>>>>> get the >>>>>>>>>> following error: >>>>>>>>>> >>>>>>>>>> Error while applying Kudu session.: Incomplete: not enough space >>>>>>>>>> remaining in buffer for op (required 46.7K, 6.96M already used >>>>>>>>>> >>>>>>>>>> When looking at http://pcsd-cdh2.local.com:8051/mem-trackers, I >>>>>>>>>> see the following. What configuration needs to be tweaked? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Memory usage by subsystem >>>>>>>>>> IdParentLimitCurrent ConsumptionPeak consumption >>>>>>>>>> root none 50.12G 4.97M 6.08M >>>>>>>>>> block_cache-sharded_lru_cache root none 937.9K 937.9K >>>>>>>>>> code_cache-sharded_lru_cache root none 1B 1B >>>>>>>>>> server root none 2.3K 201.4K >>>>>>>>>> tablet-00000000000000000000000000000000 server none 530B 200.1K >>>>>>>>>> MemRowSet-6 tablet-00000000000000000000000000000000 none 265B >>>>>>>>>> 265B >>>>>>>>>> txn_tracker tablet-00000000000000000000000000000000 64.00M 0B >>>>>>>>>> 28.5K >>>>>>>>>> DeltaMemStores tablet-00000000000000000000000000000000 none 265B >>>>>>>>>> 87.8K >>>>>>>>>> log_block_manager server none 1.8K 2.7K >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> -- >>>>>>>>>> Abhi Basu >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Todd Lipcon >>>>>>>> Software Engineer, Cloudera >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Abhi Basu >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Abhi Basu >>>>> >>>> >>>> >>>> >>>> -- >>>> Todd Lipcon >>>> Software Engineer, Cloudera >>>> >>> >>> >>> >>> -- >>> Abhi Basu >>> >> >> >> >> -- >> Todd Lipcon >> Software Engineer, Cloudera >> > > > > -- > Abhi Basu > -- Todd Lipcon Software Engineer, Cloudera