Todd: Thanks for the update. So Kudu is not designed to be a common storage system for long-term and streaming data/random access? Just curious.
On Wed, May 18, 2016 at 3:38 PM, Todd Lipcon <t...@cloudera.com> wrote: > Hm, so each of the strings is about 27 bytes, so each row is 27KB. So, a > batch size of 500 is still >13MB. I'd start with something very low like > 10, and work your way up. That said, this is definitely not in the > "standard" use cases for which Kudu has been designed. > > I'd also recommend using compression and/or dictionary coding for a table > if you have many repeat values. Unfortunately, it's not currently do this > when creating a table using Impala. > > -Todd > > On Wed, May 18, 2016 at 10:51 AM, Abhi Basu <9000r...@gmail.com> wrote: > >> Query: describe kudu_db.chr22_kudu >> +-------------+--------+---------+ >> | name | type | comment | >> +-------------+--------+---------+ >> | pos | int | | >> | id | string | | >> | chrom | string | | >> | ref | string | | >> | alt | string | | >> | qual | string | | >> | filter | string | | >> | info | string | | >> | format_type | string | | >> | hg00096 | string | | >> | hg00097 | string | | >> | hg00099 | string | | >> | hg00100 | string | | >> | hg00101 | string | | >> | hg00102 | string | | >> | hg00103 | string | | >> | hg00104 | string | | >> >> .......... >> >> all the way to column na20828 string. >> >> Each hg and na columns have values like: >> | hg00096 | >> +----------------------------+ >> | 0|0:0.000:0.00,-5.00,-5.00 | >> | 0|0:0.000:0.00,-5.00,-5.00 | >> | 0|0:0.000:0.00,-5.00,-5.00 | >> | 0|0:0.000:0.00,-5.00,-5.00 | >> | 0|0:0.000:0.00,-5.00,-5.00 | >> | 0|0:0.000:0.00,-5.00,-5.00 | >> | 0|0:0.000:0.00,-5.00,-5.00 | >> | 0|0:0.000:0.00,-5.00,-5.00 | >> | 0|0:0.000:0.00,-5.00,-5.00 | >> | 0|0:0.000:0.00,-5.00,-5.00 | >> >> >> >> On Wed, May 18, 2016 at 10:47 AM, Todd Lipcon <t...@cloudera.com> wrote: >> >>> What are the types of your 1000 columns? Maybe an even smaller batch >>> size is necessary. >>> >>> -Todd >>> >>> On Wed, May 18, 2016 at 10:41 AM, Abhi Basu <9000r...@gmail.com> wrote: >>> >>>> I have tried with batch_size=500 and still get same error. For your >>>> reference are attached info that may help diagnose. >>>> >>>> Error: Error while applying Kudu session.: Incomplete: not enough space >>>> remaining in buffer for op (required 46.7K, 7.00M already used >>>> >>>> >>>> Config settings: >>>> >>>> Kudu Tablet Server Block Cache Capacity 1 GB >>>> Kudu Tablet Server Hard Memory Limit 16 GB >>>> >>>> >>>> On Wed, May 18, 2016 at 8:26 AM, William Berkeley < >>>> wdberke...@cloudera.com> wrote: >>>> >>>>> Both options are more or less the same idea- the point is you need >>>>> less rows going in per batch so you don't go over the batch size limit. >>>>> Follow what Todd said as he explained it more clearly and suggested a >>>>> better way. >>>>> >>>>> -Will >>>>> >>>>> On Wed, May 18, 2016 at 10:45 AM, Abhi Basu <9000r...@gmail.com> >>>>> wrote: >>>>> >>>>>> Thanks for the updates. I will give both options a try and report >>>>>> back. >>>>>> >>>>>> If you are interested in testing with such datasets, I can help. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Abhi >>>>>> >>>>>> On Wed, May 18, 2016 at 6:25 AM, Todd Lipcon <t...@cloudera.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Abhi, >>>>>>> >>>>>>> Will is right that the error is client-side, and probably happening >>>>>>> because your rows are so wide.Impala typically will batch 1000 rows at a >>>>>>> time when inserting into Kudu, so if each of your rows is 7-8KB, that >>>>>>> will >>>>>>> overflow the max buffer size that Will mentioned. This seems quite >>>>>>> probable >>>>>>> if your data is 1000 columns of doubles or int64s (which are 8 bytes >>>>>>> each). >>>>>>> >>>>>>> I don't think his suggested workaround will help, but you can try >>>>>>> running 'set batch_size=500' before running the create table or insert >>>>>>> query. >>>>>>> >>>>>>> In terms of max supported columns, most of the workloads we are >>>>>>> focusing on are more like typical data-warehouse tables, on the order >>>>>>> of a >>>>>>> couple hundred columns. Crossing into the 1000+ range enters "uncharted >>>>>>> territory" where it's much more likely you'll hit problems like this and >>>>>>> quite possibly others as well. Will be interested to hear your >>>>>>> experiences, >>>>>>> though you should probably be prepared for some rough edges. >>>>>>> >>>>>>> -Todd >>>>>>> >>>>>>> On Tue, May 17, 2016 at 8:32 PM, William Berkeley < >>>>>>> wdberke...@cloudera.com> wrote: >>>>>>> >>>>>>>> Hi Abhi. >>>>>>>> >>>>>>>> I believe that error is actually coming from the client, not the >>>>>>>> server. See e,g, >>>>>>>> https://github.com/apache/incubator-kudu/blob/master/src/kudu/client/batcher.cc#L787 >>>>>>>> (NB >>>>>>>> that link is to master branch not the exact release you are using). >>>>>>>> >>>>>>>> If you look around there, you'll see that the max is set by >>>>>>>> something called max_buffer_size_, which appears to be hardcoded to 7 * >>>>>>>> 1024 * 1024 bytes = 7MiB (and this is consistent with 6.96 + 0.0467 > >>>>>>>> 7). >>>>>>>> >>>>>>>> I think the simple workaround would be to do the CTAS as a CTAS + >>>>>>>> insert as select. Pick a condition that bipartitions the table, so you >>>>>>>> don't get errors trying to double insert rows. >>>>>>>> >>>>>>>> -Will >>>>>>>> >>>>>>>> On Tue, May 17, 2016 at 4:45 PM, Abhi Basu <9000r...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> What is the limit of columns in Kudu? >>>>>>>>> >>>>>>>>> I am using 1000 gen dataset, specifically the chr22 table which >>>>>>>>> has 500,000 rows x 1101 columns. This table has been built In >>>>>>>>> Impala/HDFS. >>>>>>>>> I am trying to create a new Kudu table as select from that table. I >>>>>>>>> get the >>>>>>>>> following error: >>>>>>>>> >>>>>>>>> Error while applying Kudu session.: Incomplete: not enough space >>>>>>>>> remaining in buffer for op (required 46.7K, 6.96M already used >>>>>>>>> >>>>>>>>> When looking at http://pcsd-cdh2.local.com:8051/mem-trackers, I >>>>>>>>> see the following. What configuration needs to be tweaked? >>>>>>>>> >>>>>>>>> >>>>>>>>> Memory usage by subsystem >>>>>>>>> IdParentLimitCurrent ConsumptionPeak consumption >>>>>>>>> root none 50.12G 4.97M 6.08M >>>>>>>>> block_cache-sharded_lru_cache root none 937.9K 937.9K >>>>>>>>> code_cache-sharded_lru_cache root none 1B 1B >>>>>>>>> server root none 2.3K 201.4K >>>>>>>>> tablet-00000000000000000000000000000000 server none 530B 200.1K >>>>>>>>> MemRowSet-6 tablet-00000000000000000000000000000000 none 265B 265B >>>>>>>>> txn_tracker tablet-00000000000000000000000000000000 64.00M 0B >>>>>>>>> 28.5K >>>>>>>>> DeltaMemStores tablet-00000000000000000000000000000000 none 265B >>>>>>>>> 87.8K >>>>>>>>> log_block_manager server none 1.8K 2.7K >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> -- >>>>>>>>> Abhi Basu >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Todd Lipcon >>>>>>> Software Engineer, Cloudera >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Abhi Basu >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Abhi Basu >>>> >>> >>> >>> >>> -- >>> Todd Lipcon >>> Software Engineer, Cloudera >>> >> >> >> >> -- >> Abhi Basu >> > > > > -- > Todd Lipcon > Software Engineer, Cloudera > -- Abhi Basu