On Wed, May 18, 2016 at 3:42 PM, Abhi Basu <9000r...@gmail.com> wrote:

> Todd:
>
> Thanks for the update. So Kudu is not designed to be a common storage
> system for long-term and streaming data/random access? Just curious.
>

I'd say it is, but right now we are focusing on more common use cases that
one might have in a relational columnar database. Having 1000 ~30 byte
columns is a relatively rare type of table in my experience, so we haven't
focused our testing and tuning for that use case.

-Todd


>
> On Wed, May 18, 2016 at 3:38 PM, Todd Lipcon <t...@cloudera.com> wrote:
>
>> Hm, so each of the strings is about 27 bytes, so each row is 27KB.  So, a
>> batch size of 500 is still >13MB. I'd start with something very low like
>> 10, and work your way up. That said, this is definitely not in the
>> "standard" use cases for which Kudu has been designed.
>>
>> I'd also recommend using compression and/or dictionary coding for a table
>> if you have many repeat values. Unfortunately, it's not currently do this
>> when creating a table using Impala.
>>
>> -Todd
>>
>> On Wed, May 18, 2016 at 10:51 AM, Abhi Basu <9000r...@gmail.com> wrote:
>>
>>> Query: describe kudu_db.chr22_kudu
>>> +-------------+--------+---------+
>>> | name        | type   | comment |
>>> +-------------+--------+---------+
>>> | pos         | int    |         |
>>> | id          | string |         |
>>> | chrom       | string |         |
>>> | ref         | string |         |
>>> | alt         | string |         |
>>> | qual        | string |         |
>>> | filter      | string |         |
>>> | info        | string |         |
>>> | format_type | string |         |
>>> | hg00096     | string |         |
>>> | hg00097     | string |         |
>>> | hg00099     | string |         |
>>> | hg00100     | string |         |
>>> | hg00101     | string |         |
>>> | hg00102     | string |         |
>>> | hg00103     | string |         |
>>> | hg00104     | string |         |
>>>
>>> ..........
>>>
>>> all the way to column na20828 string.
>>>
>>> Each hg and na columns have values like:
>>> | hg00096                    |
>>> +----------------------------+
>>> | 0|0:0.000:0.00,-5.00,-5.00 |
>>> | 0|0:0.000:0.00,-5.00,-5.00 |
>>> | 0|0:0.000:0.00,-5.00,-5.00 |
>>> | 0|0:0.000:0.00,-5.00,-5.00 |
>>> | 0|0:0.000:0.00,-5.00,-5.00 |
>>> | 0|0:0.000:0.00,-5.00,-5.00 |
>>> | 0|0:0.000:0.00,-5.00,-5.00 |
>>> | 0|0:0.000:0.00,-5.00,-5.00 |
>>> | 0|0:0.000:0.00,-5.00,-5.00 |
>>> | 0|0:0.000:0.00,-5.00,-5.00 |
>>>
>>>
>>>
>>> On Wed, May 18, 2016 at 10:47 AM, Todd Lipcon <t...@cloudera.com> wrote:
>>>
>>>> What are the types of your 1000 columns? Maybe an even smaller batch
>>>> size is necessary.
>>>>
>>>> -Todd
>>>>
>>>> On Wed, May 18, 2016 at 10:41 AM, Abhi Basu <9000r...@gmail.com> wrote:
>>>>
>>>>> I have tried with batch_size=500 and still get same error. For your
>>>>> reference are attached info that may help diagnose.
>>>>>
>>>>> Error: Error while applying Kudu session.: Incomplete: not enough
>>>>> space remaining in buffer for op (required 46.7K, 7.00M already used
>>>>>
>>>>>
>>>>> Config settings:
>>>>>
>>>>> Kudu Tablet Server Block Cache Capacity   1 GB
>>>>> Kudu Tablet Server Hard Memory Limit  16 GB
>>>>>
>>>>>
>>>>> On Wed, May 18, 2016 at 8:26 AM, William Berkeley <
>>>>> wdberke...@cloudera.com> wrote:
>>>>>
>>>>>> Both options are more or less the same idea- the point is you need
>>>>>> less rows going in per batch so you don't go over the batch size limit.
>>>>>> Follow what Todd said as he explained it more clearly and suggested a
>>>>>> better way.
>>>>>>
>>>>>> -Will
>>>>>>
>>>>>> On Wed, May 18, 2016 at 10:45 AM, Abhi Basu <9000r...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks for the updates. I will give both options a try and report
>>>>>>> back.
>>>>>>>
>>>>>>> If you are interested in testing with such datasets, I can help.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Abhi
>>>>>>>
>>>>>>> On Wed, May 18, 2016 at 6:25 AM, Todd Lipcon <t...@cloudera.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Abhi,
>>>>>>>>
>>>>>>>> Will is right that the error is client-side, and probably happening
>>>>>>>> because your rows are so wide.Impala typically will batch 1000 rows at 
>>>>>>>> a
>>>>>>>> time when inserting into Kudu, so if each of your rows is 7-8KB, that 
>>>>>>>> will
>>>>>>>> overflow the max buffer size that Will mentioned. This seems quite 
>>>>>>>> probable
>>>>>>>> if your data is 1000 columns of doubles or int64s (which are 8 bytes 
>>>>>>>> each).
>>>>>>>>
>>>>>>>> I don't think his suggested workaround will help, but you can try
>>>>>>>> running 'set batch_size=500' before running the create table or insert
>>>>>>>> query.
>>>>>>>>
>>>>>>>> In terms of max supported columns, most of the workloads we are
>>>>>>>> focusing on are more like typical data-warehouse tables, on the order 
>>>>>>>> of a
>>>>>>>> couple hundred columns. Crossing into the 1000+ range enters "uncharted
>>>>>>>> territory" where it's much more likely you'll hit problems like this 
>>>>>>>> and
>>>>>>>> quite possibly others as well. Will be interested to hear your 
>>>>>>>> experiences,
>>>>>>>> though you should probably be prepared for some rough edges.
>>>>>>>>
>>>>>>>> -Todd
>>>>>>>>
>>>>>>>> On Tue, May 17, 2016 at 8:32 PM, William Berkeley <
>>>>>>>> wdberke...@cloudera.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Abhi.
>>>>>>>>>
>>>>>>>>> I believe that error is actually coming from the client, not the
>>>>>>>>> server. See e,g,
>>>>>>>>> https://github.com/apache/incubator-kudu/blob/master/src/kudu/client/batcher.cc#L787
>>>>>>>>>  (NB
>>>>>>>>> that link is to master branch not the exact release you are using).
>>>>>>>>>
>>>>>>>>> If you look around there, you'll see that the max is set by
>>>>>>>>> something called max_buffer_size_, which appears to be hardcoded to 7 
>>>>>>>>> *
>>>>>>>>> 1024 * 1024 bytes = 7MiB (and this is consistent with 6.96 + 0.0467 > 
>>>>>>>>> 7).
>>>>>>>>>
>>>>>>>>> I think the simple workaround would be to do the CTAS as a CTAS +
>>>>>>>>> insert as select. Pick a condition that bipartitions the table, so you
>>>>>>>>> don't get errors trying to double insert rows.
>>>>>>>>>
>>>>>>>>> -Will
>>>>>>>>>
>>>>>>>>> On Tue, May 17, 2016 at 4:45 PM, Abhi Basu <9000r...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> What is the limit of columns in Kudu?
>>>>>>>>>>
>>>>>>>>>> I am using 1000 gen dataset, specifically the chr22 table which
>>>>>>>>>> has 500,000 rows x 1101 columns. This table has been built In 
>>>>>>>>>> Impala/HDFS.
>>>>>>>>>> I am trying to create a new Kudu table as select from that table. I 
>>>>>>>>>> get the
>>>>>>>>>> following error:
>>>>>>>>>>
>>>>>>>>>> Error while applying Kudu session.: Incomplete: not enough space
>>>>>>>>>> remaining in buffer for op (required 46.7K, 6.96M already used
>>>>>>>>>>
>>>>>>>>>> When looking at http://pcsd-cdh2.local.com:8051/mem-trackers, I
>>>>>>>>>> see the following. What configuration needs to be tweaked?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Memory usage by subsystem
>>>>>>>>>> IdParentLimitCurrent ConsumptionPeak consumption
>>>>>>>>>> root none 50.12G 4.97M 6.08M
>>>>>>>>>> block_cache-sharded_lru_cache root none 937.9K 937.9K
>>>>>>>>>> code_cache-sharded_lru_cache root none 1B 1B
>>>>>>>>>> server root none 2.3K 201.4K
>>>>>>>>>> tablet-00000000000000000000000000000000 server none 530B 200.1K
>>>>>>>>>> MemRowSet-6 tablet-00000000000000000000000000000000 none 265B
>>>>>>>>>> 265B
>>>>>>>>>> txn_tracker tablet-00000000000000000000000000000000 64.00M 0B
>>>>>>>>>> 28.5K
>>>>>>>>>> DeltaMemStores tablet-00000000000000000000000000000000 none 265B
>>>>>>>>>> 87.8K
>>>>>>>>>> log_block_manager server none 1.8K 2.7K
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> --
>>>>>>>>>> Abhi Basu
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Todd Lipcon
>>>>>>>> Software Engineer, Cloudera
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Abhi Basu
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Abhi Basu
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Todd Lipcon
>>>> Software Engineer, Cloudera
>>>>
>>>
>>>
>>>
>>> --
>>> Abhi Basu
>>>
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>
>
> --
> Abhi Basu
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to