Re: CDH 5.5 - Kudu error not enough space remaining in buffer for op

2016-05-18 Thread Abhi Basu
I have tried with batch_size=500 and still get same error. For your
reference are attached info that may help diagnose.

Error: Error while applying Kudu session.: Incomplete: not enough space
remaining in buffer for op (required 46.7K, 7.00M already used


Config settings:

Kudu Tablet Server Block Cache Capacity   1 GB
Kudu Tablet Server Hard Memory Limit  16 GB


On Wed, May 18, 2016 at 8:26 AM, William Berkeley 
wrote:

> Both options are more or less the same idea- the point is you need less
> rows going in per batch so you don't go over the batch size limit. Follow
> what Todd said as he explained it more clearly and suggested a better way.
>
> -Will
>
> On Wed, May 18, 2016 at 10:45 AM, Abhi Basu <9000r...@gmail.com> wrote:
>
>> Thanks for the updates. I will give both options a try and report back.
>>
>> If you are interested in testing with such datasets, I can help.
>>
>> Thanks,
>>
>> Abhi
>>
>> On Wed, May 18, 2016 at 6:25 AM, Todd Lipcon  wrote:
>>
>>> Hi Abhi,
>>>
>>> Will is right that the error is client-side, and probably happening
>>> because your rows are so wide.Impala typically will batch 1000 rows at a
>>> time when inserting into Kudu, so if each of your rows is 7-8KB, that will
>>> overflow the max buffer size that Will mentioned. This seems quite probable
>>> if your data is 1000 columns of doubles or int64s (which are 8 bytes each).
>>>
>>> I don't think his suggested workaround will help, but you can try
>>> running 'set batch_size=500' before running the create table or insert
>>> query.
>>>
>>> In terms of max supported columns, most of the workloads we are focusing
>>> on are more like typical data-warehouse tables, on the order of a couple
>>> hundred columns. Crossing into the 1000+ range enters "uncharted territory"
>>> where it's much more likely you'll hit problems like this and quite
>>> possibly others as well. Will be interested to hear your experiences,
>>> though you should probably be prepared for some rough edges.
>>>
>>> -Todd
>>>
>>> On Tue, May 17, 2016 at 8:32 PM, William Berkeley <
>>> wdberke...@cloudera.com> wrote:
>>>
 Hi Abhi.

 I believe that error is actually coming from the client, not the
 server. See e,g,
 https://github.com/apache/incubator-kudu/blob/master/src/kudu/client/batcher.cc#L787
  (NB
 that link is to master branch not the exact release you are using).

 If you look around there, you'll see that the max is set by something
 called max_buffer_size_, which appears to be hardcoded to 7 * 1024 * 1024
 bytes = 7MiB (and this is consistent with 6.96 + 0.0467 > 7).

 I think the simple workaround would be to do the CTAS as a CTAS +
 insert as select. Pick a condition that bipartitions the table, so you
 don't get errors trying to double insert rows.

 -Will

 On Tue, May 17, 2016 at 4:45 PM, Abhi Basu <9000r...@gmail.com> wrote:

> What is the limit of columns in Kudu?
>
> I am using 1000 gen dataset, specifically the chr22 table which has
> 500,000 rows x 1101 columns. This table has been built In Impala/HDFS. I 
> am
> trying to create a new Kudu table as select from that table. I get the
> following error:
>
> Error while applying Kudu session.: Incomplete: not enough space
> remaining in buffer for op (required 46.7K, 6.96M already used
>
> When looking at http://pcsd-cdh2.local.com:8051/mem-trackers, I see
> the following. What configuration needs to be tweaked?
>
>
> Memory usage by subsystem
> IdParentLimitCurrent ConsumptionPeak consumption
> root none 50.12G 4.97M 6.08M
> block_cache-sharded_lru_cache root none 937.9K 937.9K
> code_cache-sharded_lru_cache root none 1B 1B
> server root none 2.3K 201.4K
> tablet- server none 530B 200.1K
> MemRowSet-6 tablet- none 265B 265B
> txn_tracker tablet- 64.00M 0B 28.5K
> DeltaMemStores tablet- none 265B 87.8K
> log_block_manager server none 1.8K 2.7K
>
> Thanks,
> --
> Abhi Basu
>


>>>
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>
>>
>>
>> --
>> Abhi Basu
>>
>
>


-- 
Abhi Basu
{
  "details" : "Query (id=314c0c293bf25601:284c797fc6e9409d)\n  Summary\n
Session ID: 2d4c2924eae92415:db543427a839ba9d\nSession Type: BEESWAX\n
Start Time: 2016-05-18 10:36:15.681951000\nEnd Time: 2016-05-18 
10:36:30.06357\nQuery Type: DDL\nQuery State: EXCEPTION\nQuery 
Status: \nError while applying Kudu session.: Incomplete: not enough space 
remaining in buffer for op (required 46.7K, 7.00M already used\n\nImpala 
Version: impalad version 2.6.0-IMPALA_KUDU-cdh5 RELEASE (build 
82d950143cff09ee21a22c88d3f5f0d676f6bb83)\nUser: root\nConnected User: 
root\nDelegated User: \n   

Re: Spark on Kudu

2016-05-18 Thread Chris George
There is some code in review that needs some more refinement.
It will allow upsert/insert from a dataframe using the datasource api. It will 
also allow the creation and deletion of tables from a dataframe
http://gerrit.cloudera.org:8080/#/c/2992/

Example usages will look something like:
http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc

-Chris George


On 5/18/16, 9:45 AM, "Benjamin Kim" 
> wrote:

Can someone tell me what the state is of this Spark work?

Also, does anyone have any sample code on how to update/insert data in Kudu 
using DataFrames?

Thanks,
Ben


On Apr 13, 2016, at 8:22 AM, Chris George 
> wrote:

SparkSQL cannot support these type of statements but we may be able to 
implement similar functionality through the api.
-Chris

On 4/12/16, 5:19 PM, "Benjamin Kim" 
> wrote:

It would be nice to adhere to the SQL:2003 standard for an “upsert” if it were 
to be implemented.

MERGE INTO table_name USING table_reference ON (condition)
 WHEN MATCHED THEN
 UPDATE SET column1 = value1 [, column2 = value2 ...]
 WHEN NOT MATCHED THEN
 INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …])

Cheers,
Ben

On Apr 11, 2016, at 12:21 PM, Chris George 
> wrote:

I have a wip kuduRDD that I made a few months ago. I pushed it into gerrit if 
you want to take a look. http://gerrit.cloudera.org:8080/#/c/2754/
It does pushdown predicates which the existing input formatter based rdd does 
not.

Within the next two weeks I’m planning to implement a datasource for spark that 
will have pushdown predicates and insertion/update functionality (need to look 
more at cassandra and the hbase datasource for best way to do this) I agree 
that server side upsert would be helpful.
Having a datasource would give us useful data frames and also make spark sql 
usable for kudu.

My reasoning for having a spark datasource and not using Impala is: 1. We have 
had trouble getting impala to run fast with high concurrency when compared to 
spark 2. We interact with datasources which do not integrate with impala. 3. We 
have custom sql query planners for extended sql functionality.

-Chris George


On 4/11/16, 12:22 PM, "Jean-Daniel Cryans" 
> wrote:

You guys make a convincing point, although on the upsert side we'll need more 
support from the servers. Right now all you can do is an INSERT then, if you 
get a dup key, do an UPDATE. I guess we could at least add an API on the client 
side that would manage it, but it wouldn't be atomic.

J-D

On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra 
> wrote:
It's pretty simple, actually.  I need to support versioned datasets in a Spark 
SQL environment.  Instead of a hack on top of a Parquet data store, I'm hoping 
(among other reasons) to be able to use Kudu's write and timestamp-based read 
operations to support not only appending data, but also updating existing data, 
and even some schema migration.  The most typical use case is a dataset that is 
updated periodically (e.g., weekly or monthly) in which the the preliminary 
data in the previous window (week or month) is updated with values that are 
expected to remain unchanged from then on, and a new set of preliminary values 
for the current window need to be added/appended.

Using Kudu's Java API and developing additional functionality on top of what 
Kudu has to offer isn't too much to ask, but the ease of integration with Spark 
SQL will gate how quickly we would move to using Kudu and how seriously we'd 
look at alternatives before making that decision.

On Mon, Apr 11, 2016 at 8:14 AM, Jean-Daniel Cryans 
> wrote:
Mark,

Thanks for taking some time to reply in this thread, glad it caught the 
attention of other folks!

On Sun, Apr 10, 2016 at 12:33 PM, Mark Hamstra 
> wrote:
Do they care being able to insert into Kudu with SparkSQL

I care about insert into Kudu with Spark SQL.  I'm currently delaying a 
refactoring of some Spark SQL-oriented insert functionality while trying to 
evaluate what to expect from Kudu.  Whether Kudu does a good job supporting 
inserts with Spark SQL will be a key consideration as to whether we adopt Kudu.

I'd like to know more about why SparkSQL inserts in necessary for you. Is it 
just that you currently do it that way into some database or parquet so with 
minimal refactoring you'd be able to use Kudu? Would re-writing those SQL lines 
into Scala and directly use the Java API's KuduSession be too much work?

Additionally, what do you expect to gain from using Kudu VS your current 
solution? If it's not completely clear, I'd love to help you think through it.


On 

Re: Spark on Kudu

2016-05-18 Thread Benjamin Kim
Can someone tell me what the state is of this Spark work?

Also, does anyone have any sample code on how to update/insert data in Kudu 
using DataFrames?

Thanks,
Ben


> On Apr 13, 2016, at 8:22 AM, Chris George  wrote:
> 
> SparkSQL cannot support these type of statements but we may be able to 
> implement similar functionality through the api.
> -Chris
> 
> On 4/12/16, 5:19 PM, "Benjamin Kim"  > wrote:
> 
> It would be nice to adhere to the SQL:2003 standard for an “upsert” if it 
> were to be implemented.
> 
> MERGE INTO table_name USING table_reference ON (condition)
>  WHEN MATCHED THEN
>  UPDATE SET column1 = value1 [, column2 = value2 ...]
>  WHEN NOT MATCHED THEN
>  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …])
> 
> Cheers,
> Ben
> 
>> On Apr 11, 2016, at 12:21 PM, Chris George > > wrote:
>> 
>> I have a wip kuduRDD that I made a few months ago. I pushed it into gerrit 
>> if you want to take a look. http://gerrit.cloudera.org:8080/#/c/2754/ 
>> 
>> It does pushdown predicates which the existing input formatter based rdd 
>> does not.
>> 
>> Within the next two weeks I’m planning to implement a datasource for spark 
>> that will have pushdown predicates and insertion/update functionality (need 
>> to look more at cassandra and the hbase datasource for best way to do this) 
>> I agree that server side upsert would be helpful.
>> Having a datasource would give us useful data frames and also make spark sql 
>> usable for kudu.
>> 
>> My reasoning for having a spark datasource and not using Impala is: 1. We 
>> have had trouble getting impala to run fast with high concurrency when 
>> compared to spark 2. We interact with datasources which do not integrate 
>> with impala. 3. We have custom sql query planners for extended sql 
>> functionality.
>> 
>> -Chris George
>> 
>> 
>> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans" > > wrote:
>> 
>> You guys make a convincing point, although on the upsert side we'll need 
>> more support from the servers. Right now all you can do is an INSERT then, 
>> if you get a dup key, do an UPDATE. I guess we could at least add an API on 
>> the client side that would manage it, but it wouldn't be atomic.
>> 
>> J-D
>> 
>> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra > > wrote:
>> It's pretty simple, actually.  I need to support versioned datasets in a 
>> Spark SQL environment.  Instead of a hack on top of a Parquet data store, 
>> I'm hoping (among other reasons) to be able to use Kudu's write and 
>> timestamp-based read operations to support not only appending data, but also 
>> updating existing data, and even some schema migration.  The most typical 
>> use case is a dataset that is updated periodically (e.g., weekly or monthly) 
>> in which the the preliminary data in the previous window (week or month) is 
>> updated with values that are expected to remain unchanged from then on, and 
>> a new set of preliminary values for the current window need to be 
>> added/appended.
>> 
>> Using Kudu's Java API and developing additional functionality on top of what 
>> Kudu has to offer isn't too much to ask, but the ease of integration with 
>> Spark SQL will gate how quickly we would move to using Kudu and how 
>> seriously we'd look at alternatives before making that decision. 
>> 
>> On Mon, Apr 11, 2016 at 8:14 AM, Jean-Daniel Cryans > > wrote:
>> Mark,
>> 
>> Thanks for taking some time to reply in this thread, glad it caught the 
>> attention of other folks!
>> 
>> On Sun, Apr 10, 2016 at 12:33 PM, Mark Hamstra > > wrote:
>> Do they care being able to insert into Kudu with SparkSQL
>> 
>> I care about insert into Kudu with Spark SQL.  I'm currently delaying a 
>> refactoring of some Spark SQL-oriented insert functionality while trying to 
>> evaluate what to expect from Kudu.  Whether Kudu does a good job supporting 
>> inserts with Spark SQL will be a key consideration as to whether we adopt 
>> Kudu.
>> 
>> I'd like to know more about why SparkSQL inserts in necessary for you. Is it 
>> just that you currently do it that way into some database or parquet so with 
>> minimal refactoring you'd be able to use Kudu? Would re-writing those SQL 
>> lines into Scala and directly use the Java API's KuduSession be too much 
>> work?
>> 
>> Additionally, what do you expect to gain from using Kudu VS your current 
>> solution? If it's not completely clear, I'd love to help you think through 
>> it.
>>  
>> 
>> On Sun, Apr 10, 2016 at 12:23 PM, Jean-Daniel Cryans > > wrote:
>> Yup, starting to get a good idea.
>> 
>> 

Re: CDH 5.5 - Kudu error not enough space remaining in buffer for op

2016-05-18 Thread Abhi Basu
Thanks for the updates. I will give both options a try and report back.

If you are interested in testing with such datasets, I can help.

Thanks,

Abhi

On Wed, May 18, 2016 at 6:25 AM, Todd Lipcon  wrote:

> Hi Abhi,
>
> Will is right that the error is client-side, and probably happening
> because your rows are so wide.Impala typically will batch 1000 rows at a
> time when inserting into Kudu, so if each of your rows is 7-8KB, that will
> overflow the max buffer size that Will mentioned. This seems quite probable
> if your data is 1000 columns of doubles or int64s (which are 8 bytes each).
>
> I don't think his suggested workaround will help, but you can try running
> 'set batch_size=500' before running the create table or insert query.
>
> In terms of max supported columns, most of the workloads we are focusing
> on are more like typical data-warehouse tables, on the order of a couple
> hundred columns. Crossing into the 1000+ range enters "uncharted territory"
> where it's much more likely you'll hit problems like this and quite
> possibly others as well. Will be interested to hear your experiences,
> though you should probably be prepared for some rough edges.
>
> -Todd
>
> On Tue, May 17, 2016 at 8:32 PM, William Berkeley  > wrote:
>
>> Hi Abhi.
>>
>> I believe that error is actually coming from the client, not the server.
>> See e,g,
>> https://github.com/apache/incubator-kudu/blob/master/src/kudu/client/batcher.cc#L787
>>  (NB
>> that link is to master branch not the exact release you are using).
>>
>> If you look around there, you'll see that the max is set by something
>> called max_buffer_size_, which appears to be hardcoded to 7 * 1024 * 1024
>> bytes = 7MiB (and this is consistent with 6.96 + 0.0467 > 7).
>>
>> I think the simple workaround would be to do the CTAS as a CTAS + insert
>> as select. Pick a condition that bipartitions the table, so you don't get
>> errors trying to double insert rows.
>>
>> -Will
>>
>> On Tue, May 17, 2016 at 4:45 PM, Abhi Basu <9000r...@gmail.com> wrote:
>>
>>> What is the limit of columns in Kudu?
>>>
>>> I am using 1000 gen dataset, specifically the chr22 table which has
>>> 500,000 rows x 1101 columns. This table has been built In Impala/HDFS. I am
>>> trying to create a new Kudu table as select from that table. I get the
>>> following error:
>>>
>>> Error while applying Kudu session.: Incomplete: not enough space
>>> remaining in buffer for op (required 46.7K, 6.96M already used
>>>
>>> When looking at http://pcsd-cdh2.local.com:8051/mem-trackers, I see the
>>> following. What configuration needs to be tweaked?
>>>
>>>
>>> Memory usage by subsystem
>>> IdParentLimitCurrent ConsumptionPeak consumption
>>> root none 50.12G 4.97M 6.08M
>>> block_cache-sharded_lru_cache root none 937.9K 937.9K
>>> code_cache-sharded_lru_cache root none 1B 1B
>>> server root none 2.3K 201.4K
>>> tablet- server none 530B 200.1K
>>> MemRowSet-6 tablet- none 265B 265B
>>> txn_tracker tablet- 64.00M 0B 28.5K
>>> DeltaMemStores tablet- none 265B 87.8K
>>> log_block_manager server none 1.8K 2.7K
>>>
>>> Thanks,
>>> --
>>> Abhi Basu
>>>
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Abhi Basu