Re: Re: Recommended maximum amount of stored data per tablet server

2018-08-02 Thread Todd Lipcon
On Thu, Aug 2, 2018 at 4:54 PM, Quanlong Huang 
wrote:

> Thank Adar and Todd! We'd like to contribute when we could.
>
> Are there any concerns if we share the machines with HDFS DataNodes and
> Yarn NodeManagers? The network bandwidth is 10Gbps. I think it's ok if they
> don't share the same disks, e.g. 4 disks for kudu and the other 11 disks
> for DataNode and NodeManager, and leave enough CPU & mem for kudu. Is that
> right?
>

That should be fine. Typically we actualyl recommend sharing all the disks
for all of the services. There is a trade-off between static partitioning
(exclusive access to a smaller number of disks) vs dynamic sharing
(potential contention but more available resources). Unless your workload
is very latency sensitive I usually think it's better to have the bigger
pool of resources available even if it needs to share with other systems.

One recommendation, though is to consider using a dedicated disk for the
Kudu WAL and metadata, which can help performance, since the WAL can be
sensitive to other heavy workloads monopolizing bandwidth on the same
spindle.

-Todd

>
> At 2018-08-03 02:26:37, "Todd Lipcon"  wrote:
>
> +1 to what Adar said.
>
> One tension we have currently for scaling is that we don't want to scale
> individual tablets too large, because of problems like the superblock that
> Adar mentioned. However, the solution of just having more tablets is also
> not a great one, since many of our startup time problems are primarily
> affected by the number of tablets more than their size (see KUDU-38 as the
> prime, ancient, example). Additionally, having lots of tablets increases
> raft heartbeat traffic and may need to dial back those heartbeat intervals
> to keep things stable.
>
> All of these things can be addressed in time and with some work. If you
> are interested in working on these areas to improve density that would be a
> great contribution.
>
> -Todd
>
>
>
> On Thu, Aug 2, 2018 at 11:17 AM, Adar Lieber-Dembo 
> wrote:
>
>> The 8TB limit isn't a hard one, it's just a reflection of the scale
>> that Kudu developers commonly test. Beyond 8TB we can't vouch for
>> Kudu's stability and performance. For example, we know that as the
>> amount of on-disk data grows, node restart times get longer and longer
>> (see KUDU-2014 for some ideas on how to improve that). Furthermore, as
>> tablets accrue more data blocks, their superblocks become larger,
>> raising the minimum amount of I/O for any operation that rewrites a
>> superblock (such as a flush or compaction). Lastly, the tablet copy
>> protocol used in rereplication tries to copy the entire superblock in
>> one RPC message; if the superblock is too large, it'll run up against
>> the default 50 MB RPC transfer size (see src/kudu/rpc/transfer.cc).
>>
>> These examples are just off the top of my head; there may be others
>> lurking. So this goes back to what I led with: beyond the recommended
>> limit we aren't quite sure how Kudu's performance and stability are
>> affected.
>>
>> All that said, you're welcome to try it out and report back with your
>> findings.
>>
>>
>> On Thu, Aug 2, 2018 at 7:23 AM Quanlong Huang 
>> wrote:
>> >
>> > Hi all,
>> >
>> > In the document of "Known Issues and Limitations", it's recommended
>> that "maximum amount of stored data, post-replication and post-compression,
>> per tablet server is 8TB". How is the 8TB calculated?
>> >
>> > We have some machines each with 15 * 4TB spinning disk drives and 256GB
>> RAM, 48 cpu cores. Does it mean the other 52(= 15 * 4 - 8) TB space is
>> recommended to leave for other systems? We prefer to make the machine
>> dedicated to Kudu. Can tablet server leverage the whole space efficiently?
>> >
>> > Thanks,
>> > Quanlong
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re:Re: Recommended maximum amount of stored data per tablet server

2018-08-02 Thread Quanlong Huang
Thank Adar and Todd! We'd like to contribute when we could.


Are there any concerns if we share the machines with HDFS DataNodes and Yarn 
NodeManagers? The network bandwidth is 10Gbps. I think it's ok if they don't 
share the same disks, e.g. 4 disks for kudu and the other 11 disks for DataNode 
and NodeManager, and leave enough CPU & mem for kudu. Is that right?


Thanks,
Quanlong

At 2018-08-03 02:26:37, "Todd Lipcon"  wrote:

+1 to what Adar said.


One tension we have currently for scaling is that we don't want to scale 
individual tablets too large, because of problems like the superblock that Adar 
mentioned. However, the solution of just having more tablets is also not a 
great one, since many of our startup time problems are primarily affected by 
the number of tablets more than their size (see KUDU-38 as the prime, ancient, 
example). Additionally, having lots of tablets increases raft heartbeat traffic 
and may need to dial back those heartbeat intervals to keep things stable.


All of these things can be addressed in time and with some work. If you are 
interested in working on these areas to improve density that would be a great 
contribution.


-Todd






On Thu, Aug 2, 2018 at 11:17 AM, Adar Lieber-Dembo  wrote:
The 8TB limit isn't a hard one, it's just a reflection of the scale
that Kudu developers commonly test. Beyond 8TB we can't vouch for
Kudu's stability and performance. For example, we know that as the
amount of on-disk data grows, node restart times get longer and longer
(see KUDU-2014 for some ideas on how to improve that). Furthermore, as
tablets accrue more data blocks, their superblocks become larger,
raising the minimum amount of I/O for any operation that rewrites a
superblock (such as a flush or compaction). Lastly, the tablet copy
protocol used in rereplication tries to copy the entire superblock in
one RPC message; if the superblock is too large, it'll run up against
the default 50 MB RPC transfer size (see src/kudu/rpc/transfer.cc).

These examples are just off the top of my head; there may be others
lurking. So this goes back to what I led with: beyond the recommended
limit we aren't quite sure how Kudu's performance and stability are
affected.

All that said, you're welcome to try it out and report back with your findings.



On Thu, Aug 2, 2018 at 7:23 AM Quanlong Huang  wrote:
>
> Hi all,
>
> In the document of "Known Issues and Limitations", it's recommended that 
> "maximum amount of stored data, post-replication and post-compression, per 
> tablet server is 8TB". How is the 8TB calculated?
>
> We have some machines each with 15 * 4TB spinning disk drives and 256GB RAM, 
> 48 cpu cores. Does it mean the other 52(= 15 * 4 - 8) TB space is recommended 
> to leave for other systems? We prefer to make the machine dedicated to Kudu. 
> Can tablet server leverage the whole space efficiently?
>
> Thanks,
> Quanlong






--

Todd Lipcon
Software Engineer, Cloudera

Re: Recommended maximum amount of stored data per tablet server

2018-08-02 Thread Todd Lipcon
+1 to what Adar said.

One tension we have currently for scaling is that we don't want to scale
individual tablets too large, because of problems like the superblock that
Adar mentioned. However, the solution of just having more tablets is also
not a great one, since many of our startup time problems are primarily
affected by the number of tablets more than their size (see KUDU-38 as the
prime, ancient, example). Additionally, having lots of tablets increases
raft heartbeat traffic and may need to dial back those heartbeat intervals
to keep things stable.

All of these things can be addressed in time and with some work. If you are
interested in working on these areas to improve density that would be a
great contribution.

-Todd



On Thu, Aug 2, 2018 at 11:17 AM, Adar Lieber-Dembo 
wrote:

> The 8TB limit isn't a hard one, it's just a reflection of the scale
> that Kudu developers commonly test. Beyond 8TB we can't vouch for
> Kudu's stability and performance. For example, we know that as the
> amount of on-disk data grows, node restart times get longer and longer
> (see KUDU-2014 for some ideas on how to improve that). Furthermore, as
> tablets accrue more data blocks, their superblocks become larger,
> raising the minimum amount of I/O for any operation that rewrites a
> superblock (such as a flush or compaction). Lastly, the tablet copy
> protocol used in rereplication tries to copy the entire superblock in
> one RPC message; if the superblock is too large, it'll run up against
> the default 50 MB RPC transfer size (see src/kudu/rpc/transfer.cc).
>
> These examples are just off the top of my head; there may be others
> lurking. So this goes back to what I led with: beyond the recommended
> limit we aren't quite sure how Kudu's performance and stability are
> affected.
>
> All that said, you're welcome to try it out and report back with your
> findings.
>
>
> On Thu, Aug 2, 2018 at 7:23 AM Quanlong Huang 
> wrote:
> >
> > Hi all,
> >
> > In the document of "Known Issues and Limitations", it's recommended that
> "maximum amount of stored data, post-replication and post-compression, per
> tablet server is 8TB". How is the 8TB calculated?
> >
> > We have some machines each with 15 * 4TB spinning disk drives and 256GB
> RAM, 48 cpu cores. Does it mean the other 52(= 15 * 4 - 8) TB space is
> recommended to leave for other systems? We prefer to make the machine
> dedicated to Kudu. Can tablet server leverage the whole space efficiently?
> >
> > Thanks,
> > Quanlong
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Recommended maximum amount of stored data per tablet server

2018-08-02 Thread Adar Lieber-Dembo
The 8TB limit isn't a hard one, it's just a reflection of the scale
that Kudu developers commonly test. Beyond 8TB we can't vouch for
Kudu's stability and performance. For example, we know that as the
amount of on-disk data grows, node restart times get longer and longer
(see KUDU-2014 for some ideas on how to improve that). Furthermore, as
tablets accrue more data blocks, their superblocks become larger,
raising the minimum amount of I/O for any operation that rewrites a
superblock (such as a flush or compaction). Lastly, the tablet copy
protocol used in rereplication tries to copy the entire superblock in
one RPC message; if the superblock is too large, it'll run up against
the default 50 MB RPC transfer size (see src/kudu/rpc/transfer.cc).

These examples are just off the top of my head; there may be others
lurking. So this goes back to what I led with: beyond the recommended
limit we aren't quite sure how Kudu's performance and stability are
affected.

All that said, you're welcome to try it out and report back with your findings.


On Thu, Aug 2, 2018 at 7:23 AM Quanlong Huang  wrote:
>
> Hi all,
>
> In the document of "Known Issues and Limitations", it's recommended that 
> "maximum amount of stored data, post-replication and post-compression, per 
> tablet server is 8TB". How is the 8TB calculated?
>
> We have some machines each with 15 * 4TB spinning disk drives and 256GB RAM, 
> 48 cpu cores. Does it mean the other 52(= 15 * 4 - 8) TB space is recommended 
> to leave for other systems? We prefer to make the machine dedicated to Kudu. 
> Can tablet server leverage the whole space efficiently?
>
> Thanks,
> Quanlong


Recommended maximum amount of stored data per tablet server

2018-08-02 Thread Quanlong Huang
Hi all,


In the document of "Known Issues and Limitations", it's recommended that 
"maximum amount of stored data, post-replication and post-compression, per 
tablet server is 8TB". How is the 8TB calculated?


We have some machines each with 15 * 4TB spinning disk drives and 256GB RAM, 48 
cpu cores. Does it mean the other 52(= 15 * 4 - 8) TB space is recommended to 
leave for other systems? We prefer to make the machine dedicated to Kudu. Can 
tablet server leverage the whole space efficiently?


Thanks,
Quanlong

Re: poor performance on insert into range partitions and scaling

2018-08-02 Thread farkas
Found the reason from profiles. It is again about the exchange. Noshuffle 
helped a lot. Because when you do create table parq as select * from kudu180M 
it scans kudu, writes directly to HDFS. When you do insert into parq partition 
(year) select * from kudu180M where partition=2018 then it just reads 45M rows, 
but the exchange hashes the rows, so it is slower.

On 2018/07/31 20:59:28, Mike Percy  wrote: 
> Can you post a query profile from Impala for one of the slow insert jobs?
> 
> Mike
> 
> On Tue, Jul 31, 2018 at 12:56 PM Tomas Farkas  wrote:
> 
> > Hi,
> > wanted share with you the preliminary results of my Kudu testing on AWS
> > Created a set of performance tests for evaluation of different instance
> > types in AWS and different configurations (Kudu separated from Impala, Kudu
> > and Impala on the same nodes); different drive (st1 and gp2) settings and
> > here my results:
> >
> > I was quite dissapointed by the inserts in Step3 see attached sqls,
> >
> > Any hints, ideas, why this does not scale?
> > Thanks
> >
> >
> >
> 


Re: swap data in Kudu table

2018-08-02 Thread farkas
Thanks Boris for a great article!
Tomas

On 2018/07/25 19:56:10, Boris Tyukin  wrote: 
> Hi guys,
> 
> thanks again for your help!  I just blogged about this
> https://boristyukin.com/how-to-hot-swap-apache-kudu-tables-with-apache-impala/
> 
> BTW I did not have to invalidate or refresh metadata - it just worked with
>  ALTER TABLE TBLPROPERTIES idea. We have one Kudu master on our dev cluster
> so not sure if it is because of that but Impala/Kudu docs also do not
> mention anything about metadata refresh.  Looks like Impala is keeping a
> reference to uuid of the Kudu table not its actual name.
> 
> One thing I am still puzzled is how Impala was able to finish my
> long-running SELECT statement, that I had kicked off right before the swap.
> I did not get any error messages and I could clearly see that Kudu tables
> were getting renamed and dropped, while the query was still running in a
> different session and completed 10 seconds after the swap. This is still a
> mystery to me. The only explanation I have is that data was already in
> Impala daemons memory and did not need Kudu tables at that point.
> 
> Boris
> 
> 
> 
> On Fri, Feb 23, 2018 at 5:13 PM Boris Tyukin  wrote:
> 
> > you are guys are awesome, thanks!
> >
> > Todd, I like ALTER TABLE TBLPROPERTIES idea - will test it next week.
> > Views might work as well but for a number of reasons want to keep it as my
> > last resort :)
> >
> > On Fri, Feb 23, 2018 at 4:32 PM, Todd Lipcon  wrote:
> >
> >> A couple other ideas from the Impala side:
> >>
> >> - could you use a view and alter the view to point to a different table?
> >> Then all readers would be pointed at the view, and security permissions
> >> could be on that view rather than the underlying tables?
> >>
> >> - I think if you use an external table in Impala you could use an ALTER
> >> TABLE TBLPROPERTIES ... statement to change kudu.table_name to point to a
> >> different table. Then issue a 'refresh' on the impalads so that they load
> >> the new metadata. Subsequent queries would hit the new underlying Kudu
> >> table, but permissions and stats would be unchanged.
> >>
> >> -Todd
> >>
> >> On Fri, Feb 23, 2018 at 1:16 PM, Mike Percy  wrote:
> >>
> >>> Hi Boris, those are good ideas. Currently Kudu does not have atomic bulk
> >>> load capabilities or staging abilities. Theoretically renaming a partition
> >>> atomically shouldn't be that hard to implement, since it's just a master
> >>> metadata operation which can be done atomically, but it's not yet
> >>> implemented.
> >>>
> >>> There is a JIRA to track a generic bulk load API here:
> >>> https://issues.apache.org/jira/browse/KUDU-1370
> >>>
> >>> Since I couldn't find anything to track the specific features you
> >>> mentioned, I just filed the following improvement JIRAs so we can track 
> >>> it:
> >>>
> >>>- KUDU-2326: Support atomic bulk load operation
> >>>
> >>>- KUDU-2327: Support atomic swap of tables or partitions
> >>>
> >>>
> >>> Mike
> >>>
> >>> On Thu, Feb 22, 2018 at 6:39 AM, Boris Tyukin 
> >>> wrote:
> >>>
>  Hello,
> 
>  I am trying to figure out the best and safest way to swap data in a
>  production Kudu table with data from a staging table.
> 
>  Basically, once in a while we need to perform a full reload of some
>  tables (once in a few months). These tables are pretty large with 
>  billions
>  of rows and we want to minimize the risk and downtime for users if
>  something bad happens in the middle of that process.
> 
>  With Hive and Impala on HDFS, we can use a very cool handy command LOAD
>  DATA INPATH. We can prepare data for reload in a staging table upfront 
>  and
>  this process might take many hours. Once staging table is ready, we can
>  issue LOAD DATA INPATH command which will move underlying HDFS files to a
>  production table - this operation is almost instant and the very last 
>  step
>  in our pipeline.
> 
>  Alternatively, we can swap partitions using ALTER TABLE EXCHANGE
>  PARTITION command.
> 
>  Now with Kudu, I cannot seem to find a good strategy. The only thing
>  came to my mind is to drop the production table and rename a staging 
>  table
>  to production table as the last step of the job, but in this case we are
>  going to lose statistics and security permissions.
> 
>  Any other ideas?
> 
>  Thanks!
>  Boris
> 
> >>>
> >>>
> >>
> >>
> >> --
> >> Todd Lipcon
> >> Software Engineer, Cloudera
> >>
> >
> >
> 


Re:Re: Re: Re: Why RowSet size is much smaller than flush_threshold_mb

2018-08-02 Thread Quanlong Huang
No, I failed to tune other flags... That's why I started this thread...


I understand it's a trade-off whether to expose the design docs. Not exposing 
them will make the document clearer. The downside is users may bother you guys 
more when they encounter problems since there're no answers they can find 
themselves. However, it's not a problem since you guys are quite helpful :)


Thanks,
Quanlong


At 2018-08-02 10:18:00,"Todd Lipcon"  wrote:

On Wed, Aug 1, 2018 at 4:52 PM, Quanlong Huang  wrote:

In my experience, when I found the performance is below my expectation, I'd 
like to tune flags listed in 
https://kudu.apache.org/docs/configuration_reference.html , which needs a clear 
understanding of kudu internals. Maybe we can add the link there?




Any particular flags that you found you had to tune? I almost never advise 
tuning anything other than the number of maintenance threads. If you have some 
good guidance on how tuning those flags can improve performance, maybe we can 
consider changing the defaults or giving some more prescriptive advice?


I'm a little nervous that saying "here are all the internals, and here are 100 
config flags to study" will scare users more than help them :)


-Todd
 

At 2018-08-02 01:06:40,"Todd Lipcon"  wrote:

On Wed, Aug 1, 2018 at 6:28 AM, Quanlong Huang  wrote:

Hi Todd and William,


I'm really appreciated for your help and sorry for my late reply. I was going 
to reply with some follow-up questions but was assigned to focus some other 
works... Now I'm back to this work.


The design docs are really helpful. Now I understand the flush and compaction. 
I think we can add a link to these design docs in the kudu documentation page, 
so users who want to dig deeper can know more about kudu internal.


Personally, since starting the project, I have had the philosophy that the 
user-facing documentation should remain simple and not discuss internals too 
much. I found in some other open source projects that there isn't a clear 
difference between user documentation and developer documentation, and users 
can easily get confused by all of the internal details. Or, users may start to 
believe that Kudu is very complex and they need to understand knapsack problem 
approximation algorithms in order to operate it. So, normally we try to avoid 
exposing too much of the details.


That said, I think it is a good idea to add a small note in the documentation 
somewhere that links to the design docs, maybe with some sentence explaining 
that understanding internals is not necessary to operate Kudu, but that expert 
users may find the internal design useful as a reference? I would be curious to 
hear what other users think about how best to make this trade-off.


-Todd
 
At 2018-06-15 23:41:17, "Todd Lipcon"  wrote:

Also, keep in mind that when the MRS flushes, it flushes into a bunch of 
separate RowSets, not 1:1. It "rolls" to a new RowSet every N MB (N=32 by 
default). This is set by --budgeted_compaction_target_rowset_size


However, increasing this size isn't likely to decrease the number of 
compactions, because each of these 32MB rowsets is non-overlapping. In other 
words, if your MRS contains rows A-Z, the output RowSets will include [A-C], 
[D-G], [H-P], [Q-Z]. Since these ranges do not overlap, they will never need to 
be compacted with each other. The net result, here, is that compaction becomes 
more fine-grained and only needs to operate on sub-ranges of the tablet where 
there is a lot of overlap.


You can read more about this in docs/design-docs/compaction-policy.md, in 
particular the section "Limiting RowSet Sizes"


Hope that helps
-Todd


On Fri, Jun 15, 2018 at 8:26 AM, William Berkeley  wrote:

The op seen in the logs is a rowset compaction, which takes existing 
diskrowsets and rewrites them. It's not a flush, which writes data in memory to 
disk, so I don't think the flush_threshold_mb is relevant. Rowset compaction is 
done to reduce the amount of overlap of rowsets in primary key space, i.e. 
reduce the number of rowsets that might need to be checked to enforce the 
primary key constraint or find a row. Having lots of rowset compaction 
indicates that rows are being written in a somewhat random order w.r.t the 
primary key order. Kudu will perform much better as writes scale when rows are 
inserted roughly in increasing order per tablet.


Also, because you are using the log block manager (the default and only one 
suitable for production deployments), there isn't a 1-1 relationship between 
cfiles or diskrowsets and files on the filesystem. Many cfiles and diskrowsets 
will be put together in a container file.


Config parameters that might be relevant here:
--maintenance_manager_num_threads
--fs_data_dirs (how many)
--fs_wal_dir (is it shared on a device with the data dir?)


The metrics from the compact row sets op indicates the time is spent in 
fdatasync and in reading (likely reading the original rowsets). The overall 
compaction time is