Re: How can I look up the total disk space occupied by the kudu table

2017-05-31 Thread William Berkeley
Since a Kudu table is distributed across tablet servers, the total size of
the table is the sum of the sizes of its tablets. /metrics has an entry for
each tablet, which will list which table it came from and the on-disk size
of the tablet, so you can roll up all these numbers to compute the total
size of the table. Kudu does not do this for you, but metrics software
should be able to.

Also, keep in mind that the on-disk size metric for tablets doesn't capture
the whole size of the tablet. See KUDU-1755
 and KUDU-2001
.

-Will

On Tue, May 30, 2017 at 8:09 PM, lizhong0...@qq.com 
wrote:

> How can I see the total disk space occupied by the kudu table
>
> I did not find the entry by checking the size of the kudu-web,
> Can only find the size of each tablet, no total size. What should I do?
>
> Thanks
> --
> lizhong0...@qq.com
>


Re: Why RowSet size is much smaller than flush_threshold_mb

2018-06-15 Thread William Berkeley
The op seen in the logs is a rowset compaction, which takes existing
diskrowsets and rewrites them. It's not a flush, which writes data in
memory to disk, so I don't think the flush_threshold_mb is relevant. Rowset
compaction is done to reduce the amount of overlap of rowsets in primary
key space, i.e. reduce the number of rowsets that might need to be checked
to enforce the primary key constraint or find a row. Having lots of rowset
compaction indicates that rows are being written in a somewhat random order
w.r.t the primary key order. Kudu will perform much better as writes scale
when rows are inserted roughly in increasing order per tablet.

Also, because you are using the log block manager (the default and only one
suitable for production deployments), there isn't a 1-1 relationship
between cfiles or diskrowsets and files on the filesystem. Many cfiles and
diskrowsets will be put together in a container file.

Config parameters that might be relevant here:
--maintenance_manager_num_threads
--fs_data_dirs (how many)
--fs_wal_dir (is it shared on a device with the data dir?)

The metrics from the compact row sets op indicates the time is spent in
fdatasync and in reading (likely reading the original rowsets). The overall
compaction time is kinda long but not crazy long. What's the performance
you are seeing and what is the performance you would like to see?

-Will

On Fri, Jun 15, 2018 at 7:52 AM, Quanlong Huang 
wrote:

> Hi all,
>
> I'm running kudu 1.6.0-cdh5.14.2. When looking into the logs of tablet
> server, I find most of the compactions are compacting small files (~40MB
> for each). For example:
>
> I0615 07:22:42.637351 30614 tablet.cc:1661] T
> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
> Compaction: stage 1 complete, picked 4 rowsets to compact
> I0615 07:22:42.637385 30614 compaction.cc:903] Selected 4 rowsets to
> compact:
> I0615 07:22:42.637393 30614 compaction.cc:906] RowSet(343)(current size
> on disk: ~4000 bytes)
> I0615 07:22:42.637401 30614 compaction.cc:906] RowSet(1563)(current size
> on disk: ~34720852 bytes)
> I0615 07:22:42.637408 30614 compaction.cc:906] RowSet(1645)(current size
> on disk: ~29914833 bytes)
> I0615 07:22:42.637415 30614 compaction.cc:906] RowSet(1870)(current size
> on disk: ~29007249 bytes)
> I0615 07:22:42.637428 30614 tablet.cc:1447] T
> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
> Compaction: entering phase 1 (flushing snapshot). Phase 1 snapshot:
> MvccSnapshot[committed={T|T < 6263071556616208384 or (T in
> {6263071556616208384})}]
> I0615 07:22:42.641582 30614 multi_column_writer.cc:103] Opened CFile
> writers for 124 column(s)
> I0615 07:22:43.875396 30614 multi_column_writer.cc:103] Opened CFile
> writers for 124 column(s)
> I0615 07:22:44.418421 30614 multi_column_writer.cc:103] Opened CFile
> writers for 124 column(s)
> I0615 07:22:45.114389 30614 multi_column_writer.cc:103] Opened CFile
> writers for 124 column(s)
> I0615 07:22:54.762563 30614 tablet.cc:1532] T
> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
> Compaction: entering phase 2 (starting to duplicate updates in new rowsets)
> I0615 07:22:54.773572 30614 tablet.cc:1587] T
> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
> Compaction Phase 2: carrying over any updates which arrived during Phase 1
> I0615 07:22:54.773599 30614 tablet.cc:1589] T
> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
> Phase 2 snapshot: MvccSnapshot[committed={T|T < 6263071556616208384 or (T
> in {6263071556616208384})}]
> I0615 07:22:55.189757 30614 tablet.cc:1631] T
> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
> Compaction successful on 82987 rows (123387929 bytes)
> I0615 07:22:55.191426 30614 maintenance_manager.cc:491] Time spent
> running CompactRowSetsOp(6bdefb8c27764a0597dcf98ee1b450ba): real 12.628s user
> 1.460s sys 0.410s
> I0615 07:22:55.191484 30614 maintenance_manager.cc:497] P
> 70f3e54fe0f3490cbf0371a6830a33a7: CompactRowSetsOp(
> 6bdefb8c27764a0597dcf98ee1b450ba) metrics: {"cfile_cache_hit":812,"cfile_
> cache_hit_bytes":16840376,"cfile_cache_miss":2730,"cfile_
> cache_miss_bytes":251298442,"cfile_init":496,"data
> dirs.queue_time_us":6646,"data dirs.run_cpu_time_us":2188,"data
> dirs.run_wall_time_us":101717,"fdatasync":315,"fdatasync_us"
> :9617174,"lbm_read_time_us":1288971,"lbm_reads_1-10_ms":
> 32,"lbm_reads_10-100_ms":41,"lbm_reads_lt_1ms":4641,"lbm_
> write_time_us":122520,"lbm_writes_lt_1ms":2799,"mutex_
> wait_us":25,"spinlock_wait_cycles":155264,"tcmalloc_
> contention_cycles":768,"thread_start_us":677,"threads_
> started":14,"wal-append.queue_time_us":300}
>
> The flush_threshold_mb is set in the default value (1024). Wouldn't the
> flushed file size be ~1GB?
>
> I think increasing the initial RowSet size can reduce compactions and then
> reduce the impact of other ongoing operations. It may also improve the
> flush performance. Is that 

Re: How to migrate kudu tablets

2018-01-02 Thread William Berkeley
move_replica isn't available in CDH 5.12 / Kudu 1.4. It's first available
in CDH 5.13 / Kudu 1.5. There's a few solutions:

1. Running move_replica against a 5.12 cluster using a 5.13 or later kudu
tool should work.
2. You can move replicas manually using add_replica to add the new replica,
waiting for the new replica to tablet copy and bootstrap, then removing the
unwanted replica with remove_replica. The easiest way to do this is to
monitor the status of the tablet with ksck while you're moving replicas
(it's essentially what the move tool does internally).
3. Upgrade to 5.13 or later.

-Will


On Tue, Jan 2, 2018 at 4:59 AM, Beata Jursza  wrote:

> Hi,
>
>
>
>
>
> I need to migrate kudu tablets from one tablet server to another.
>
> I have run `kudu remote_replica list’ to determinate which tablets I want
> to move, after that  I have tried  to use kudu tablet change_config
> move_replica tool for migration however  I got below message:
>
> Invalid argument: unknown command 'move_replica'
> Usage: 
> /opt/cloudera/parcels/KUDU-1.4.0-1.cdh5.12.1.p0.10/bin/../lib/kudu/bin/kudu
> tablet change_config  []
>
>  can be one of the following:
> add_replica Add a new replica to a tablet's Raft configuration
> change_replica_type Change the type of an existing replica in a tablet's
> Raft configuration
> remove_replica Remove an existing replica from a tablet's Raft
> configuration
>
> Do you have any ideas how can I migrate kudu tablets?
>
> Thank you in advance
>
>
>
> Met vriendelijke groet,
>
> Best regards,
>
>
>
> *Beata Jursza*
>
>
>
> [image: cid:image001.png@01D2973B.4357FC30]
>
> E:  *beata.jur...@onmarc.nl *
>
> T:  +31 (0)30 636 3900 <+31%2030%20636%203900>
>
> M: +31 (0)6 83673664 <+31%206%2083673664>
>
>
>


Re: Segmentation Fault when running kudu ksck

2018-08-20 Thread William Berkeley
That looks like KUDU-2113, which was fixed in 1.6.0.

It happens if the tablet servers report peers in their config that are not
known to the master. Probably, you have removed servers from the cluster
and some of the tablets are in a bad state as a result. These sorts of
problems were unfortunately common on earlier Kudu releases. Every new
version since 5.12 had made significant improvements to prevent these sorts
of situations. I'd recommend upgrading to 1.5, or at least taking a 1.5
kudu tool and running it against the 1.4 cluster to see what the issues are.

-Will

On Mon, Aug 20, 2018 at 10:57 AM, Vincent Kooijman <
vincent.kooij...@onmarc.nl> wrote:

> Hi all,
>
>
>
> We're running into a few Kudu issues with the first being the Kudu cluster
> check utility (sudo -u kudu /opt/cloudera/parcels/CDH/lib/kudu/bin-debug/kudu
> cluster ksck) showing:
>
>
>
> Connected to the Master
>
> Fetched info from all 10 Tablet Servers
>
>
>
> Tablet 41bf41e4127a46c69242f707298cf4ba of table 'xxx' is
> under-replicated: 1 replica(s) not RUNNING
>
>   1b3d49dd6ce64acda32f97a89d7de193: TS unavailable
>
>   1a05af887edf4ba7b5c1731ce3508b19 (pdn05:7050): RUNNING [LEADER]
>
>   4028533287964369928034c3616a0a16 (pdn01:7050): RUNNING
>
>
>
> 2 replicas' active configs differ from the master's.
>
>   All the peers reported by the master and tablet servers are:
>
>   A = 1a05af887edf4ba7b5c1731ce3508b19
>
>   B = 1b3d49dd6ce64acda32f97a89d7de193
>
>   C = 4028533287964369928034c3616a0a16
>
>
>
> *The consensus matrix is:*
>
> *Segmentation fault*
>
>
>
> There is some mention of segmentation fault in combination with ksck in
> the Kudu release notes for 1.4.0, but we are running 1.5.0 on a CDH cluster.
>
>
>
> Some notes:
>
>
>
>- All masters (we have 3) are up with one leader being elected
>- All tablet servers (10) are live and visible in the master web UI
>- We've ran kudu fs check ... -repair on all servers (master & tablet)
>- Master logs are filled with errors like:
>
>Previously reported cstate for tablet 5977f01cea8a908bb56f97b46d9e
>(table 'xxx' [id=bb359f4b89dd46e797e2e24f9efac971]) gave a different
>leader for term 2007 than the current cstate. Previous cstate:
>current_term: 2007 leader_uuid: ""
>
>- And tablet server logs contain a lot of:
>
>Couldn't send request to peer 228515616baf44a99561c2b72dfb3bab for
>tablet 138854a04f804f4ebf42df657c22b995. Error code:
>TABLET_NOT_RUNNING (12). Status: Illegal state: Tablet not RUNNING:
>INITIALIZED. Retrying in the next heartbeat period. Already tried 12813
>times.
>
>
>
> We're a bit lost as to where to look next.
>
>
>
> If anyone can point us in the right direction, that would be great!
>
>
> Thanks,
>
>
>
> Vincent
>


Re: Kudu hashes and Java hashes

2018-08-28 Thread William Berkeley
> 1. We have multiple Kudu clients (Reducers).

Would it be better if each one has a single session to a single tablet
writing large number of records,

or multiple sessions writing to different tablets (total number of records
is the same)?


The advantage I see in writing to a single tablet from a single reducer is
that if the reducer is scheduled locally to the leader replica of the
tablet then one network hop is eliminated. However, the client, AFAIK,
doesn't offer a general mechanism to know where a write will go. If the
table is purely range partitioned it is possible, but not if the table has
hash partitioning. Since leadership can change at any time, it wouldn't be
a reliable mechanism anyway. To compare, Kudu does offer a way to split a
scan into scan tokens, which can be serialized and dehydrated into a
scanner on a machine where a suitable replica lives.


So it doesn't really matter, as long as there are a good enough number of
rows for most of the tablets being written to, so that the execution time
isn't dominated by roundtrips to n servers (vs. to 1 server).


Are you seeing a specific problem where Kudu isn't as fast as you
anticipated, or where reducers writing to many tablets is a bottleneck?


> 2. Assuming it is preferable to have 1-to-1 relationship, i.e. 1 Reducers
should write to 1 Tablet. What would be the proper implementation to reduce
amount of connections between reducers to different tablets, i.e. if there
are 128 reducers (each gathers its own set of unique hashes) and 128
tablets, then ideally each reducer should write to 1 tablet, but not to
each of 128 tablets.


I don't think the 1-1 relationship is preferable if, for each reducer, the
number of rows written per tablet is large enough to fill multiple batches
(10,000, say, is good enough).


> Why have the questions arised: there is a hash implementation in Java and
another one in Kudu. Is there any chance to ensure Java Reducers use the
same hash function as Kudu partitioning hash?


The current Java implementation of the encodings of the primary and
partition keys can be found at
https://github.com/apache/kudu/blob/master/java/kudu-client/src/main/java/org/apache/kudu/client/KeyEncoder.java,
so you could adjust your code to match that and, possibly with a bit of
additional custom code, be able to tell ahead of time which tablet a row
belongs to. However, it is implementation, not interface. I don't imagine
it changing, but it could, and there's no guarantee it won't.


Adding the ability to determine the row-to-tablet mapping, as a feature
request, I think might not be a good idea because writes must go to the
tablet leader, and that can change at any time, so such a feature still
doesn't provide a reliable way to determine the row-to-tablet-server
mapping.


-Will

On Tue, Aug 28, 2018 at 1:08 AM Sergejs Andrejevs 
wrote:

> Hi there,
>
>
>
> We're running Map-Reduce jobs in java and Reducers write to Kudu.
>
> In java we use hashCode() function to send results from Mappers to
> Reducers, e.g.
>
> public int getPartition(ArchiveKey key, Object value, int
> numReduceTasks) {
>
> int hash = key.getCaseId().hashCode();
>
> return (hash & Integer.MAX_VALUE) % numReduceTasks;
>
> }
>
> There is also a partitioning hash function in Kudu tables.
>
>
>
> Therefore, there are 2 questions:
>
> 1. We have multiple Kudu clients (Reducers).
>
> Would it be better if each one has a single session to a single tablet
> writing large number of records,
>
> or multiple sessions writing to different tablets (total number of records
> is the same)?
>
> 2. Assuming it is preferable to have 1-to-1 relationship, i.e. 1 Reducers
> should write to 1 Tablet. What would be the proper implementation to reduce
> amount of connections between reducers to different tablets, i.e. if there
> are 128 reducers (each gathers its own set of unique hashes) and 128
> tablets, then ideally each reducer should write to 1 tablet, but not to
> each of 128 tablets.
>
>
>
> Why have the questions arised: there is a hash implementation in Java and
> another one in Kudu. Is there any chance to ensure Java Reducers use the
> same hash function as Kudu partitioning hash?
>
>
>
>
>
> Best regards,
>
> Sergejs
>


Re: Kudu's data pagination

2018-09-04 Thread William Berkeley
Hi Irtiza. What do you mean by paginate? I'm guessing you mean doing
something like taking the results of a query like

SELECT name, age FROM users SORT BY age DESC

and displaying the results on some UI 10 at a time, say.

If that's the case, the answer is no. It requires additional application
code. In general, Kudu cannot return rows in order. So, if you want rows
101-110, you must retrieve *all* the rows, select the top 110, and then
display only the final 10.

In special cases when the sort is on a prefix of the primary key, scan
tokens can be used to have Kudu return sorted subsets of rows from each
tablet, which you can partially merge to get the desired result set.

With a lot of data it's best to retrieve a large amount of sorted results
and paginate from the cached results, rather than running a new query per
page.

-Will

On Tue, Sep 4, 2018 at 9:02 AM Irtiza Ali  wrote:

> Hello everyone,
>
> Is there a way to paginate kudu's data using its python client?
>
>
> I
>


Re: Unable to Initialize catalog manager

2018-07-05 Thread William Berkeley
You need the follow the directions at
http://kudu.apache.org/docs/administration.html#migrate_to_multi_master to
migrate from 1 to 3 masters. It's not sufficient just to start up the new
masters and change the master_addresses flag.

-Will

On Thu, Jul 5, 2018 at 7:10 PM Sangeeta Gulia 
wrote:

> Hi Team,
>
> I am facing an issue in starting kudu. It is s cluster of 8 nodes where i
> have added hadoop-master-slave2 and slave3 to --master addresses
>
> Below is the error log i get when i start my kudu server:
>
> E0705 11:32:20.674165  8250 master.cc:183] Master@0.0.0.0:7051: Unable to
> init master catalog manager: Invalid argument: Unable to initialize catalog
> manager: Failed to initialize sys tables async: on-disk master list
> (hadoop-master:7051, slave2:7051, slave3:7051) and provided master list
> (:0) differ. Their symmetric difference is: :0, hadoop-master:7051,
> slave2:7051, slave3:7051
> F0705 11:32:20.674226  8184 master_main.cc:71] Check failed: _s.ok() Bad
> status: Invalid argument: Unable to initialize catalog manager: Failed to
> initialize sys tables async: on-disk master list (hadoop-master:7051,
> slave2:7051, slave3:7051) and provided master list (:0) differ. Their
> symmetric difference is: :0, hadoop-master:7051, slave2:7051, slave3:7051
>
>
> --
> Warm Regards,
> Sangeeta Gulia
> Software Consultant
> Knoldus Inc.
> m: 9650877357
> w: www.knoldus.com  e: sangeeta.gu...@knoldus.in
> 
>


Re: Any plans for supporting schemas (namespaces of tables)?

2018-04-24 Thread William Berkeley
Hi Martin. I don't see any conflicts between that and any current or
near-term work I know of happening in Kudu. There are a couple of related
JIRAs for database support: KUDU-2063
 and KUDU-2362
.

In fact, Impala does something similar right now. Kudu tables managed by
Impala are named like impala::database_name.table_name. It might also be a
good idea to add some kind of identifier like the initial "impala::" for
presto-managed or -created tables, since many other integrations may use a
database_name.table_name convention.

-Will

On Tue, Apr 24, 2018 at 7:01 PM, Martin Weindel 
wrote:

> Hi,
>
> are there any plans for supporting schemas in the sense of relational
> databases, i.e. namespaces of tables?
>
> I'm asking, as I have a discussion in the Presto project for the proposed
> Kudu connector. (see https://github.com/prestodb/presto/pull/10388)
>
> As Presto supports catalog > schema > table, I thought it a good idea to
> make use of these namespaces of tables also in the Kudu connector.
>
> This happens by using a dot in the Kudu table name. E.g. the Kudu table
> name "mytable" is  mapped to kudu.default.mytable in Presto and a Kudu
> table named "myschema.mytable2" to kudu.myschema.mytable2. There is nothing
> in Kudu which prevents using table names with dots, so it does not seem to
> be a problem at least at the moment.
>
> Can someone of the Kudu developer team can give me a hint, if you see any
> conflicts with the Kudu roadmap?
>
> Thanks,
>
> Martin Weindel
>
>


Re: Problems connecting form Spark

2018-03-06 Thread William Berkeley
In each case the problem is that some part of your application can't find
the leader master of the Kudu cluster:

org.apache.kudu.client.NoLeaderFoundException: Master config (*172.17.0.43:7077
*) has no leader.
org.apache.kudu.client.NoLeaderFoundException: Master config (
*localhost:7051*) has no leader.

I think you're seeing these errors for two reasons:

1. Are you using multi-master? The first exception shows you specified one
remote master. If your cluster has multiple masters, you should specify all
of them. If you specify only one, and it's not the leader master, then
connecting to it will fail. You can check which master is the leader by
going to the /masters page on the web ui of any master.

2. In the "standalone" case, the Spark tasks are being distributed to
executors and fail there:

Lost task 1.0 in stage 0.0 (TID 1, tt-slave-2.novalocal, executor 1)

You've specified the master address as localhost. That address is passed
as-is to executors. Any task on an executor that doesn't have the leader
master locally at port 7051 will fail to connect to the leader master.
Getting the column names doesn't fail as that doesn't generate tasks sent
to remote executors.

I make this mistake all the time while playing with kudu-spark :)

-Will





On Mon, Mar 5, 2018 at 4:14 PM, Mac Noland  wrote:

> Any chance you can try spark2-shell with Kudu 1.6 and then re-try your
> tests?
>
> spark-shell --packages org.apache.kudu:kudu-spark2_2.11:1.6.0
>
> On Fri, Mar 2, 2018 at 5:02 AM, Saúl Nogueras  wrote:
>
>> I cannot properly connect to Kudu from Spark, error says “Kudu master has
>> no leader”
>>
>>- CDH 5.14
>>- Kudu 1.6
>>- Spark 1.6.0 standalone and 2.2.0
>>
>> When I use Impala in HUE to create and query kudu tables, it works
>> flawlessly.
>>
>> However, connecting from Spark throws some errors I cannot decipher.
>>
>> I have tried using both pyspark and spark-shell. With spark shell I had
>> to use spark 1.6 instead of 2.2 because some maven dependencies problems,
>> that I have localized but not been able to fix. More info here.
>> --
>> Case 1: using pyspark2 (Spark 2.2.0)
>>
>> $ pyspark2 --master yarn --jars 
>> /opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib/kudu/kudu-spark2_2.11.jar
>>
>> > df = 
>> > sqlContext.read.format('org.apache.kudu.spark.kudu').options(**{"kudu.master":"172.17.0.43:7077",
>> >  "kudu.table":"impala::default.test"}).load()
>>
>> 18/03/02 10:23:27 WARN client.ConnectToCluster: Error receiving response 
>> from 172.17.0.43:7077
>> org.apache.kudu.client.RecoverableException: [peer master-172.17.0.43:7077] 
>> encountered a read timeout; closing the channel
>> at 
>> org.apache.kudu.client.Connection.exceptionCaught(Connection.java:412)
>> at 
>> org.apache.kudu.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:112)
>> at 
>> org.apache.kudu.client.Connection.handleUpstream(Connection.java:239)
>> at 
>> org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>> at 
>> org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>> at 
>> org.apache.kudu.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.exceptionCaught(SimpleChannelUpstreamHandler.java:153)
>> at 
>> org.apache.kudu.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:112)
>> at 
>> org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>> at 
>> org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>> at 
>> org.apache.kudu.shaded.org.jboss.netty.channel.Channels.fireExceptionCaught(Channels.java:536)
>> at 
>> org.apache.kudu.shaded.org.jboss.netty.handler.timeout.ReadTimeoutHandler.readTimedOut(ReadTimeoutHandler.java:236)
>> at 
>> org.apache.kudu.shaded.org.jboss.netty.handler.timeout.ReadTimeoutHandler$ReadTimeoutTask$1.run(ReadTimeoutHandler.java:276)
>> at 
>> org.apache.kudu.shaded.org.jboss.netty.channel.socket.ChannelRunnableWrapper.run(ChannelRunnableWrapper.java:40)
>> at 
>> org.apache.kudu.shaded.org.jboss.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:391)
>> at 
>> org.apache.kudu.shaded.org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:315)
>> at 
>> org.apache.kudu.shaded.org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
>> at 
>> 

Re: kudu use impala query question!

2018-10-18 Thread William Berkeley
I think those messages are harmless and indicative of an underlying issue
of leader elections, perhaps caused by load. The full ksck output would be
helpful to understand more.

This shouldn't cause writes to be lost. More likely the data was not
written at all- are you checking for the success of the write operations in
your code?

-Will

On Wed, Oct 10, 2018 at 8:12 PM fengba...@uce.cn  wrote:

> Hi:
>
>I used CDH5.14.0,kudu version is 1.6.0 (3 kudu master,7 kudu ts), Kudu
> ts and yarn mix and deploy on the same machine.
> 8 SAS on each TS machine (1T capacity)
>
>
> When I query the kudu table with impala, the kudu tsserver appears 
> unavailable. The tsserver log:
> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
>
> E0919 16:40:20.339299 139726 consensus_queue.cc:618] T 
> 92d642ac75fe4fa0bdb85fd879a1e725 P 162f275784fa4fbfa49ad8a2639f87c4 [LEADER]: 
> Error trying to read ahead of the log while preparing peer request: 
> Incomplete: Op with index 400 is ahead of the local log (next sequential op: 
> 400). Destination peer: Peer: 8907a006b28a4d52afcc66ff48e11faf, Status: 
> INVALID_TERM, Last received: 455.400, Next index: 401, Last known committed 
> idx: 400, Time since last communication: 0.084s
>
> E0919 16:40:20.630578 140247 consensus_queue.cc:618] T 
> 7f65df8fb13e467493a257aae698f219 P 162f275784fa4fbfa49ad8a2639f87c4 
> [NON_LEADER]: Error trying to read ahead of the log while preparing peer 
> request: Incomplete: Op with index 7925 is ahead of the local log (next 
> sequential op: 7925). Destination peer: Peer: 
> 739b33f810844ebcaef489e0b83c3eba, Status: INVALID_TERM, Last received: 
> 1.7924, Next index: 7926, Last known committed idx: 7925, Time since last 
> communication: 0.001s
>
> E0925 10:45:18.919159 33765 consensus_queue.cc:618] T 
> a3854738cf284426bea0943184c0a4d5 P 162f275784fa4fbfa49ad8a2639f87c4 [LEADER]: 
> Error trying to read ahead of the log while preparing peer request: 
> Incomplete: Op with index 380 is ahead of the local log (next sequential op: 
> 380). Destination peer: Peer: 739b33f810844ebcaef489e0b83c3eba, Status: 
> INVALID_TERM, Last received: 443.380, Next index: 381, Last known committed 
> idx: 380, Time since last communication: 0.042s
>
> E0925 10:45:18.908238 33741 consensus_queue.cc:618] T 
> be346255d6344e6ba12f18b1084dd2f4 P 162f275784fa4fbfa49ad8a2639f87c4 [LEADER]: 
> Error trying to read ahead of the log while preparing peer request: 
> Incomplete: Op with index 74419 is ahead of the local log (next sequential 
> op: 74419). Destination peer: Peer: 739b33f810844ebcaef489e0b83c3eba, Status: 
> INVALID_TERM, Last received: 346.74419, Next index: 74420, Last known 
> committed idx: 74419, Time since last communication: 0.019s
>
> E0925 10:45:18.918596 33759 consensus_queue.cc:618] T 
> 98faf1901e2848a985afb974167e0582 P 162f275784fa4fbfa49ad8a2639f87c4 [LEADER]: 
> Error trying to read ahead of the log while preparing peer request: 
> Incomplete: Op with index 324 is ahead of the local log (next sequential op: 
> 324). Destination peer: Peer: 739b33f810844ebcaef489e0b83c3eba, Status: 
> INVALID_TERM, Last received: 375.324, Next index: 325, Last known committed 
> idx: 324, Time since last communication: 0.043s
>
> E0929 16:41:43.993786 191493 consensus_queue.cc:618] T 
> 142db0b84a2d4920a2ce5248319da3b1 P 162f275784fa4fbfa49ad8a2639f87c4 [LEADER]: 
> Error trying to read ahead of the log while preparing peer request: 
> Incomplete: Op with index 49664 is ahead of the local log (next sequential 
> op: 49664). Destination peer: Peer: 739b33f810844ebcaef489e0b83c3eba, Status: 
> INVALID_TERM, Last received: 1120.49663, Next index: 49665, Last known 
> committed idx: 49664, Time since last communication: 0.536s
>
> E0929 16:43:05.334996 196311 consensus_queue.cc:618] T 
> 142db0b84a2d4920a2ce5248319da3b1 P 162f275784fa4fbfa49ad8a2639f87c4 [LEADER]: 
> Error trying to read ahead of the log while preparing peer request: 
> Incomplete: Op with index 49664 is ahead of the local log (next sequential 
> op: 49664). Destination peer: Peer: 739b33f810844ebcaef489e0b83c3eba, Status: 
> INVALID_TERM, Last received: 1120.49663, Next index: 49665, Last known 
> committed idx: 49664, Time since last communication: 81.877s
>
> E0929 16:43:05.337278 196314 consensus_queue.cc:618] T 
> 142db0b84a2d4920a2ce5248319da3b1 P 162f275784fa4fbfa49ad8a2639f87c4 [LEADER]: 
> Error trying to read ahead of the log while preparing peer request: 
> Incomplete: Op with index 49664 is ahead of the local log (next sequential 
> op: 49664). Destination peer: Peer: 739b33f810844ebcaef489e0b83c3eba, Status: 
> INVALID_TERM, Last received: 1120.49663, Next index: 49665, Last known 
> committed idx: 49664, Time since last communication: 81.879s
>
> E0929 16:47:05.313355  7456 consensus_queue.cc:618] T 
> 83105158520f487d9122666b06ff8d34 P 162f275784fa4fbfa49ad8a2639f87c4 [LEADER]: 
> Error trying to read ahead of the log while preparing peer request: 
> Incomplete: 

Re: Re[2]: [KUDU] Rebalancing tool

2018-12-04 Thread William Berkeley
Yeah, it's worth looking into. If leadership tends to migrate to a subset
of the servers it indicates some kind of instability or communication
problems with the other servers, since leadership will move away from
servers that have trouble sending timely heartbeats to followers.

The leader does do more work for a tablet than the followers, and writes
must go into the leader's log as well as the log of a majority, so there's
some penalty on write latency and throughput to have leadership
concentrated.

The next version of Kudu will have an enhanced leader_step_down tool that
implements KUDU-2245, so leadership can be transferred with minimal
disruption to the tablet and it can be transferred to a specific replica
(as long as that replica is almost caught up to the leader).

-Will

On Tue, Dec 4, 2018 at 5:04 AM Дмитрий Павлов  wrote:

> Thank Will
>
> So i have one more question.
>
> is this (i mean leaders skew) something i should be concerned about?
> In terms of load balancing for example in case if i use kudu 1.8.0 with
> spark
>
> Regards Dmitry Pavlov
>
>
> Вторник, 4 декабря 2018, 0:10 +03:00 от William Berkeley <
> wdberke...@cloudera.com>:
>
> Yes, it's expected. The rebalancing tool does not balance leadership. In
> fact, it tries to avoid relocating leader replicas because the tool wants
> to minimize the disturbance to the cluster. It will relocate leader
> replicas if it has to, but it won't make any attempt to balance them.
>
> You can use 'kudu tablet leader_step_down' to force the leader replica of
> a tablet to step down. A new leader will be elected after the election
> timeout passes. It's not guaranteed that the old leader won't be reelected.
>
> -Will
>
> On Mon, Dec 3, 2018 at 4:11 AM Дмитрий Павлов  <https://e.mail.ru/compose/?mailto=mailto%3adm.pav...@inbox.ru>> wrote:
>
>
> Hi guys
>
> I have a question about Kudu rebalancing tool (*kudu cluster rebalance*).
>
> So my situation is following:
>
> When our kudu cluster has been updated to version 1.8.0 i ran *kudu
> cluster rebalance *for one table.
> After that i checked per-table replica distribution and it looks fine but
> leaders distributions per nodes still not good for my table
>
> Is it expected behaviour for rebalancing tool?
>
> Regards, Dmitry Pavlov
>
>
>
>
>
> --
> Дмитрий Павлов
>


Re: [KUDU] Rebalancing tool

2018-12-03 Thread William Berkeley
Yes, it's expected. The rebalancing tool does not balance leadership. In
fact, it tries to avoid relocating leader replicas because the tool wants
to minimize the disturbance to the cluster. It will relocate leader
replicas if it has to, but it won't make any attempt to balance them.

You can use 'kudu tablet leader_step_down' to force the leader replica of a
tablet to step down. A new leader will be elected after the election
timeout passes. It's not guaranteed that the old leader won't be reelected.

-Will

On Mon, Dec 3, 2018 at 4:11 AM Дмитрий Павлов  wrote:

>
> Hi guys
>
> I have a question about Kudu rebalancing tool (*kudu cluster rebalance*).
>
> So my situation is following:
>
> When our kudu cluster has been updated to version 1.8.0 i ran *kudu
> cluster rebalance *for one table.
> After that i checked per-table replica distribution and it looks fine but
> leaders distributions per nodes still not good for my table
>
> Is it expected behaviour for rebalancing tool?
>
> Regards, Dmitry Pavlov
>
>
>
>


Re: Slow queries after massive deletions. Is it due to compaction?

2018-11-25 Thread William Berkeley
Hi Sergejs. You are correct. Kudu tracks deletes as a past data plus a
"redo" that contains delete operations. The base data and the redos are
stored on disk separately and are logically reconciled on scan.

Brock is right that this situation is improved greatly for certain deletion
patterns with the fix to KUDU-2429. In particular, if deletions come in
large contiguous blocks (where contiguity is determined by the primary key
ordering), then the KUDU-2429 improvement will greatly increase the speed
of scans over deleted data. Your problem might be solved by upgrading to a
version that contains that improvement.

Unfortunately, as implied by the description for the
--tablet_delta_store_major_compact_min_ratio flag, major delta compaction
will not eliminate delete operations in the redo files. Those are only
compacted with the base data when a merge compaction (also known as a
rowset compaction) occurs. However, right now that sort of compaction is
triggered only by having rowsets whose minimum and maximum key bounds
overlap. So, for example, if you are inserting data in increasing or
decreasing primary key order, there won't be any merge compactions.
KUDU-1625  tracks the
improvement to trigger rowset compaction based on having a high percentage
of deleted data.

I can't think of a great workaround without potentially changing the
partitioning or schema of the table, unfortunately. It's possible to coax
Kudu into doing merge compactions by inserting rows in the same approximate
key range as the deleted data and deleting them quickly. This would
hopefully cause the merge compaction to compact away a lot of the older
deleted data, but it would leave the newly inserted and deleted data. Plus,
if there are concurrent queries, this sort of workaround could cause wrong
results, and filtering them out efficiently, at least Kudu-side, would mean
a change to the primary key.

There is a good solution if you can change your partitioning scheme. If you
use range partitioning to group together the blocks of rows that will be
deleted, for example by range partitioning with one partition by day and
deleting a day at a time, then deletes can be done efficiently by dropping
range partitions. See range-partitioning
.
Besides the restrictions this places on your schema and on the manner in
which deletes can be done efficiently, keep in mind that dropping a range
partition is not transactional and scans concurrent with drops of range
partitions do not have consistency guarantees.

-Will

On Sun, Nov 25, 2018 at 3:40 PM Brock Noland  wrote:

> Hi,
>
> I believe you are hitting a known issue and I think if you upgrade to
> 5.15.1 you'll see the fix:
>
>
> https://www.cloudera.com/documentation/enterprise/release-notes/topics/kudu_fixed_issues.html#fixed-5-15-1
>
> "Greatly improved the performance of many types of queries on tables from
> which many rows have been deleted."
>
> I think there might have been more than one JIRA, but here is one of the
> fixes:
>
> https://issues.apache.org/jira/browse/KUDU-2429
>
> On Thu, Nov 22, 2018 at 9:57 AM Sergejs Andrejevs 
> wrote:
>
>> Hi,
>>
>>
>>
>> Is there a way to call of MajorDeltaCompactionOp for a
>> table/tablet/rowset?
>>
>>
>>
>> We’ve faced with an issue:
>>
>> 0.   Kudu table is created
>>
>> 1.   Data is inserted there
>>
>> 2.   Run select query - it goes fast (matter of a few seconds)
>>
>> 3.   Delete all data from the table (but not dropping the table)
>>
>> 4.   Run select query - it goes slow (4-6 minutes)
>>
>>
>>
>> Investigating and reading documentation of Kudu has leaded to a thought
>> that delete operations are done logically, but physically the table
>> contains written data and deletes are applied each time on top of it.
>>
>> I had a look at kudu tablet and there are quite large “redo” blocks (see
>> one of rowset examples below).
>>
>> There was a thought that compression and encoding play their role
>> (reducing the chances to run compaction), but removing them (keeping column
>> defaults) hasn’t helped as well.
>>
>> We run tservers
>>
>> -  maintenance_manager_num_threads=10 (increased comparing to
>> default)
>>
>> -  tablet_delta_store_major_compact_min_ratio=0.1000149011612
>> (default value)
>>
>> -  kudu 1.7.0-cdh5.15.0
>>
>>
>>
>> From documentation and comments in code I saw the description of
>> tablet_delta_store_major_compact_min_ratio: “Minimum ratio of
>> sizeof(deltas) to sizeof(base data) before a major compaction.”
>>
>> And “Major compactions: the score will be the result of
>> sizeof(deltas)/sizeof(base data), unless it is smaller than
>> tablet_delta_store_major_compact_min_ratio or if the delta files are only
>> composed of deletes, in which case the score is brought down to zero.”
>>
>> So basically the table stays in such state for more than a day.
>>
>>
>>
>> While 

Re: Re: kuduissue!

2019-03-12 Thread William Berkeley
I'm sorry I'm not sure what you are trying to say. The tablet replicas
should pass through an INITIALIZED state as part of starting: INITIALIZED
-> BOOTSTRAPPING -> RUNNING. If some replicas are staying for awhile as
INITIALIZED, it's probably because they are waiting for other replicas to
bootstrap. That state shouldn't last for very long though. It's hard to say
anything more concrete without more information, though.

-Will

On Mon, Mar 11, 2019 at 5:09 PM 冯宝利  wrote:

> My kudu version is 1.8.0.I think the problem that is impala issued .When I
> created the kudu table with impala, I found that the tablet of the newly
> created kudu table was in the initialization state.
>
> ------
> 发件人:William Berkeley
> 日 期:2019年03月12日 05:46:07
> 收件人:
> 抄 送:Attila Bukor
> 主 题:Re: Re: kuduissue!
>
> What Kudu version are you running? The master logs would be useful here,
> otherwise we can't understand why the table took so long to create.
>
> I'm not sure what timeout to set as that is an Impala configuration, not a
> Kudu one. You want the configuration that changes the default Kudu
> operation timeout for DDL operations issued by Impala.
>
> -Will
>
> On Thu, Mar 7, 2019 at 5:37 PM fengba...@uce.cn  wrote:
>
>> This table partitions  is 32  ,the master logs has been deleted.
>>  I
>> want to know which timeout parameter is. I want to modify this timeout 
>> parameter to see if it can solve this problem.
>> Because my kudu cluster is small, but the tablet format is already very large
>>
>>
>>   thanks!
>> --
>> [image: 说明: logo1]
>> 优速物流科技有限公司
>> 大数据中心 冯宝利
>> *Mobil*:15050552430
>> *Email*:fengba...@uce.cn
>>
>>
>> *发件人:* Attila Bukor 
>> *发送时间:* 2019-03-07 18:24
>> *收件人:* fengba...@uce.cn
>> *抄送:* user 
>> *主题:* Re: kuduissue!
>> Hi,
>>
>> From Impala 2.12 it's not allowed to set the kudu.table_name explicity
>> on CREATE TABLE[1], so that part is expected.
>>
>> It's interesting that the table creation times out though. How many
>> partitions does the table have? Can you share the master logs with us
>> from the relevant time period?
>>
>> Thanks,
>> Attila
>>
>> [1] https://issues.apache.org/jira/browse/IMPALA-5654
>>
>> On Thu, Mar 07, 2019 at 06:19:05PM +0800, fengba...@uce.cn wrote:
>> > Hi:
>> >When I upgraded from cm 5.14.0 to cm 6.1.0, there was a problem that
>> I failed to create kudu tables using imapla. The specific impala client's
>> error was as follows:
>> > ERROR: ImpalaRuntimeException: Error creating Kudu table
>> 'impala::tmp_kudu.t_prs_acc_bill_prep'
>> > CAUSED BY: NonRecoverableException: can not complete before timeout:
>> KuduRpc(method=IsCreateTableDone, tablet=null, attempt=95,
>> DeadlineTracker(timeout=18, elapsed=178986), Traces: [0ms] sending RPC
>> to server master-hadoop2.uce.cn:7051, [0ms] received from server
>> master-hadoop2.uce.cn:7051 response OK, [20ms] sending RPC to server
>> master-hadoop2.uce.cn:7051, [20ms] received from server
>> master-hadoop2.uce.cn:7051 response OK, [40ms] sending RPC to server
>> master-hadoop2.uce.cn:7051, [41ms] received from server
>> master-hadoop2.uce.cn:7051 response OK, [59ms] sending RPC to server
>> master-hadoop2.uce.cn:7051, [60ms] received from server
>> master-hadoop2.uce.cn:7051 response OK, [80ms] sending RPC to server
>> master-hadoop2.uce.cn:7051, [80ms] received from server
>> master-hadoop2.uce.cn:7051 response OK, [120ms] sending RPC to server
>> master-hadoop2.uce.cn:7051, [121ms] received from server
>> master-hadoop2.uce.cn:7051 response OK, [180ms] sending RPC to server
>> master-hadoop2.uce.cn:7051, [180ms] received from server
>> master-hadoop2.uce.cn:7051 response OK, [260ms] sending RPC to server
>> master-hadoop2.uce.cn:7051, [261ms] received from server
>> master-hadoop2.uce.cn:7051 response OK, [440ms] sending RPC to server
>> master-hadoop2.uce.cn:7051, [489ms] received from server
>> master-hadoop2.uce.cn:7051 response OK, [580ms] sending RPC to server
>> master-hadoop2.uce.cn:7051, [580ms] received from server
>> master-hadoop2.uce.cn:7051 response OK, [980ms] sending RPC to server
>> master-hadoop2.uce.cn:7051, [981ms] received from server
>> master-hadoop2.uce.cn:7051 response OK, [1300ms] sending RPC to server
>> master-hadoop2.uce.cn:7051, [1301ms] received from server
>> master-hadoop2.uce.cn:7051 response OK, [2260ms] sending RPC to server
>> master-hadoop2.uce.cn:7051, [2261ms] received from server
>> master