Re: File descriptor limit for WAL

2017-02-24 Thread Todd Lipcon
believe uses epoll(2) under the hood. There's one other place where we > use ppoll() (in RPC negotiation), but no select(). > > A bit of historical curiosity: we actually had this bug a few years back and fixed it, see 82cf3724077a8fb639a44dd86f04d10ecbedabf4 -- Todd Lipcon Software Engineer, Cloudera

Re: Adding examples to docs?

2017-02-13 Thread Todd Lipcon
gn.html are >>> missing SQL examples. >>> >>> I can not find the exact SQL syntax for partition management. >>> >>> can this be added? >>> >>> Thanks in advance. >>> >>> >>> >>> >>> >> > -- Todd Lipcon Software Engineer, Cloudera

Re: Feature request for Kudu 1.3.0

2017-02-10 Thread Todd Lipcon
rocess is along the line of: > > 1) copy software to target machine > > 2) shut down services on machine > > 3) expand software to final location > > 4) reboot (if new kernel) > > 5) restart services. > OK, hopefully that happens quickly us

Re: Missing 'com.cloudera.kudu.hive.KuduStorageHandler'

2017-02-12 Thread Todd Lipcon
all nodes an appropriate way >> to bring spark in a position to work with kudu? >> What about the beeline-shell from hive and the possibility to read from >> kudu? >> >> My Environment: Cloudera 5.7 with kudu and impala-kudu from installed >> parcels. Build a working python-kudu library successfully from scratch (git) >> >> Thanks a lot! >> Frank >> > > -- Todd Lipcon Software Engineer, Cloudera

Re: Missing 'com.cloudera.kudu.hive.KuduStorageHandler'

2017-02-16 Thread Todd Lipcon
f their inserts start failing with "data out of range for int32" errors or whatever. Forcing people to evaluate the column sizes up front avoids nasty surprises later. But, maybe you can see my biases towards static-typed languages leaking through here ;-) -Todd > 2017-02-14 19:44 GMT+01:0

Re: Fetch row on the basis of composite primary key from table using Java API

2017-02-15 Thread Todd Lipcon
fy the sender. Any use of this email is prohibited when > received in error. Impetus does not represent, warrant and/or guarantee, > that the integrity of this communication has been maintained nor that the > communication is free of errors, virus, interception or interference. > -- Todd Lipcon Software Engineer, Cloudera

Re: Error when creating partitions with BIGINT

2017-01-16 Thread Todd Lipcon
ne > > Can you please help us on this, if you have any idea about this issue or > any impact of this error on the functionality. > > Thank you so much for your help. > > Thanks, > Amit > > On Tue, Jan 3, 2017 at 8:14 AM, Todd Lipcon <t...@cloudera.com> wrote: > &g

Re: mixing range and hash partitioning

2017-02-28 Thread Todd Lipcon
.table_name("test_table") > .schema() > .add_hash_partitions({"key"}, 2) > .set_range_partition_columns({"time"}) > .num_replicas(1) > .Create() > > I later try to add a partition: > > auto timesplit(KuduSchema & schema, std::int64_t t) { > auto split = schema.NewRow(); > check_ok(split->SetInt64("time", t)); > return split; > } > > alterer->AddRangePartition( > timesplit(schema, date_start), > timesplit(schema, next_date_start)); > > check_ok(alterer->Alter()); > > But I get an error "Invalid argument: New range partition conflicts with > existing range partition". > > How are hash and range partitioning intended to be mixed? > > > > > > > > -- Todd Lipcon Software Engineer, Cloudera

Re: No order by in kudu java api

2016-09-01 Thread Todd Lipcon
ou parallelism on the client side. -Todd > Thanks, > Amit > > On Aug 31, 2016 10:36 PM, "Todd Lipcon" <t...@cloudera.com> wrote: > >> Hi Amit, >> >> That's correct, there is no "order by" support in the Java API, because >> this i

Re: No order by in kudu java api

2016-08-31 Thread Todd Lipcon
iseminación, > distribución o copiado de esta comunicación o su contenido está > estrictamente prohibido. En caso de que Ud. hubiera recibido este mensaje > por error le agradeceremos notificarnos por e-mail inmediatamente y > eliminarlo de su sistema. Muchas gracias. > > -- Todd Lipcon Software Engineer, Cloudera

Re: Question on performance of Kudu

2016-09-14 Thread Todd Lipcon
others? > > With your innovative design, these tests should show some good numbers. > > > >Regards, > >Roberta Marton > > > -- Todd Lipcon Software Engineer, Cloudera

Re: Casual meetup/happy hour at Strata?

2016-09-26 Thread Todd Lipcon
let's aim to finish up the happy hour by around that time. If you can't find us, feel free to ping me via Slack ( https://getkudu-slack.herokuapp.com/ if you don't already have an account) Thanks -Todd On Tue, Sep 20, 2016 at 10:28 AM, Todd Lipcon <t...@cloudera.com> wrote: > Sounds

Casual meetup/happy hour at Strata?

2016-09-17 Thread Todd Lipcon
and whoever's around can drop by and put some faces to names. Let me know if you're interested - if not enough people are around, I'll can the idea, but if it seems there are at least a few people in town it might be fun. -Todd -- Todd Lipcon Software Engineer, Cloudera

[ANNOUNCE] Apache Kudu 1.0.0 release

2016-09-20 Thread Todd Lipcon
The Apache Kudu team is happy to announce the release of Kudu 1.0.0! Kudu is an open source storage engine for structured data which supports low-latency random access together with efficient analytical access patterns. It is designed within the context of the Apache Hadoop ecosystem and supports

Re: Casual meetup/happy hour at Strata?

2016-09-20 Thread Todd Lipcon
crowded during the conference. -Todd On Sat, Sep 17, 2016 at 7:12 PM, Clifford Resnick <cresn...@mediamath.com> wrote: > +1. We're just starting with Kudu, but it would be nice to meet other > users, and a casual Q & A would be great if you're up for it! > > On Sep 17, 2016 9

Re: [ANNOUNCE] Apache Kudu 1.0.0 release

2016-09-20 Thread Todd Lipcon
predicates on your data frames. (though I haven't personally verified it) -Todd > On Sep 20, 2016, at 12:11 AM, Todd Lipcon <t...@apache.org> wrote: > > The Apache Kudu team is happy to announce the release of Kudu 1.0.0! > > Kudu is an open source storage engine for struc

Re: Casual meetup/happy hour at Strata?

2016-09-28 Thread Todd Lipcon
Hrm, looks like there may not be sufficient interest after all (or too many drinks available at the conference itself?) Unless someone texts /slacks me I'll plan to stick around here at the conference. Todd On Sep 28, 2016 6:00 PM, "Todd Lipcon" <t...@cloudera.com> wrote: &

Re: About data file size and on-disk size

2016-11-23 Thread Todd Lipcon
t; I have a table with 16 buckets over 3 physical machines. The tablet > only > >>> has > >>> one replica. > >>> > >>> > >>> Tablets Web UI shows that each tablet has around ~4.5G on-disk size. > >>> > >>> In one machine, there are total 8 tablets, so the on-disk size is > about > >>> 4.5*8 = 36G. > >>> > >>> however, in the same machine, the disk actually used is about 211G. > >>> > >>> > >>> # du -sh /data/kudu/tserver/data/ > >>> > >>> 210G /data/kudu/tserver/data/ > >>> > >>> > >>> # find /data/kudu/tserver/data/ -name "*.data" | wc -l > >>> > >>> 8133 > >>> > >>> > >>> > >>> What’s the difference between data file and on-disk size. > >>> > >>> Can files in /data/kudu/tserver/data/ be compacted, purged, or some of > >>> them > >>> be deleted? > >>> > >>> > >>> Thanks very much. > >>> > >>> > >>> BR > >>> > >>> Brooks > >>> > >>> > >>> > -- Todd Lipcon Software Engineer, Cloudera

Re: kudu master crashes

2016-10-11 Thread Todd Lipcon
o <darren@gmail.com> wrote: > >> kudu master seldom crashes, but starting with yesterday, one of our >> two kud masters crashes very often >> >> Can anyone help to see what's going on? >> >> you can obtain get core file here : http://167.88.124.211:8000/c >> ore.22459.xz >> >> >> > -- Todd Lipcon Software Engineer, Cloudera

Re: Schema Normalization

2016-10-10 Thread Todd Lipcon
u use Impala -this should help a lot wth joins where one side of the join has selective predicates on a large table. -Todd > > On Oct 10, 2016, at 4:15 PM, Todd Lipcon <t...@cloudera.com> wrote: > > Hey Ben, > > Yea, we currently don't do great with very wide tables. For e

Re: Kudu on kerberos enabled cluster

2016-12-11 Thread Todd Lipcon
r communication? > Should not have any negative effect. There's no "conversion" or anything to worry about. > > Do you have any reference or details that what could be the precaution > that needs to be taken or how can we do it? > It ought to "just work"

Re: Good way to find "Real" size of the tables

2016-12-11 Thread Todd Lipcon
; > > On Nov 30, 2016, at 4:29 PM, Todd Lipcon <t...@cloudera.com> wrote: > > On Wed, Nov 30, 2016 at 6:26 AM, Weber, Richard <riwe...@akamai.com> w > rote: > >> Hi All, >> >> I'm trying to figure out the right/best/easiest way to find out how much &g

Re: About data file size and on-disk size

2017-01-09 Thread Todd Lipcon
ce from the container. > I will try it later this month. > > By the way, when will kudu's next release come out? Will 1.2 release in > mid-January include this fix? > > Thanks. > BR > -GU > > > ------ 原始邮件 -- > *发件人:* "Todd Lipcon&

Re: Missing 'com.cloudera.kudu.hive.KuduStorageHandler'

2017-01-09 Thread Todd Lipcon
org/jira/browse/KUDU-1603 a while back. Hopefully he will chime in with a better answer than I can give :) -Todd 2016-12-13 16:05 GMT+01:00 Frank Heimerzheim <fh.or...@gmail.com>: > >> Hello Todd, >> >> thanks a lot for the clarification. >> >> Greetings

Re: Good way to find "Real" size of the tables

2016-11-30 Thread Todd Lipcon
egate as you prefer. Unfortunately this would give you only the physical size and not the logical, since you'd have to scan the actual data to know its uncompressed sizes. If you have any interest in helping to build such a tool I'd be happy to point you in the right direction. Otherwise let's file

Adding some guard rails to Kudu

2016-11-30 Thread Todd Lipcon
re being too conservative? Thanks -Todd -- Todd Lipcon Software Engineer, Cloudera

Re: Error when creating partitions with BIGINT

2017-01-02 Thread Todd Lipcon
munication, or any of its contents, > is strictly prohibited. If you have received it by mistake please let us > know by e-mail immediately and delete it from your system. Many thanks. > > > > La información contenida en este mensaje puede ser confidencial. Ha sido > enviada para el uso exclusivo del destinatario(s) previsto. Si el lector de > este mensaje no fuera el destinatario previsto, por el presente queda Ud. > notificado que cualquier lectura, uso, publicación, diseminación, > distribución o copiado de esta comunicación o su contenido está > estrictamente prohibido. En caso de que Ud. hubiera recibido este mensaje > por error le agradeceremos notificarnos por e-mail inmediatamente y > eliminarlo de su sistema. Muchas gracias. > > -- Todd Lipcon Software Engineer, Cloudera

Re: Kudu on top of Alluxio

2017-03-27 Thread Todd Lipcon
es > depending on the workload. Anything else is untested AFAIK. > I would amend this and say that SSD for the WAL is nice to have, but not a requirement. We do lots of testing on non-SSD test clusters and I'm aware of many production clusters which also do not have SSD. -Todd -- Todd Lipcon Software Engineer, Cloudera

Re: How to calculate the optimal value of `maintenance_manager_num_threads`

2017-03-27 Thread Todd Lipcon
ssibly better performance. The tradeoff may be non-linear, though (i.e doubling MM threads won't double performance!) As Kudu is still a young project, we're still gathering operational experience from users around topics like this. It would be great if you can share back any results you find with the community. Thanks -Todd -- Todd Lipcon Software Engineer, Cloudera

[ANNOUNCE] Apache Kudu 1.3.0 released

2017-03-20 Thread Todd Lipcon
The Apache Kudu team is happy to announce the release of Kudu 1.3.0. Kudu is an open source storage engine for structured data which supports low-latency random access together with efficient analytical access patterns. It is designed within the context of the Apache Hadoop ecosystem and supports

Re: How to flush `block_cache_capacity_mb` easily?

2017-04-10 Thread Todd Lipcon
...@gmail.com> >>> wrote: >>> >>>> Hi. >>>> >>>> I'm using Apache Kudu 1.2 on CDH 5.10. >>>> >>>> Currently, I'm doing a performance test of Kudu. >>>> >>>> Flushing OS Page Cache is easy, but I

Re: How to flush `block_cache_capacity_mb` easily?

2017-04-10 Thread Todd Lipcon
; Then I'll try in my spare time. > > 2017-04-11 7:46 GMT+09:00 Todd Lipcon <t...@cloudera.com>: > >> On Sun, Apr 9, 2017 at 6:38 PM, Jason Heo <jason.heo@gmail.com> >> wrote: >> >>> Hi Todd. >>> >>> I hope you had a good week

Re: Question about redistributing tablets on failure of a tserver.

2017-04-12 Thread Todd Lipcon
sure there aren't additional problems in the cluster (admin >> guide >> on the ksck tool >> <https://github.com/apache/kudu/blob/master/docs/administration.adoc#ksck> >> ). >> >> >>> Q3. `--follower_unavailable_considered_failed_sec` can be changed >>> without restarting cluster? >>> >> >> The flag can be changed, but it comes with the same caveats as above: >> >> 'kudu tserver set-flag >> follower_unavailable_considered_failed_sec >> 900 --force' >> >> >> - Dan >> >> > -- Todd Lipcon Software Engineer, Cloudera

[ANNOUNCE] Apache Kudu 1.3.1 released

2017-04-19 Thread Todd Lipcon
The Apache Kudu team is happy to announce the release of Kudu 1.3.1. Kudu is an open source storage engine for structured data which supports low-latency random access together with efficient analytical access patterns. It is designed within the context of the Apache Hadoop ecosystem and supports

Re: Building from Source fails on my CentOS 7.2

2017-04-17 Thread Todd Lipcon
ed Hat 4.8.5-11) > Copyright (C) 2015 Free Software Foundation, Inc. > This is free software; see the source for copying conditions. There is NO > warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. > > > Thanks, > > Jason. > -- Todd Lipcon Software Engineer, Cloudera

Re: How to flush `block_cache_capacity_mb` easily?

2017-04-17 Thread Todd Lipcon
t familiar with the > contributing process <https://kudu.apache.org/docs/contributing.html>. > > Thanks, > > Jason > > 2017-04-11 12:55 GMT+09:00 Todd Lipcon <t...@cloudera.com>: > >> Sure. Here's a high-level overview of the approach: >> >&g

Re: How to reuse tablet server UUID, or removing old one

2017-03-09 Thread Todd Lipcon
gt; How can i tel Kudu to completely remove the dead tabletserver5 UUID and >>> populate the new tabletserver5 UUID instead ? >>> >>> the `kudu` command line tool does not seem to allow to delete a tablet >>> server UUID, or decommission >>> so how ? >>> >>> Or other way, how can i recreate an empty Kudu tablet server reusing my >>> old UUID ? >>> >> >> > -- Todd Lipcon Software Engineer, Cloudera

Re: How to flush `block_cache_capacity_mb` easily?

2017-04-07 Thread Todd Lipcon
rily table to evict >> cached block of testing table. >> >> It is cumbersome, so I'd like to know is there a command for flushing >> block caches (or another kudu's caches which I don't know yet) >> >> Thanks. >> >> Regards, >> Jason >> > > -- Todd Lipcon Software Engineer, Cloudera

Re: Building from Source fails on my CentOS 7.2

2017-04-17 Thread Todd Lipcon
y > libk5crypto.so.3 => /usr/path/to/lib/libk5crypto.so.3 (0x7f4f17b23000) > > Thanks, > > Jason. > > 2017-04-18 4:00 GMT+09:00 Todd Lipcon <t...@cloudera.com>: > >> Hi Jason, >> >> This is interesting. It seems like for some reason your libkrb5.so isn't

Re: tserver died during bulk indexing and dies again after restarting

2017-04-24 Thread Todd Lipcon
o get it in more detail? >>>> >>>> I tried what I did again and again to reproduce same error, but it >>>> didn't happen again. >>>> >>>> Please feel free to ask me for anything what you need to resolve. >>>> >>>> Regards, >>>> >>>> Jason >>>> >>>> 2017-04-23 1:56 GMT+09:00 <davidral...@gmail.com>: >>>> >>>>> Hi Jason >>>>> >>>>> Anything else of interest in those logs? Can you share them (with >>>>> just me, if you prefer)? Would it be possible to also get the WAL with >>>>> the corrupted entry? >>>>> Did this happen on a single server? >>>>> >>>>> Best >>>>> David >>>>> >>>> >>>> >>> >> > -- Todd Lipcon Software Engineer, Cloudera

Re: Configure Impala for Kudu on Separate Cluster

2017-08-14 Thread Todd Lipcon
erableException: [Peer > master-prod-dc1-datanode151.pdc1i.gradientx.com:7051] Connection closed, > [33361ms] trace too long, truncated) > CAUSED BY: NoLeaderFoundException: Master config ( > prod-dc1-datanode151.pdc1i.gradientx.com:7051) has no leader. Exceptions > received: or

Re: Any plans for "Aggregation Push down" or integrating Impala + Kudu more tightly?

2017-06-29 Thread Todd Lipcon
th setting up a shared memory region and also found another small speedup over the domain socket. However, there was a lot of complexity involved in this code (particularly the shared memory approach) relative to the gain that we saw, so we didn't end up merging it before his internship ended :) -Todd -- Todd Lipcon Software Engineer, Cloudera

Re: Table size is not decreasing after large amount of rows deleted.

2017-04-24 Thread Todd Lipcon
y compacted cleanup is >> more unlikely) >> In Kudu 1.3 we added a background task to clean up old data even in >> the absence of compactions. Could you upgrade? >> >> Best >> David >> > > -- Todd Lipcon Software Engineer, Cloudera

Re: Some bulk requests are missing when a tserver stopped

2017-04-24 Thread Todd Lipcon
(RDD.scala:920) >> at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkC >> ontext.scala:1869) >> at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkC >> ontext.scala:1869) >> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) >> at org.apache.spark.scheduler.Task.run(Task.scala:89) >> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) >> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool >> Executor.java:1142) >> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo >> lExecutor.java:617) >> at java.lang.Thread.run(Thread.java:745) >> -- >> > > -- Todd Lipcon Software Engineer, Cloudera

Re: Bad insert performance of java kudu-client

2017-04-25 Thread Todd Lipcon
e-getbyinetaddress-takes-way-too-long confirmation so far). See >> profiler screenshot http://pasteboard.co/8uHil3I5H.png (kudu-client >> v1.3.1), every call take 53 ms (!) on average. >> Also, could you recheck logic, why this function recalls 88 times in 12 >> seconds

Re: RPC and Service difference?

2017-04-28 Thread Todd Lipcon
rease input throughput then should i increase >> '--rpc_num_service_threads' right? >> >> 3. Why '--rpc_num_acceptors_per_address' has so small value compared >> to --rpc_num_service_threads? Because I'm going to increase that value >> too, do you think this is a bad idea? if so can you plz describe >> reason? >> >> Thanks for replying me! >> >> Have a nice day~ :) >> > > -- Todd Lipcon Software Engineer, Cloudera

Re: Apache Apex supports kudu as a high throughput sink

2017-05-30 Thread Todd Lipcon
atrato.io/ > blog/2017/05/28/apex-kudu-output/ . Please use the comments section to > provide any feedback. > > Regards, > Ananth > -- Todd Lipcon Software Engineer, Cloudera

Re: How to manage yearly range partition efficiently

2017-06-08 Thread Todd Lipcon
suggest plan 1, plus also put it on several people's calendars to verify :) Alternatively, something like in 2017 add the partitions for 2018 and 2019, so you always maintain one extra year ahead and you are less likely to "not notice" if the new one is not created in time. -Todd -- Todd Lipcon Software Engineer, Cloudera

[ANNOUNCE] Apache Kudu 1.4.0 released

2017-06-15 Thread Todd Lipcon
The Apache Kudu team is happy to announce the release of Kudu 1.4.0. Kudu is an open source storage engine for structured data which supports low-latency random access together with efficient analytical access patterns. It is designed within the context of the Apache Hadoop ecosystem and supports

Re: What does "Failed RPC negotiation" in kudu-tserver.WARNING

2017-06-17 Thread Todd Lipcon
No problem. We are here to help! We are glad to see your team using Kudu. Todd On Jun 17, 2017 7:24 PM, "Jason Heo" wrote: > Hi Jean-Daniel, Todd, and Alexey > > Thank your for the replies. > > Recently, I've experienced many issues but successfully resolved them with

Re: Time travel reads in Kudu

2017-06-18 Thread Todd Lipcon
skew, you'll have to use the more advanced APIs to retrieve propagated timestamps from the server side after each write. -Todd On Sun, Jun 18, 2017 at 1:36 PM, Todd Lipcon <t...@cloudera.com> wrote: > Hi Ananth, > > Answers inline below > > On Sat, Jun 17, 2017 at 1:40 PM

Re: Time travel reads in Kudu

2017-06-18 Thread Todd Lipcon
t 15 minutes). You can bump this to a longer amount of time. > > If it is otherwise , does the model hold good after a compaction is > performed ? > > Yes, as of version 1.2 (I think) the full history is properly retained regardless of any compactions, etc, subject to the above mentioned history limit. -Todd -- Todd Lipcon Software Engineer, Cloudera

Re: Actual encoding and compression

2017-05-05 Thread Todd Lipcon
ed here > https://kudu.apache.org/docs/schema_design.html#encoding Kudu may > "transparently fall back to plain encoding" from dictionary encoding. > I think it would be useful to the user to see actual used encoding & > compression. > > -- > with best regards, Pav

Re: Apache Kudu, Spark and StreamSets

2017-05-04 Thread Todd Lipcon
r firms, each > of which is a legally separate and independent entity. Please see > www.deloitte.com.au/about <http://www.deloitte.com/au/about> for a > detailed description of the legal structure of Deloitte Touche Tohmatsu > Limited and its member firms. Nothing in this e-mail, nor any related > attachments or communications or services, have any capacity to bind any > other entity under the ‘Deloitte’ network of member firms (including those > operating in Australia). > -- Todd Lipcon Software Engineer, Cloudera

Re: Please tell me about License regarding kudu logo usage

2017-09-19 Thread Todd Lipcon
Oops, adding the original poster in case he or she is not subscribed to the list. On Sep 19, 2017 10:46 PM, "Todd Lipcon" <t...@cloudera.com> wrote: > Hi Yuya, > > There should be no problem to use the Apache Kudu logo in your conference > slides, assuming yo

Re: Please tell me about License regarding kudu logo usage

2017-09-19 Thread Todd Lipcon
Hi Yuya, There should be no problem to use the Apache Kudu logo in your conference slides, assuming you are just using as intended to describe or refer to the project itself. This is considered "nominative use" under trademark laws. You can read more about nominative use at:

Re: Composite primary key

2017-09-05 Thread Todd Lipcon
tside of the actual table data? > > -- > Br. > Janne Keskitalo, > Database Architect, PAF.COM > For support: dbdsupp...@paf.com > > -- Todd Lipcon Software Engineer, Cloudera

Re: Question about per server data upper limit.

2017-09-05 Thread Todd Lipcon
d kudu version: > 32 cpu Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz 128G memory 6*16T hdd > for data and 3T for wal. kudu 1.4.0 5 master + 5 tserver. > if more interesting things happened, I will replay here. > thanks again. > -- Todd Lipcon Software Engineer, Cloudera

Re: Change Data Capture (CDC) with Kudu

2017-09-29 Thread Todd Lipcon
e >>>> primary has up to say an hour before (or something like that). >>>> >>>> >>>> So far we considered a couple of options: >>>> - refreshing the seconday instance with a full copy of the primary one >>>> every so often, but that would mean having to transfer say 50TB of data >>>> between the two locations every time, and our network bandwidth constraints >>>> would prevent to do that even on a daily basis >>>> - having a column that contains the most recent time a row was updated, >>>> however this column couldn't be part of the primary key (because the >>>> primary key in Kudu is immutable), and therefore finding which rows have >>>> been changed every time would require a full scan of the table to be >>>> sync'd. It would also rely on the "last update timestamp" column to be >>>> always updated by the application (an assumption that we would like to >>>> avoid), and would need some other process to take into accounts the rows >>>> that are deleted. >>>> >>>> >>>> Since many of today's RDBMS (Oracle, MySQL, etc) allow for some sort of >>>> 'Change Data Capture' mechanism where only the 'deltas' are captured and >>>> applied to the secondary instance, we were wondering if there's any way in >>>> Kudu to achieve something like that (possibly mining the WALs, since my >>>> understanding is that each change gets applied to the WALs first). >>>> >>>> >>>> Thanks, >>>> Franco Venturi >>>> >>> >> > > -- Todd Lipcon Software Engineer, Cloudera

Re: [DISCUSS] Move Slack discussions to ASF official slack?

2017-10-23 Thread Todd Lipcon
on the ASF slack in case we decide to go > forward with this. If we don't decide to go forward with it, it's a good > idea to hold onto the channel and pin a message in there about how to get > to the "official" Kudu slack. > > On Mon, Oct 23, 2017 at 3:00 PM, Todd Lipcon <t...@cl

Re: [DISCUSS] Move Slack discussions to ASF official slack?

2017-10-23 Thread Todd Lipcon
channels on the official ASF slack (http://the-asf.slack.com/ > ) > and migrate our discussions there. What does everyone think? > -- Todd Lipcon Software Engineer, Cloudera

Re: INT128 Column Support Interest

2017-11-20 Thread Todd Lipcon
need for anywhere near that range. -Todd > > On Thu, Nov 16, 2017 at 5:30 PM, Dan Burkert <danburk...@apache.org> > wrote: > > > Aren't we going to need efficient encodings in order to make decimal work > > well, anyway? > > > > - Dan &g

Re: INT128 Column Support Interest

2017-11-16 Thread Todd Lipcon
sses, MD5 hashes and other similar types >> of data. >> >> Is there any interest or uses for a INT128 column type? Is anyone >> currently using a STRING or BINARY column for 128 bit data? >> >> Thank you, >> Grant >> -- >> Grant Henke >> Software Engineer | Cloudera >> gr...@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke >> > > -- Todd Lipcon Software Engineer, Cloudera

Re: Low ingestion rate from Kafka

2017-11-01 Thread Todd Lipcon
1.3 it was called "kudu test loadgen" and may have fewer options available. -Todd On Wed, Nov 1, 2017 at 12:23 AM, Todd Lipcon <t...@cloudera.com> wrote: > >> On Wed, Nov 1, 2017 at 12:20 AM, Todd Lipcon <t...@cloudera.com> wrote: >> >>>

Re: Low ingestion rate from Kafka

2017-11-01 Thread Todd Lipcon
erloaded (it's a torture-test cluster of sorts that is always way out of balance, re-replicating stuff, etc) -Todd > > > > On Wed, Nov 1, 2017 at 1:40 PM, Todd Lipcon <t...@cloudera.com> wrote: > >> On Wed, Nov 1, 2017 at 1:23 PM, Chao Sun <sunc...@uber.com> wrote: &

Re: Kudu background tasks

2017-11-01 Thread Todd Lipcon
find information about these background > operations? I want to understand what happens in situations when some node > is offline and then comes back up after a while. What is tablet > initialization and bootstrapping, etc. > > -- > Br. > Janne Keskitalo, > Database Architect,

Re: Error message: 'Tried to update clock beyond the max. error.'

2017-11-01 Thread Todd Lipcon
What's the full log line where you're seeing this crash? Is it coming from tablet_bootstrap.cc, raft_consensus.cc, or elsewhere? -Todd 2017-11-01 15:45 GMT-07:00 Franco Venturi <fvent...@comcast.net>: > Our version is kudu 1.5.0-cdh5.13.0. > > Franco > > > > &

Re: Low ingestion rate from Kafka

2017-11-01 Thread Todd Lipcon
--table-num-buckets=32 There are also a bunch of options to tune buffer sizes, flush options, etc. But with the default settings above on an 8-node cluster I have, I was able to insert 8M rows in 44 seconds (180k/sec). Adding --buffer-size-bytes=1000 almost doubled the above throughput (330k r

Re: Error message: 'Tried to update clock beyond the max. error.'

2017-11-01 Thread Todd Lipcon
and even time values up to 1000 seconds in the future (we read 1 billion nanoseconds as 1 billion microseconds (=1000 seconds)). I'll work on reproducing this and a patch, to backport to previous versions. -Todd On Wed, Nov 1, 2017 at 5:00 PM, Todd Lipcon <t...@cloudera.com> wrote: &g

Re: The service queue is full; it has 400 items.. Retrying in the next heartbeat period.

2017-11-03 Thread Todd Lipcon
One thing you might try is to update the consensus rpc timeout to 30 seconds instead of 1. We changed the default in later versions. I'd also recommend updating up 1.4 or 1.5 for other related fixes to consensus stability. I think I recall you were on 1.3 still? Todd On Nov 3, 2017 7:47 PM,

Re: kudu 1.4 kerberos

2017-10-24 Thread Todd Lipcon
; > > On Mon, Oct 16, 2017 at 2:29 PM, Matteo Durighetto < > m.durighe...@miriade.it> wrote: > > the "abcdefgh1234" it's an example of the the string created by the > cloudera manager during the enable kerberos. > > ... > > On Mon, Oct 16, 2017 at 11:57

Re: kudu 1.4 kerberos

2017-10-24 Thread Todd Lipcon
On Tue, Oct 24, 2017 at 12:41 PM, Todd Lipcon <t...@cloudera.com> wrote: > I've filed https://issues.apache.org/jira/browse/KUDU-2198 to provide a > workaround for systems like this. I should have a patch up shortly since > it's relatively simple. > > ... and here's the patc

Re: Low ingestion rate from Kafka

2017-10-30 Thread Todd Lipcon
Hey Chao, Nice to hear you are checking out Kudu. What are you using to consume from Kafka and write to Kudu? Is it possible that it is Java code and you are using the SYNC flush mode? That would result in a separate round trip for each record and thus very low throughput. Todd On Oct 30, 2017

Re: Low ingestion rate from Kafka

2017-10-31 Thread Todd Lipcon
sert insert = kuduTable.newInsert(); > PartialRow row = insert.getRow(); > // fill the columns > kuduSession.apply(insert) > } > > I didn't specify the flushing mode, so it will pick up the AUTO_FLUSH_SYNC > as default? > should I use MANUAL_FLUSH? > > Thanks, > Ch

Re: Low ingestion rate from Kafka

2017-10-31 Thread Todd Lipcon
is case (only upsert)? >> >> Thanks again, >> Chao >> >> On Mon, Oct 30, 2017 at 11:42 PM, Todd Lipcon <t...@cloudera.com> wrote: >> >>> If you want to manage batching yourself you can use the manual flush >>> mode. Easiest would be the auto

Re: scan performance super bad

2018-05-14 Thread Todd Lipcon
ON "78" <= VALUES < "785000", > PARTITION "785000" <= VALUES < "79", > PARTITION "79" <= VALUES < "795000", > PARTITION "795000" <= VALUES < "80", > PARTITION "80" <= VALUES < "805000", > PARTITION "805000" <= VALUES < "81", > PARTITION "81" <= VALUES < "815000", > PARTITION "815000" <= VALUES < "82", > PARTITION "82" <= VALUES < "825000", > PARTITION "825000" <= VALUES < "83", > PARTITION "83" <= VALUES < "835000", > PARTITION "835000" <= VALUES < "84", > PARTITION "84" <= VALUES < "845000", > PARTITION "845000" <= VALUES < "85", > PARTITION "85" <= VALUES < "855000", > PARTITION "855000" <= VALUES < "86", > PARTITION "86" <= VALUES < "865000", > PARTITION "865000" <= VALUES < "87", > PARTITION "87" <= VALUES < "875000", > PARTITION "875000" <= VALUES < "88", > PARTITION "88" <= VALUES < "885000", > PARTITION "885000" <= VALUES < "89", > PARTITION "89" <= VALUES < "895000", > PARTITION "895000" <= VALUES < "90", > PARTITION "90" <= VALUES < "905000", > PARTITION "905000" <= VALUES < "91", > PARTITION "91" <= VALUES < "915000", > PARTITION "915000" <= VALUES < "92", > PARTITION "92" <= VALUES < "925000", > PARTITION "925000" <= VALUES < "93", > PARTITION "93" <= VALUES < "935000", > PARTITION "935000" <= VALUES < "94", > PARTITION "94" <= VALUES < "945000", > PARTITION "945000" <= VALUES < "95", > PARTITION "95" <= VALUES < "955000", > PARTITION "955000" <= VALUES < "96", > PARTITION "96" <= VALUES < "965000", > PARTITION "965000" <= VALUES < "97", > PARTITION "97" <= VALUES < "975000", > PARTITION "975000" <= VALUES < "98", > PARTITION "98" <= VALUES < "985000", > PARTITION "985000" <= VALUES < "99", > PARTITION "99" <= VALUES < "995000", > PARTITION VALUES >= "995000" > ) > > > So it looks like you have a numeric value being stored here in the string column. Are you sure that you are properly zero-padding when creating your key? For example if you accidentally scan from "50_..." to "80_..." you will end up scanning a huge portion of your table. > i did not delete rows in this table ever. > > my scanner code is below: > buildKey method will build the lower bound and the upper bound, the unique > id is same, the startRow offset(third part) is 0, and the endRow offset is > , startRow and endRow only differs from time. > though the max offset is big(999), generally it is less than 100. > > private KuduScanner buildScanner(Metric startRow, Metric endRow, > List dimensionIds, List dimensionFilterList) { > KuduTable kuduTable = > kuduService.getKuduTable(BizConfig.parseFrom(startRow.getBizId())); > > PartialRow lower = kuduTable.getSchema().newPartialRow(); > lower.addString("key", buildKey(startRow)); > PartialRow upper = kuduTable.getSchema().newPartialRow(); > upper.addString("key", buildKey(endRow)); > > LOG.info("build scanner. lower = {}, upper = {}", buildKey(startRow), > buildKey(endRow)); > > KuduScanner.KuduScannerBuilder builder = > kuduService.getKuduClient().newScannerBuilder(kuduTable); > builder.setProjectedColumnNames(COLUMNS); > builder.lowerBound(lower); > builder.exclusiveUpperBound(upper); > builder.prefetching(true); > builder.batchSizeBytes(MAX_BATCH_SIZE); > > if (CollectionUtils.isNotEmpty(dimensionFilterList)) { > for (int i = 0; i < dimensionIds.size() && i < MAX_DIMENSION_NUM; > i++) { > for (DimensionFilter dimensionFilter : dimensionFilterList) { > if (!Objects.equals(dimensionFilter.getDimensionId(), > dimensionIds.get(i))) { > continue; > } > ColumnSchema columnSchema = > kuduTable.getSchema().getColumn(String.format("dimension_%02d", i)); > KuduPredicate predicate = buildKuduPredicate(columnSchema, > dimensionFilter); > if (predicate != null) { > builder.addPredicate(predicate); > LOG.info("add predicate. predicate = {}", > predicate.toString()); > } > } > } > } > return builder.build(); > } > > What client version are you using? 1.7.0? > i checked the metrics, only get content below, it seems no relationship > with my table. > Looks like you got the metrics from the kudu master, not a tablet server. You need to figure out which tablet server you are scanning and grab the metrics from that one. -Todd -- Todd Lipcon Software Engineer, Cloudera

Re: scan performance super bad

2018-05-13 Thread Todd Lipcon
should call hundreds > times nextRows() to fetch all data, and it finally cost several minutes. > > i don't know why this happened and how to resolve itmaybe the final > solution is that i should giving up kudu, using hbase instead... > -- Todd Lipcon Software Engineer, Cloudera

Re: Kudu read - performance issue

2018-05-11 Thread Todd Lipcon
st-values). So, if you had for example: pre-chunk in-list: 1,2,3,4,5,6 chunk 1: col2 IN (1,6) chunk 2: col2 IN (2,5) chunk 3: col2 IN (3,4) then you will actually scan over the middle portion of that table 3 times. If you sort the in-list before chunking you'll avoid the multiple-scan effect here. -Todd -- Todd Lipcon Software Engineer, Cloudera

Re: will upsert have bad effect on scan performace?

2018-05-21 Thread Todd Lipcon
t; will it have a bad effect even though these data were firstly loaded. > i do not know compaction mechanism of kudu, will it lead to many > compaction, thus lead to bad scan performance. > > Best regards. > -- Todd Lipcon Software Engineer, Cloudera

Re: kudu Insert、Update、Delete operating data lost

2018-06-15 Thread Todd Lipcon
contains Insert, Update, Delete operations, if > the database does not exist in the data there will be > some new data loss, how to avoid such problems. > -- Todd Lipcon Software Engineer, Cloudera

Re: 答复: 答复: How kudu synchronize real-time records?

2017-10-26 Thread Todd Lipcon
ay node1 load record1 from WAL at t1, node2 t2, node3 t3 (t1 < > t2 < t3) then reading client attached node1 can see record but other > reading clients attached not node1(node2, node3) have possibilities missing > record1. > > > > I think that does not happens in kudu, and i wonder how kudu synchronize > real time data. > > > > Thanks! > > > > -- Todd Lipcon Software Engineer, Cloudera

Re: new Kudu benchmarks

2018-01-05 Thread Todd Lipcon
Oh, one other piece of feedback: maybe worth editing the title to say "vs Apache Parquet" instead of "vs Apache Impala" since in all cases you are using Impala as the query engine? -Todd On Fri, Jan 5, 2018 at 11:06 AM, Todd Lipcon <t...@cloudera.com> wrote

Re: new Kudu benchmarks

2018-01-05 Thread Todd Lipcon
evelopers for such an amazing and much-needed product. > > Boris > > > -- Todd Lipcon Software Engineer, Cloudera

Re: Data inconsistency after restart

2018-01-04 Thread Todd Lipcon
ache.org/docs/command_line_tools_referenc >>>>>> e.html#cluster-ksck for more details. For restarting a cluster, I >>>>>> would recommend taking down all tablet servers at once, otherwise >>>>>> tablet >>>>>> replicas may try to replicate data from the server that was taken >>>>>> down. >>>>>> >>>>>> Hope this helped, >>>>>> Andrew >>>>>> >>>>>> On Tue, Dec 5, 2017 at 10:42 AM, Petter von Dolwitz (Hem) < >>>>>> petter.von.dolw...@gmail.com> wrote: >>>>>> >>>>>> Hi Kudu users, >>>>>>> >>>>>>> We just started to use Kudu (1.4.0+cdh5.12.1). To make a baseline for >>>>>>> evaluation we ingested 3 month worth of data. During ingestion we >>>>>>> were >>>>>>> facing messages from the maintenance threads that a soft memory >>>>>>> limit were >>>>>>> reached. It seems like the background maintenance threads stopped >>>>>>> performing their tasks at this point in time. It also so seems like >>>>>>> the >>>>>>> memory was never recovered even after stopping ingestion so I guess >>>>>>> there >>>>>>> was a large backlog being built up. I guess the root cause here is >>>>>>> that we >>>>>>> were a bit too conservative when giving Kudu memory. After a >>>>>>> reststart a >>>>>>> lot of maintenance tasks were started (i.e. compaction). >>>>>>> >>>>>>> When we verified that all data was inserted we found that some data >>>>>>> was missing. We added this missing data and on some chunks we got the >>>>>>> information that all rows were already present, i.e impala says >>>>>>> something >>>>>>> like Modified: 0 rows, nnn errors. Doing the verification again >>>>>>> now >>>>>>> shows that the Kudu table is complete. So, even though we did not >>>>>>> insert >>>>>>> any data on some chunks, a count(*) operation over these chunks now >>>>>>> returns >>>>>>> a different value. >>>>>>> >>>>>>> Now to my question. Will data be inconsistent if we recycle Kudu >>>>>>> after >>>>>>> seeing soft memory limit warnings? >>>>>>> >>>>>>> Is there a way to tell when it is safe to restart Kudu to avoid these >>>>>>> issues? Should we use any special procedure when restarting (e.g. >>>>>>> only >>>>>>> restart the tablet servers, only restart one tablet server at a time >>>>>>> or >>>>>>> something like that)? >>>>>>> >>>>>>> The table design uses 50 tablets per day (times 90 days). It is 8 TB >>>>>>> of data after 3xreplication over 5 tablet servers. >>>>>>> >>>>>>> Thanks, >>>>>>> Petter >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Andrew Wong >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Andrew Wong >>>>> >>>>> >>>> >>>> >>> > > -- > David Alves > -- Todd Lipcon Software Engineer, Cloudera

Re: new Kudu benchmarks

2018-01-05 Thread Todd Lipcon
ittle clearer. Thanks -Todd > > On Fri, Jan 5, 2018 at 11:13 AM, Todd Lipcon <t...@cloudera.com> wrote: > >> Oh, one other piece of feedback: maybe worth editing the title to say "vs >> Apache Parquet" instead of "vs Apache Impala" since in all cases

Re: new Kudu benchmarks

2018-01-05 Thread Todd Lipcon
estamp representation with microsecond precision, so that's what Kudu implemented internally. With 64 bits there is still enough range to store dates for 584,554 years at microsecond precision. I think https://impala.apache.org/docs/build/html/topics/impala_timestamp.html has some info about Kudu compatibility and limitations. -Todd -- Todd Lipcon Software Engineer, Cloudera

Re: new Kudu benchmarks

2018-01-08 Thread Todd Lipcon
our feedback! look forward to new releases coming up! > > Boris > > On Fri, Jan 5, 2018 at 9:08 PM, Todd Lipcon <t...@cloudera.com> wrote: > >> On Fri, Jan 5, 2018 at 5:50 PM, Boris Tyukin <bo...@boristyukin.com> >> wrote: >> >>> Hi Todd, >

Re: Bulk / Initial load of large tables into Kudu using Spark

2018-01-29 Thread Todd Lipcon
some_kudu_table >> SELECT * FROM some_csv_tabledoes the trick. >> >> You can also use Kudu’s MapReduce OutputFormat to load data from HDFS, >> HBase, or any other data store that has an InputFormat. >> >> No tool is provided to load data directly into Kudu’s on-d

Re: Bulk / Initial load of large tables into Kudu using Spark

2018-01-30 Thread Todd Lipcon
-Todd > > On Mon, Jan 29, 2018 at 2:22 PM, Todd Lipcon <t...@cloudera.com> wrote: > >> On Mon, Jan 29, 2018 at 11:18 AM, Patrick Angeles <patr...@cloudera.com> >> wrote: >> >>> Hi Boris. >>> >>> 1) I would like to bypass Impa

Re: Using Kudu to Handle Huge amount of Data

2018-02-04 Thread Todd Lipcon
e run some basic smoke tests of Kudu on ~800 nodes before. > > Looking forward to your inputs on any organisation using Kudu where data > volumes of more than 10 TB is ingested everyday. > Hope some other users can chime in. -Todd -- Todd Lipcon Software Engineer, Cloudera

Re: Recommended maximum amount of stored data per tablet server

2018-08-02 Thread Todd Lipcon
gt; RAM, 48 cpu cores. Does it mean the other 52(= 15 * 4 - 8) TB space is > recommended to leave for other systems? We prefer to make the machine > dedicated to Kudu. Can tablet server leverage the whole space efficiently? > > > > Thanks, > > Quanlong > -- Todd Lipcon Software Engineer, Cloudera

Re: Re: Recommended maximum amount of stored data per tablet server

2018-08-02 Thread Todd Lipcon
systems. One recommendation, though is to consider using a dedicated disk for the Kudu WAL and metadata, which can help performance, since the WAL can be sensitive to other heavy workloads monopolizing bandwidth on the same spindle. -Todd > > At 2018-08-03 02:26:37, "Todd Lipcon" wrot

Re: Re: Why RowSet size is much smaller than flush_threshold_mb

2018-08-01 Thread Todd Lipcon
-off. -Todd > At 2018-06-15 23:41:17, "Todd Lipcon" wrote: > > Also, keep in mind that when the MRS flushes, it flushes into a bunch of > separate RowSets, not 1:1. It "rolls" to a new RowSet every N MB (N=32 by > default). This is set by --budgeted_compacti

Re: Re: Re: Why RowSet size is much smaller than flush_threshold_mb

2018-08-01 Thread Todd Lipcon
lts or giving some more prescriptive advice? I'm a little nervous that saying "here are all the internals, and here are 100 config flags to study" will scare users more than help them :) -Todd > > At 2018-08-02 01:06:40,"Todd Lipcon" wrote: > > On Wed, Aug 1, 2018

Re: Dictionary encoding

2018-08-06 Thread Todd Lipcon
> Does any body know what is the maximum distinct values of a String column > that Kudu considers in order to set its encoding to Dictionary? Many thanks > :) > > br, > > -- Todd Lipcon Software Engineer, Cloudera

Re: "broadcast" tablet replication for kudu?

2018-07-23 Thread Todd Lipcon
>>>>> perhaps partition kudu table, even if small, into multiple tablets), it >>>>> was >>>>> to speed up joins/exchanges, not to parallelize the scan. >>>>> >>>>> For example recently we ran into this slow query where the

Re: "broadcast" tablet replication for kudu?

2018-07-23 Thread Todd Lipcon
Impala 2.12. The external RPC protocol is still Thrift. Todd On Mon, Jul 23, 2018, 7:02 AM Clifford Resnick wrote: > Is this impala 3.0? I’m concerned about breaking changes and our RPC to > Impala is thrift-based. > > From: Todd Lipcon > Reply-To: "user@kudu.apache.org&qu

Re: "broadcast" tablet replication for kudu?

2018-07-23 Thread Todd Lipcon
gh replication count.* > > *I could see bumping the replication count to 5 for these tables since the > extra storage cost is low and it will ensure higher availability of the > important central tables, but I'd be surprised if there is any measurable > perf impact.* > "

Re: cannot import kudu.client

2018-08-31 Thread Todd Lipcon
@boot2docker:~/kudu# python > Python 2.7.12 (default, Dec 4 2017, 14:50:18) > [GCC 5.4.0 20160609] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import kudu > >>> import kudu.client >

Re: spark on kudu performance!

2018-07-05 Thread Todd Lipcon
appen automatically so long as the filter predicate has been pushed down. Using 'explain()' and showing us the results, along with the code you used to create your table, will help understand what might be the problem with performance. -Todd -- Todd Lipcon Software Engineer, Cloudera

  1   2   >