Fwd: Hive Compatibility

2020-09-26 Thread Boris Tyukin
I am Kudu user not dev but here is my 2 cents. I would not use that Hive/Kudu integration for any sort of production/important work. I think it was a quick POC and I remember seeing sqoop kudu prototype too but be warned... you are probably better off using Kudu spark client as I see it is being

real-time pipeline with Kudu

2020-05-12 Thread Boris Tyukin
Hi guys, there are not a lot of real-life experiences with Kudu and I wanted to share with you my blog post where I described our use-case - near real-time data lake in a large healthcare system https://boristyukin.com/building-near-real-time-big-data-lake-part-2/ Our real-time infra was pretty

Re: Partitioning Rules of Thumb

2020-04-25 Thread Boris Tyukin
, 2020 at 2:54 PM Boris Tyukin wrote: > Cliff, i would be extremely interested to see a blog post to compare > Snowflake, Redshift and Impala/Kudu since you tried all of them. > > would love to get some details how you set up Kudu/Impala cluster on AWS > as well as my company m

Re: Partitioning Rules of Thumb

2020-04-25 Thread Boris Tyukin
ng. This is despite the fact that RedShift has built-in cache. We > also use streaming ingestion which, aside from being impossible with > RedShift, removes the added cost of staging. > > Getting back to Snowflake, there's no way we could use it the same way we > use Kudu, and

Re: Kudu - Dremio

2020-03-29 Thread Boris Tyukin
when I was looking at Dremio some time ago (very interesting technology and I love the idea of query rewrites and materialized viewed federation from different sources), it did not support Impala which you have to use currently to get SQL support with Kudu. On Sun, Mar 29, 2020 at 11:44 AM pino

Re: Partitioning Rules of Thumb

2020-03-17 Thread Boris Tyukin
tion in the next > release after 3.4. We saw huge speedups on a lot of queries (like 10x or > more). Some queries didn't benefit much, if they were limited by the scan > perf (including if the runtime filters pushed into the scans were filtering > most data before joins). > >

Re: Partitioning Rules of Thumb

2020-03-16 Thread Boris Tyukin
at for your problem space. But if > analytics are also more centrally integrated in pipelines then parquet is > hard to beat for the price and flexibility, as is Kudu for dashboards or > other intelligence that leverages upsert/key semantics. Ultimately, like > many of us, you m

Re: Partitioning Rules of Thumb

2020-03-13 Thread Boris Tyukin
Hive depending on what version of Hive you > are using. Feel free to reach out on Slack if you have issues. The details > on the integration here: > https://cwiki.apache.org/confluence/display/Hive/Kudu+Integration > > On Fri, Mar 13, 2020 at 8:22 AM Boris Tyukin > wrote: > >

Re: Partitioning Rules of Thumb

2020-03-13 Thread Boris Tyukin
; Separately, you mentioned having to manage Kudu's compaction process. > Could you go into more detail here? > > On Wed, Mar 11, 2020 at 6:49 AM Boris Tyukin > wrote: > >> thanks Cliff, this is really good info. I am tempted to do the benchmarks >> myself but nee

Re: Partitioning Rules of Thumb

2020-03-11 Thread Boris Tyukin
Shift, removes the added cost of staging. > > Getting back to Snowflake, there's no way we could use it the same way we > use Kudu, and even if we could, the cost would would probably put us out of > business! > > On Tue, Mar 10, 2020, 10:59 AM Boris Tyukin wrote: > >

Re: Partitioning Rules of Thumb

2020-03-10 Thread Boris Tyukin
s, and each of those might land > on a different tablet server running on isolated hardware. For a > significantly larger table into which you expect highly concurrent > workloads, the recommendation serves as a lower bound -- I'd recommend > having more partitions, and if your data is

Re: Partitioning Rules of Thumb

2020-03-07 Thread Boris Tyukin
to stream 100s of tables and we use PK from RBDMS and need to come with an automated way to pick number of partitions/tablets. So far I was using 1Gb rule but rethinking this now for another project. On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin wrote: > forgot to post results of my quick test: > &

Re: Partitioning Rules of Thumb

2020-03-07 Thread Boris Tyukin
servers. if I have 20 tablet servers and I have two tables - one with 1MM rows and another one with 100MM rows, do I pick 20 / 3 partitions for both (divide by 3 because of replication)? On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin wrote: > hey guys, > > I asked the same question

Re: Please please add bloom filter support

2019-10-21 Thread Boris Tyukin
This explains why some of our heavy queries against billion row tables are so much slower than Impala on hdfs. Surprised it has not been addressed yet as it performance difference based on numbers in Kudu jira is staggering On Mon, Oct 21, 2019, 00:28 Adar Lieber-Dembo wrote: > I commented on

Incorta vs Kudu

2019-10-15 Thread Boris Tyukin
Hi guys, I was just reading about incorta. They get a lot of traction and buzz recently. While they do not explain how it actually works but I got a feeling their "secret" technology is very similar to Kudu. Just curious if you looked at it and compared to Kudu/Impala combo. They mentioned their

Re: Long text and complex data types support

2019-09-09 Thread Boris Tyukin
exists yet. Do you have any sample schemas with complex types > you could send me to help inform designs and trade offs? > > Thank you, > Grant > > On Sat, Sep 7, 2019 at 11:43 AM Boris Tyukin > wrote: > >> Hi guys, >> >> Any plans to support lon

Long text and complex data types support

2019-09-07 Thread Boris Tyukin
Hi guys, Any plans to support long text type in Kudu? We would love to use Kudu with other projects but unfortunately long text data are pretty common in healthcare industry and we have to use hive/Impala/hdfs instead which is quite painful since we cannot do in place updates and deletes. Same

Re: "broadcast" tablet replication for kudu?

2019-04-24 Thread Boris Tyukin
node to improve performance and avoid broadcasting this table every time? On Mon, Jul 23, 2018 at 10:52 AM Todd Lipcon wrote: > > > On Mon, Jul 23, 2018, 7:21 AM Boris Tyukin wrote: > >> Hi Todd, >> >> Are you saying that your earlier comment below is not lo

Re: Kudu table api

2019-03-22 Thread Boris Tyukin
Hi Dmitry, check Java Kudu API examples if you have not done it yet https://github.com/apache/kudu/tree/master/examples I remember it had a helper class that counts rows. Like Adar said, I do not think there is a better / faster way - you just create a Kudu scanner, get rows back and iterate over

Re: Changing number of Kudu worker threads

2019-02-14 Thread Boris Tyukin
s (which I'm guessing > was your case). Or should we have some recipes like "here's how you should > write to Kudu from Nifi"? Any thoughts? > > In any case, thanks for reporting back! > > J-D > > On Thu, Feb 14, 2019 at 1:56 PM Boris Tyukin > wrote: > >>

Re: Changing number of Kudu worker threads

2019-02-14 Thread Boris Tyukin
duration of the tests. > > - Can you share your table schema and partitions schema? For the columns > I'm mostly interested in the row keys and the cardinality of each column. > > Thanks, > > J-D > > On Thu, Feb 14, 2019 at 5:41 AM Boris Tyukin > wrote: > >> Hi

Re: Changing number of Kudu worker threads

2019-02-12 Thread Boris Tyukin
Can someone point us to documentation or explain what these parameters really mean or how they should be set on production cluster? I will greatly appreciate it! Boris On Fri, Feb 8, 2019 at 3:40 PM Boris Tyukin wrote: > Hi guys, > > we need to process 1000s of operations p

Changing number of Kudu worker threads

2019-02-08 Thread Boris Tyukin
Hi guys, we need to process 1000s of operations per second and noticed that our Kudu 1.5 cluster was only using 10 threads while our application spins up 50 clients/threads. We observed in the web UI that only 10 threads are working and other 40 waiting in the queue. We found

Re: close Kudu client on timeout

2019-01-18 Thread Boris Tyukin
We started as well with our > custom kudu client implementation in NiFi, but at the end we switched over > to the existing processors as it was much easier to handle… > > > > Cheers Josef > > > > > > > > *From: *Boris Tyukin > *Reply-To: *"user@kudu.a

Re: close Kudu client on timeout

2019-01-17 Thread Boris Tyukin
t; > > Thanks, > > Alexey > > On Wed, Jan 16, 2019 at 1:31 PM Boris Tyukin > wrote: > >> sorry it is Java >> >> On Wed, Jan 16, 2019 at 3:32 PM Mike Percy wrote: >> >>> Java or C++ / Python client? >>>

Re: close Kudu client on timeout

2019-01-16 Thread Boris Tyukin
sorry it is Java On Wed, Jan 16, 2019 at 3:32 PM Mike Percy wrote: > Java or C++ / Python client? > > Mike > > Sent from my iPhone > > > On Jan 16, 2019, at 12:27 PM, Boris Tyukin > wrote: > > > > Hi guys, > > > > is there a setting on Ku

kudu-client dependencies

2019-01-02 Thread Boris Tyukin
Hi guys, sorry for a dumb question but why kudu-client.jar does not include async and slf4j-api and slf4j-simple libs? I need to call Kudu API from a simple groovy script and had to add 3 other jars explicitly. I see these libs were excluded on purpose:

Re: getRowErrors and operation type (insert, delete or update)

2018-12-28 Thread Boris Tyukin
never mind, figured it out. I can do RowError.getOperation().getClass() or even simpler RowError.getOperation().getChangeType(). I love how clean Kudu API is! On Fri, Dec 28, 2018 at 9:11 AM Boris Tyukin wrote: > Hi guys, > > I need to write some custom logic to handle Kudu e

getRowErrors and operation type (insert, delete or update)

2018-12-28 Thread Boris Tyukin
Hi guys, I need to write some custom logic to handle Kudu exceptions in AUTO_FLUSH_BACKGROUND mode and I can get what I need from session.getPendingErrors().getRowErrors() except operation type (insert, delete or update). getRowErrors returns an array of RowError

Re: KuduScanner with multiple sets of compound primary keys

2018-12-11 Thread Boris Tyukin
sjunctions (i.e. OR predicates); > if this is something you'd be interested in working on, your patches > would be welcome. > > On Tue, Dec 11, 2018 at 1:00 PM Boris Tyukin > wrote: > > > > Hi guys, > > > > my Kudu table has several PK columns and I need to create a

KuduScanner with multiple sets of compound primary keys

2018-12-11 Thread Boris Tyukin
Hi guys, my Kudu table has several PK columns and I need to create a scanner to pull multiple rows for these primary keys. If I used Impala, it would be something like SELECT pk1, pk2, col1 FROM table1 WHERE (pk1 = 1 and pk2 = 11) OR (pk1 = 2 and pk2 = 22) OR (pk1 = 3 and pk2 = 33) I

Re: Tablets Per Tablet Server

2018-11-29 Thread Boris Tyukin
Mac, I asked the same question some time ago if you want to check out the comments from Kudu team..There is also umbrella Jira below to support high density nodes.

Re: strange behavior of getPendingErrors

2018-11-17 Thread Boris Tyukin
happen that only 2 >>> operations would be rejected, in case of 30 partitions -- just the single >>> key==2 row could be rejected. >>> >>> BTW, that might also happen if using the MANUAL_FLUSH mode. However, >>> with the AUTO_FLUSH_SYNC m

Re: strange behavior of getPendingErrors

2018-11-16 Thread Boris Tyukin
> server? > > -Todd > > On Fri, Nov 16, 2018 at 1:12 PM Boris Tyukin > wrote: > >> Hey guys, >> >> I am playing with Kudu Java client (wow it is fast), using mostly code >> from Kudu Java example. >> >> While learning about exceptions dur

strange behavior of getPendingErrors

2018-11-16 Thread Boris Tyukin
Hey guys, I am playing with Kudu Java client (wow it is fast), using mostly code from Kudu Java example. While learning about exceptions during rows inserts, I stumbled upon something I could not explain. If I insert 10 rows into a brand new Kudu table (AUTO_FLUSH_BACKGROUND mode) and I make

Re: clarification on Partitioning Guidelines and CPU cores

2018-10-17 Thread Boris Tyukin
fterwards. > > On Wed, Oct 17, 2018 at 6:00 PM Boris Tyukin > wrote: > >> thanks for replying, Adar. Did some math and in our case we are hitting >> another Kudu limit - 60 tablets per node. We use high density nodes with 2 >> 24-core CPUs so we have 88 hyperthre

Re: clarification on Partitioning Guidelines and CPU cores

2018-10-17 Thread Boris Tyukin
thanks for replying, Adar. Did some math and in our case we are hitting another Kudu limit - 60 tablets per node. We use high density nodes with 2 24-core CPUs so we have 88 hyperthreaded cores total per node or 88*24=2112 cores total. But I cannot create more than 60*24=1440 tablets per table.

Re: is it worth to have partitions on very small tables?

2018-10-15 Thread Boris Tyukin
ne can take advantage of the size disparity in the tables. > > - Dan > > On Mon, Oct 15, 2018 at 10:44 AM Boris Tyukin > wrote: > >> Out of 300 tables I need to ingest into Kudu, 250 are really small - less >> than 500k rows and will fit in a single 1Gb partition. Does it

is it worth to have partitions on very small tables?

2018-10-15 Thread Boris Tyukin
Out of 300 tables I need to ingest into Kudu, 250 are really small - less than 500k rows and will fit in a single 1Gb partition. Does it still make sense to create 3 partitions or have no partitions at all? Some of these tables are frequently joined to very large 1-10B row tables... Thanks,

Re: Multi-level partitions question

2018-10-11 Thread Boris Tyukin
is that the hotspotting resistance isn't as good. If > the shop_id and customer_id columns aren't skewed to begin with that's not > a concern, though. > > - Dan > > On Thu, Oct 11, 2018 at 12:14 PM Boris Tyukin > wrote: > >> Hi guys, >> Read this doc >> https

Multi-level partitions question

2018-10-11 Thread Boris Tyukin
Hi guys, Read this doc https://kudu.apache.org/docs/schema_design.html#multilevel-partitioning and I have a question on this particular statement "Scans on multilevel partitioned tables can take advantage of partition pruning on any of the levels independently" Does it mean, that both strategies

Re: clarification on Partitioning Guidelines and CPU cores

2018-10-10 Thread Boris Tyukin
Also, when they say tablets - I assume this is before replication? so in reality, it is number of nodes x cpu cores / replication factor? If this is the case, it is not looking good... On Wed, Oct 10, 2018 at 5:02 PM Boris Tyukin wrote: > Hi all, > > can someone clarify if this recom

clarification on Partitioning Guidelines and CPU cores

2018-10-10 Thread Boris Tyukin
Hi all, can someone clarify if this recommendation below - does it mean physical or hyper-threaded CPU cores? quite a big difference... Thanks, Boris Partitioning Guidelines (https://kudu.apache.org/docs/ kudu_impala_integration.html#partitioning_rules_of_thumb) - For large tables, such as fact

Re: Re: Recommended maximum amount of stored data per tablet server

2018-08-04 Thread Boris Tyukin
How much space typically allocated just for WAL and metadata? We have 2 400GB ssds in raid5 for OS and 12 12TB hdds. Is it still a good idea to carve out maybe 100gb on SSD or use a dedicated hdd On Thu, Aug 2, 2018, 20:36 Todd Lipcon wrote: > On Thu, Aug 2, 2018 at 4:54 PM, Quanlong Huang >

Re: swap data in Kudu table

2018-07-25 Thread Boris Tyukin
have is that data was already in Impala daemons memory and did not need Kudu tables at that point. Boris On Fri, Feb 23, 2018 at 5:13 PM Boris Tyukin wrote: > you are guys are awesome, thanks! > > Todd, I like ALTER TABLE TBLPROPERTIES idea - will test it next week. > Views might

Re: "broadcast" tablet replication for kudu?

2018-07-23 Thread Boris Tyukin
on, Jul 23, 2018, 6:43 AM Boris Tyukin wrote: > >> sorry to revive the old thread but I am curious if there is a good way to >> speed up requests to frequently used tables in Kudu. >> >> On Thu, Apr 12, 2018 at 8:19 AM Boris Tyukin >> wrote: >> >>> b

Re: [ANNOUNCE] Apache Kudu 1.7.0 released

2018-03-23 Thread Boris Tyukin
Great news! Yay for decimals!! Grant, I wonder if you have done any benchmarking of decimals vs float or strong. On Fri, Mar 23, 2018, 14:44 Grant Henke wrote: > The Apache Kudu team is happy to announce the release of Kudu 1.7.0. > > Kudu is an open source storage engine

Re: "broadcast" tablet replication for kudu?

2018-03-16 Thread Boris Tyukin
I'm new to Kudu but we are also going to use Impala mostly with Kudu. We have a few tables that are small but used a lot. My plan is replicate them more than 3 times. When you create a kudu table, you can specify number of replicated copies (3 by default) and I guess you can put there a number,

Re: Impala Parquet to Kudu 1.5 - severe ingest performance degradation

2018-02-28 Thread Boris Tyukin
t; them, I and other employees participate here in the Apache Kudu project as > individuals and it's important to keep the distinction separate. Kudu is a > product of the ASF non-profit organization, not a product of any commercial > vendor. > > -Todd > > > On Wed, Feb 28, 2018 at

Re: Impala Parquet to Kudu 1.5 - severe ingest performance degradation

2018-02-28 Thread Boris Tyukin
er_table; > > > Hope that helps > -Todd > > On Thu, Feb 22, 2018 at 11:02 AM, Hao Hao <hao@cloudera.com> wrote: > >> Did you happen to check the health of the cluster after the upgrade by 'kudu >> cluster ksck'? >> >> Best, >> Hao >>

Impala Parquet to Kudu 1.5 - severe ingest performance degradation

2018-02-22 Thread Boris Tyukin
Hello, we just upgraded our dev cluster from Kudu 1.3 to kudu 1.5.0-cdh5.13.1 and noticed quite severe performance degradation. We did CTAS from Impala parquet table which has not changed a bit since the upgrade (even the same # of rows) to Kudu using the follow query below. It used to take

Bulk / Initial load of large tables into Kudu using Spark

2018-01-26 Thread Boris Tyukin
I found this in the FAQ but I am wondering if Spark Kudu library can be used for efficient bulk loads from HDFS to Kudu directly. By a large table, I mean 5-10B row tables. I do not really like the options described below because 1) I would like to bypass Impala as data for my bulk load coming

Re: new Kudu benchmarks

2018-01-08 Thread Boris Tyukin
awesome, thanks Todd! On Mon, Jan 8, 2018 at 12:53 PM, Todd Lipcon <t...@cloudera.com> wrote: > Thanks for making the updates. I tweeted it from my account and from > @ApacheKudu. feel free to retweet! > > -Todd > > On Sat, Jan 6, 2018 at 1:10 PM, Boris Tyukin <bo.

Re: new Kudu benchmarks

2018-01-06 Thread Boris Tyukin
thanks Todd, updated my post with that info and also changes title a bit. thanks again for your feedback! look forward to new releases coming up! Boris On Fri, Jan 5, 2018 at 9:08 PM, Todd Lipcon <t...@cloudera.com> wrote: > On Fri, Jan 5, 2018 at 5:50 PM, Boris Tyukin <bo...@bor

new Kudu benchmarks

2018-01-05 Thread Boris Tyukin
Hi guys, we just finished testing Kudu, mostly comparing Kudu to Impala on HDFS/parquet. I wanted to share my blog post and results. We used typical (and real) healthcare data for the test, not a synthetic data which I think makes it is a bit more interesting. I welcome any feedback!

Re: first and second run 2x query time difference

2018-01-03 Thread Boris Tyukin
it is possible but I thought Kudu keeps its stuff in its own folders On Wed, Jan 3, 2018 at 1:45 PM, Jean-Daniel Cryans <jdcry...@apache.org> wrote: > Hey Boris, > > Thanks for reporting back with results! > > On Wed, Jan 3, 2018 at 10:38 AM, Boris Tyukin <bo...@boristyuk

Re: first and second run 2x query time difference

2018-01-03 Thread Boris Tyukin
;went down" means. > > On Sat, Dec 16, 2017 at 12:50 PM, Boris Tyukin <bo...@boristyukin.com> > wrote: > >> yep it is really weird since Kudu does not use neither one. I'll get with >> him on Monday to gather more details >> >> On Sat, Dec 16, 20

Re: first and second run 2x query time difference

2017-12-16 Thread Boris Tyukin
e > but I don't know how that can cause things like DataNodes to fail. > > J-D > > On Sat, Dec 16, 2017 at 11:45 AM, Boris Tyukin <bo...@boristyukin.com> > wrote: > >> well our admin had fun two days - it was the first time we restarted Kudu >> on our DEV clust

Re: first and second run 2x query time difference

2017-12-16 Thread Boris Tyukin
more details next week so not asking for help as I do not know all the details. What is obvious thought is that it has to do something with Kudu :) On Thu, Dec 14, 2017 at 9:40 AM, Boris Tyukin <bo...@boristyukin.com> wrote: > thanks for your suggestions, J-D, I am sure you are right m

Re: decimals support anytime soon?

2017-12-14 Thread Boris Tyukin
. Please don't > hesitate to give feedback to ensure the solution fits your needs. > > Additionally what integrations are most important to you? Are you using > the clients directly? Impala? Spark? > > Thank you, > Grant > > On Thu, Dec 14, 2017 at 8:25

Re: first and second run 2x query time difference

2017-12-14 Thread Boris Tyukin
for decimals On Wed, Dec 13, 2017 at 5:07 PM, Jean-Daniel Cryans <jdcry...@apache.org> wrote: > On Wed, Dec 13, 2017 at 11:30 AM, Boris Tyukin <bo...@boristyukin.com> > wrote: > >> thanks J-D! we are going to try that and see how it impacts the runtime. >> >>

Re: first and second run 2x query time difference

2017-12-13 Thread Boris Tyukin
times again, without restarting Kudu, to > understand the effect of the page cache itself. There's currently now way > to purge the cached metadata in Kudu though. > > Hope this helps a bit, > > J-D > > On Wed, Dec 13, 2017 at 8:07 AM, Boris Tyukin <bo...@boristyuk

first and second run 2x query time difference

2017-12-13 Thread Boris Tyukin
Hi guys, I am doing some benchmarks with Kudu and Impala/Parquet and hope to share it soon but there is one thing that bugs me. This is perhaps Impala question but since I am using Kudu with Impala I am going to try and ask anyway. One of my queries takes 120 seconds to run the very first time.

Re: Data inconsistency after restart

2017-12-06 Thread Boris Tyukin
can > rebuild Kudu-tables upon errors. We are still in the early learning phase. > > Br, > Petter > > > > 2017-12-06 14:35 GMT+01:00 Boris Tyukin <bo...@boristyukin.com>: > >> this is definitely concerning thread for us looking to use Impala for >> stori

Re: Data inconsistency after restart

2017-12-06 Thread Boris Tyukin
this is definitely concerning thread for us looking to use Impala for storing mission-critical company data. Petter, are you paid Cloudera customer btw? I wonder if you opened support ticket as well On Wed, Dec 6, 2017 at 7:26 AM, Petter von Dolwitz (Hem) < petter.von.dolw...@gmail.com> wrote: >