Help me understand Kudu scalability limitations

2017-11-28 Thread Boris Tyukin
Hi guys, I was really excited about Kudu until I saw this: https://kudu.apache.org/docs/known_issues.html - Recommended maximum amount of stored data, post-replication and post-compression, per tablet server is 8TB. - Recommended maximum number of tablets per tablet server is 2

Confused where to post user type questions

2017-11-29 Thread Boris Tyukin
Hi folks, as a new user to Kudu, it is confusing what is the best venue to post user type questions about Kudu which is important for any thriving open source project. I have posted some questions on slack and got a feeling they were not welcome there as discussions on slack seem to be focused on

Re: Help me understand Kudu scalability limitations

2017-11-29 Thread Boris Tyukin
ng. > Although given the number of HDDs per node, it sounds like a lot would go > unused. If you meant that you have 3 nodes, that's a different story. Would > you mind clarifying? > > > Andrew > > On Tue, Nov 28, 2017 at 7:25 AM, Boris Tyukin > wrote: > >>

Re: Help me understand Kudu scalability limitations

2017-11-29 Thread Boris Tyukin
ing we're tracking and hoping to improve in the near > future. There has already been some pretty drastic bumps in this area (see > here <https://issues.apache.org/jira/browse/KUDU-1967>), although I don't > think there's an exact timeline. > > > Andre

Re: Confused where to post user type questions

2017-11-29 Thread Boris Tyukin
scussions in the past to migrate those >> discussions to a #kudu-dev or something similar. Would be interested in >> seeing whether others think it's time to bring this to fruition. >> >> I should also point out that the Cloudera Community forums are also a >>

Re: Data inconsistency after restart

2017-12-06 Thread Boris Tyukin
this is definitely concerning thread for us looking to use Impala for storing mission-critical company data. Petter, are you paid Cloudera customer btw? I wonder if you opened support ticket as well On Wed, Dec 6, 2017 at 7:26 AM, Petter von Dolwitz (Hem) < petter.von.dolw...@gmail.com> wrote: >

Re: Data inconsistency after restart

2017-12-06 Thread Boris Tyukin
can > rebuild Kudu-tables upon errors. We are still in the early learning phase. > > Br, > Petter > > > > 2017-12-06 14:35 GMT+01:00 Boris Tyukin : > >> this is definitely concerning thread for us looking to use Impala for >> storing mission-critical company data

first and second run 2x query time difference

2017-12-13 Thread Boris Tyukin
Hi guys, I am doing some benchmarks with Kudu and Impala/Parquet and hope to share it soon but there is one thing that bugs me. This is perhaps Impala question but since I am using Kudu with Impala I am going to try and ask anyway. One of my queries takes 120 seconds to run the very first time. I

Re: first and second run 2x query time difference

2017-12-13 Thread Boris Tyukin
2 times again, without restarting Kudu, to > understand the effect of the page cache itself. There's currently now way > to purge the cached metadata in Kudu though. > > Hope this helps a bit, > > J-D > > On Wed, Dec 13, 2017 at 8:07 AM, Boris Tyukin > wrote: >

decimals support anytime soon?

2017-12-14 Thread Boris Tyukin
https://issues.apache.org/jira/browse/KUDU-721 Hi guys, I wonder if someone actively working on this Jira which has been opened for a while and looks abundant. We really like to know if this is even on a roadmap anytime soon as this is pretty important for our healthcare and financial data applic

Re: first and second run 2x query time difference

2017-12-14 Thread Boris Tyukin
decimals On Wed, Dec 13, 2017 at 5:07 PM, Jean-Daniel Cryans wrote: > On Wed, Dec 13, 2017 at 11:30 AM, Boris Tyukin > wrote: > >> thanks J-D! we are going to try that and see how it impacts the runtime. >> >> is there any way to load this metadata upfront? a lot of

Re: decimals support anytime soon?

2017-12-14 Thread Boris Tyukin
27;t > hesitate to give feedback to ensure the solution fits your needs. > > Additionally what integrations are most important to you? Are you using > the clients directly? Impala? Spark? > > Thank you, > Grant > > On Thu, Dec 14, 2017 at 8:25 AM, Boris Tyukin >

Re: first and second run 2x query time difference

2017-12-16 Thread Boris Tyukin
more details next week so not asking for help as I do not know all the details. What is obvious thought is that it has to do something with Kudu :) On Thu, Dec 14, 2017 at 9:40 AM, Boris Tyukin wrote: > thanks for your suggestions, J-D, I am sure you are right more often than > that! :))

Re: first and second run 2x query time difference

2017-12-16 Thread Boris Tyukin
27;t know how that can cause things like DataNodes to fail. > > J-D > > On Sat, Dec 16, 2017 at 11:45 AM, Boris Tyukin > wrote: > >> well our admin had fun two days - it was the first time we restarted Kudu >> on our DEV cluster and it did not go well. He is still tro

Re: first and second run 2x query time difference

2018-01-03 Thread Boris Tyukin
s. > > On Sat, Dec 16, 2017 at 12:50 PM, Boris Tyukin > wrote: > >> yep it is really weird since Kudu does not use neither one. I'll get with >> him on Monday to gather more details >> >> On Sat, Dec 16, 2017 at 3:28 PM, Jean-Daniel Cryans >> wrote:

Re: first and second run 2x query time difference

2018-01-03 Thread Boris Tyukin
it is possible but I thought Kudu keeps its stuff in its own folders On Wed, Jan 3, 2018 at 1:45 PM, Jean-Daniel Cryans wrote: > Hey Boris, > > Thanks for reporting back with results! > > On Wed, Jan 3, 2018 at 10:38 AM, Boris Tyukin > wrote: > >> so it was the

new Kudu benchmarks

2018-01-05 Thread Boris Tyukin
Hi guys, we just finished testing Kudu, mostly comparing Kudu to Impala on HDFS/parquet. I wanted to share my blog post and results. We used typical (and real) healthcare data for the test, not a synthetic data which I think makes it is a bit more interesting. I welcome any feedback! http://bori

Re: new Kudu benchmarks

2018-01-05 Thread Boris Tyukin
Hi Todd, thanks for your feedback! sure will be happy to update my post with your suggestions. I am not sure Apache Parquet will be clear though as some might understand it as using parquet files with Hive or Spark. What do you think about "Impala on Kudu vs Impala on Parquet"? Realistically, for

Re: new Kudu benchmarks

2018-01-06 Thread Boris Tyukin
thanks Todd, updated my post with that info and also changes title a bit. thanks again for your feedback! look forward to new releases coming up! Boris On Fri, Jan 5, 2018 at 9:08 PM, Todd Lipcon wrote: > On Fri, Jan 5, 2018 at 5:50 PM, Boris Tyukin > wrote: > >> Hi Todd, &

Re: new Kudu benchmarks

2018-01-08 Thread Boris Tyukin
awesome, thanks Todd! On Mon, Jan 8, 2018 at 12:53 PM, Todd Lipcon wrote: > Thanks for making the updates. I tweeted it from my account and from > @ApacheKudu. feel free to retweet! > > -Todd > > On Sat, Jan 6, 2018 at 1:10 PM, Boris Tyukin > wrote: > >> thanks T

Bulk / Initial load of large tables into Kudu using Spark

2018-01-26 Thread Boris Tyukin
I found this in the FAQ but I am wondering if Spark Kudu library can be used for efficient bulk loads from HDFS to Kudu directly. By a large table, I mean 5-10B row tables. I do not really like the options described below because 1) I would like to bypass Impala as data for my bulk load coming fr

Re: Bulk / Initial load of large tables into Kudu using Spark

2018-01-29 Thread Boris Tyukin
thank you both. Does it make a difference from performance perspective though if I do a bulk load through Impala versus Spark? is the Kudu client with Spark will be faster than Impala? On Mon, Jan 29, 2018 at 2:22 PM, Todd Lipcon wrote: > On Mon, Jan 29, 2018 at 11:18 AM, Patrick Angeles > wrot

Impala Parquet to Kudu 1.5 - severe ingest performance degradation

2018-02-22 Thread Boris Tyukin
Hello, we just upgraded our dev cluster from Kudu 1.3 to kudu 1.5.0-cdh5.13.1 and noticed quite severe performance degradation. We did CTAS from Impala parquet table which has not changed a bit since the upgrade (even the same # of rows) to Kudu using the follow query below. It used to take 11-11

swap data in Kudu table

2018-02-22 Thread Boris Tyukin
Hello, I am trying to figure out the best and safest way to swap data in a production Kudu table with data from a staging table. Basically, once in a while we need to perform a full reload of some tables (once in a few months). These tables are pretty large with billions of rows and we want to mi

Re: swap data in Kudu table

2018-02-23 Thread Boris Tyukin
ince I couldn't find anything to track the specific features you >> mentioned, I just filed the following improvement JIRAs so we can track it: >> >>- KUDU-2326: Support atomic bulk load operation >><https://issues.apache.org/jira/browse/KUDU-2326> >&g

Re: Impala Parquet to Kudu 1.5 - severe ingest performance degradation

2018-02-28 Thread Boris Tyukin
noshuffle */ select * from > my_other_table; > > > Hope that helps > -Todd > > On Thu, Feb 22, 2018 at 11:02 AM, Hao Hao wrote: > >> Did you happen to check the health of the cluster after the upgrade by 'kudu >> cluster ksck'? >> >>

Re: Impala Parquet to Kudu 1.5 - severe ingest performance degradation

2018-02-28 Thread Boris Tyukin
employees participate here in the Apache Kudu project as > individuals and it's important to keep the distinction separate. Kudu is a > product of the ASF non-profit organization, not a product of any commercial > vendor. > > -Todd > > > On Wed, Feb 28, 2018 at 6:17

Re: "broadcast" tablet replication for kudu?

2018-03-16 Thread Boris Tyukin
I'm new to Kudu but we are also going to use Impala mostly with Kudu. We have a few tables that are small but used a lot. My plan is replicate them more than 3 times. When you create a kudu table, you can specify number of replicated copies (3 by default) and I guess you can put there a number, cor

Re: [ANNOUNCE] Apache Kudu 1.7.0 released

2018-03-23 Thread Boris Tyukin
Great news! Yay for decimals!! Grant, I wonder if you have done any benchmarking of decimals vs float or strong. On Fri, Mar 23, 2018, 14:44 Grant Henke wrote: > The Apache Kudu team is happy to announce the release of Kudu 1.7.0. > > Kudu is an open source storage engine for structured data tha

Re: "broadcast" tablet replication for kudu?

2018-07-23 Thread Boris Tyukin
> > Todd > > On Mon, Jul 23, 2018, 6:43 AM Boris Tyukin wrote: > >> sorry to revive the old thread but I am curious if there is a good way to >> speed up requests to frequently used tables in Kudu. >> >> On Thu, Apr 12, 2018 at 8:19 AM Boris Tyukin >>

Re: swap data in Kudu table

2018-07-25 Thread Boris Tyukin
have is that data was already in Impala daemons memory and did not need Kudu tables at that point. Boris On Fri, Feb 23, 2018 at 5:13 PM Boris Tyukin wrote: > you are guys are awesome, thanks! > > Todd, I like ALTER TABLE TBLPROPERTIES idea - will test it next week. > Views might

Re: Re: Recommended maximum amount of stored data per tablet server

2018-08-04 Thread Boris Tyukin
How much space typically allocated just for WAL and metadata? We have 2 400GB ssds in raid5 for OS and 12 12TB hdds. Is it still a good idea to carve out maybe 100gb on SSD or use a dedicated hdd On Thu, Aug 2, 2018, 20:36 Todd Lipcon wrote: > On Thu, Aug 2, 2018 at 4:54 PM, Quanlong Huang > wr

clarification on Partitioning Guidelines and CPU cores

2018-10-10 Thread Boris Tyukin
Hi all, can someone clarify if this recommendation below - does it mean physical or hyper-threaded CPU cores? quite a big difference... Thanks, Boris Partitioning Guidelines (https://kudu.apache.org/docs/ kudu_impala_integration.html#partitioning_rules_of_thumb) - For large tables, such as fact t

Re: clarification on Partitioning Guidelines and CPU cores

2018-10-10 Thread Boris Tyukin
Also, when they say tablets - I assume this is before replication? so in reality, it is number of nodes x cpu cores / replication factor? If this is the case, it is not looking good... On Wed, Oct 10, 2018 at 5:02 PM Boris Tyukin wrote: > Hi all, > > can someone clarify if this recom

Multi-level partitions question

2018-10-11 Thread Boris Tyukin
Hi guys, Read this doc https://kudu.apache.org/docs/schema_design.html#multilevel-partitioning and I have a question on this particular statement "Scans on multilevel partitioned tables can take advantage of partition pruning on any of the levels independently" Does it mean, that both strategies b

Re: Multi-level partitions question

2018-10-11 Thread Boris Tyukin
trade-off is that the hotspotting resistance isn't as good. If > the shop_id and customer_id columns aren't skewed to begin with that's not > a concern, though. > > - Dan > > On Thu, Oct 11, 2018 at 12:14 PM Boris Tyukin > wrote: > >> Hi guys, >> Rea

is it worth to have partitions on very small tables?

2018-10-15 Thread Boris Tyukin
Out of 300 tables I need to ingest into Kudu, 250 are really small - less than 500k rows and will fit in a single 1Gb partition. Does it still make sense to create 3 partitions or have no partitions at all? Some of these tables are frequently joined to very large 1-10B row tables... Thanks, Boris

Re: is it worth to have partitions on very small tables?

2018-10-15 Thread Boris Tyukin
her > the engine can take advantage of the size disparity in the tables. > > - Dan > > On Mon, Oct 15, 2018 at 10:44 AM Boris Tyukin > wrote: > >> Out of 300 tables I need to ingest into Kudu, 250 are really small - less >> than 500k rows and will fit in a single 1G

Re: clarification on Partitioning Guidelines and CPU cores

2018-10-17 Thread Boris Tyukin
thanks for replying, Adar. Did some math and in our case we are hitting another Kudu limit - 60 tablets per node. We use high density nodes with 2 24-core CPUs so we have 88 hyperthreaded cores total per node or 88*24=2112 cores total. But I cannot create more than 60*24=1440 tablets per table. Loo

Re: clarification on Partitioning Guidelines and CPU cores

2018-10-17 Thread Boris Tyukin
fterwards. > > On Wed, Oct 17, 2018 at 6:00 PM Boris Tyukin > wrote: > >> thanks for replying, Adar. Did some math and in our case we are hitting >> another Kudu limit - 60 tablets per node. We use high density nodes with 2 >> 24-core CPUs so we have 88 hyperthre

strange behavior of getPendingErrors

2018-11-16 Thread Boris Tyukin
Hey guys, I am playing with Kudu Java client (wow it is fast), using mostly code from Kudu Java example. While learning about exceptions during rows inserts, I stumbled upon something I could not explain. If I insert 10 rows into a brand new Kudu table (AUTO_FLUSH_BACKGROUND mode) and I make one

Re: strange behavior of getPendingErrors

2018-11-16 Thread Boris Tyukin
f the > server? > > -Todd > > On Fri, Nov 16, 2018 at 1:12 PM Boris Tyukin > wrote: > >> Hey guys, >> >> I am playing with Kudu Java client (wow it is fast), using mostly code >> from Kudu Java example. >> >> While learning about exceptions during r

Re: strange behavior of getPendingErrors

2018-11-17 Thread Boris Tyukin
ashed partitions it might happen that only 2 >>> operations would be rejected, in case of 30 partitions -- just the single >>> key==2 row could be rejected. >>> >>> BTW, that might also happen if using the MANUAL_FLUSH mode. However, >>> with the AUTO_FLUSH_SY

Re: Tablets Per Tablet Server

2018-11-29 Thread Boris Tyukin
Mac, I asked the same question some time ago if you want to check out the comments from Kudu team..There is also umbrella Jira below to support high density nodes. http://mail-archives.apache.org/mod_mbox/kudu-user/201711.mbox/%3ccanrt7t1aktogw1-1p7lz9atzrytmub1aubykcns3_-fyu1d...@mail.gmail.com%3

KuduScanner with multiple sets of compound primary keys

2018-12-11 Thread Boris Tyukin
Hi guys, my Kudu table has several PK columns and I need to create a scanner to pull multiple rows for these primary keys. If I used Impala, it would be something like SELECT pk1, pk2, col1 FROM table1 WHERE (pk1 = 1 and pk2 = 11) OR (pk1 = 2 and pk2 = 22) OR (pk1 = 3 and pk2 = 33) I tried

Re: KuduScanner with multiple sets of compound primary keys

2018-12-11 Thread Boris Tyukin
isjunctions (i.e. OR predicates); > if this is something you'd be interested in working on, your patches > would be welcome. > > On Tue, Dec 11, 2018 at 1:00 PM Boris Tyukin > wrote: > > > > Hi guys, > > > > my Kudu table has several PK columns and I need to cre

getRowErrors and operation type (insert, delete or update)

2018-12-28 Thread Boris Tyukin
Hi guys, I need to write some custom logic to handle Kudu exceptions in AUTO_FLUSH_BACKGROUND mode and I can get what I need from session.getPendingErrors().getRowErrors() except operation type (insert, delete or update). getRowErrors returns an array of RowError https://kudu.apache.org/apidocs/o

Re: getRowErrors and operation type (insert, delete or update)

2018-12-28 Thread Boris Tyukin
never mind, figured it out. I can do RowError.getOperation().getClass() or even simpler RowError.getOperation().getChangeType(). I love how clean Kudu API is! On Fri, Dec 28, 2018 at 9:11 AM Boris Tyukin wrote: > Hi guys, > > I need to write some custom logic to handle Kudu exce

kudu-client dependencies

2019-01-02 Thread Boris Tyukin
Hi guys, sorry for a dumb question but why kudu-client.jar does not include async and slf4j-api and slf4j-simple libs? I need to call Kudu API from a simple groovy script and had to add 3 other jars explicitly. I see these libs were excluded on purpose: https://github.com/apache/kudu/blob/master/

Re: kudu-client dependencies

2019-01-02 Thread Boris Tyukin
are symlinked to a proper version of jars for CDH parcel. /opt/cloudera/parcels/CDH/lib/kudu/kudu-client.jar /opt/cloudera/parcels/CDH/lib/kudu/kudu-client-tools.jar /opt/cloudera/parcels/CDH/jars/slf4j-simple-1.7.5.jar On Wed, Jan 2, 2019 at 2:44 PM Boris Tyukin wrote: > Hi guys, > &

Re: kudu-client dependencies

2019-01-02 Thread Boris Tyukin
management in the Groovy > world, but a quick Google search turned up Grape > <http://docs.groovy-lang.org/latest/html/documentation/grape.html>, so > maybe that's worth looking into. > > Regards, > Mike > > > On Wed, Jan 2, 2019 at 12:37 PM Boris Tyukin > wrot

close Kudu client on timeout

2019-01-16 Thread Boris Tyukin
Hi guys, is there a setting on Kudu server to close/clean-up inactive Kudu clients? we just found some rogue code that did not close client on code completion and wondering if we can prevent this in future on Kudu server level rather than relying on good developers. That code caused 22,000 threa

Re: close Kudu client on timeout

2019-01-16 Thread Boris Tyukin
sorry it is Java On Wed, Jan 16, 2019 at 3:32 PM Mike Percy wrote: > Java or C++ / Python client? > > Mike > > Sent from my iPhone > > > On Jan 16, 2019, at 12:27 PM, Boris Tyukin > wrote: > > > > Hi guys, > > > > is there a setting on Ku

Re: close Kudu client on timeout

2019-01-17 Thread Boris Tyukin
hreads? > > > Thanks, > > Alexey > > On Wed, Jan 16, 2019 at 1:31 PM Boris Tyukin > wrote: > >> sorry it is Java >> >> On Wed, Jan 16, 2019 at 3:32 PM Mike Percy wrote: >> >>> Java or C++ / Python client? >>> >>> M

Re: close Kudu client on timeout

2019-01-17 Thread Boris Tyukin
, one would normally use connection pool, that would create and dispose connections. On Thu, Jan 17, 2019 at 7:23 PM Todd Lipcon wrote: > On Thu, Jan 17, 2019 at 1:46 PM Boris Tyukin > wrote: > >> Hi Alexey, >> >> it was "single idle Kudu Java client that cre

Re: close Kudu client on timeout

2019-01-18 Thread Boris Tyukin
arted as well with our > custom kudu client implementation in NiFi, but at the end we switched over > to the existing processors as it was much easier to handle… > > > > Cheers Josef > > > > > > > > *From: *Boris Tyukin > *Reply-To: *"user@kudu.a

Changing number of Kudu worker threads

2019-02-08 Thread Boris Tyukin
Hi guys, we need to process 1000s of operations per second and noticed that our Kudu 1.5 cluster was only using 10 threads while our application spins up 50 clients/threads. We observed in the web UI that only 10 threads are working and other 40 waiting in the queue. We found rpc_num_service_thre

Re: Changing number of Kudu worker threads

2019-02-12 Thread Boris Tyukin
Can someone point us to documentation or explain what these parameters really mean or how they should be set on production cluster? I will greatly appreciate it! Boris On Fri, Feb 8, 2019 at 3:40 PM Boris Tyukin wrote: > Hi guys, > > we need to process 1000s of operations per s

Re: Changing number of Kudu worker threads

2019-02-14 Thread Boris Tyukin
Hao >>> >>>>> Hi Boris, >>>>> >>>>> Sorry for the delay, --rpc_num_service_threads sets the number of >>>>> threads in RPC service thread pool (the default is 20 for tablet >>>>> server, 10 for master). It should he

Re: Changing number of Kudu worker threads

2019-02-14 Thread Boris Tyukin
servers for the duration of the tests. > > - Can you share your table schema and partitions schema? For the columns > I'm mostly interested in the row keys and the cardinality of each column. > > Thanks, > > J-D > > On Thu, Feb 14, 2019 at 5:41 AM Boris Tyukin > wrot

Re: Changing number of Kudu worker threads

2019-02-14 Thread Boris Tyukin
on requests (which I'm guessing > was your case). Or should we have some recipes like "here's how you should > write to Kudu from Nifi"? Any thoughts? > > In any case, thanks for reporting back! > > J-D > > On Thu, Feb 14, 2019 at 1:56 PM Boris Tyukin > w

Re: Kudu table api

2019-03-22 Thread Boris Tyukin
Hi Dmitry, check Java Kudu API examples if you have not done it yet https://github.com/apache/kudu/tree/master/examples I remember it had a helper class that counts rows. Like Adar said, I do not think there is a better / faster way - you just create a Kudu scanner, get rows back and iterate over

Re: "broadcast" tablet replication for kudu?

2019-04-24 Thread Boris Tyukin
node to improve performance and avoid broadcasting this table every time? On Mon, Jul 23, 2018 at 10:52 AM Todd Lipcon wrote: > > > On Mon, Jul 23, 2018, 7:21 AM Boris Tyukin wrote: > >> Hi Todd, >> >> Are you saying that your earlier comment below is not longer va

Re: "broadcast" tablet replication for kudu?

2019-04-24 Thread Boris Tyukin
> > where f.dim_1_id = 123; > > > > This equivalent query will broadcast a filtered rowset. > > > > SELECT f.a,d1.b,d2.c > > from FACT f > > inner join DIM_1 d1 on f.dim_1_id = d1.id > > inner join DIM_2 d2 on f.dim_2_id = d2.id > > wh

Long text and complex data types support

2019-09-07 Thread Boris Tyukin
Hi guys, Any plans to support long text type in Kudu? We would love to use Kudu with other projects but unfortunately long text data are pretty common in healthcare industry and we have to use hive/Impala/hdfs instead which is quite painful since we cannot do in place updates and deletes. Same qu

Re: Long text and complex data types support

2019-09-09 Thread Boris Tyukin
exists yet. Do you have any sample schemas with complex types > you could send me to help inform designs and trade offs? > > Thank you, > Grant > > On Sat, Sep 7, 2019 at 11:43 AM Boris Tyukin > wrote: > >> Hi guys, >> >> Any plans to support long te

Re: Long text and complex data types support

2019-09-09 Thread Boris Tyukin
too small. How large would these text columns need to be? > > > > > > On Mon, Sep 9, 2019 at 10:09 AM Boris Tyukin > wrote: > >> Hi Grant, >> >> thanks for responding! >> >> Oracle has CLOBs and BLOBs, MS SQL has varchar(max) and binary. I be

Incorta vs Kudu

2019-10-15 Thread Boris Tyukin
Hi guys, I was just reading about incorta. They get a lot of traction and buzz recently. While they do not explain how it actually works but I got a feeling their "secret" technology is very similar to Kudu. Just curious if you looked at it and compared to Kudu/Impala combo. They mentioned their su

Re: Please please add bloom filter support

2019-10-21 Thread Boris Tyukin
This explains why some of our heavy queries against billion row tables are so much slower than Impala on hdfs. Surprised it has not been addressed yet as it performance difference based on numbers in Kudu jira is staggering On Mon, Oct 21, 2019, 00:28 Adar Lieber-Dembo wrote: > I commented on KU

Re: Partitioning Rules of Thumb

2020-03-07 Thread Boris Tyukin
servers. if I have 20 tablet servers and I have two tables - one with 1MM rows and another one with 100MM rows, do I pick 20 / 3 partitions for both (divide by 3 because of replication)? On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin wrote: > hey guys, > > I asked the same question on Sla

Re: Partitioning Rules of Thumb

2020-03-07 Thread Boris Tyukin
stream 100s of tables and we use PK from RBDMS and need to come with an automated way to pick number of partitions/tablets. So far I was using 1Gb rule but rethinking this now for another project. On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin wrote: > forgot to post results of my quick test: > &

Re: Partitioning Rules of Thumb

2020-03-10 Thread Boris Tyukin
en up into 20 tablet scans, and each of those might land > on a different tablet server running on isolated hardware. For a > significantly larger table into which you expect highly concurrent > workloads, the recommendation serves as a lower bound -- I'd recommend > having more p

Re: Partitioning Rules of Thumb

2020-03-11 Thread Boris Tyukin
from being impossible with > RedShift, removes the added cost of staging. > > Getting back to Snowflake, there's no way we could use it the same way we > use Kudu, and even if we could, the cost would would probably put us out of > business! > > On Tue, Mar 10, 2020, 10

Re: Partitioning Rules of Thumb

2020-03-13 Thread Boris Tyukin
at hotspots can move around > too quickly for a load balancer to keep up. > > Separately, you mentioned having to manage Kudu's compaction process. > Could you go into more detail here? > > On Wed, Mar 11, 2020 at 6:49 AM Boris Tyukin > wrote: > >> thanks Cliff,

Re: Partitioning Rules of Thumb

2020-03-13 Thread Boris Tyukin
ive depending on what version of Hive you > are using. Feel free to reach out on Slack if you have issues. The details > on the integration here: > https://cwiki.apache.org/confluence/display/Hive/Kudu+Integration > > On Fri, Mar 13, 2020 at 8:22 AM Boris Tyukin > wrote: > &g

Re: Partitioning Rules of Thumb

2020-03-15 Thread Boris Tyukin
nce in "trickling" scenarios, especially not over a long period of > time. That's also why we didn't think to advertise --flush_threshold_secs, > or even to change its default value (which is still 2 minutes: far too > short if we hadn't fixed KUDU-1400); we just

Re: Partitioning Rules of Thumb

2020-03-16 Thread Boris Tyukin
ing like Snowflake is great for your problem space. But if > analytics are also more centrally integrated in pipelines then parquet is > hard to beat for the price and flexibility, as is Kudu for dashboards or > other intelligence that leverages upsert/key semantics. Ultimately, like &

Re: Partitioning Rules of Thumb

2020-03-17 Thread Boris Tyukin
ly supported option in the next > release after 3.4. We saw huge speedups on a lot of queries (like 10x or > more). Some queries didn't benefit much, if they were limited by the scan > perf (including if the runtime filters pushed into the scans were filtering > most data befo

Re: Kudu - Dremio

2020-03-29 Thread Boris Tyukin
when I was looking at Dremio some time ago (very interesting technology and I love the idea of query rewrites and materialized viewed federation from different sources), it did not support Impala which you have to use currently to get SQL support with Kudu. On Sun, Mar 29, 2020 at 11:44 AM pino pa

Re: Partitioning Rules of Thumb

2020-04-25 Thread Boris Tyukin
stly from better concurrency > handling. This is despite the fact that RedShift has built-in cache. We > also use streaming ingestion which, aside from being impossible with > RedShift, removes the added cost of staging. > > Getting back to Snowflake, there's no way we co

Re: Partitioning Rules of Thumb

2020-04-25 Thread Boris Tyukin
, 2020 at 2:54 PM Boris Tyukin wrote: > Cliff, i would be extremely interested to see a blog post to compare > Snowflake, Redshift and Impala/Kudu since you tried all of them. > > would love to get some details how you set up Kudu/Impala cluster on AWS > as well as my company might

real-time pipeline with Kudu

2020-05-12 Thread Boris Tyukin
Hi guys, there are not a lot of real-life experiences with Kudu and I wanted to share with you my blog post where I described our use-case - near real-time data lake in a large healthcare system https://boristyukin.com/building-near-real-time-big-data-lake-part-2/ Our real-time infra was pretty c

Fwd: Hive Compatibility

2020-09-26 Thread Boris Tyukin
I am Kudu user not dev but here is my 2 cents. I would not use that Hive/Kudu integration for any sort of production/important work. I think it was a quick POC and I remember seeing sqoop kudu prototype too but be warned... you are probably better off using Kudu spark client as I see it is being