I am Kudu user not dev but here is my 2 cents.
I would not use that Hive/Kudu integration for any sort of
production/important work. I think it was a quick POC and I remember seeing
sqoop kudu prototype too but be warned...
you are probably better off using Kudu spark client as I see it is being
Hi guys,
there are not a lot of real-life experiences with Kudu and I wanted to
share with you my blog post where I described our use-case - near real-time
data lake in a large healthcare system
https://boristyukin.com/building-near-real-time-big-data-lake-part-2/
Our real-time infra was pretty
, 2020 at 2:54 PM Boris Tyukin wrote:
> Cliff, i would be extremely interested to see a blog post to compare
> Snowflake, Redshift and Impala/Kudu since you tried all of them.
>
> would love to get some details how you set up Kudu/Impala cluster on AWS
> as well as my company m
ng. This is despite the fact that RedShift has built-in cache. We
> also use streaming ingestion which, aside from being impossible with
> RedShift, removes the added cost of staging.
>
> Getting back to Snowflake, there's no way we could use it the same way we
> use Kudu, and
when I was looking at Dremio some time ago (very interesting technology and
I love the idea of query rewrites and materialized viewed federation from
different sources), it did not support Impala which you have to use
currently to get SQL support with Kudu.
On Sun, Mar 29, 2020 at 11:44 AM pino
tion in the next
> release after 3.4. We saw huge speedups on a lot of queries (like 10x or
> more). Some queries didn't benefit much, if they were limited by the scan
> perf (including if the runtime filters pushed into the scans were filtering
> most data before joins).
>
>
at for your problem space. But if
> analytics are also more centrally integrated in pipelines then parquet is
> hard to beat for the price and flexibility, as is Kudu for dashboards or
> other intelligence that leverages upsert/key semantics. Ultimately, like
> many of us, you m
Hive depending on what version of Hive you
> are using. Feel free to reach out on Slack if you have issues. The details
> on the integration here:
> https://cwiki.apache.org/confluence/display/Hive/Kudu+Integration
>
> On Fri, Mar 13, 2020 at 8:22 AM Boris Tyukin
> wrote:
>
>
; Separately, you mentioned having to manage Kudu's compaction process.
> Could you go into more detail here?
>
> On Wed, Mar 11, 2020 at 6:49 AM Boris Tyukin
> wrote:
>
>> thanks Cliff, this is really good info. I am tempted to do the benchmarks
>> myself but nee
Shift, removes the added cost of staging.
>
> Getting back to Snowflake, there's no way we could use it the same way we
> use Kudu, and even if we could, the cost would would probably put us out of
> business!
>
> On Tue, Mar 10, 2020, 10:59 AM Boris Tyukin wrote:
>
>
s, and each of those might land
> on a different tablet server running on isolated hardware. For a
> significantly larger table into which you expect highly concurrent
> workloads, the recommendation serves as a lower bound -- I'd recommend
> having more partitions, and if your data is
to
stream 100s of tables and we use PK from RBDMS and need to come with an
automated way to pick number of partitions/tablets. So far I was using 1Gb
rule but rethinking this now for another project.
On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin wrote:
> forgot to post results of my quick test:
>
&
servers.
if I have 20 tablet servers and I have two tables - one with 1MM rows and
another one with 100MM rows, do I pick 20 / 3 partitions for both (divide
by 3 because of replication)?
On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin wrote:
> hey guys,
>
> I asked the same question
This explains why some of our heavy queries against billion row tables are
so much slower than Impala on hdfs. Surprised it has not been addressed yet
as it performance difference based on numbers in Kudu jira is staggering
On Mon, Oct 21, 2019, 00:28 Adar Lieber-Dembo wrote:
> I commented on
Hi guys, I was just reading about incorta. They get a lot of traction and
buzz recently. While they do not explain how it actually works but I got a
feeling their "secret" technology is very similar to Kudu. Just curious if
you looked at it and compared to Kudu/Impala combo. They mentioned their
exists yet. Do you have any sample schemas with complex types
> you could send me to help inform designs and trade offs?
>
> Thank you,
> Grant
>
> On Sat, Sep 7, 2019 at 11:43 AM Boris Tyukin
> wrote:
>
>> Hi guys,
>>
>> Any plans to support lon
Hi guys,
Any plans to support long text type in Kudu? We would love to use Kudu with
other projects but unfortunately long text data are pretty common in
healthcare industry and we have to use hive/Impala/hdfs instead which is
quite painful since we cannot do in place updates and deletes.
Same
node to improve performance
and avoid broadcasting this table every time?
On Mon, Jul 23, 2018 at 10:52 AM Todd Lipcon wrote:
>
>
> On Mon, Jul 23, 2018, 7:21 AM Boris Tyukin wrote:
>
>> Hi Todd,
>>
>> Are you saying that your earlier comment below is not lo
Hi Dmitry, check Java Kudu API examples if you have not done it yet
https://github.com/apache/kudu/tree/master/examples
I remember it had a helper class that counts rows. Like Adar said, I do not
think there is a better / faster way - you just create a Kudu scanner, get
rows back and iterate over
s (which I'm guessing
> was your case). Or should we have some recipes like "here's how you should
> write to Kudu from Nifi"? Any thoughts?
>
> In any case, thanks for reporting back!
>
> J-D
>
> On Thu, Feb 14, 2019 at 1:56 PM Boris Tyukin
> wrote:
>
>>
duration of the tests.
>
> - Can you share your table schema and partitions schema? For the columns
> I'm mostly interested in the row keys and the cardinality of each column.
>
> Thanks,
>
> J-D
>
> On Thu, Feb 14, 2019 at 5:41 AM Boris Tyukin
> wrote:
>
>> Hi
Can someone point us to documentation or explain what these parameters
really mean or how they should be set on production cluster?
I will greatly appreciate it!
Boris
On Fri, Feb 8, 2019 at 3:40 PM Boris Tyukin wrote:
> Hi guys,
>
> we need to process 1000s of operations p
Hi guys,
we need to process 1000s of operations per second and noticed that our Kudu
1.5 cluster was only using 10 threads while our application spins up 50
clients/threads. We observed in the web UI that only 10 threads are working
and other 40 waiting in the queue.
We found
We started as well with our
> custom kudu client implementation in NiFi, but at the end we switched over
> to the existing processors as it was much easier to handle…
>
>
>
> Cheers Josef
>
>
>
>
>
>
>
> *From: *Boris Tyukin
> *Reply-To: *"user@kudu.a
t;
>
> Thanks,
>
> Alexey
>
> On Wed, Jan 16, 2019 at 1:31 PM Boris Tyukin
> wrote:
>
>> sorry it is Java
>>
>> On Wed, Jan 16, 2019 at 3:32 PM Mike Percy wrote:
>>
>>> Java or C++ / Python client?
>>>
sorry it is Java
On Wed, Jan 16, 2019 at 3:32 PM Mike Percy wrote:
> Java or C++ / Python client?
>
> Mike
>
> Sent from my iPhone
>
> > On Jan 16, 2019, at 12:27 PM, Boris Tyukin
> wrote:
> >
> > Hi guys,
> >
> > is there a setting on Ku
Hi guys,
sorry for a dumb question but why kudu-client.jar does not include async
and slf4j-api and slf4j-simple libs? I need to call Kudu API from a simple
groovy script and had to add 3 other jars explicitly.
I see these libs were excluded on purpose:
never mind, figured it out. I can do RowError.getOperation().getClass() or
even simpler RowError.getOperation().getChangeType().
I love how clean Kudu API is!
On Fri, Dec 28, 2018 at 9:11 AM Boris Tyukin wrote:
> Hi guys,
>
> I need to write some custom logic to handle Kudu e
Hi guys,
I need to write some custom logic to handle Kudu exceptions in
AUTO_FLUSH_BACKGROUND mode and I can get what I need from
session.getPendingErrors().getRowErrors() except operation type (insert,
delete or update).
getRowErrors returns an array of RowError
sjunctions (i.e. OR predicates);
> if this is something you'd be interested in working on, your patches
> would be welcome.
>
> On Tue, Dec 11, 2018 at 1:00 PM Boris Tyukin
> wrote:
> >
> > Hi guys,
> >
> > my Kudu table has several PK columns and I need to create a
Hi guys,
my Kudu table has several PK columns and I need to create a scanner to pull
multiple rows for these primary keys. If I used Impala, it would be
something like
SELECT pk1, pk2, col1 FROM table1
WHERE
(pk1 = 1 and pk2 = 11)
OR (pk1 = 2 and pk2 = 22)
OR (pk1 = 3 and pk2 = 33)
I
Mac, I asked the same question some time ago if you want to check out the
comments from Kudu team..There is also umbrella Jira below to support high
density nodes.
happen that only 2
>>> operations would be rejected, in case of 30 partitions -- just the single
>>> key==2 row could be rejected.
>>>
>>> BTW, that might also happen if using the MANUAL_FLUSH mode. However,
>>> with the AUTO_FLUSH_SYNC m
> server?
>
> -Todd
>
> On Fri, Nov 16, 2018 at 1:12 PM Boris Tyukin
> wrote:
>
>> Hey guys,
>>
>> I am playing with Kudu Java client (wow it is fast), using mostly code
>> from Kudu Java example.
>>
>> While learning about exceptions dur
Hey guys,
I am playing with Kudu Java client (wow it is fast), using mostly code from
Kudu Java example.
While learning about exceptions during rows inserts, I stumbled upon
something I could not explain.
If I insert 10 rows into a brand new Kudu table (AUTO_FLUSH_BACKGROUND
mode) and I make
fterwards.
>
> On Wed, Oct 17, 2018 at 6:00 PM Boris Tyukin
> wrote:
>
>> thanks for replying, Adar. Did some math and in our case we are hitting
>> another Kudu limit - 60 tablets per node. We use high density nodes with 2
>> 24-core CPUs so we have 88 hyperthre
thanks for replying, Adar. Did some math and in our case we are hitting
another Kudu limit - 60 tablets per node. We use high density nodes with 2
24-core CPUs so we have 88 hyperthreaded cores total per node or 88*24=2112
cores total. But I cannot create more than 60*24=1440 tablets per table.
ne can take advantage of the size disparity in the tables.
>
> - Dan
>
> On Mon, Oct 15, 2018 at 10:44 AM Boris Tyukin
> wrote:
>
>> Out of 300 tables I need to ingest into Kudu, 250 are really small - less
>> than 500k rows and will fit in a single 1Gb partition. Does it
Out of 300 tables I need to ingest into Kudu, 250 are really small - less
than 500k rows and will fit in a single 1Gb partition. Does it still make
sense to create 3 partitions or have no partitions at all?
Some of these tables are frequently joined to very large 1-10B row tables...
Thanks,
is that the hotspotting resistance isn't as good. If
> the shop_id and customer_id columns aren't skewed to begin with that's not
> a concern, though.
>
> - Dan
>
> On Thu, Oct 11, 2018 at 12:14 PM Boris Tyukin
> wrote:
>
>> Hi guys,
>> Read this doc
>> https
Hi guys,
Read this doc
https://kudu.apache.org/docs/schema_design.html#multilevel-partitioning
and I have a question on this particular statement
"Scans on multilevel partitioned tables can take advantage of partition
pruning on any of the levels independently"
Does it mean, that both strategies
Also, when they say tablets - I assume this is before replication? so in
reality, it is number of nodes x cpu cores / replication factor? If this is
the case, it is not looking good...
On Wed, Oct 10, 2018 at 5:02 PM Boris Tyukin wrote:
> Hi all,
>
> can someone clarify if this recom
Hi all,
can someone clarify if this recommendation below - does it mean physical or
hyper-threaded CPU cores? quite a big difference...
Thanks,
Boris
Partitioning Guidelines (https://kudu.apache.org/docs/
kudu_impala_integration.html#partitioning_rules_of_thumb)
- For large tables, such as fact
How much space typically allocated just for WAL and metadata? We have 2
400GB ssds in raid5 for OS and 12 12TB hdds. Is it still a good idea to
carve out maybe 100gb on SSD or use a dedicated hdd
On Thu, Aug 2, 2018, 20:36 Todd Lipcon wrote:
> On Thu, Aug 2, 2018 at 4:54 PM, Quanlong Huang
>
have is that data was already in
Impala daemons memory and did not need Kudu tables at that point.
Boris
On Fri, Feb 23, 2018 at 5:13 PM Boris Tyukin wrote:
> you are guys are awesome, thanks!
>
> Todd, I like ALTER TABLE TBLPROPERTIES idea - will test it next week.
> Views might
on, Jul 23, 2018, 6:43 AM Boris Tyukin wrote:
>
>> sorry to revive the old thread but I am curious if there is a good way to
>> speed up requests to frequently used tables in Kudu.
>>
>> On Thu, Apr 12, 2018 at 8:19 AM Boris Tyukin
>> wrote:
>>
>>> b
Great news! Yay for decimals!! Grant, I wonder if you have done any
benchmarking of decimals vs float or strong.
On Fri, Mar 23, 2018, 14:44 Grant Henke wrote:
> The Apache Kudu team is happy to announce the release of Kudu 1.7.0.
>
> Kudu is an open source storage engine
I'm new to Kudu but we are also going to use Impala mostly with Kudu. We
have a few tables that are small but used a lot. My plan is replicate them
more than 3 times. When you create a kudu table, you can specify number of
replicated copies (3 by default) and I guess you can put there a number,
t; them, I and other employees participate here in the Apache Kudu project as
> individuals and it's important to keep the distinction separate. Kudu is a
> product of the ASF non-profit organization, not a product of any commercial
> vendor.
>
> -Todd
>
>
> On Wed, Feb 28, 2018 at
er_table;
>
>
> Hope that helps
> -Todd
>
> On Thu, Feb 22, 2018 at 11:02 AM, Hao Hao <hao@cloudera.com> wrote:
>
>> Did you happen to check the health of the cluster after the upgrade by 'kudu
>> cluster ksck'?
>>
>> Best,
>> Hao
>>
Hello,
we just upgraded our dev cluster from Kudu 1.3 to kudu 1.5.0-cdh5.13.1 and
noticed quite severe performance degradation. We did CTAS from Impala
parquet table which has not changed a bit since the upgrade (even the same
# of rows) to Kudu using the follow query below.
It used to take
I found this in the FAQ but I am wondering if Spark Kudu library can be
used for efficient bulk loads from HDFS to Kudu directly. By a large table,
I mean 5-10B row tables.
I do not really like the options described below because
1) I would like to bypass Impala as data for my bulk load coming
awesome, thanks Todd!
On Mon, Jan 8, 2018 at 12:53 PM, Todd Lipcon <t...@cloudera.com> wrote:
> Thanks for making the updates. I tweeted it from my account and from
> @ApacheKudu. feel free to retweet!
>
> -Todd
>
> On Sat, Jan 6, 2018 at 1:10 PM, Boris Tyukin <bo.
thanks Todd, updated my post with that info and also changes title a bit.
thanks again for your feedback! look forward to new releases coming up!
Boris
On Fri, Jan 5, 2018 at 9:08 PM, Todd Lipcon <t...@cloudera.com> wrote:
> On Fri, Jan 5, 2018 at 5:50 PM, Boris Tyukin <bo...@bor
Hi guys,
we just finished testing Kudu, mostly comparing Kudu to Impala on
HDFS/parquet. I wanted to share my blog post and results. We used typical
(and real) healthcare data for the test, not a synthetic data which I think
makes it is a bit more interesting.
I welcome any feedback!
it is possible but I thought Kudu keeps its stuff in its own folders
On Wed, Jan 3, 2018 at 1:45 PM, Jean-Daniel Cryans <jdcry...@apache.org>
wrote:
> Hey Boris,
>
> Thanks for reporting back with results!
>
> On Wed, Jan 3, 2018 at 10:38 AM, Boris Tyukin <bo...@boristyuk
;went down" means.
>
> On Sat, Dec 16, 2017 at 12:50 PM, Boris Tyukin <bo...@boristyukin.com>
> wrote:
>
>> yep it is really weird since Kudu does not use neither one. I'll get with
>> him on Monday to gather more details
>>
>> On Sat, Dec 16, 20
e
> but I don't know how that can cause things like DataNodes to fail.
>
> J-D
>
> On Sat, Dec 16, 2017 at 11:45 AM, Boris Tyukin <bo...@boristyukin.com>
> wrote:
>
>> well our admin had fun two days - it was the first time we restarted Kudu
>> on our DEV clust
more details next week so not asking for help as I do not know all the
details. What is obvious thought is that it has to do something with Kudu :)
On Thu, Dec 14, 2017 at 9:40 AM, Boris Tyukin <bo...@boristyukin.com> wrote:
> thanks for your suggestions, J-D, I am sure you are right m
. Please don't
> hesitate to give feedback to ensure the solution fits your needs.
>
> Additionally what integrations are most important to you? Are you using
> the clients directly? Impala? Spark?
>
> Thank you,
> Grant
>
> On Thu, Dec 14, 2017 at 8:25
for decimals
On Wed, Dec 13, 2017 at 5:07 PM, Jean-Daniel Cryans <jdcry...@apache.org>
wrote:
> On Wed, Dec 13, 2017 at 11:30 AM, Boris Tyukin <bo...@boristyukin.com>
> wrote:
>
>> thanks J-D! we are going to try that and see how it impacts the runtime.
>>
>>
times again, without restarting Kudu, to
> understand the effect of the page cache itself. There's currently now way
> to purge the cached metadata in Kudu though.
>
> Hope this helps a bit,
>
> J-D
>
> On Wed, Dec 13, 2017 at 8:07 AM, Boris Tyukin <bo...@boristyuk
Hi guys,
I am doing some benchmarks with Kudu and Impala/Parquet and hope to share
it soon but there is one thing that bugs me. This is perhaps Impala
question but since I am using Kudu with Impala I am going to try and ask
anyway.
One of my queries takes 120 seconds to run the very first time.
can
> rebuild Kudu-tables upon errors. We are still in the early learning phase.
>
> Br,
> Petter
>
>
>
> 2017-12-06 14:35 GMT+01:00 Boris Tyukin <bo...@boristyukin.com>:
>
>> this is definitely concerning thread for us looking to use Impala for
>> stori
this is definitely concerning thread for us looking to use Impala for
storing mission-critical company data. Petter, are you paid Cloudera
customer btw? I wonder if you opened support ticket as well
On Wed, Dec 6, 2017 at 7:26 AM, Petter von Dolwitz (Hem) <
petter.von.dolw...@gmail.com> wrote:
>
65 matches
Mail list logo