Re: [ANNOUNCE] Welcoming Yingchun Lai as a Kudu committer and PMC member

2019-06-05 Thread Mike Percy
Congrats Yingchun and welcome aboard!

Regards,
Mike

On Wed, Jun 5, 2019 at 11:25 AM Todd Lipcon  wrote:

> Hi Kudu community,
>
> I'm happy to announce that the Kudu PMC has voted to add Yingchun Lai as a
> new committer and PMC member.
>
> Yingchun has been contributing to Kudu for the last 6-7 months and
> contributed a number of bug fixes, improvements, and features, including:
> - new CLI tools (eg 'kudu table scan', 'kudu table copy')
> - fixes for compilation warnings, code cleanup, and usability improvements
> on the web UI
> - support for prioritization of tables for maintenance manager tasks
> - CLI support for config files to make it easier to connect to multi-master
> clusters
>
> Yingchun has also been contributing by helping new users on Slack, and
> helps operate 6 production clusters at Xiaomi, one of our larger
> installations in China.
>
> Please join me in congratulating Yingchun!
>
> -Todd
>


Re: close Kudu client on timeout

2019-01-17 Thread Mike Percy
I have a couple more questions:

 - Did you get a jstack of the process? If so I assume you saw lots of
Netty threads like "New I/O boss", "New I/O worker", etc. because of having
many KuduClient instances. Is that right?
 - Just curious: are your edge node clients in the same data center as Kudu
or are you going across the WAN with your client API writes? This should
not affect client threads but has application architecture implications
(i.e. are you buffering or dropping events at the edge node?) when the WAN
link or the Kudu service is unavailable for some reason.

In general, we recommend sharing Kudu client instances to avoid too many
threads. A single Kudu client and Netty setup should be able to handle all
the threads in the process. An example of this is the static Kudu client
cache we use for the Spark integration at
https://github.com/apache/kudu/blob/master/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala#L445

Hope that helps,
Mike


On Thu, Jan 17, 2019 at 11:52 AM Alexey Serbin  wrote:

> Hi Boris,
>
> Kudu servers have a setting for connection inactivity period: idle
> connections to the servers will be automatically closed after the specified
> time (--rpc_default_keepalive_time_ms is the flag).  So, from that
> perspective idle clients is not a big concern to the Kudu server side.
>
> As for your question, right now Kudu doesn't have a way to initiate a
> shutdown of an idle client from the server side.
>
> BTW, I'm curious what it was in your case you reported: were there too
> many idle Kudu client objects around created by the same application?  Or
> that was something else, like a single idle Kudu Java client that created
> so many threads?
>
>
> Thanks,
>
> Alexey
>
> On Wed, Jan 16, 2019 at 1:31 PM Boris Tyukin 
> wrote:
>
>> sorry it is Java
>>
>> On Wed, Jan 16, 2019 at 3:32 PM Mike Percy  wrote:
>>
>>> Java or C++ / Python client?
>>>
>>> Mike
>>>
>>> Sent from my iPhone
>>>
>>> > On Jan 16, 2019, at 12:27 PM, Boris Tyukin 
>>> wrote:
>>> >
>>> > Hi guys,
>>> >
>>> > is there a setting on Kudu server to close/clean-up inactive Kudu
>>> clients?
>>> >
>>> > we just found some rogue code that did not close client on code
>>> completion and wondering if we can prevent this in future on Kudu server
>>> level rather than relying on good developers.
>>> >
>>> > That code caused 22,000 threads opened on our edge node over the last
>>> few days.
>>> >
>>> > Boris
>>>
>>>


Re: close Kudu client on timeout

2019-01-16 Thread Mike Percy
Java or C++ / Python client?

Mike

Sent from my iPhone

> On Jan 16, 2019, at 12:27 PM, Boris Tyukin  wrote:
> 
> Hi guys,
> 
> is there a setting on Kudu server to close/clean-up inactive Kudu clients? 
> 
> we just found some rogue code that did not close client on code completion 
> and wondering if we can prevent this in future on Kudu server level rather 
> than relying on good developers.
> 
> That code caused 22,000 threads opened on our edge node over the last few 
> days.
> 
> Boris



Re: kudu-client dependencies

2019-01-02 Thread Mike Percy
Hi Boris,
kudu-client is a client API library designed to be embedded in a client
application, and it specifies its dependencies via a Maven pom. Typically
one would only want one version of a given dep on the classpath at runtime
and so shipping a fat jar usually isn't done for client libraries.

We shade all dependencies that are not exposed via the public API except
slf4j and related bindings since those are typically provided by the
application (e.g. slf4j-log4j). Since async appears in the public Kudu
Client API we can't shade it.

kudu-client-tools is not a library but a set of command-line tools, so it
has to carry all of its dependencies in the jar.

I'm not sure how most people handle dependency management in the Groovy
world, but a quick Google search turned up Grape
, so
maybe that's worth looking into.

Regards,
Mike


On Wed, Jan 2, 2019 at 12:37 PM Boris Tyukin  wrote:

> ok we just figured out that we need another jar  - kudu-client-tools.jar.
> that one bundled with a proper version async lib and slf4j-api.
>
> slf4j-simple.jar has to be added separately but you do not have to do it
> if it is okay to suppress kudu client logs.
>
> kudu-client.jar and kudu-client-tools.jar are symlinked to a proper
> version of jars for CDH parcel.
>
> /opt/cloudera/parcels/CDH/lib/kudu/kudu-client.jar
> /opt/cloudera/parcels/CDH/lib/kudu/kudu-client-tools.jar
> /opt/cloudera/parcels/CDH/jars/slf4j-simple-1.7.5.jar
>
>
>
> On Wed, Jan 2, 2019 at 2:44 PM Boris Tyukin  wrote:
>
>> Hi guys,
>>
>> sorry for a dumb question but why kudu-client.jar does not include async
>> and slf4j-api and slf4j-simple libs? I need to call Kudu API from a simple
>> groovy script and had to add 3 other jars explicitly.
>>
>> I see these libs were excluded on purpose:
>> https://github.com/apache/kudu/blob/master/java/kudu-client/build.gradle
>>
>> Kafka client, for example, is a single jar.
>>
>> My challenge now which version of these libs to pick, how to support them
>> so they won't break in future etc.
>>
>> Even on CDH cluster, while kudu client is shipped with parcel, one has to
>> know exact versions of other 3 jars for client to work.
>>
>> Maybe I am missing something here and there is an easy way, especially on
>> CDH since it ships already with Kudu.
>>
>> Boris
>>
>


Community chat on Slack on Tue Nov 13 @ 10am PDT

2018-10-24 Thread Mike Percy
Hi Kudu dev community,

I'm posting this to dev@ and BCC'ing user@ -- let's follow up on the Kudu
dev@ list.

Following up on some previous email threads on the topic of growing the
Kudu community, I would like to know if Kudu developers / interested
community members would be interested in having a real-time chat meeting
(online) to discuss progress and continue those discussions.

*What*: The agenda would be to evaluate progress on and discuss action
items in service of the following goals:

   1. Increase adoption of Kudu in general (and remove barriers to adoption)
   2. Increase the number of contributors to Kudu, especially committers

In addition to reviewing and updating the list of action items, I'd also
like to get volunteers for things that need help to get completed (or
started).

*When / Where*: Let's meet in the #kudu-general chat room on the getkudu
 Slack instance for one hour starting
at 10am PDT on Tuesday, November 13.

For those who can't attend in real-time, the chat history will be available
and I'll send notes to the mailing list afterward, so we can also discuss
the same topics over email after the meeting.

Please let me know if this sounds like something you'd like to take part in
or if you have a suggestion for a better way to coordinate this effort,
want to propose an alternative time, etc.

Please find below the current list of action items compiled by Grant and me.

Thanks,
Mike

--

*Being worked on:*

   - KUDU-2411 : Binary
   artifacts (Linux / macOS) on Maven to enable a Kudu MiniCluster usable by
   external projects - Grant / Mike
   - KUDU-2402 : Gerrit
   Sign In UI bug: we upgraded Gerrit to 2.4.15 but unfortunately it didn't
   fix the issue (we thought this was in the list of fixed issues for 2.4.6).
   We are going to try updating some RewriteRules next - Mike working with
   Cloudera IT, who hosts this infrastructure

*Not being worked on:*

*Increase number of contributors*

   - Support GitHub pull requests (forward to Gerrit?)
   - Create more contributor-focused FAQs and docs (wiki?)
   - Code overview and C++ guidelines article targeted at Java developers
   - Quarterly email to the dev/user lists with links to beginner / newbie
   jiras
   - Video walkthrough of Kudu code base, including how to set up a dev env
   with 
   - Simplify CONTRIBUTING.adoc

*Increase adoption*

*Non-product*

   - Binary artifacts as part of the Apache Kudu release process
  - DEB / RPM packages
  - Tarball releases
  - Ports / Homebrew integration for macOS
   - Full fledged demos / application examples
   - Easy ingest tools for demos, i.e. CLI tools for CSV -> Kudu or similar
   - Schedule regular meetups / hold more talks
   - Improve client APIs to make integration easier / more powerful (need
   specific ideas)
   - More blog posts, including invited blog posts
   - More documentation / blog posts about existing integrations that
   people may not know how to use

*Product improvements*

For now, let's leave big-ticket features off this list -- most are pretty
obvious and they'll take up all the oxygen in the room. Let's reserve this
section for relatively low-effort and high-reward quality-of-life
improvements to the product.

   - TBD


Re: Locks are acquired to cost much time in transactions

2018-09-18 Thread Mike Percy
Why do you think you are spending a lot of time contending on row locks?

Have you tried configuring your clients to send smaller batches? This may
decrease throughput on a per-client basis but will likely improve latency
and reduce the likelihood of row lock contention.

If you are really spending most of your time contending on row locks then
you will likely run into more fundamental performance issues trying to
scale your writes, since Kudu's MVCC implementation effectively stores a
linked list of updates to a given cell until compaction occurs. See
https://github.com/apache/kudu/blob/master/docs/design-docs/tablet.md#historical-mvcc-in-diskrowsets
for more information about the on-disk design.

If you accumulate too many uncompacted mutations against a given row,
reading the latest value for that row at scan time will be slow because it
has to do a lot of work at read time.

Mike

On Tue, Sep 18, 2018 at 8:48 AM Xiaokai Wang  wrote:

> Moved here from JIRA.
>
> Hi guys, I met a problem about the keys locks that almost impacts the
> service normal writing.
>
>
> As we all know, a transaction which get all row_key locks will go on next
> step in kudu. Everything looks good, if keys are not concurrent updated.
> But when keys are updated by more than one client at the same time, locks
> are acquired to wait much time. The cases are often in my product
> environment. Does anybody meet the problem? Has any good ideal for this?
>
>
> In my way, I want to try to abandon keys locks, instead using
> *_pool_token_ 'SERIAL' mode which keeping the key of transaction is serial
> and ordered. Dose this work?
>
>
> Hope to get your advice. Thanks.
>
>
> -
> Regards,
> Xiaokai
>


Re: poor performance on insert into range partitions and scaling

2018-07-31 Thread Mike Percy
Can you post a query profile from Impala for one of the slow insert jobs?

Mike

On Tue, Jul 31, 2018 at 12:56 PM Tomas Farkas  wrote:

> Hi,
> wanted share with you the preliminary results of my Kudu testing on AWS
> Created a set of performance tests for evaluation of different instance
> types in AWS and different configurations (Kudu separated from Impala, Kudu
> and Impala on the same nodes); different drive (st1 and gp2) settings and
> here my results:
>
> I was quite dissapointed by the inserts in Step3 see attached sqls,
>
> Any hints, ideas, why this does not scale?
> Thanks
>
>
>


Re: Growing the Kudu community

2018-07-23 Thread Mike Percy
On Mon, Jul 23, 2018 at 10:46 AM Sailesh Mukil 
wrote:

> On Tue, Jul 17, 2018 at 7:37 PM, Mike Percy  wrote:
> > On Tue, Jul 17, 2018 at 2:59 PM Sailesh Mukil 
> wrote:
> >
> > > A suggestion to add on to the easily downloadable pre-built packages,
> is to
> > > have easily accessible/downloadable example test-data that's fairly
> > > representative of real world datasets (but it doesn't have to be too
> > > large). Additionally, we can write tutorials in kudu/examples/ that use
> > > this test data, to give new users a better feel for the system.
> >
> > That sounds useful. Any ideas on where we could find such a data set?
>
> Starting with a small scale factor of TPC-H and TPC-DS might not be a bad
> idea.
>

Once backup and restore has stabilized we could push some example data sets
to S3 and allow people to restore locally from the bucket. That could make
a nice basis for a quickstart tutorial.

Mike


Re: Growing the Kudu community

2018-07-18 Thread Mike Percy
On Wed, Jul 18, 2018 at 8:52 AM Tim Robertson 
wrote:

> Perhaps we should continue this on the dev@ list discussion I started a
> few weeks back [2]?



[2]
> https://lists.apache.org/thread.html/ee697a022b72bbca2761b1af0581773d8fb708f701fc969bc259fc2d@%3Cdev.kudu.apache.org%3E
>


Sure, let's continue the conversation on that thread.

Mike


Growing the Kudu community

2018-07-17 Thread Mike Percy
Hi Apache Kudu community,

Apologies for cross-posting, we just wanted to reach a broad audience for
this topic.

Grant and I have been brainstorming about what we can do to grow the
community of Kudu developers and users. We think Kudu has a lot going for
it, but not everybody knows what it is and what it’s capable of. Focusing
and combining our collective efforts to increase awareness (marketing) and
to reduce barriers to contribution and adoption could be a good way to
achieve organic growth.

We’d like to hear your ideas about what barriers and pain points exist and
any ideas you may have to fix some of those things -- especially ideas
requiring minimal effort and maximum impact.

To kick this off, here are some ideas Grant and I have come up with so far,
in sort of a rough priority order:

Ideas for general improvements

   1. Java MiniCluster support out of the box (KUDU-2411)
   1. This will enable integration with other projects in a way that allows
  them to test against a running Kudu cluster and ensure quality without
  having to build it themselves.
  2. Create a dedicated Maven-consumable java module for a Kudu
  MiniCluster
  3. Pre-built binary artifacts (for testing use only) downloadable
  with MiniCluster (Linux / MacOS)
  4. Ship all dependencies (even security deps, which will not be fixed
  if CVEs found)
  5. Make the binaries Linux distro-independent by building on an old
  distro (EL6)
   2. Upgrade Gerrit to fix the “New UI” GitHub Login Bug (KUDU-2402)
  1. Remove barrier to submitting a patch
  2. Latest version of Gerrit has a fix for the bad GitHub login
  redirect
   3. Upstream pre-built packages for production use (Start rhel7, maybe
   ubuntu)
   1. This is potentially a pretty large effort, depending in the number of
  platforms we want to support
  2. Tarballs -- per-OS / per-distro
  3. Yum install, apt get: per-OS / per-distro
  4. Homebrew?
   4. CLI based tools with zero dependencies for quick experiments/demos
   1. Create, describe, alter tables
  2. Cat data out, pipe data in.
  3. Or simple Python examples to do similar
   5. Create developer oriented docs and faqs (wiki style?)
   6. CONTRIBUTING.adoc in repo
   1. Simplified
  2. Quick “assume nothing tutorial”
  3. Video Guide?

Ongoing marketing and engagement

   1. Quarterly email to the dev / users list
   1. Recognize new contributors
  2. Call out beginner jiras
  3. Summarize ongoing projects
   2. Consistently use the beginner / newbie tag in JIRA
   1. Doc how to find beginner jiras in the contributing docs
   3. Regular blog posts
   1. Developer and community contributors
  2. Invite people from other projects that integrate w/ Kudu to post
  on our Blog
  3. Document how to contribute a blog post
  4. Topics: Compile and maintain a list of blog post ideas in case
  people want inspiration -- Grant has been gathering ideas for this
   4. Archive Slack to a mailing list to be indexed by search engines
   (SlackArchive.io has shut down)

Please offer your suggestions for where we can get a good bang for our
collective buck, and if there is anything you would like to work on by all
means please either speak up or feel free to reach out directly.

Thanks,

Grant and Mike


Re: WAL directory is full

2018-05-14 Thread Mike Percy
Hi Saeid,
What version of Kudu are you running? Do you see any errors when you run
"sudo -u kudu kudu cluster ksck" on the cluster?

Mike

On Fri, May 11, 2018 at 5:12 AM, Saeid Sattari 
wrote:

> Hi all,
>
> I assigned a 100GB SSD disks to WAL on each node in my cluster. Recently,
> I realized that some nodes replicated their cfiles to other nodes due to
> the insufficient space. I found this flag--log_max_segments_to_retain
> 
> that control the number of past logs to preserve but it is currently
> unsupported. Do you have any idea or experience on solving this problem?
> Thank you in advance.
>
> Regards,
> Saeid
>


Re: Spark Streaming + Kudu

2018-03-06 Thread Mike Percy
Hmm, could you try in spark local mode? i.e. https://jaceklaskowski.
gitbooks.io/mastering-apache-spark/content/spark-local.html

Mike

On Tue, Mar 6, 2018 at 7:14 PM, Ravi Kanth <ravikanth@gmail.com> wrote:

> Mike,
>
> Can you clarify a bit on grabbing the jstack for the process? I launched
> my Spark application and tried to get the pid using which I thought I can
> grab jstack trace during hang. Unfortunately, I am not able to figure out
> grabbing pid for Spark application.
>
> Thanks,
> Ravi
>
> On 6 March 2018 at 18:36, Mike Percy <mpe...@apache.org> wrote:
>
>> Thanks Ravi. Would you mind attaching the output of jstack on the process
>> during this hang? That would show what the Kudu client threads are doing,
>> as what we are seeing here is just the netty boss thread.
>>
>> Mike
>>
>> On Tue, Mar 6, 2018 at 8:52 AM, Ravi Kanth <ravikanth@gmail.com>
>> wrote:
>>
>>>
>>> Yes, I have debugged to find the root cause. Every logger before "table
>>> = client.openTable(tableName);" is executing fine and exactly at the
>>> point of opening the table, it is throwing the below exception and nothing
>>> is being executed after that. Still the Spark batches are being processed
>>> and at opening the table is failing. I tried catching it with no luck.
>>> Please find below the exception.
>>>
>>> 8/02/23 00:16:30 ERROR client.TabletClient: [Peer
>>> bd91f34d456a4eccaae50003c90f0fb2] Unexpected exception from downstream
>>> on [id: 0x6e13b01f]
>>> java.net.ConnectException: Connection refused:
>>> kudu102.dev.sac.int.threatmetrix.com/10.112.3.12:7050
>>> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl
>>> .java:717)
>>> at org.apache.kudu.client.shaded.org.jboss.netty.channel.socket
>>> .nio.NioClientBoss.connect(NioClientBoss.java:152)
>>> at org.apache.kudu.client.shaded.org.jboss.netty.channel.socket
>>> .nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:105)
>>> at org.apache.kudu.client.shaded.org.jboss.netty.channel.socket
>>> .nio.NioClientBoss.process(NioClientBoss.java:79)
>>> at org.apache.kudu.client.shaded.org.jboss.netty.channel.socket
>>> .nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
>>> at org.apache.kudu.client.shaded.org.jboss.netty.channel.socket
>>> .nio.NioClientBoss.run(NioClientBoss.java:42)
>>> at org.apache.kudu.client.shaded.org.jboss.netty.util.ThreadRen
>>> amingRunnable.run(ThreadRenamingRunnable.java:108)
>>> at org.apache.kudu.client.shaded.org.jboss.netty.util.internal.
>>> DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>>> Executor.java:1142)
>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>>> lExecutor.java:617)
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>>
>>> Thanks,
>>> Ravi
>>>
>>> On 5 March 2018 at 23:52, Mike Percy <mpe...@apache.org> wrote:
>>>
>>>> Have you considered checking your session error count or pending errors
>>>> in your while loop every so often? Can you identify where your code is
>>>> hanging when the connection is lost (what line)?
>>>>
>>>> Mike
>>>>
>>>> On Mon, Mar 5, 2018 at 9:08 PM, Ravi Kanth <ravikanth@gmail.com>
>>>> wrote:
>>>>
>>>>> In addition to my previous comment, I raised a support ticket for this
>>>>> issue with Cloudera and one of the support person mentioned below,
>>>>>
>>>>> *"Thank you for clarifying, The exceptions are logged but not
>>>>> re-thrown to an upper layer, so that explains why the Spark application is
>>>>> not aware of the underlying error."*
>>>>>
>>>>> On 5 March 2018 at 21:02, Ravi Kanth <ravikanth@gmail.com> wrote:
>>>>>
>>>>>> Mike,
>>>>>>
>>>>>> Thanks for the information. But, once the connection to any of the
>>>>>> Kudu servers is lost then there is no way I can have a control on the
>>>>>> KuduSession object and so with getPendingErrors(). The KuduClient in this
>>>>>> case is becoming a zombie and never returned back till the connection is
>>>>>> properly establis

Re: Spark Streaming + Kudu

2018-03-05 Thread Mike Percy
KuduConnection {
>> private static Logger logger = LoggerFactory.getLogger(KuduCo
>> nnection.class);
>> private static Map<String, AsyncKuduClient> asyncCache = new HashMap<>();
>> private static int ShutdownHookPriority = 100;
>>
>> static AsyncKuduClient getAsyncClient(String kuduMaster) {
>> if (!asyncCache.containsKey(kuduMaster)) {
>> AsyncKuduClient asyncClient = new AsyncKuduClient.AsyncKuduClien
>> tBuilder(kuduMaster).build();
>> ShutdownHookManager.get().addShutdownHook(new Runnable() {
>> @Override
>> public void run() {
>> try {
>> asyncClient.close();
>> } catch (Exception e) {
>> logger.error("Exception closing async client", e);
>> }
>> }
>> }, ShutdownHookPriority);
>> asyncCache.put(kuduMaster, asyncClient);
>> }
>> return asyncCache.get(kuduMaster);
>> }
>> }
>>
>>
>>
>> Thanks,
>> Ravi
>>
>> On 5 March 2018 at 16:20, Mike Percy <mpe...@apache.org> wrote:
>>
>>> Hi Ravi, it would be helpful if you could attach what you are getting
>>> back from getPendingErrors() -- perhaps from dumping RowError.toString()
>>> from items in the returned array -- and indicate what you were hoping to
>>> get back. Note that a RowError can also return to you the Operation
>>> <https://kudu.apache.org/releases/1.6.0/apidocs/org/apache/kudu/client/RowError.html#getOperation-->
>>> that you used to generate the write. From the Operation, you can get the
>>> original PartialRow
>>> <https://kudu.apache.org/releases/1.6.0/apidocs/org/apache/kudu/client/PartialRow.html>
>>> object, which should be able to identify the affected row that the write
>>> failed for. Does that help?
>>>
>>> Since you are using the Kudu client directly, Spark is not involved from
>>> the Kudu perspective, so you will need to deal with Spark on your own in
>>> that case.
>>>
>>> Mike
>>>
>>> On Mon, Mar 5, 2018 at 1:59 PM, Ravi Kanth <ravikanth@gmail.com>
>>> wrote:
>>>
>>>> Hi Mike,
>>>>
>>>> Thanks for the reply. Yes, I am using AUTO_FLUSH_BACKGROUND.
>>>>
>>>> So, I am trying to use Kudu Client API to perform UPSERT into Kudu and
>>>> I integrated this with Spark. I am trying to test a case where in if any of
>>>> Kudu server fails. So, in this case, if there is any problem in writing,
>>>> getPendingErrors() should give me a way to handle these errors so that I
>>>> can successfully terminate my Spark Job. This is what I am trying to do.
>>>>
>>>> But, I am not able to get a hold of the exceptions being thrown from
>>>> with in the KuduClient when retrying to connect to Tablet Server. My
>>>> getPendingErrors is not getting ahold of these exceptions.
>>>>
>>>> Let me know if you need more clarification. I can post some Snippets.
>>>>
>>>> Thanks,
>>>> Ravi
>>>>
>>>> On 5 March 2018 at 13:18, Mike Percy <mpe...@apache.org> wrote:
>>>>
>>>>> Hi Ravi, are you using AUTO_FLUSH_BACKGROUND
>>>>> <https://kudu.apache.org/releases/1.6.0/apidocs/org/apache/kudu/client/SessionConfiguration.FlushMode.html>?
>>>>> You mention that you are trying to use getPendingErrors()
>>>>> <https://kudu.apache.org/releases/1.6.0/apidocs/org/apache/kudu/client/KuduSession.html#getPendingErrors-->
>>>>>  but
>>>>> it sounds like it's not working for you -- can you be more specific about
>>>>> what you expect and what you are observing?
>>>>>
>>>>> Thanks,
>>>>> Mike
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Feb 26, 2018 at 8:04 PM, Ravi Kanth <ravikanth@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thank Clifford. We are running Kudu 1.4 version. Till date we didn't
>>>>>> see any issues in production and we are not losing tablet servers. But, 
>>>>>> as
>>>>>> part of testing I have to generate few unforeseen cases to analyse the
>>>>>> application performance. One among that is bringing down the tablet 
>>>>>> server
>>>>>> or master server intentionally during which I observed the loss of 
>>>>>> records.
>>>>>> Just wanted to test cases out of the happy path here. Once again thanks 

Re: Spark Streaming + Kudu

2018-03-05 Thread Mike Percy
Hi Ravi, it would be helpful if you could attach what you are getting back
from getPendingErrors() -- perhaps from dumping RowError.toString() from
items in the returned array -- and indicate what you were hoping to get
back. Note that a RowError can also return to you the Operation
<https://kudu.apache.org/releases/1.6.0/apidocs/org/apache/kudu/client/RowError.html#getOperation-->
that you used to generate the write. From the Operation, you can get the
original PartialRow
<https://kudu.apache.org/releases/1.6.0/apidocs/org/apache/kudu/client/PartialRow.html>
object, which should be able to identify the affected row that the write
failed for. Does that help?

Since you are using the Kudu client directly, Spark is not involved from
the Kudu perspective, so you will need to deal with Spark on your own in
that case.

Mike

On Mon, Mar 5, 2018 at 1:59 PM, Ravi Kanth <ravikanth@gmail.com> wrote:

> Hi Mike,
>
> Thanks for the reply. Yes, I am using AUTO_FLUSH_BACKGROUND.
>
> So, I am trying to use Kudu Client API to perform UPSERT into Kudu and I
> integrated this with Spark. I am trying to test a case where in if any of
> Kudu server fails. So, in this case, if there is any problem in writing,
> getPendingErrors() should give me a way to handle these errors so that I
> can successfully terminate my Spark Job. This is what I am trying to do.
>
> But, I am not able to get a hold of the exceptions being thrown from with
> in the KuduClient when retrying to connect to Tablet Server. My
> getPendingErrors is not getting ahold of these exceptions.
>
> Let me know if you need more clarification. I can post some Snippets.
>
> Thanks,
> Ravi
>
> On 5 March 2018 at 13:18, Mike Percy <mpe...@apache.org> wrote:
>
>> Hi Ravi, are you using AUTO_FLUSH_BACKGROUND
>> <https://kudu.apache.org/releases/1.6.0/apidocs/org/apache/kudu/client/SessionConfiguration.FlushMode.html>?
>> You mention that you are trying to use getPendingErrors()
>> <https://kudu.apache.org/releases/1.6.0/apidocs/org/apache/kudu/client/KuduSession.html#getPendingErrors-->
>>  but
>> it sounds like it's not working for you -- can you be more specific about
>> what you expect and what you are observing?
>>
>> Thanks,
>> Mike
>>
>>
>>
>> On Mon, Feb 26, 2018 at 8:04 PM, Ravi Kanth <ravikanth@gmail.com>
>> wrote:
>>
>>> Thank Clifford. We are running Kudu 1.4 version. Till date we didn't see
>>> any issues in production and we are not losing tablet servers. But, as part
>>> of testing I have to generate few unforeseen cases to analyse the
>>> application performance. One among that is bringing down the tablet server
>>> or master server intentionally during which I observed the loss of records.
>>> Just wanted to test cases out of the happy path here. Once again thanks for
>>> taking time to respond to me.
>>>
>>> - Ravi
>>>
>>> On 26 February 2018 at 19:58, Clifford Resnick <cresn...@mediamath.com>
>>> wrote:
>>>
>>>> I'll have to get back to you on the code bits, but I'm pretty sure
>>>> we're doing simple sync batching. We're not in production yet, but after
>>>> some months of development I haven't seen any failures, even when pushing
>>>> load doing multiple years' backfill. I think the real question is why are
>>>> you losing tablet servers? The only instability we ever had with Kudu was
>>>> when it had that weird ntp sync issue that was fixed I think for 1.6. What
>>>> version are you running?
>>>>
>>>> Anyway I would think that infinite loop should be catchable somewhere.
>>>> Our pipeline is set to fail/retry with Flink snapshots. I imagine there is
>>>> similar with Spark. Sorry I cant be of more help!
>>>>
>>>>
>>>>
>>>> On Feb 26, 2018 9:10 PM, Ravi Kanth <ravikanth@gmail.com> wrote:
>>>>
>>>> Cliff,
>>>>
>>>> Thanks for the response. Well, I do agree that its simple and seamless.
>>>> In my case, I am able to upsert ~25000 events/sec into Kudu. But, I am
>>>> facing the problem when any of the Kudu Tablet or master server is down. I
>>>> am not able to get a hold of the exception from client. The client is going
>>>> into an infinite loop trying to connect to Kudu. Meanwhile, I am loosing my
>>>> records. I tried handling the errors through getPendingErrors() but still
>>>> it is helpless. I am using AsyncKuduClient to establish the connection and
>>>> retrieving the syncClient from the 

Re: Spark Streaming + Kudu

2018-03-05 Thread Mike Percy
Hi Ravi, are you using AUTO_FLUSH_BACKGROUND
?
You mention that you are trying to use getPendingErrors()

but
it sounds like it's not working for you -- can you be more specific about
what you expect and what you are observing?

Thanks,
Mike



On Mon, Feb 26, 2018 at 8:04 PM, Ravi Kanth  wrote:

> Thank Clifford. We are running Kudu 1.4 version. Till date we didn't see
> any issues in production and we are not losing tablet servers. But, as part
> of testing I have to generate few unforeseen cases to analyse the
> application performance. One among that is bringing down the tablet server
> or master server intentionally during which I observed the loss of records.
> Just wanted to test cases out of the happy path here. Once again thanks for
> taking time to respond to me.
>
> - Ravi
>
> On 26 February 2018 at 19:58, Clifford Resnick 
> wrote:
>
>> I'll have to get back to you on the code bits, but I'm pretty sure we're
>> doing simple sync batching. We're not in production yet, but after some
>> months of development I haven't seen any failures, even when pushing load
>> doing multiple years' backfill. I think the real question is why are you
>> losing tablet servers? The only instability we ever had with Kudu was when
>> it had that weird ntp sync issue that was fixed I think for 1.6. What
>> version are you running?
>>
>> Anyway I would think that infinite loop should be catchable somewhere.
>> Our pipeline is set to fail/retry with Flink snapshots. I imagine there is
>> similar with Spark. Sorry I cant be of more help!
>>
>>
>>
>> On Feb 26, 2018 9:10 PM, Ravi Kanth  wrote:
>>
>> Cliff,
>>
>> Thanks for the response. Well, I do agree that its simple and seamless.
>> In my case, I am able to upsert ~25000 events/sec into Kudu. But, I am
>> facing the problem when any of the Kudu Tablet or master server is down. I
>> am not able to get a hold of the exception from client. The client is going
>> into an infinite loop trying to connect to Kudu. Meanwhile, I am loosing my
>> records. I tried handling the errors through getPendingErrors() but still
>> it is helpless. I am using AsyncKuduClient to establish the connection and
>> retrieving the syncClient from the Async to open the session and table. Any
>> help?
>>
>> Thanks,
>> Ravi
>>
>> On 26 February 2018 at 18:00, Cliff Resnick  wrote:
>>
>> While I can't speak for Spark, we do use the client API from Flink
>> streaming and it's simple and seamless. It's especially nice if you require
>> an Upsert semantic.
>>
>> On Feb 26, 2018 7:51 PM, "Ravi Kanth"  wrote:
>>
>> Hi,
>>
>> Anyone using Spark Streaming to ingest data into Kudu and using Kudu
>> Client API to do so rather than the traditional KuduContext API? I am stuck
>> at a point and couldn't find a solution.
>>
>> Thanks,
>> Ravi
>>
>>
>>
>>
>


Re: swap data in Kudu table

2018-02-23 Thread Mike Percy
Hi Boris, those are good ideas. Currently Kudu does not have atomic bulk
load capabilities or staging abilities. Theoretically renaming a partition
atomically shouldn't be that hard to implement, since it's just a master
metadata operation which can be done atomically, but it's not yet
implemented.

There is a JIRA to track a generic bulk load API here:
https://issues.apache.org/jira/browse/KUDU-1370

Since I couldn't find anything to track the specific features you
mentioned, I just filed the following improvement JIRAs so we can track it:

   - KUDU-2326: Support atomic bulk load operation
   
   - KUDU-2327: Support atomic swap of tables or partitions
   

Mike

On Thu, Feb 22, 2018 at 6:39 AM, Boris Tyukin  wrote:

> Hello,
>
> I am trying to figure out the best and safest way to swap data in a
> production Kudu table with data from a staging table.
>
> Basically, once in a while we need to perform a full reload of some tables
> (once in a few months). These tables are pretty large with billions of rows
> and we want to minimize the risk and downtime for users if something bad
> happens in the middle of that process.
>
> With Hive and Impala on HDFS, we can use a very cool handy command LOAD
> DATA INPATH. We can prepare data for reload in a staging table upfront and
> this process might take many hours. Once staging table is ready, we can
> issue LOAD DATA INPATH command which will move underlying HDFS files to a
> production table - this operation is almost instant and the very last step
> in our pipeline.
>
> Alternatively, we can swap partitions using ALTER TABLE EXCHANGE PARTITION
> command.
>
> Now with Kudu, I cannot seem to find a good strategy. The only thing came
> to my mind is to drop the production table and rename a staging table to
> production table as the last step of the job, but in this case we are going
> to lose statistics and security permissions.
>
> Any other ideas?
>
> Thanks!
> Boris
>


Re: [ANNOUNCE] New committers over past several months

2017-12-18 Thread Mike Percy
Well deserved for all! Congratulations belated and otherwise to Andrew, Grant, 
and Hao!

Mike

> On Dec 18, 2017, at 9:00 PM, Todd Lipcon  wrote:
> 
> Hi Kudu community,
> 
> I'm pleased to announce that the Kudu PMC has voted to add Andrew Wong,
> Grant Henke, and Hao Hao as Kudu committers and PMC members. This
> announcement is a bit delayed, but I figured it's better late than never!
> 
> Andrew has contributed to Kudu in a bunch of areas. Most notably, he
> authored a bunch of optimizations for predicate evaluation on the read
> path, and recently has led the effort to introduce better tolerance of disk
> failures within the tablet server. In addition to code, Andrew has been a
> big help with questions on the user mailing list, Slack, and elsewhere.
> 
> Grant's contributions have spanned several areas. Notably, he made a bunch
> of improvements to our Java and Scala builds -- an area where others might
> be shy. He also implemented checksum verification for data blocks and has
> begun working on a design for DECIMAL, one of the most highly-requested
> features.
> 
> Hao has also been contributing to Kudu for quite some time. Her notable
> contributions include improved fault tolerance for the Java client, fixes
> and optimizations on the Spark integration, and some important refactorings
> and performance optimizations in the block layer. Hao has also represented
> the community by giving talks about Kudu at a conference in China.
> 
> Please join me in congratulating the new committers and PMC members!
> 
> -Todd



[ANNOUNCE] Apache Kudu 1.6.0 released

2017-12-07 Thread Mike Percy
The Apache Kudu team is happy to announce the release of Kudu 1.6.0.

Kudu is an open source storage engine for structured data that supports
low-latency random access together with efficient analytical access
patterns.
It is designed within the context of the Apache Hadoop ecosystem and
supports
many integrations with other data analytics projects both inside and
outside of
the Apache Software Foundation.

Apache Kudu 1.6.0 is a minor release that offers several new features,
improvements, optimizations, and bug fixes. Please see the release notes for
details.

Download it here: https://kudu.apache.org/releases/1.6.0/
Full release notes:
https://kudu.apache.org/releases/1.6.0/docs/release_notes.html

Regards,
The Apache Kudu team


Re: Confused where to post user type questions

2017-11-29 Thread Mike Percy
Hi Boris,
Thanks again for asking about this and I'm happy that you enjoyed listening
to me blab about Kudu! Mark Rittman and the Roaring Elephant guys were very
kind and fun to talk to. I'll note that I think the more recent one (with
Roaring Elephant) had the better audio quality of the two recordings as a
result of the equipment I used.

Mike

On Wed, Nov 29, 2017 at 5:24 PM, Boris Tyukin <bo...@boristyukin.com> wrote:

> totally makes sense to me. thanks Mike and Andrew.
>
> Mike, on a side note, I was just listening to a drill to detail and
> roaring elephant podcasts featuring you :) you did a really great job
> explaining Kudu's role in Big Data ecosystem, I enjoyed both episodes and
> they were one year apart I think so it was interesting to see how Kudu had
> been evolving over the past year.
>
> Thanks,
> Boris
>
> On Wed, Nov 29, 2017 at 5:50 PM, Mike Percy <mpe...@apache.org> wrote:
>
>> Hi Boris,
>> Here's my 2 cents. To some extent, chat vs email is a matter of personal
>> preference and we try to support both.
>>
>> Personally I think Slack is nice for instant feedback when you can get
>> it, but email lists are better for questions. Chat channels are a kind of
>> stream-of-conversations and I often find that it's easy to miss someone's
>> comment or question while I'm in the middle of a discussion or when it's
>> been a busy day and there was a lot of activity while I was away.
>>
>> Email threads have subject lines that make them hard to miss, plus they
>> are indexed by Google, which is helpful for others who have the same
>> question in the future. My recommendation would be to use this email list
>> as much as you're comfortable with, and I hope we can encourage more people
>> to use it because of the previously-stated benefits as well as the ability
>> to communicate with people who are not in your local time zone.
>>
>> Regarding the Cloudera forums, it's not something I'd recommend in an
>> Apache context because we can't rely on it for Apache releases. Only
>> Cloudera's software releases are supported there. We need to provide an
>> avenue to support Apache software releases, so this email list (
>> user@kudu.apache.org) and Slack provide the basis for that.
>>
>> Hope that helps. Thank you for asking this question and please continue
>> to raise any concerns with us when you're unable to get the help you need.
>>
>> Mike
>>
>> On Wed, Nov 29, 2017 at 9:19 AM, Andrew Wong <aw...@cloudera.com> wrote:
>>
>>> Hi Boris,
>>>
>>> Thanks for reaching out! Yeah, currently the most active place to ask
>>> questions is the Kudu slack #kudu-general channel. Sometimes we talk about
>>> dev stuff, but it is also a place user questions. Given its activity,
>>> sometimes user questions fall through the cracks, although we try to avoid
>>> this as much as possible.
>>>
>>> You raise a good point though: for a new user, it might seem like the
>>> wrong place to ask questions if there are a bunch of dev conversations
>>> going on. There have been discussions in the past to migrate those
>>> discussions to a #kudu-dev or something similar. Would be interested in
>>> seeing whether others think it's time to bring this to fruition.
>>>
>>> I should also point out that the Cloudera Community forums are also a
>>> nice platform for Q There's a board for Impala, where Kudu questions are
>>> often asked, so feel free to ask questions there too!
>>>
>>> On Wed, Nov 29, 2017 at 7:03 AM, Boris Tyukin <bo...@boristyukin.com>
>>> wrote:
>>>
>>>> Hi folks,
>>>>
>>>> as a new user to Kudu, it is confusing what is the best venue to post
>>>> user type questions about Kudu which is important for any thriving open
>>>> source project. I have posted some questions on slack and got a feeling
>>>> they were not welcome there as discussions on slack seem to be focused on
>>>> development. I can see that slack group is very active though.
>>>>
>>>> the user group is not that active though with like 10 email threads
>>>> this month.
>>>>
>>>> Can someone clarify this for us newcomers?
>>>>
>>>> We also have official channel for paying CDH customers but there are
>>>> benefits to use informal ones :)
>>>>
>>>> Thanks for such an amazing product and everything you do!
>>>>
>>>> Boris
>>>>
>>>
>>>
>>>
>>> --
>>> Andrew Wong
>>>
>>
>>
>


Re: [DISCUSS] Move Slack discussions to ASF official slack?

2017-10-23 Thread Mike Percy
Users will likely be confused if they have to switch Slack instances. We 
switched over to ASF mailing lists over a year ago and we still get requests to 
join the old pre-ASF user mailing list sometimes.

Unfortunately the Slack-In inviter bot doesn’t allow you to invite people to a 
particular room without a paid account. It has to go to the default room for 
the whole instance. Maybe we could ask Slack if it’s possible to get an 
exception for the ASF.

That said, if it’s not strictly better than what we have then I don’t see a 
real benefit in switching.

Mike

> On Oct 24, 2017, at 8:22 AM, Todd Lipcon  wrote:
> 
>> On Mon, Oct 23, 2017 at 4:12 PM, Misty Stanley-Jones  
>> wrote:
>> 1.  I have no idea, but you could enable the @all at-mention in the eisting 
>> #kudu-general and let people know that way. Also see my next answer.
>> 
> 
> Fair enough.
>  
>> 2.  It looks like if you have an apache.org email address you don't need an 
>> invite, but otherwise an existing member needs to invite you. If you can 
>> somehow get all the member email addresses, you can invite them all at once 
>> as a comma-separated list.
> 
> I'm not sure if that's doable but potentially.
> 
> I'm concerned though if we don't have auto-invite for arbitrary community 
> members who just come by a link from our website. A good portion of our 
> traffic is users, rather than developers, and by-and-large they don't have 
> apache.org addresses. If we closed the Slack off to them I think we'd lose a 
> lot of the benefit.
>  
>> 
>> 3.  I can't tell what access there is to integrations. I can try to find out 
>> who administers that on ASF infra and get back with you. I would not be 
>> surprised if integrations with the ASF JIRA were already enabled.
>> 
>> I pre-emptively grabbed #kudu on the ASF slack in case we decide to go 
>> forward with this. If we don't decide to go forward with it, it's a good 
>> idea to hold onto the channel and pin a message in there about how to get to 
>> the "official" Kudu slack.
>> 
>>> On Mon, Oct 23, 2017 at 3:00 PM, Todd Lipcon  wrote:
>>> A couple questions about this:
>>> 
>>> - is there any way we can email out to our existing Slack user base to 
>>> invite them to move over? We have 866 members on our current slack and 
>>> would be a shame if people got confused as to where to go for questions.
>>> 
>>> - does the ASF slack now have a functioning self-serve "auto-invite" 
>>> service?
>>> 
>>> - will we still be able to set up integrations like JIRA/github?
>>> 
>>> -Todd
>>> 
 On Mon, Oct 23, 2017 at 2:53 PM, Misty Stanley-Jones  
 wrote:
 When we first started using Slack, I don't think the ASF Slack instance
 existed. Using our own Slack instance means that we have limited access to
 message archives (unless we pay) and that people who work on multiple ASF
 projects need to add the Kudu slack in addition to any other Slack
 instances they may be on. I propose that we instead create one or more
 Kudu-related channels on the official ASF slack (http://the-asf.slack.com/)
 and migrate our discussions there. What does everyone think?
>>> 
>>> 
>>> 
>>> -- 
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>> 
> 
> 
> 
> -- 
> Todd Lipcon
> Software Engineer, Cloudera


Re: Change Data Capture (CDC) with Kudu

2017-09-22 Thread Mike Percy
Franco,
I just realized that I suggested something you mentioned in your initial
email. My mistake for not reading through to the end. It is probably the
least-worst approach right now and it's probably what I would do if I were
you.

Mike

On Fri, Sep 22, 2017 at 2:29 PM, Mike Percy <mpe...@apache.org> wrote:

> CDC is something that I would like to see in Kudu but we aren't there yet
> with the underlying support in the Raft Consensus implementation. Once we
> have higher availability re-replication support (KUDU-1097) we will be a
> bit closer for a solution involving traditional WAL streaming to an
> external consumer because we will have support for non-voting replicas. But
> there would still be plenty of work to do to support CDC after that, at
> least from an API perspective as well as a WAL management perspective (how
> long to keep old log files).
>
> That said, what you really are asking for is a streaming backup solution,
> which may or may not use the same mechanism (unfortunately it's not
> designed or implemented yet).
>
> As an alternative to Adar's suggestions, a reasonable option for you at
> this time may be an incremental backup. It takes a little schema design to
> do it, though. You could consider doing something like the following:
>
>1. Add a last_updated column to all your tables and update the column
>when you change the value. Ideally monotonic across the cluster but you
>could also go with local time and build in a "fudge factor" when reading in
>step 2
>2. Periodically scan the table for any changes newer than the previous
>scan in the last_updated column. This type of scan is more efficient to do
>in Kudu than in many other systems. With Impala you could run a query like:
>select * from table1 where last_updated > $prev_updated;
>3. Dump the results of this query to parquet
>4. Use distcp to copy the parquet files over to the other cluster
>periodically (maybe you can throttle this if needed to avoid saturating the
>pipe)
>5. Upsert the parquet data into Kudu on the remote end
>
> Hopefully some workaround like this would work for you until Kudu has a
> reliable streaming backup solution.
>
> Like Adar said, as an Apache project we are always open to contributions
> and it would be great to get some in this area. Please reach out if you're
> interested in collaborating on a design.
>
> Mike
>
> On Fri, Sep 22, 2017 at 10:43 AM, Adar Lieber-Dembo <a...@cloudera.com>
> wrote:
>
>> Franco,
>>
>> Thanks for the detailed description of your problem.
>>
>> I'm afraid there's no such mechanism in Kudu today. Mining the WALs seems
>> like a path fraught with land mines. Kudu GCs WAL segments aggressively so
>> I'd be worried about a listening mechanism missing out on some row
>> operations. Plus the WAL is Raft-specific as it includes both REPLICATE
>> messages (reflecting a Write RPC from a client) and COMMIT messages
>> (written out when a majority of replicas have written a REPLICATE); parsing
>> and making sense of this would be challenging. Perhaps you could build
>> something using Linux's inotify system for receiving file change
>> notifications, but again I'd be worried about missing certain updates.
>>
>> Another option is to replicate the data at the OS level. For example, you
>> could periodically rsync the entire cluster onto a standby cluster. There's
>> bound to be data loss in the event of a failover, but I don't think you'll
>> run into any corruption (though Kudu does take advantage of sparse files
>> and hole punching, so you should verify that any tool you use supports
>> that).
>>
>> Disaster Recovery is an oft-requested feature, but one that Kudu
>> developers have been unable to prioritize yet. Would you or your someone on
>> your team be interested in working on this?
>>
>> On Thu, Sep 21, 2017 at 7:12 PM Franco Venturi <fvent...@comcast.net>
>> wrote:
>>
>>> We are planning for a 50-100TB Kudu installation (about 200 tables or
>>> so).
>>>
>>> One of the requirements that we are working on is to have a secondary
>>> copy of our data in a Disaster Recovery data center in a different location.
>>>
>>>
>>> Since we are going to have inserts, updates, and deletes (for instance
>>> in the case the primary key is changed), we are trying to devise a process
>>> that will keep the secondary instance in sync with the primary one. The two
>>> instances do not have to be identical in real-time (i.e. we are not looking
>>> for synchronous writes to Kudu), but we would like to have some pret

Re: Change Data Capture (CDC) with Kudu

2017-09-22 Thread Mike Percy
CDC is something that I would like to see in Kudu but we aren't there yet
with the underlying support in the Raft Consensus implementation. Once we
have higher availability re-replication support (KUDU-1097) we will be a
bit closer for a solution involving traditional WAL streaming to an
external consumer because we will have support for non-voting replicas. But
there would still be plenty of work to do to support CDC after that, at
least from an API perspective as well as a WAL management perspective (how
long to keep old log files).

That said, what you really are asking for is a streaming backup solution,
which may or may not use the same mechanism (unfortunately it's not
designed or implemented yet).

As an alternative to Adar's suggestions, a reasonable option for you at
this time may be an incremental backup. It takes a little schema design to
do it, though. You could consider doing something like the following:

   1. Add a last_updated column to all your tables and update the column
   when you change the value. Ideally monotonic across the cluster but you
   could also go with local time and build in a "fudge factor" when reading in
   step 2
   2. Periodically scan the table for any changes newer than the previous
   scan in the last_updated column. This type of scan is more efficient to do
   in Kudu than in many other systems. With Impala you could run a query like:
   select * from table1 where last_updated > $prev_updated;
   3. Dump the results of this query to parquet
   4. Use distcp to copy the parquet files over to the other cluster
   periodically (maybe you can throttle this if needed to avoid saturating the
   pipe)
   5. Upsert the parquet data into Kudu on the remote end

Hopefully some workaround like this would work for you until Kudu has a
reliable streaming backup solution.

Like Adar said, as an Apache project we are always open to contributions
and it would be great to get some in this area. Please reach out if you're
interested in collaborating on a design.

Mike

On Fri, Sep 22, 2017 at 10:43 AM, Adar Lieber-Dembo 
wrote:

> Franco,
>
> Thanks for the detailed description of your problem.
>
> I'm afraid there's no such mechanism in Kudu today. Mining the WALs seems
> like a path fraught with land mines. Kudu GCs WAL segments aggressively so
> I'd be worried about a listening mechanism missing out on some row
> operations. Plus the WAL is Raft-specific as it includes both REPLICATE
> messages (reflecting a Write RPC from a client) and COMMIT messages
> (written out when a majority of replicas have written a REPLICATE); parsing
> and making sense of this would be challenging. Perhaps you could build
> something using Linux's inotify system for receiving file change
> notifications, but again I'd be worried about missing certain updates.
>
> Another option is to replicate the data at the OS level. For example, you
> could periodically rsync the entire cluster onto a standby cluster. There's
> bound to be data loss in the event of a failover, but I don't think you'll
> run into any corruption (though Kudu does take advantage of sparse files
> and hole punching, so you should verify that any tool you use supports
> that).
>
> Disaster Recovery is an oft-requested feature, but one that Kudu
> developers have been unable to prioritize yet. Would you or your someone on
> your team be interested in working on this?
>
> On Thu, Sep 21, 2017 at 7:12 PM Franco Venturi 
> wrote:
>
>> We are planning for a 50-100TB Kudu installation (about 200 tables or so).
>>
>> One of the requirements that we are working on is to have a secondary
>> copy of our data in a Disaster Recovery data center in a different location.
>>
>>
>> Since we are going to have inserts, updates, and deletes (for instance in
>> the case the primary key is changed), we are trying to devise a process
>> that will keep the secondary instance in sync with the primary one. The two
>> instances do not have to be identical in real-time (i.e. we are not looking
>> for synchronous writes to Kudu), but we would like to have some pretty good
>> confidence that the secondary instance contains all the changes that the
>> primary has up to say an hour before (or something like that).
>>
>>
>> So far we considered a couple of options:
>> - refreshing the seconday instance with a full copy of the primary one
>> every so often, but that would mean having to transfer say 50TB of data
>> between the two locations every time, and our network bandwidth constraints
>> would prevent to do that even on a daily basis
>> - having a column that contains the most recent time a row was updated,
>> however this column couldn't be part of the primary key (because the
>> primary key in Kudu is immutable), and therefore finding which rows have
>> been changed every time would require a full scan of the table to be
>> sync'd. It would also rely on the "last update timestamp" column to be
>> always updated by the 

Re: Table size is not decreasing after large amount of rows deleted.

2017-04-24 Thread Mike Percy
Yep, that's right -- currently the only thing that reclaims space taken by
deleted rows is a RowSet merge compaction. We haven't added any logic to
trigger those based on the number of deleted rows in a RowSet; they are
currently only triggered by logic which tries to merge RowSets with
overlapping key ranges (see https://github.com/apache
/kudu/blob/master/docs/design-docs/compaction-policy.md#
intuition-behind-compaction-selection-policy and BudgetedCompactionPolicy::
PickRowSets()).

The follow-up work to add a background task to permanently remove deleted
rows is being tracked in https://issues.apache.org/jira/browse/KUDU-1979
(which I just filed).

Mike

On Mon, Apr 24, 2017 at 12:37 PM, Todd Lipcon  wrote:

> Mike can correct me if wrong, but I think the background task in 1.3 is
> only responsible for removing old deltas, and doesn't do anything to try to
> trigger compactions on rowsets with a high percentage of deleted _rows_.
>
> That's a separate bit of work that hasn't been started yet.
>
> -Todd
>
> On Sat, Apr 22, 2017 at 7:36 PM, Jason Heo 
> wrote:
>
>> Hi David.
>>
>> Thank you for your reply.
>>
>> I'll try to upgrade to 1.3 this week.
>>
>> Regards,
>>
>> Jason
>>
>> 2017-04-23 2:06 GMT+09:00 :
>>
>>> Hi Jason
>>>
>>>   In Kudu 1.2 if there are compactions happening, they will reclaim
>>> space. Unfortunately the conditions for this to happen don't always
>>> occur (if the portion of the keyspace where the deletions occurred
>>> stopped receiving writes and was already fully compacted cleanup is
>>> more unlikely)
>>>   In Kudu 1.3 we added a background task to clean up old data even in
>>> the absence of compactions. Could you upgrade?
>>>
>>> Best
>>> David
>>>
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>


Re: Number of data files and opened file descriptors are not decreasing after DROP TABLE.

2017-04-24 Thread Mike Percy
HI Jason,
I would strongly recommend upgrading to Kudu 1.3.1 as 1.3.0 has a serious
data-loss bug related to re-replication. Please see https://kudu.apache.org/
releases/1.3.1/docs/release_notes.html (if you are using the Cloudera
version of 1.3.0, no need to worry because it includes the fix for that
bug).

In 1.3.0 and 1.3.1 you should be able to use the "kudu fs check" tool to
see if you have orphaned blocks. If you do, you could use the --repair
argument to that tool to repair it if you bring your tablet server offline.

That said, Kudu uses hole punching to delete data and the same container
files may remain open even after removing data. After dropping tables, you
should see disk usage at the file system level drop.

I'm not sure that I've answered all your questions. If you have specific
concerns, please let us know what you are worried about.

Mike

On Sun, Apr 23, 2017 at 11:43 PM, Jason Heo  wrote:

> Hi.
>
> Before dropping, there were about 30 tables, 27,000 files in tablet_data
>  directory.
> I dropped most tables and there is ONLY one table which has 400 tablets in
> my test Kudu cluster.
> After dropping, there are still 27,000 files in tablet_data directory,
> and output of /sbin/lsof is the same before dropping. (kudu tserver opens
> almost 50M files)
>
> I'm curious that this can be resolved using "kudu fs check" which is
> available at Kudu 1.4.
>
> I used Kudu 1.2 when executing `DROP TABLE` and currently using Kudu 1.3.0
>
> Regards,
>
> Jason
>
>


Re: Kudu on top of Alluxio

2017-03-27 Thread Mike Percy
+1 thanks for adding that Todd.

Mike


On Mon, Mar 27, 2017 at 9:55 AM, Todd Lipcon <t...@cloudera.com> wrote:

> On Sat, Mar 25, 2017 at 2:54 PM, Mike Percy <mpe...@apache.org> wrote:
>
>> Kudu currently relies on local storage on a POSIX file system. Right now
>> there is no support for S3, which would be interesting but is non-trivial
>> in certain ways (particularly if we wanted to rely on S3's replication and
>> disable Kudu's app-level replication).
>>
>> I would suggest using only either EXT4 or XFS file systems for production
>> deployments as of Kudu 1.3, in a JBOD configuration, with one SSD per
>> machine for the WAL and with the data disks on either SATA or SSD drives
>> depending on the workload. Anything else is untested AFAIK.
>>
>
> I would amend this and say that SSD for the WAL is nice to have, but not a
> requirement. We do lots of testing on non-SSD test clusters and I'm aware
> of many production clusters which also do not have SSD.
>
> -Todd
> --
> Todd Lipcon
> Software Engineer, Cloudera
>


Re: Spark on Kudu Roadmap

2017-03-27 Thread Mike Percy
Hi Ben,
Is there anything in particular you are looking for?

Thanks,
Mike

On Mon, Mar 27, 2017 at 9:48 AM, Benjamin Kim  wrote:

> Hi,
>
> Are there any plans for deeper integration with Spark especially Spark
> SQL? Is there a roadmap to look at, so I can know what to expect in the
> future?
>
> Cheers,
> Ben


Re: Kudu on top of Alluxio

2017-03-25 Thread Mike Percy
Yeah. I think the reason HBase can pretty easily use something like Alluxio or 
S3 and Kudu can't as easily do it is because HBase already relied on external 
storage (HDFS) for replication so substituting another storage system with 
similar properties doesn't really amount to an architectural change for them.

Mike

Sent from my iPhone

> On Mar 25, 2017, at 3:43 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
> 
> Mike,
> 
> Thanks for the informative answer. I asked this question because I saw that 
> Alluxio can be used to handle storage for HBase. Plus, we could keep our 
> cluster size to a minimum and not need to add more nodes based on storage 
> capacity. We would only need to size our clusters based on load (cores, 
> memory, bandwidth) instead.
> 
> Cheers,
> Ben
> 
> 
>> On Mar 25, 2017, at 2:54 PM, Mike Percy <mpe...@apache.org> wrote:
>> 
>> Kudu currently relies on local storage on a POSIX file system. Right now 
>> there is no support for S3, which would be interesting but is non-trivial in 
>> certain ways (particularly if we wanted to rely on S3's replication and 
>> disable Kudu's app-level replication).
>> 
>> I would suggest using only either EXT4 or XFS file systems for production 
>> deployments as of Kudu 1.3, in a JBOD configuration, with one SSD per 
>> machine for the WAL and with the data disks on either SATA or SSD drives 
>> depending on the workload. Anything else is untested AFAIK.
>> 
>> As for Alluxio, I haven't heard of people using it for permanent storage and 
>> since Kudu has its own block cache I don't think it would really help with 
>> caching. Also I don't recall Tachyon providing POSIX semantics.
>> 
>> Mike
>> 
>> Sent from my iPhone
>> 
>>> On Mar 25, 2017, at 9:50 AM, Benjamin Kim <bbuil...@gmail.com> wrote:
>>> 
>>> Hi,
>>> 
>>> Does anyone know of a way to use AWS S3 or 
>> 
> 



Re: [ANNOUNCE] Two new Kudu committer/PMC members

2016-09-12 Thread Mike Percy
Congrats Alexey and Will! Great work.

Best,
Mike

On Mon, Sep 12, 2016 at 3:55 PM, Todd Lipcon  wrote:

> Hi Kudu community,
>
> It's my great pleasure to announce that the PMC has voted to add both
> Alexey Serbin and Will Berkeley as committers and PMC members.
>
> Alexey has been contributing for a few months, including developing some
> pretty meaty (and tricky) additions. Two of note are the addition of
> doxygen for our client APIs, as well as the implementation of
> AUTO_FLUSH_BACKGROUND in C++. He has also been quite active in reviews
> recently, having reviewed 40+ patches in the last couple months. He also
> contributed by testing and voting on the recent 0.10 release.
>
> Will has been a great contributor as well, spending a lot of time in areas
> that don't get as much love from others. In particular, he's made several
> fixes to the web UIs, has greatly improved the Flume integration, and has
> been burning down a lot of long-standing bugs recently. Will also spends a
> lot of his time outside of Kudu working with users and always has good
> input on what our user community will think of a feature. Like Alexey, Will
> also participated in the 0.10 release process.
>
> Both of these community members have already been "acting the part"
> through their contributions detailed above, and the PMC is excited to
> continue working with them in their expanded roles.
>
> Please join me in congratulating them!
>
> -Todd
>


Re: Where can we Use Apache Kudu?

2016-08-05 Thread Mike Percy
Hi Darshan,
You should be able to use Kudu as an additional store alongside HDFS and
Phoenix. Your data scientists should be able to do joins across HDFS,
HBase, and Kudu using Spark. You could also use Apache Impala (incubating)
to do those joins, however Impala does not support accessing Phoenix, as
far as I know.

You can also access Kudu from R if you go through rimpala:
http://blog.cloudera.com/blog/2013/12/how-to-do-statistical-analysis-with-impala-and-r/
... but I have never used R, myself.

Hope this helps!
Mike

On Wed, Aug 3, 2016 at 11:02 PM, Darshan Shah  wrote:

> Following is our current architecture...
>
>
>
> We have huge data residing in HDFS.. That we do not want to change.
>
>
>
> With Impala select queries, we are taking that data and loading it in
> HBase, using Phoenix. Which is then used by data scientists to do analysis
> using R and Spark.
>
>
>
> Each data set creates new schemas and tables in hbase, so its fast for
> data scientists to do analysis...
>
>
>
>
>
> We want to go for Kudu for obvious advantages in this space.
>
>
>
> Can you tell me where can we fit it?
>
>
> Thanks,
>
> Darshan...
>