Re: Parent nodes multi-step transactions

2010-08-24 Thread Thomas Koch
Hi Gustavo,

I have a very strong feeling against more complex operations in the ZK server. 
These are things that should be provided by a ZK client helper library. The 
zkclient library from 101tec for example gives you exactly that.
If you're planning to write another layer on top of the ZK API please have a 
look at https://issues.apache.org/jira/browse/ZOOKEEPER-835
I'm planning to provide an alternative java client API for 3.4.0 and would 
then propose to deprecate the current one in the long run.
You can preview the new API at
http://github.com/thkoch2001/zookeeper/tree/operation_classes
However we need to redo it on top of ZOOKEEPER-823 ones it is applied to 
trunk.

Best regards,

Thomas Koch, http://www.koch.ro


Re: Parent nodes multi-step transactions

2010-08-24 Thread Thomas Koch
Gustavo Niemeyer:
 Hi Thomas,
 
  I have a very strong feeling against more complex operations in the ZK
  server.
 
 Can you please describe a little better what that feeling is about?
Every functionality added to ZK will make it harder to maintain. The use case 
you're asking for is IMHO easily solvable in a client site helper library. So 
there's no reason to let ZK solve your problems.

  These are things that should be provided by a ZK client helper library.
  The
 
 Which things should be provided by client helper libraries? 
 [...]
  zkclient library from 101tec for example gives you exactly that.
 
 It's not clear to me what exactly that is in this context.  I've
 looked for the code and couldn't find an answer/alternative to the
 issues discussed in this thread.
recursiveDelete, recursiveCreate: If you want to create /A/C/D-1 just use 
recursiveCreate and you will end up with  /A/C/D-1, even if the full parent 
path did not exist before.

  If you're planning to write another layer on top of the ZK API please
  have a look at https://issues.apache.org/jira/browse/ZOOKEEPER-835
 
 Looked there as well.  Also can't find anything relative to this
 discussion.

  I'm planning to provide an alternative java client API for 3.4.0 and
  would then propose to deprecate the current one in the long run.
  You can preview the new API at
  http://github.com/thkoch2001/zookeeper/tree/operation_classes
 
 And this is a full branch of ZK.  Tried checking out the commit
 messages or something to get an idea of what you mean, but also am
 unable to find answers to these problems.
The idea is to provide operation classes that can be handed around. So you can 
create a list of create operation and hand the full list to a specific 
executor. If the executor ignores NodeExists exeptions then you already have 
an implementation of recursiveCreate:

List creates = new List {new Create(/A), new Create(/A/C), new 
Create(/A/C/D-1)}
myExecutor.execute(creates)

 If you actually have/know of solutions for the suggested problems
 which were not yet covered here, I'm very interested in knowing about
 them, but will need slightly more precise information.
An alternative would be that you have a special znode in /A that signals, that 
the full structure has correctly been setup.

Best regards,

Thomas Koch, http://www.koch.ro


Re: Non Hadoop scheduling frameworks

2010-08-24 Thread Thomas Koch
Todd Nine:
 [...]
 UC1: Synchronized Jobs
 1. A job is fired across all nodes
 2. The nodes wait until the barrier is entered by all participants
 3. The nodes process the data and leave
 4. On all nodes leaving the barrier, the Leader node marks the job as
 complete.
 
 UC2: Multiple Jobs per Node
 1. A Job is scheduled for a future time on a specific node (usually the
 same node that's creating the trigger)
 2. A Trigger can be overwritten and cancelled without the job firing
 3. In the event of a node failure, the Leader will take all pending jobs
 from the failed node, and partition them across the remaining nodes.

Hi Todd,

we've implemented UC2 for an internal project with ZK. I'd love to make the 
code free, but I've to ask our product owner. It's a small company so this 
could go quickly. But I don't know how to convince them. They're so afraid of 
giving away stuff.
The basic idea is, that we've two folders in ZK, a work queue and a lock 
folder. The items (znodes) in the work queue a timestamp prefixed. Every node 
consuming the queue tries to create an ephemeral znode in the lock folder 
before starting on a work item. Work items are actually URLs and we lock on 
the domain. Since we also use a lock pool on every worker that only releases 
on overflow or timeout, we can reuse locks and also get weak locality for 
URLs of the same domain. - That's all the magic. Six java classes on top of 
our own ZK helper lib.

Best regards,

Thomas Koch, http://www.koch.ro


Re: ZkClient package

2010-07-14 Thread Thomas Koch
Jun Rao:
 Hi,
 
 ZkClient (http://github.com/sgroschupf/zkclient) provides a nice wrapper
 around the ZooKeeper client and handles things like retry during
 ConnectionLoss events, and auto reconnect. Does anyone (other than Katta)
 use it? Would people recommend using it? Thanks,
 
 Jun
Hi Jun,

I have some ideas for an alternative Zk Client design, but haven't had the 
time yet to hack it together:
http://mail-archives.apache.org/mod_mbox/hadoop-zookeeper-
dev/201005.mbox/%3c201005261509.54236.tho...@koch.ro%3e

I don't like zkClient very much, but it's the best thing available by now 
AFAIK. Also have a look at this bug:
http://oss.101tec.com/jira/browse/KATTA-137

Best regards,

Thomas Koch, http://www.koch.ro


HBase has entered Debian (unstable)

2010-05-13 Thread Thomas Koch
Hi,

HBase 0.20.4 has entered Debian unstable, should slide into testing after the 
usual 14 day period and will therefor most likely be included in the upcomming 
Debian Squeeze.

http://packages.debian.org/source/sid/hbase

Please note, that this packaging effort is still very much work-in-progress 
and not yet suitable for production use. However the aim is to have a rock 
solid stable HBase in squeeze+1 respectively in Debian testing in the next 
months. Meanwhile the HBase package in Debian can raise HBase's visibility and 
lower the entrance barrier.

So if somebody wants to try out HBase (on Debian), it is as easy as:

aptitude install zookeeperd hbase-masterd

In other news: zookeeper is in Debian testing as of today.

Best regards,

Thomas Koch, http://www.koch.ro


Re: Using Zookeeper to distribute tasks

2010-04-27 Thread Thomas Koch
David Rouchy:
 Hi all,
 
 We are studying using ZooKeeper to manage configuration across multiple
 processes  server. What would be also interesting, as ZooKeeper know the
 list of process running, would be to use it to distribute tasks.
 
 We have some long running tasks, so we used multiple servers to process
 multiple tasks in the same time. My idea would be to use multiple ephemeral
 watcher connected to one znode. A process will send data to this node when
  a new task has to be launched. But here is my issue, I would like only one
  watcher to be triggered of this change (a random watcher in the list). Is
  there way to do such thing in ZooKeeper?
 
 Regards,
 
 David Rouchy

Hi David,

there's gearman[1], a client-server system to distribute tasks. I've mentioned 
it already two times on this list, so sorry for the repetition. Gearman 
servers and clients are availabe in different implementations that use the 
same protocol and are interchangeable AFAIK.
If somebody would build a gearman server using Zookeeper, there would be a 
whole ecosystem for ZK for free to conquer!

[1] http://gearman.org

Regards,

Thomas Koch, http://www.koch.ro


feed queue fetcher with hadoop/zookeeper/gearman?

2010-04-12 Thread Thomas Koch
Hi,

I'd like to implement a feed loader with Hadoop and most likely HBase. I've 
got around 1 million feeds, that should be loaded and checked for new entries. 
However the feeds have different priorities based on their average update 
frequency in the past and their relevance.
The feeds (url, last_fetched timestamp, priority) are stored in HBase. How 
could I implement the fetch queue for the loaders?

- An hourly map-reduce job to produce new queues for each node and save them 
on the nodes?
  - but how to know, which feeds have been fetched in the last hour?
  - what to do, if a fetch node dies?

- Store a fetch queue in zookeeper and add to the queue with map-reduce each 
hour?
  - Isn't that too much load for zookeeper? (I could make one znode for a 
bunch of urls...?)

- Use gearman to store the fetch queue?
  - But the gearman job server still seems to be a SPOF

[1] http://gearman.org

Thank you!

Thomas Koch, http://www.koch.ro


Re: feed queue fetcher with hadoop/zookeeper/gearman?

2010-04-12 Thread Thomas Koch
Mahadev Konar:
 Hi Thomas,
   There are a couple of projects inside Yahoo! that use ZooKeeper as an
 event manager for feed processing.
 
 I am little bit unclear on your example below. As I understand it-
 
 1. There are 1 million feeds that will be stored in Hbase.
 2. A map reduce job will be run on these feeds to find out which feeds need
 to be fetched.
 3. This will create queues in ZooKeeper to fetch the feeds
 4.  Workers will pull items from this queue and process feeds
 
 Did I understand it correctly? Also, if above is the case, how many queue
 items would you anticipate be accumulated every hour?
Yes. That's exactly what I'm thinking about. Currently one node processes like 
2 Feeds an hour and we have 5 feed-fetch-nodes. This would mean ~10 
queue items/hour. Each queue item should carry some meta informations, most 
important the feed items, that are already known to the system so that only 
new items get processed.

Thomas Koch, http://www.koch.ro


[ANN] Eclipse GIT plugin beta version released

2010-03-31 Thread Thomas Koch
GIT is one of the most popular distributed version control system. 
In the hope, that more Java developers may want to explore the world of easy 
branching, merging and patch management, I'd like to inform you, that a beta 
version of the upcoming Eclipse GIT plugin is available:

http://www.infoq.com/news/2010/03/egit-released
http://aniszczyk.org/2010/03/22/the-start-of-an-adventure-egitjgit-0-7-1/

Maybe, one day, some apache / hadoop projects will use GIT... :-)

(Yes, I know git.apache.org.)

Best regards,

Thomas Koch, http://www.koch.ro


zookeeper for gearman?

2010-02-19 Thread Thomas Koch
CC to zookeeper-user

Hi,

I've not keeped myself up to date on gearman development in the last weeks 
(months) since I've been occupied with other duties, mostly the introduction 
of hadoop[1] in our company.
One subproject of hadoop is zookeeper[2], a centralized service for 
maintaining configuration information, naming, providing distributed 
synchronization, and providing group services..
One of the documented use cases of zookeeper is a distributed queue[3]. 
(However this document doesn't seem to be that well written.)
I woundered if anyone from the gearman project has already heard of zookeeper 
and eventually considered a gearman implementation on top of it? It shouldn't 
be that hard and it would get you replication for free.
Maybe I'll try ones my current project is done. :-)

[1] http://hadoop.apache.org/
[2] http://hadoop.apache.org/zookeeper/
[3] http://hadoop.apache.org/zookeeper/docs/current/zookeeperTutorial.html

Best regards,

Thomas Koch, http://www.koch.ro


init.d scripts for zookeeper?

2010-01-14 Thread Thomas Koch
Hi,

does anybody have an init.d script for zookeeper lying around, which I could 
adapt and include in the Debian package of zookeeper?

Thanks,

Thomas Koch, http://www.koch.ro


Packaging for Debian started

2010-01-08 Thread Thomas Koch
Hi,

I've started packaging zookeeper for Debian at:

http://git.debian.org/?p=pkg-java/zookeeper.git

The Debian RequestForPackaging bug can be found at
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=561947

Thank you for ZooKeeper!

Thomas Koch, http://www.koch.ro