Yarn AppMaster request for containers not working
Hello, I'm writing a Yarn Client for my distributed processing framework and I`m not able to request containers for workers from AppMaster addContainerRequest method. Please find here a more detailed explanation: http://stackoverflow.com/questions/29668132/yarn-appmaster-request-for-containers-not-working Let me know if more information is needed about configuration, server logs or client code. Many thanks, Best, Andrei
How to import custom Python module in MapReduce job?
(cross-posted from StackOverflowhttp://stackoverflow.com/questions/18150208/how-to-import-custom-module-in-mapreduce-job?noredirect=1#comment26584564_18150208 ) I have a MapReduce job defined in file *main.py*, which imports module lib from file *lib.py*. I use Hadoop Streaming to submit this job to Hadoop cluster as follows: hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files lib.py,main.py -mapper ./main.py map -reducer ./main.py reduce -input input -output output In my understanding, this should put both main.py and lib.py into *distributed cache folder* on each computing machine and thus make module lib available to main. But it doesn't happen - from log file I see, that files *are really copied* to the same directory, but main can't import lib, throwing* ImportError*. Adding current directory to the path didn't work: import sys sys.path.append(os.path.realpath(__file__))import lib# ImportError though, loading module manually did the trick: import imp lib = imp.load_source('lib', 'lib.py') But that's not what I want. So why Python interpreter can see other .py files in the same directory, but can't import them? Note, I have already tried adding empty __init__.py file to the same directory without effect.
Re: How to import custom Python module in MapReduce job?
Hi Binglin, thanks for your explanation, now it makes sense. However, I'm not sure how to implement suggested method with. First of all, I found out that `-cachArchive` option is deprecated, so I had to use `-archives` instead. I put my `lib.py` to directory `lib` and then zipped it to `lib.zip`. After that I uploaded archive to HDFS and linked it in call to Streaming API as follows: hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files main.py *-archives hdfs://hdfs-namenode/user/me/lib.jar* -mapper ./main.py map -reducer ./main.py reduce -combiner ./main.py combine -input input -output output But script failed, and from logs I see that lib.jar hasn't been unpacked. What am I missing? On Mon, Aug 12, 2013 at 11:33 AM, Binglin Chang decst...@gmail.com wrote: Hi, The problem seems to caused by symlink, hadoop uses file cache, so every file is in fact a symlink. lrwxrwxrwx 1 root root 65 Aug 12 15:22 lib.py - /root/hadoop3/data/nodemanager/usercache/root/filecache/13/lib.py lrwxrwxrwx 1 root root 66 Aug 12 15:23 main.py - /root/hadoop3/data/nodemanager/usercache/root/filecache/12/main.py [root@master01 tmp]# ./main.py Traceback (most recent call last): File ./main.py, line 3, in ? import lib ImportError: No module named lib This should be a python bug: when using import, it can't handle symlink You can try to use a directory containing lib.py and use -cacheArchive, so the symlink actually links to a directory, python may handle this case well. Thanks, Binglin On Mon, Aug 12, 2013 at 2:50 PM, Andrei faithlessfri...@gmail.com wrote: (cross-posted from StackOverflowhttp://stackoverflow.com/questions/18150208/how-to-import-custom-module-in-mapreduce-job?noredirect=1#comment26584564_18150208 ) I have a MapReduce job defined in file *main.py*, which imports module lib from file *lib.py*. I use Hadoop Streaming to submit this job to Hadoop cluster as follows: hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files lib.py,main.py -mapper ./main.py map -reducer ./main.py reduce -input input -output output In my understanding, this should put both main.py and lib.py into *distributed cache folder* on each computing machine and thus make module lib available to main. But it doesn't happen - from log file I see, that files *are really copied* to the same directory, but main can't import lib, throwing *ImportError*. Adding current directory to the path didn't work: import sys sys.path.append(os.path.realpath(__file__))import lib# ImportError though, loading module manually did the trick: import imp lib = imp.load_source('lib', 'lib.py') But that's not what I want. So why Python interpreter can see other .py files in the same directory, but can't import them? Note, I have already tried adding empty __init__.py file to the same directory without effect.
Re: How to import custom Python module in MapReduce job?
For some reason using -archives option leads to Error in configuring object without any further information. However, I found out that -files option works pretty well for this purpose. I was able to run my example as follows. 1. I put `main.py` and `lib.py` into `app` directory. 2. In `main.py` I used `lib.py` directly, that is, import string is just import lib 3. Instead of uploading to HDFS and using -archives option I just pointed to `app` directory in -files option: hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar *-files app*-mapper *app/*main.py map -reducer *app/*main.py reduce -input input -output output It did the trick. Note, that I tested with both - CPython (2.6) and PyPy (1.9), so I think it's quite safe to assume this way correct for Python scripts. Thanks for your help, Binglin, without it I wouldn't be able to figure it out anyway. On Mon, Aug 12, 2013 at 1:12 PM, Binglin Chang decst...@gmail.com wrote: Maybe you doesn't specify symlink name in you cmd line, so the symlink name will be just lib.jar, so I am not sure how you import lib module in your main.py file. Please try this: put main.py lib.py in same jar file, e.g. app.zip *-archives hdfs://hdfs-namenode/user/me/app.zip#app* -mapper app/main.py map -reducer app/main.py reduce in main.py: import app.lib or: import .lib
Re: Large-scale collection of logs from multiple Hadoop nodes
We have similar requirements and build our log collection system around RSyslog and Flume. It is not in production yet, but tests so far look pretty well. We rejected idea of using AMQP since it introduces large overhead for log events. Probably you can use Flume interceptors to do real-time processing on your events, though I haven't tried anything like that earlier. Alternatively, you can use Twitter Storm to handle your logs. Anyway, I wouldn't recommend using Hadoop MapReduce for real-time processing of logs, and there's at least one important reason for this. As you probably know, Flume sources obtains new event and put it into channel, where sink then pulls it from. If we are talking about HDFS Sink, it has pull interval (normally time, but you can also use total size of events in channel). If this interval is large, you won't get real-time processing. And if it is small, Flume will produce large number of small files in HDFS, say, of size 10-100KB. However, HDFS cannot store multiple files in a single block, and minimal block size is 64M, so each of your 10-100KB of logs will become 64M (multiplied by # of replicas!). Of course, you can use some ad-hoc solution like deleting small files from time to time or combining them into a larger file, but monitoring of such a system becomes much harder and may lead to unexpected results. So, processing log events before they get to HDFS seems to be better idea. On Tue, Aug 6, 2013 at 7:54 AM, Inder Pall inder.p...@gmail.com wrote: We have been using a flume like system for such usecases at significantly large scale and it has been working quite well. Would like to hear thoughts/challenges around using zeromq alike systems at good enough scale. inder you are the average of 5 people you spend the most time with On Aug 5, 2013 11:29 PM, Public Network Services publicnetworkservi...@gmail.com wrote: Hi... I am facing a large-scale usage scenario of log collection from a Hadoop cluster and examining ways as to how it should be implemented. More specifically, imagine a cluster that has hundreds of nodes, each of which constantly produces Syslog events that need to be gathered an analyzed at another point. The total amount of logs could be tens of gigabytes per day, if not more, and the reception rate in the order of thousands of events per second, if not more. One solution is to send those events over the network (e.g., using using flume) and collect them in one or more (less than 5) nodes in the cluster, or in another location, whereby the logs will be processed by a either constantly MapReduce job, or by non-Hadoop servers running some log processing application. Another approach could be to deposit all these events into a queuing system like ActiveMQ or RabbitMQ, or whatever. In all cases, the main objective is to be able to do real-time log analysis. What would be the best way of implementing the above scenario? Thanks! PNS
Re: ConnectionException in container, happens only sometimes
Here are logs of RM and 2 NMs: RM (master-host): http://pastebin.com/q4qJP8Ld NM where AM ran (slave-1-host): http://pastebin.com/vSsz7mjG NM where slave container ran (slave-2-host): http://pastebin.com/NMFi6gRp The only related error I've found in them is the following (from RM logs): ... 2013-07-11 07:46:06,225 ERROR org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: AppAttemptId doesnt exist in cache appattempt_1373465780870_0005_01 2013-07-11 07:46:06,227 WARN org.apache.hadoop.ipc.Server: IPC Server Responder, call org.apache.hadoop.yarn.api.AMRMProtocolPB.allocate from 10.128.40.184:47101: output error 2013-07-11 07:46:06,228 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8030 caught an exception java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:456) at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2140) at org.apache.hadoop.ipc.Server.access$2000(Server.java:108) at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:939) at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1005) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1747) 2013-07-11 07:46:11,238 INFO org.apache.hadoop.yarn.util.RackResolver: Resolved my_user to /default-rack 2013-07-11 07:46:11,283 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: NodeManager from node my_user(cmPort: 59267 httpPort: 8042) registered with capability: 8192, assigned nodeId my_user:59267 ... Though from stack trace it's hard to tell where this error came from. Let me know if you need any more information. On Thu, Jul 11, 2013 at 1:00 AM, Andrei faithlessfri...@gmail.com wrote: Hi Omkar, I'm out of office now, so I'll post it as fast as get back there. Thanks On Thu, Jul 11, 2013 at 12:39 AM, Omkar Joshi ojo...@hortonworks.comwrote: can you post RM/NM logs too.? Thanks, Omkar Joshi *Hortonworks Inc.* http://www.hortonworks.com
ConnectionException in container, happens only sometimes
Hi, I'm running CDH4.3 installation of Hadoop with the following simple setup: master-host: runs NameNode, ResourceManager and JobHistoryServer slave-1-host and slave-2-hosts: DataNodes and NodeManagers. When I run simple MapReduce job (both - using streaming API or Pi example from distribution) on client I see that some tasks fail: 13/07/10 14:40:10 INFO mapreduce.Job: map 60% reduce 0% 13/07/10 14:40:14 INFO mapreduce.Job: Task Id : attempt_1373454026937_0005_m_03_0, Status : FAILED 13/07/10 14:40:14 INFO mapreduce.Job: Task Id : attempt_1373454026937_0005_m_05_0, Status : FAILED ... 13/07/10 14:40:23 INFO mapreduce.Job: map 60% reduce 20% ... Every time different set of tasks/attempts fails. In some cases number of failed attempts becomes critical, and the whole job fails, in other cases job is finished successfully. I can't see any dependency, but I noticed the following. Let's say, ApplicationMaster runs on _slave-1-host_. In this case on _slave-2-host_ there will be corresponding syslog with the following contents: ... 2013-07-10 11:06:10,986 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 2013-07-10 11:06:11,989 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) ... 2013-07-10 11:06:20,013 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 2013-07-10 11:06:20,019 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.net.ConnectException: Call From slave-2-host/ 127.0.0.1 to slave-2-host:11812 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:782) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:729) at org.apache.hadoop.ipc.Client.call(Client.java:1229) at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:225) at com.sun.proxy.$Proxy6.getTask(Unknown Source) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:131) Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:708) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:528) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:492) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:499) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:593) at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:241) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1278) at org.apache.hadoop.ipc.Client.call(Client.java:1196) ... 3 more Notice several things: 1. This exception always happens on the different host than ApplicationMaster runs on. 2. It always tries to connect to localhost, not other host in cluster. 3. Port number (11812 in this case) is always different. My questions are: 1. I assume this is the task (container) that tries to establish connection, but what it wants to connect to? 2. Why this error happens and how can I fix it? Any suggestions are welcome. Thanks, Andrei
Re: ConnectionException in container, happens only sometimes
Hi Devaraj, thanks for your answer. Yes, I suspected it could be because of host mapping, so I have already checked (and have just re-checked) settings in /etc/hosts of each machine, and they all are ok. I use both fully-qualified names (e.g. `master-host.company.com`) and their shortcuts (e.g. `master-host`), so it shouldn't depend on notation too. I have also checked AM syslog. There's nothing about network, but there are several messages like the following: ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container complete event for unknown container id container_1373460572360_0001_01_88 I understand container just doesn't get registered in AM (probably because of the same issue), is it correct? So I wonder who sends container complete event to ApplicationMaster? On Wed, Jul 10, 2013 at 3:19 PM, Devaraj k devara...@huawei.com wrote: 1. I assume this is the task (container) that tries to establish connection, but what it wants to connect to? It is trying to connect to MRAppMaster for executing the actual task. ** ** 1. I assume this is the task (container) that tries to establish connection, but what it wants to connect to? It seems Container is not getting the correct MRAppMaster address due to some reason or AM is crashing before giving the task to Container. Probably it is coming due to invalid host mapping. Can you check the host mapping is proper in both the machines and also check the AM log that time for any clue. ** ** Thanks Devaraj k ** ** *From:* Andrei [mailto:faithlessfri...@gmail.com] *Sent:* 10 July 2013 17:32 *To:* user@hadoop.apache.org *Subject:* ConnectionException in container, happens only sometimes ** ** Hi, ** ** I'm running CDH4.3 installation of Hadoop with the following simple setup: ** ** master-host: runs NameNode, ResourceManager and JobHistoryServer slave-1-host and slave-2-hosts: DataNodes and NodeManagers. ** ** When I run simple MapReduce job (both - using streaming API or Pi example from distribution) on client I see that some tasks fail: ** ** 13/07/10 14:40:10 INFO mapreduce.Job: map 60% reduce 0% 13/07/10 14:40:14 INFO mapreduce.Job: Task Id : attempt_1373454026937_0005_m_03_0, Status : FAILED 13/07/10 14:40:14 INFO mapreduce.Job: Task Id : attempt_1373454026937_0005_m_05_0, Status : FAILED ... 13/07/10 14:40:23 INFO mapreduce.Job: map 60% reduce 20% ... ** ** Every time different set of tasks/attempts fails. In some cases number of failed attempts becomes critical, and the whole job fails, in other cases job is finished successfully. I can't see any dependency, but I noticed the following. ** ** Let's say, ApplicationMaster runs on _slave-1-host_. In this case on _slave-2-host_ there will be corresponding syslog with the following contents: ** ** ... 2013-07-10 11:06:10,986 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 2013-07-10 11:06:11,989 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) ... 2013-07-10 11:06:20,013 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 2013-07-10 11:06:20,019 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.net.ConnectException: Call From slave-2-host/ 127.0.0.1 to slave-2-host:11812 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:782) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:729) at org.apache.hadoop.ipc.Client.call(Client.java:1229) at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:225) at com.sun.proxy.$Proxy6.getTask(Unknown Source) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:131) Caused by: java.net.ConnectException
Re: ConnectionException in container, happens only sometimes
If it helps, full log of AM can be found here http://pastebin.com/zXTabyvv . On Wed, Jul 10, 2013 at 4:21 PM, Andrei faithlessfri...@gmail.com wrote: Hi Devaraj, thanks for your answer. Yes, I suspected it could be because of host mapping, so I have already checked (and have just re-checked) settings in /etc/hosts of each machine, and they all are ok. I use both fully-qualified names (e.g. `master-host.company.com`) and their shortcuts (e.g. `master-host`), so it shouldn't depend on notation too. I have also checked AM syslog. There's nothing about network, but there are several messages like the following: ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container complete event for unknown container id container_1373460572360_0001_01_88 I understand container just doesn't get registered in AM (probably because of the same issue), is it correct? So I wonder who sends container complete event to ApplicationMaster? On Wed, Jul 10, 2013 at 3:19 PM, Devaraj k devara...@huawei.com wrote: 1. I assume this is the task (container) that tries to establish connection, but what it wants to connect to? It is trying to connect to MRAppMaster for executing the actual task. ** ** 1. I assume this is the task (container) that tries to establish connection, but what it wants to connect to? It seems Container is not getting the correct MRAppMaster address due to some reason or AM is crashing before giving the task to Container. Probably it is coming due to invalid host mapping. Can you check the host mapping is proper in both the machines and also check the AM log that time for any clue. ** ** Thanks Devaraj k ** ** *From:* Andrei [mailto:faithlessfri...@gmail.com] *Sent:* 10 July 2013 17:32 *To:* user@hadoop.apache.org *Subject:* ConnectionException in container, happens only sometimes ** ** Hi, ** ** I'm running CDH4.3 installation of Hadoop with the following simple setup: ** ** master-host: runs NameNode, ResourceManager and JobHistoryServer slave-1-host and slave-2-hosts: DataNodes and NodeManagers. ** ** When I run simple MapReduce job (both - using streaming API or Pi example from distribution) on client I see that some tasks fail: ** ** 13/07/10 14:40:10 INFO mapreduce.Job: map 60% reduce 0% 13/07/10 14:40:14 INFO mapreduce.Job: Task Id : attempt_1373454026937_0005_m_03_0, Status : FAILED 13/07/10 14:40:14 INFO mapreduce.Job: Task Id : attempt_1373454026937_0005_m_05_0, Status : FAILED ... 13/07/10 14:40:23 INFO mapreduce.Job: map 60% reduce 20% ... ** ** Every time different set of tasks/attempts fails. In some cases number of failed attempts becomes critical, and the whole job fails, in other cases job is finished successfully. I can't see any dependency, but I noticed the following. ** ** Let's say, ApplicationMaster runs on _slave-1-host_. In this case on _slave-2-host_ there will be corresponding syslog with the following contents: ** ** ... 2013-07-10 11:06:10,986 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)** ** 2013-07-10 11:06:11,989 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)** ** ... 2013-07-10 11:06:20,013 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)** ** 2013-07-10 11:06:20,019 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.net.ConnectException: Call From slave-2-host/ 127.0.0.1 to slave-2-host:11812 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:782) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:729) at org.apache.hadoop.ipc.Client.call(Client.java:1229) at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:225
hadoop mapreduce and contrib.lucene.index: ClassNotFoundException: org.apache.lucene.index.IndexDeletionPolicy
Hi, I tried to run the example org.apache.hadoop.contrib.index.main.UpdateIndex: hadoop-0.21.0$ ./bin/hadoop jar hadoop-0.21.0-index.jar -inputPaths di/input1/ -outputPath di/output/ -indexPath di/ -numShards 1 -numMapTasks 2 -conf conf/index-config.xml and got 11/02/10 16:35:52 INFO mapreduce.Job: Task Id : attempt_201102080006_0067_m_01_2, Status : FAILED Error: java.lang.ClassNotFoundException: org.apache.lucene.index.IndexDeletionPolicy I double checked that hadop-0.21.0/lib folder contaings lucene-core-2.3.1.jar and also tried to pass it as -libjars : ./bin/hadoop jar hadoop-0.21.0-index.jar -libjars lib/lucene-core-2.3.1.jar -inputPaths di/input1/ -outputPath di/output/ -indexPath di/ -numShards 1 -numMapTasks 2 -conf conf/index-config.xml but the result is still the same. Thanks in advance, Andrey
Re: Question about rest interface
The latest version of the rest gateway, now available in trunk, works the way you want it. I had the same problem you have while working on the code. There is also a simple start/stop script available (src/contrib/rest/rest.sh). You should checkout the trunk [1] or [2]. Runt ant jar in the root folder and ant tar in src/contrib/rest. After running these steps you will find in build/contrib/rest/ a .tar.gz archive that contains everything you need to run a standalone REST gateway for ZooKeeper. The config file should be pretty much self explanatory but if you need more help let me know. The version in the trunk is now session aware and you can use it even to implement things like leader election (you can find some python examples in src/contrib/rest/src/python). I'm planning to add more features to it, things like ACLs and session authentication but unfortunately I haven't got the time. I should be able to do this in the near future. [1] http://hadoop.apache.org/zookeeper/version_control.html [2] http://github.com/apache/zookeeper On Thu, Sep 30, 2010 at 7:01 PM, Patrick Hunt ph...@apache.org wrote: Hi Marc, you should checkout the REST interface that's on the svn trunk, it includes new functionality and numerous fixes that might be interesting to you, this will be part of 3.4.0. CCing Andrei who worked on this as part of his GSOC project this summer. If you look at this file: src/contrib/rest/src/java/org/apache/zookeeper/server/jersey/RestMain.java you'll see how we start the server. Looks like we need an option to run as a process w/o assuming interactive use. It should be pretty easy for someone to patch this (if you do please consider submitting a patch via our JIRA process, others would find it interesting). With the current code you might get away with something like java /dev/null -- basically turn off stdin. Patrick On Wed, Sep 29, 2010 at 3:09 PM, marc slayton gangofn...@yahoo.com wrote: Hey all -- Having a great time with Zookeeper and recently started testing the RESTful interface in src/contrib. 'ant runrestserver' creates a test instance attached to stdin which works well but any input kills it. How does one configure Jersey to run for real i.e. not attached to my terminal's stdin? I've tried altering log4j settings without much luck. If there are example setup docs for Linux, could somebody point me there? FWIW, I'm running zookeeper-3.3.1 with openjdk-1.6. Cheers, and thanks in advance -- -- Andrei Savu -- http://www.andreisavu.ro/
Re: ZK monitoring
It's not possible. You need to query all the servers in order to know who is the current leader. It should be pretty simple to implement this by parsing the output from the 'stat' 4-letter command. On Tue, Aug 17, 2010 at 9:50 PM, Jun Rao jun...@gmail.com wrote: Hi, Is there a way to see the current leader and a list of followers from a single node in the ZK quorum? It seems that ZK monitoring (JMX, 4-letter commands) only provides info local to a node. Thanks, Jun -- Andrei Savu
Re: ZK monitoring
You should also take a look at ZOOKEEPER-744 [1] and ZOOKEEPER-799 [2] The archive from 799 contains ready to be used scripts for monitoring ZooKeeper using Ganglia, Nagios and Cacti. Let me know if you need more help. [1] https://issues.apache.org/jira/browse/ZOOKEEPER-744 [2] https://issues.apache.org/jira/browse/ZOOKEEPER-799 On Tue, Aug 17, 2010 at 9:50 PM, Jun Rao jun...@gmail.com wrote: Hi, Is there a way to see the current leader and a list of followers from a single node in the ZK quorum? It seems that ZK monitoring (JMX, 4-letter commands) only provides info local to a node. Thanks, Jun -- Andrei Savu
Re: building client tools
Hi, In this case I think you have to install libcppunit (should work using apt-get). I believe that should be enough but I don't really remember what else I've installed the first time I compiled the c client. Let me know what else was needed. I would like to submit a patch to update the README file in order to avoid this problem in the future. Thanks. On Tue, Jul 13, 2010 at 8:09 PM, Martin Waite waite@gmail.com wrote: Hi, I am trying to build the c client on debian lenny for zookeeper 3.3.1. autoreconf -if configure.ac:33: warning: macro `AM_PATH_CPPUNIT' not found in library configure.ac:33: warning: macro `AM_PATH_CPPUNIT' not found in library configure.ac:33: error: possibly undefined macro: AM_PATH_CPPUNIT If this token and others are legitimate, please use m4_pattern_allow. See the Autoconf documentation. autoreconf: /usr/bin/autoconf failed with exit status: 1 I probably need to install some required tools. Is there a list of what tools are needed to build this please ? regards, Martin -- Andrei Savu - http://andreisavu.ro/
Re: Starting zookeeper in replicated mode
As Luka Stojanovic suggested you need to a a file called /var/zookeeper/myid on each node: $ echo 1,2 ... 6 /var/zookeeper/myid I want to make a few more comments related to your setup and to your questions: - there is no configured master node in a zookeeper cluster. the leader is automatically elected at runtime - you can write and read from any node at any time Am I supposed to have an instance of ZooKeeper on each node started before running in replication mode? - you start the cluster by starting one node at a time Should I have each node that will be running ZK listed in the config file? - yes. you need to have all nodes running ZK listed in the config file. Should I be using an IP address to point to a server instead of a hostname? - it doesn't really make difference if you use hostnames or IP addresses. I hope this will help you. Andrei On Mon, Jun 21, 2010 at 10:04 PM, Erik Test erik.shi...@gmail.com wrote: Hi All, I'm having a problem with installing zookeeper on a cluster with 6 nodes in replicated mode. I was able to install and run zookeeper in standalone mode but I'm unable to run zookeeper in replicated mode. I've added a list of servers in zoo.cfg as suggested by the ZooKeeper Getting Started Guide but I get these logs displayed to screen: *[r...@master1 bin]# ./zkServer.sh start JMX enabled by default Using config: /root/zookeeper-3.2.2/bin/../conf/zoo.cfg Starting zookeeper ... STARTED [r...@master1 bin]# 2010-06-21 12:25:23,738 - INFO [main:quorumpeercon...@80] - Reading configuration from: /root/zookeeper-3.2.2/bin/../conf/zoo.cfg 2010-06-21 12:25:23,743 - INFO [main:quorumpeercon...@232] - Defaulting to majority quorums 2010-06-21 12:25:23,745 - FATAL [main:quorumpeerm...@82] - Invalid config, exiting abnormally org.apache.zookeeper.server.quorum.QuorumPeerConfig$ConfigException: Error processing /root/zookeeper-3.2.2/bin/../conf/zoo.cfg at org.apache.zookeeper.server.quorum.QuorumPeerConfig.parse(QuorumPeerConfig.java:100) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:98) at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:75) Caused by: java.lang.IllegalArgumentException: /var/zookeeper/myid file is missing at org.apache.zookeeper.server.quorum.QuorumPeerConfig.parseProperties(QuorumPeerConfig.java:238) at org.apache.zookeeper.server.quorum.QuorumPeerConfig.parse(QuorumPeerConfig.java:96) ... 2 more Invalid config, exiting abnormally* And here is my config file: * # The number of milliseconds of each tick tickTime=2000 # The number of ticks that the initial # synchronization phase can take initLimit=5 # The number of ticks that can pass between # sending a request and getting an acknowledgement syncLimit=2 # the directory where the snapshot is stored. dataDir=/var/zookeeper # the port at which the clients will connect clientPort=2181 server.1=master1:2888:3888 server.2=slave2:2888:3888 server.3=slave3:2888:3888 * I'm a little confused as to why this doesn't work and I haven't had any luck finding answers to some questions I have. Am I supposed to have an instance of ZooKeeper on each node started before running in replication mode? Should I have each node that will be running ZK listed in the config file? Should I be using an IP address to point to a server instead of a hostname? Thanks for your time. Erik -- Andrei Savu http://www.andreisavu.ro/
GSoC 2010: ZooKeeper Monitoring Recipes and Web-based Administrative Interface
Hi all, My name is Andrei Savu and I am on of the GSoC2010 accepted students. My mentor is Patrick Hunt. My objective in the next 4 months is to write tools and recipes for monitoring ZooKeeper and to implement a web-based administrative interface. I have created a wiki page for this project: - http://wiki.apache.org/hadoop/ZooKeeper/GSoCMonitoringAndWebInterface Are there any HBase / Hadoop specific ZooKeeper monitoring requirements? Regards -- Savu Andrei Website: http://www.andreisavu.ro/
Sample Application: Feed Aggregator
Hi, I have just finished the first version of a small python / thrift demo application: a basic feed aggregator.I want to share this with you because I believe this could be useful for a beginner (I have detailed install instructions). Someone new to Hbase should be able to understand how to build an index table. You can find the source code on github: http://github.com/andreisavu/feedaggregator Thank you for your attention. I would highly appreciate your feedback. -- Savu Andrei Website: http://www.andreisavu.ro/
unable to start hbase 0.20. zookeeper server not found.
Hi, I have downloaded the release candidate from here: http://su.pr/1NHIlM and I am unable to make it start standalone. It seems like the zookeeper server does not start. 2009-08-28 10:43:49,872 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, host=localhost:2181 sessionTimeout=6 watcher=Thread[Thread-0,5,main] 2009-08-28 10:43:49,876 INFO org.apache.zookeeper.ClientCnxn: zookeeper.disableAutoWatchReset is false 2009-08-28 10:43:49,911 INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server localhost/127.0.0.1:2181 2009-08-28 10:43:49,926 WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x0 to sun.nio.ch.selectionkeyi...@7d2452e8 java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:885) 2009-08-28 10:43:49,933 WARN org.apache.zookeeper.ClientCnxn: Ignoring exception during shutdown input The zookeeper server should be installed as a standalone application? I'm running bin/start-hbase.sh . On the same machine hbase 0.19.3 works fine. Sorry if this is a silly question :) -- Savu Andrei Website: http://www.andreisavu.ro/
Re: unable to start hbase 0.20. zookeeper server not found.
While trying to write a response I found the solution :) It seems like the os env is not what I expected it to be when running a command over ssh. This tutorial helped me understand why JAVA_HOME is not set and how to fix it. http://www.netexpertise.eu/en/ssh/environment-variables-and-ssh.html Thanks for your time. On Fri, Aug 28, 2009 at 5:06 PM, Jean-Daniel Cryansjdcry...@apache.org wrote: What's in the Zookeeper log? It's kept with the other HBase logs. J-D On Fri, Aug 28, 2009 at 3:59 AM, Andrei Savusavu.and...@gmail.com wrote: Hi, I have downloaded the release candidate from here: http://su.pr/1NHIlM and I am unable to make it start standalone. It seems like the zookeeper server does not start. 2009-08-28 10:43:49,872 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, host=localhost:2181 sessionTimeout=6 watcher=Thread[Thread-0,5,main] 2009-08-28 10:43:49,876 INFO org.apache.zookeeper.ClientCnxn: zookeeper.disableAutoWatchReset is false 2009-08-28 10:43:49,911 INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server localhost/127.0.0.1:2181 2009-08-28 10:43:49,926 WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x0 to sun.nio.ch.selectionkeyi...@7d2452e8 java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:885) 2009-08-28 10:43:49,933 WARN org.apache.zookeeper.ClientCnxn: Ignoring exception during shutdown input The zookeeper server should be installed as a standalone application? I'm running bin/start-hbase.sh . On the same machine hbase 0.19.3 works fine. Sorry if this is a silly question :) -- Savu Andrei Website: http://www.andreisavu.ro/ -- Savu Andrei Website: http://www.andreisavu.ro/
Re: hbase/jython outdated
See comments bellow. On Thu, Aug 27, 2009 at 7:58 PM, stackst...@duboce.net wrote: On Wed, Aug 26, 2009 at 3:29 AM, Andrei Savu savu.and...@gmail.com wrote: I have fixed the code samples and opened a feature request on JIRA for the jython command. https://issues.apache.org/jira/browse/HBASE-1796 Thanks. Patch looks good. Will commit soon. Did you update the jython wiki page? It seems to be using old API still. I have updated the Jython wiki page to use the latest API. After the commit I will also update the instruction for running the sample code. Is there any python library for REST interface? How stable is the REST interface? Not that I know of (a ruby one, yes IIRC). Write against stargate if you are going to do one since o.a.h.h.rest is deprecated in 0.20.0. I am going give it a try and post the results back here. What about thrift? It's going to be deprecated? St.Ack -- Savu Andrei Website: http://www.andreisavu.ro/
Re: hbase/jython outdated
I have fixed the code samples and opened a feature request on JIRA for the jython command. https://issues.apache.org/jira/browse/HBASE-1796 Until recently I have used the python thrift interface but it has some serious issues with unicode. Currently I am searching for alternatives. Is there any python library for REST interface? How stable is the REST interface? On Tue, Aug 25, 2009 at 4:18 PM, Jean-Daniel Cryansjdcry...@apache.org wrote: I can edit this page just fine but you have to be logged in to do that, anyone can sign in. Thx! J-D On Tue, Aug 25, 2009 at 7:02 AM, Andrei Savusavu.and...@gmail.com wrote: Hi, The Hbase/Jython ( http://wiki.apache.org/hadoop/Hbase/Jython ) wiki page is outdated. I want to edit it but the page is marked as immutable. I have attached a working sample and a patched version of bin/hbase with the jython command added. -- Savu Andrei Website: http://www.andreisavu.ro/ -- Savu Andrei Website: http://www.andreisavu.ro/
hbase/jython outdated
Hi, The Hbase/Jython ( http://wiki.apache.org/hadoop/Hbase/Jython ) wiki page is outdated. I want to edit it but the page is marked as immutable. I have attached a working sample and a patched version of bin/hbase with the jython command added. -- Savu Andrei Website: http://www.andreisavu.ro/ import java.lang from org.apache.hadoop.hbase import HBaseConfiguration, HTableDescriptor, HColumnDescriptor, HConstants from org.apache.hadoop.hbase.client import HBaseAdmin, HTable from org.apache.hadoop.hbase.io import BatchUpdate, Cell, RowResult # First get a conf object. This will read in the configuration # that is out in your hbase-*.xml files such as location of the # hbase master node. conf = HBaseConfiguration() # Create a table named 'test' that has two column families, # one named 'content, and the other 'anchor'. The colons # are required for column family names. tablename = test desc = HTableDescriptor(tablename) desc.addFamily(HColumnDescriptor(content:)) desc.addFamily(HColumnDescriptor(anchor:)) admin = HBaseAdmin(conf) # Drop and recreate if it exists if admin.tableExists(tablename): admin.disableTable(tablename) admin.deleteTable(tablename) admin.createTable(desc) tables = admin.listTables() table = HTable(conf, tablename) # Add content to 'column:' on a row named 'row_x' row = 'row_x' update = BatchUpdate(row) update.put('content:', 'some content') table.commit(update) # Now fetch the content just added, returns a byte[] data_row = table.get(row, content:) data = java.lang.String(data.value, UTF8) print The fetched row contains the value '%s' % data # Delete the table. admin.disableTable(desc.getName()) admin.deleteTable(desc.getName())
Re: Feed Aggregator Schema
Thanks for your answer Peter. I will give it a try using this approach and I will let you know how it works. On Mon, Aug 17, 2009 at 10:26 AM, Peter Rietzlerpeter.rietz...@smarter-ecommerce.com wrote: Hi In our project we are handling event lists where we have similar requirements. We do ordering by choosing our row keys wisely. We use the following key for our events (they should be ordered by time in ascending order): eventListName/MMddHHmmssSSS-000[-111] where eventListName is the name of the event list and 000 is a three digit instance id to disambiguate between different running instances of application, and -111 is optional to disambiguate events that occured in the same millisecond on one instance. We additionally insert and artifical row for each day with the id eventListName/MMddHHmmssSSS This allows us to start scanning at the beginning of each day without searching through the event list. You need to be aware of the fact that if you have a very high load of inserts, then always one hbase region server is busy inserting while the others are idle ... if that's a problem for you, you have to find different keys for your purpose. You could also use an HBase index table but I have no experience with it and I remember an email on the mailing list that this would double all requests because the API would first lookup the index table and then the original table ??? (please correct me if this is not right ...) Kind regards, Peter Andrei Savu wrote: Hello, I am working on a project involving monitoring a large number of rss/atom feeds. I want to use hbase for data storage and I have some problems designing the schema. For the first iteration I want to be able to generate an aggregated feed (last 100 posts from all feeds in reverse chronological order). Currently I am using two tables: Feeds: column families Content and Meta : raw feed stored in Content:raw Urls: column families Content and Meta : raw post version stored in Content:raw and the rest of the data found in RSS stored in Meta I need some sort of index table for the aggregated feed. How should I build that? Is hbase a good choice for this kind of application? In other words: Is it possible( in hbase) to design a schema that could efficiently answer to queries like the one listed bellow? SELECT data FROM Urls ORDER BY date DESC LIMIT 100 Thanks. -- Savu Andrei Website: http://www.andreisavu.ro/ -- View this message in context: http://www.nabble.com/Feed-Aggregator-Schema-tp24974071p25002264.html Sent from the HBase User mailing list archive at Nabble.com. -- Savu Andrei Website: http://www.andreisavu.ro/
Feed Aggregator Schema
Hello, I am working on a project involving monitoring a large number of rss/atom feeds. I want to use hbase for data storage and I have some problems designing the schema. For the first iteration I want to be able to generate an aggregated feed (last 100 posts from all feeds in reverse chronological order). Currently I am using two tables: Feeds: column families Content and Meta : raw feed stored in Content:raw Urls: column families Content and Meta : raw post version stored in Content:raw and the rest of the data found in RSS stored in Meta I need some sort of index table for the aggregated feed. How should I build that? Is hbase a good choice for this kind of application? In other words: Is it possible( in hbase) to design a schema that could efficiently answer to queries like the one listed bellow? SELECT data FROM Urls ORDER BY date DESC LIMIT 100 Thanks. -- Savu Andrei Website: http://www.andreisavu.ro/