Re: Passing Command-line Parameters to the Job Submit Command

2012-09-25 Thread Mohit Anchlia
You could always write your own properties file and read it as resource. On Tue, Sep 25, 2012 at 12:10 AM, Hemanth Yamijala yhema...@gmail.comwrote: By java environment variables, do you mean the ones passed as -Dkey=value ? That's one way of passing them. I suppose another way is to have a

Re: Number of Maps running more than expected

2012-08-16 Thread Mohit Anchlia
It would be helpful to see some statistics out of both the jobs like bytes read, written number of errors etc. On Thu, Aug 16, 2012 at 8:02 PM, Raj Vishwanathan rajv...@yahoo.com wrote: You probably have speculative execution on. Extra maps and reduce tasks are run in case some of them fail

Re: Basic Question

2012-08-07 Thread Mohit Anchlia
creation. Thanks! On Tue, Aug 7, 2012 at 11:56 PM, Mohit Anchlia mohitanch...@gmail.com wrote: In Mapper I often use a Global Text object and througout the map processing I just call set on it. My question is, what happens if collector receives similar byte array value. Does the last one

Setting Configuration for local file:///

2012-08-07 Thread Mohit Anchlia
I am trying to write a test on local file system but this test keeps taking xml files in the path even though I am setting a different Configuration object. Is there a way for me to override it? I thought the way I am doing overwrites the configuration but doesn't seem to be working: @Test

Re: Setting Configuration for local file:///

2012-08-07 Thread Mohit Anchlia
conf = new JobConf(getConf()) and I don't pass in any configuration then does the data from xml files in the path used? I want this to work for all the scenarios. On Wed, Aug 8, 2012 at 1:10 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I am trying to write a test on local file system

Local jobtracker in test env?

2012-08-07 Thread Mohit Anchlia
I just wrote a test where fs.default.name is file:/// and mapred.job.tracker is set to local. The test ran fine, I also see mapper and reducer were invoked but what I am trying to understand is that how did this run without specifying the job tracker port and which port task tracker connected with

Re: Avro

2012-08-05 Thread Mohit Anchlia
should be able to read older data as well. Try it out. It is very straight forward. Hope this helps! Thanks! I am new to Avro what's the best place to see some examples of how Avro deals with schema changes? I am trying to find some examples. On Sun, Aug 5, 2012 at 12:01 AM, Mohit Anchlia

Compression and Decompression

2012-07-05 Thread Mohit Anchlia
Is the compression done on the client side or on the server side? If I run hadoop fs -text then is this client decompressing the file for me?

Dealing with changing file format

2012-07-02 Thread Mohit Anchlia
I am wondering what's the right way to go about designing reading input and output where file format may change over period. For instance we might start with field1,field2,field3 but at some point we add new field4 in the input. What's the best way to deal with such scenarios? Keep a catalog of

Re: Sync and Data Replication

2012-06-10 Thread Mohit Anchlia
On Sun, Jun 10, 2012 at 9:39 AM, Harsh J ha...@cloudera.com wrote: Mohit, On Sat, Jun 9, 2012 at 11:11 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Thanks Harsh for detailed info. It clears things up. Only thing from those page is concerning is what happens when client crashes

Re: Sync and Data Replication

2012-06-09 Thread Mohit Anchlia
(), HBase can survive potential failures caused by major power failure cases (among others). Let us know if this clears it up for you! On Sat, Jun 9, 2012 at 4:58 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I am wondering the role of sync in replication of data to other nodes. Say client

Sync and Data Replication

2012-06-08 Thread Mohit Anchlia
I am wondering the role of sync in replication of data to other nodes. Say client writes a line to a file in Hadoop, at this point file handle is open and sync has not been called. In this scenario is data also replicated as defined by the replication factor to other nodes as well? I am wondering

Ideal file size

2012-06-06 Thread Mohit Anchlia
We have continuous flow of data into the sequence file. I am wondering what would be the ideal file size before file gets rolled over. I know too many small files are not good but could someone tell me what would be the ideal size such that it doesn't overload NameNode.

Re: Ideal file size

2012-06-06 Thread Mohit Anchlia
issues with the NameNode but rather increase in processing times if there are too many small files. Looks like I need to find that balance. It would also be interesting to see how others solve this problem when not using Flume. On Wed, Jun 6, 2012 at 7:00 AM, Mohit Anchlia mohitanch...@gmail.com

Re: Writing click stream data to hadoop

2012-05-30 Thread Mohit Anchlia
seek. Thanks Harsh, Does flume also provides API on top. I am getting this data as http call, how would I go about using flume with http calls? On Fri, May 25, 2012 at 8:24 PM, Mohit Anchlia mohitanch...@gmail.com wrote: We get click data through API calls. I now need to send this data to our

Re: Bad connect ack with firstBadLink

2012-05-04 Thread Mohit Anchlia
Please see: http://hbase.apache.org/book.html#dfs.datanode.max.xcievers On Fri, May 4, 2012 at 5:46 AM, madhu phatak phatak@gmail.com wrote: Hi, We are running a three node cluster . From two days whenever we copy file to hdfs , it is throwing java.IO.Exception Bad connect ack with

Compressing map only output

2012-04-30 Thread Mohit Anchlia
Is there a way to compress map only jobs to compress map output that gets stored on hdfs as part-m-* files? In pig I used : Would these work form plain map reduce jobs as well? set output.compression.enabled true; set output.compression.codec org.apache.hadoop.io.compress.SnappyCodec;

Re: Compressing map only output

2012-04-30 Thread Mohit Anchlia
those properties in your job conf. On Mon, Apr 30, 2012 at 5:25 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Is there a way to compress map only jobs to compress map output that gets stored on hdfs as part-m-* files? In pig I used : Would these work form plain map reduce jobs as well

Re: Compressing map only output

2012-04-30 Thread Mohit Anchlia
are also available at: http://hadoop.apache.org/common/docs/current/mapred-default.html (core-default.html, hdfs-default.html) On Tue, May 1, 2012 at 6:36 AM, Mohit Anchlia mohitanch...@gmail.com wrote: Thanks! When I tried to search for this property I couldn't find it. Is there a page that has

Re: DFSClient error

2012-04-29 Thread Mohit Anchlia
: property namedfs.datanode.max.xcievers/name value4096/value /property To your DNs' config/hdfs-site.xml and restart the DNs. On Mon, Apr 30, 2012 at 1:35 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I even tried to lower number of parallel jobs even further but I still get

Re: DFSClient error

2012-04-27 Thread Mohit Anchlia
or get) command? If yes, how about a wordcount example? 'path/hadoop jar pathhadoop-*examples*.jar wordcount input output' -Original Message- From: Mohit Anchlia mohitanch...@gmail.com Reply-To: common-user@hadoop.apache.org common-user@hadoop.apache.org Date: Fri, 27 Apr 2012 14:36:49

Re: Design question

2012-04-26 Thread Mohit Anchlia
Ant suggestion or pointers would be helpful. Are there any best practices? On Mon, Apr 23, 2012 at 3:27 PM, Mohit Anchlia mohitanch...@gmail.comwrote: I just wanted to check how do people design their storage directories for data that is sent to the system continuously. For eg: for a given

DFSClient error

2012-04-26 Thread Mohit Anchlia
I had 20 mappers in parallel reading 20 gz files and each file around 30-40MB data over 5 hadoop nodes and then writing to the analytics database. Almost midway it started to get this error: 2012-04-26 16:13:53,723 [Thread-8] INFO org.apache.hadoop.hdfs.DFSClient - Exception in

Design question

2012-04-23 Thread Mohit Anchlia
I just wanted to check how do people design their storage directories for data that is sent to the system continuously. For eg: for a given functionality we get data feed continuously writen to sequencefile, that is then coverted to more structured format using map reduce and stored in tab

Re: Get Current Block or Split ID, and using it, the Block Path

2012-04-08 Thread Mohit Anchlia
I think if you called getInputFormat on JobConf and then called getSplits you would atleast get the locations. http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/InputSplit.html On Sun, Apr 8, 2012 at 9:16 AM, Deepak Nettem deepaknet...@gmail.comwrote: Hi, Is it

Re: Doubt from the book Definitive Guide

2012-04-05 Thread Mohit Anchlia
to understand the rational behind using local disk for final output. Prashant On Apr 4, 2012, at 9:55 PM, Mohit Anchlia mohitanch...@gmail.com wrote: On Wed, Apr 4, 2012 at 8:42 PM, Harsh J ha...@cloudera.com wrote: Hi Mohit, On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia mohitanch

Doubt from the book Definitive Guide

2012-04-04 Thread Mohit Anchlia
I am going through the chapter How mapreduce works and have some confusion: 1) Below description of Mapper says that reducers get the output file using HTTP call. But the description under The Reduce Side doesn't specifically say if it's copied using HTTP. So first confusion, Is the output copied

Re: Doubt from the book Definitive Guide

2012-04-04 Thread Mohit Anchlia
On Wed, Apr 4, 2012 at 8:42 PM, Harsh J ha...@cloudera.com wrote: Hi Mohit, On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I am going through the chapter How mapreduce works and have some confusion: 1) Below description of Mapper says that reducers get

Re: setNumTasks

2012-03-22 Thread Mohit Anchlia
Could someone please help me answer this question? On Wed, Mar 14, 2012 at 8:06 AM, Mohit Anchlia mohitanch...@gmail.comwrote: What is the corresponding system property for setNumTasks? Can it be used explicitly as system property like mapred.tasks.?

Re: setNumTasks

2012-03-22 Thread Mohit Anchlia
as setNumTasks. There is however, setNumReduceTasks, which sets mapred.reduce.tasks. Does this answer your question? On Thu, Mar 22, 2012 at 8:21 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Could someone please help me answer this question? On Wed, Mar 14, 2012 at 8:06 AM, Mohit Anchlia

Re: SequenceFile split question

2012-03-15 Thread Mohit Anchlia
in case of MR tasks as well. Regards Bejoy.K.S On Thu, Mar 15, 2012 at 6:17 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I have a client program that creates sequencefile, which essentially merges small files into a big file. I was wondering how is sequence file splitting the data

Re: EOFException

2012-03-15 Thread Mohit Anchlia
This is actually just hadoop job over HDFS. I am assuming you also know why this is erroring out? On Thu, Mar 15, 2012 at 1:02 PM, Gopal absoft...@gmail.com wrote: On 03/15/2012 03:06 PM, Mohit Anchlia wrote: When I start a job to read data from HDFS I start getting these errors. Does

SequenceFile split question

2012-03-14 Thread Mohit Anchlia
I have a client program that creates sequencefile, which essentially merges small files into a big file. I was wondering how is sequence file splitting the data accross nodes. When I start the sequence file is empty. Does it get split when it reaches the dfs.block size? If so then does it mean

Re: mapred.tasktracker.map.tasks.maximum not working

2012-03-10 Thread Mohit Anchlia
, Mar 9, 2012 at 7:32 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I have mapred.tasktracker.map.tasks.maximum set to 2 in my job and I have 5 nodes. I was expecting this to have only 10 concurrent jobs. But I have 30 mappers running. Does hadoop ignores this setting when supplied from

mapred.map.tasks vs mapred.tasktracker.map.tasks.maximum

2012-03-09 Thread Mohit Anchlia
What's the difference between mapred.tasktracker.reduce.tasks.maximum and mapred.map.tasks ** I want my data to be split against only 10 mappers in the entire cluster. Can I do that using one of the above parameters?

Re: mapred.map.tasks vs mapred.tasktracker.map.tasks.maximum

2012-03-09 Thread Mohit Anchlia
default number of reduce (map) tasks your job will has. To set the number of mappers in your application. You can write like this: *configuration.setNumMapTasks(the number you want);* Chen Actually, you can just use configuration.set() On Fri, Mar 9, 2012 at 6:42 PM, Mohit Anchlia mohitanch

mapred.tasktracker.map.tasks.maximum not working

2012-03-09 Thread Mohit Anchlia
I have mapred.tasktracker.map.tasks.maximum set to 2 in my job and I have 5 nodes. I was expecting this to have only 10 concurrent jobs. But I have 30 mappers running. Does hadoop ignores this setting when supplied from the job?

Re: mapred.map.tasks vs mapred.tasktracker.map.tasks.maximum

2012-03-09 Thread Mohit Anchlia
file. On Fri, Mar 9, 2012 at 7:19 PM, Mohit Anchlia mohitanch...@gmail.com wrote: What's the difference between setNumMapTasks and mapred.map.tasks? On Fri, Mar 9, 2012 at 5:00 PM, Chen He airb...@gmail.com wrote: Hi Mohit mapred.tasktracker.reduce(map).tasks.maximum means how

Re: Profiling Hadoop Job

2012-03-08 Thread Mohit Anchlia
Can you check which user you are running this process as and compare it with the ownership on the directory? On Thu, Mar 8, 2012 at 3:13 PM, Leonardo Urbina lurb...@mit.edu wrote: Does anyone have any idea how to solve this problem? Regardless of whether I'm using plain HPROF or profiling

Re: Java Heap space error

2012-03-06 Thread Mohit Anchlia
I am still trying to see how to narrow this down. Is it possible to set heapdumponoutofmemoryerror option on these individual tasks? On Mon, Mar 5, 2012 at 5:49 PM, Mohit Anchlia mohitanch...@gmail.comwrote: Sorry for multiple emails. I did find: 2012-03-05 17:26:35,636 INFO

Re: AWS MapReduce

2012-03-05 Thread Mohit Anchlia
, vs a flexible infrastructure that could use a local cluster or cluster on a different cloud provider. Thanks for your input. I am assuming HDFS is created on ephemerial disks and not EBS. Also, is it possible to share some of your findings? On Sun, Mar 4, 2012 at 8:51 AM, Mohit Anchlia

Re: Java Heap space error

2012-03-05 Thread Mohit Anchlia
, 2012 at 5:03 PM, Mohit Anchlia mohitanch...@gmail.comwrote: I currently have java.opts.mapred set to 512MB and I am getting heap space errors. How should I go about debugging heap space issues?

Re: Java Heap space error

2012-03-05 Thread Mohit Anchlia
, Mohit Anchlia mohitanch...@gmail.comwrote: All I see in the logs is: 2012-03-05 17:26:36,889 FATAL org.apache.hadoop.mapred.TaskTracker: Task: attempt_201203051722_0001_m_30_1 - Killed : Java heap space Looks like task tracker is killing the tasks. Not sure why. I increased heap from

Re: AWS MapReduce

2012-03-04 Thread Mohit Anchlia
slow. The setup is done pretty fast and there are some configuration parameters you can bypass - for example blocksizes etc. - but in the end imho setting up ec2 instances by copying images is the better alternative. Kind Regards Hannes On Sun, Mar 4, 2012 at 2:31 AM, Mohit Anchlia mohitanch

Re: AWS MapReduce

2012-03-03 Thread Mohit Anchlia
I think found answer to this question. However, it's still not clear if HDFS is on local disk or EBS volumes. Does anyone know? On Sat, Mar 3, 2012 at 3:54 PM, Mohit Anchlia mohitanch...@gmail.comwrote: Just want to check how many are using AWS mapreduce and understand the pros and cons

Re: Hadoop pain points?

2012-03-02 Thread Mohit Anchlia
+1 On Fri, Mar 2, 2012 at 4:09 PM, Harsh J ha...@cloudera.com wrote: Since you ask about anything in general, when I forayed into using Hadoop, my biggest pain was lack of documentation clarity and completeness over the MR and DFS user APIs (and other little points). It would be nice to

kill -QUIT

2012-03-01 Thread Mohit Anchlia
When I try kill -QUIT for a job it doesn't send the stacktrace to the log files. Does anyone know why or if I am doing something wrong? I find the job using ps -ef|grep attempt. I then go to logs/userLogs/jobid/attemptid/

Adding nodes

2012-03-01 Thread Mohit Anchlia
Is this the right procedure to add nodes? I took some from hadoop wiki FAQ: http://wiki.apache.org/hadoop/FAQ 1. Update conf/slave 2. on the slave nodes start datanode and tasktracker 3. hadoop balancer Do I also need to run dfsadmin -refreshnodes?

Re: Adding nodes

2012-03-01 Thread Mohit Anchlia
, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote: Is this the right procedure to add nodes? I took some from hadoop wiki FAQ: http://wiki.apache.org/hadoop/FAQ 1. Update conf/slave 2. on the slave nodes start datanode and tasktracker 3. hadoop balancer Do I also need to run

Re: Adding nodes

2012-03-01 Thread Mohit Anchlia
/jobtracker know there is a new node in the cluster. Is it initiated by namenode when slave file is edited? Or is it initiated by tasktracker when tasktracker is started? Sent from my iPhone On Mar 1, 2012, at 18:49, Mohit Anchlia mohitanch...@gmail.com wrote: On Thu, Mar 1, 2012 at 4:46 PM, Joey

Re: Adding nodes

2012-03-01 Thread Mohit Anchlia
) then you would need to edit these files and issue the refresh command. On Mar 1, 2012, at 5:35 PM, Mohit Anchlia wrote: On Thu, Mar 1, 2012 at 4:57 PM, Joey Echeverria j...@cloudera.com wrote: Not quite. Datanodes get the namenode host from fs.defalt.name in core-site.xml. Task trackers

Re: 100x slower mapreduce compared to pig

2012-02-29 Thread Mohit Anchlia
can take a look at what you are doing in the UDF vs the Mapper. 100x slow does not make sense for the same job/logic, its either the Mapper code or may be the cluster was busy at the time you scheduled MapReduce job? Thanks, Prashant On Tue, Feb 28, 2012 at 4:11 PM, Mohit Anchlia mohitanch

Re: 100x slower mapreduce compared to pig

2012-02-29 Thread Mohit Anchlia
and couldn't find one. Does anyone know where stacktraces are generally sent? On Wed, Feb 29, 2012 at 1:08 PM, Mohit Anchlia mohitanch...@gmail.comwrote: I can't seem to find what's causing this slowness. Nothing in the logs. It's just painfuly slow. However, pig job is awesome in performance that has

Re: Invocation exception

2012-02-29 Thread Mohit Anchlia
Guide, 2nd ed.). On Wed, Feb 29, 2012 at 1:40 AM, Mohit Anchlia mohitanch...@gmail.com wrote: It looks like adding this line causes invocation exception. I looked in hdfs and I see that file in that path DistributedCache.*addFileToClassPath*(*new* Path(/jars/common.jar), conf); I have

Re: Invocation exception

2012-02-28 Thread Mohit Anchlia
I commented reducer and combiner both and still I see the same exception. Could it be because I have 2 jars being added? On Mon, Feb 27, 2012 at 8:23 PM, Subir S subir.sasiku...@gmail.com wrote: On Tue, Feb 28, 2012 at 4:30 AM, Mohit Anchlia mohitanch...@gmail.com wrote: For some reason I

Re: Invocation exception

2012-02-28 Thread Mohit Anchlia
); but this works just fine. On Tue, Feb 28, 2012 at 11:44 AM, Mohit Anchlia mohitanch...@gmail.comwrote: I commented reducer and combiner both and still I see the same exception. Could it be because I have 2 jars being added? On Mon, Feb 27, 2012 at 8:23 PM, Subir S subir.sasiku

100x slower mapreduce compared to pig

2012-02-28 Thread Mohit Anchlia
I am comparing runtime of similar logic. The entire logic is exactly same but surprisingly map reduce job that I submit is 100x slow. For pig I use udf and for hadoop I use mapper only and the logic same as pig. Even the splits on the admin page are same. Not sure why it's so slow. I am submitting

Re: dfs.block.size

2012-02-27 Thread Mohit Anchlia
Can someone please suggest if parameters like dfs.block.size, mapred.tasktracker.map.tasks.maximum are only cluster wide settings or can these be set per client job configuration? On Sat, Feb 25, 2012 at 5:43 PM, Mohit Anchlia mohitanch...@gmail.comwrote: If I want to change the block size

Task Killed but no errors

2012-02-27 Thread Mohit Anchlia
I submitted a map reduce job that had 9 tasks killed out of 139. But I don't see any errors in the admin page. The entire job however has SUCCEDED. How can I track down the reason? Also, how do I determine if this is something to worry about?

Re: dfs.block.size

2012-02-27 Thread Mohit Anchlia
How do I verify the block size of a given file? Is there a command? On Mon, Feb 27, 2012 at 7:59 AM, Joey Echeverria j...@cloudera.com wrote: dfs.block.size can be set per job. mapred.tasktracker.map.tasks.maximum is per tasktracker. -Joey On Mon, Feb 27, 2012 at 10:19 AM, Mohit Anchlia

Handling bad records

2012-02-27 Thread Mohit Anchlia
What's the best way to write records to a different file? I am doing xml processing and during processing I might come accross invalid xml format. Current I have it under try catch block and writing to log4j. But I think it would be better to just write it to an output file that just contains

Re: Invocation exception

2012-02-27 Thread Mohit Anchlia
Does it matter if reducer is set even if the no of reducers is 0? Is there a way to get more clear reason? On Mon, Feb 27, 2012 at 8:23 PM, Subir S subir.sasiku...@gmail.com wrote: On Tue, Feb 28, 2012 at 4:30 AM, Mohit Anchlia mohitanch...@gmail.com wrote: For some reason I am getting

Re: Invocation exception

2012-02-27 Thread Mohit Anchlia
to the topic in that book where I'll find this information? Sent from my iPhone On Feb 27, 2012, at 8:54 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Does it matter if reducer is set even if the no of reducers is 0? Is there a way to get more clear reason? On Mon, Feb 27, 2012 at 8:23

Re: Handling bad records

2012-02-27 Thread Mohit Anchlia
: Mohit, Use the MultipleOutputs API: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html to have a named output of bad records. There is an example of use detailed on the link. On Tue, Feb 28, 2012 at 3:48 AM, Mohit Anchlia mohitanch

Re: LZO with sequenceFile

2012-02-26 Thread Mohit Anchlia
Eugen Stan stan.ieu...@gmail.com wrote: 2012/2/26 Mohit Anchlia mohitanch...@gmail.com: Thanks. Does it mean LZO is not installed by default? How can I install LZO? The LZO library is released under GPL and I believe it can't be included in most distributions of Hadoop because

dfs.block.size

2012-02-25 Thread Mohit Anchlia
If I want to change the block size then can I use Configuration in mapreduce job and set it when writing to the sequence file or does it need to be cluster wide setting in .xml files? Also, is there a way to check the block of a given file?

Re: LZO with sequenceFile

2012-02-25 Thread Mohit Anchlia
Thanks. Does it mean LZO is not installed by default? How can I install LZO? On Sat, Feb 25, 2012 at 6:27 PM, Shi Yu sh...@uchicago.edu wrote: Yes, it is supported by Hadoop sequence file. It is splittable by default. If you have installed and specified LZO correctly, use these:

MapReduce tunning

2012-02-24 Thread Mohit Anchlia
I am looking at some hadoop tuning parameters like io.sort.mb, mapred.child.javaopts etc. - My question was where to look at for current setting - Are these settings configured cluster wide or per job? - What's the best way to look at reasons of slow performance?

Re: Splitting files on new line using hadoop fs

2012-02-22 Thread Mohit Anchlia
'/examples/testfile5.txt using org.apache.pig.piggybank.storage.XMLLoader('abc') as (document:chararray); dump raw; --Original Message-- From: Mohit Anchlia To: common-user@hadoop.apache.org ReplyTo: common-user@hadoop.apache.org Subject: Splitting files on new line using hadoop fs

Re: Splitting files on new line using hadoop fs

2012-02-22 Thread Mohit Anchlia
. -- *From: *Mohit Anchlia mohitanch...@gmail.com *Date: *Wed, 22 Feb 2012 12:29:26 -0800 *To: *common-user@hadoop.apache.org; bejoy.had...@gmail.com *Subject: *Re: Splitting files on new line using hadoop fs On Wed, Feb 22, 2012 at 12:23 PM, bejoy.had...@gmail.com wrote

Streaming job hanging

2012-02-22 Thread Mohit Anchlia
Streaming job just seems to be hanging 12/02/22 17:35:50 INFO streaming.StreamJob: map 0% reduce 0% - On the admin page I see that it created 551 input split. Could somone suggest a way to find out what might be causing it to hang? I increased io.sort.mb to 200 MB. I am using 5 data

Re: Writing small files to one big file in hdfs

2012-02-21 Thread Mohit Anchlia
org.apache.pig.piggybank.storage.XMLLoader for processing. Would it work with sequence file? This text file that I was referring to would be in hdfs itself. Is it still different than using sequence file? Regards Bejoy.K.S On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia mohitanch...@gmail.com wrote: We

Re: Writing small files to one big file in hdfs

2012-02-21 Thread Mohit Anchlia
, 2012 at 10:45 PM, Mohit Anchlia mohitanch...@gmail.com wrote: We have small xml files. Currently I am planning to append these small files to one file in hdfs so that I can take advantage of splits, larger blocks and sequential IO. What I am unsure is if it's ok to append one

Re: Writing small files to one big file in hdfs

2012-02-21 Thread Mohit Anchlia
of them with different input paths. Hope this helps! Cheers Arko On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am trying to look for examples that demonstrates using sequence files including writing to it and then running mapred on it, but unable to find

Re: Writing to SequenceFile fails

2012-02-21 Thread Mohit Anchlia
I am past this error. Looks like I needed to use CDH libraries. I changed my maven repo. Now I am stuck at *org.apache.hadoop.security.AccessControlException *since I am not writing as user that owns the file. Looking online for solutions On Tue, Feb 21, 2012 at 12:48 PM, Mohit Anchlia

Re: Writing small files to one big file in hdfs

2012-02-21 Thread Mohit Anchlia
, Mohit Anchlia mohitanch...@gmail.com wrote: Sorry may be it's something obvious but I was wondering when map or reduce gets called what would be the class used for key and value? If I used org.apache.hadoop.io.Text value = *new* org.apache.hadoop.io.Text(); would the map be called

Re: Writing small files to one big file in hdfs

2012-02-21 Thread Mohit Anchlia
It looks like in mapper values are coming as binary instead of Text. Is this expected from sequence file? I initially wrote SequenceFile with Text values. On Tue, Feb 21, 2012 at 4:13 PM, Mohit Anchlia mohitanch...@gmail.comwrote: Need some more help. I wrote sequence file using below code

Re: Writing small files to one big file in hdfs

2012-02-21 Thread Mohit Anchlia
Finally figured it out. I needed to use SequenceFileAstextInputFormat. There is just lack of examples that makes it difficult when you start. On Tue, Feb 21, 2012 at 4:50 PM, Mohit Anchlia mohitanch...@gmail.comwrote: It looks like in mapper values are coming as binary instead of Text

Re: Processing small xml files

2012-02-18 Thread Mohit Anchlia
can't seem to find examples of how to do xml processing in Pig. Can you please send me some pointers? Basically I need to convert my xml to more structured format using hadoop to write it to database. On Sat, Feb 18, 2012 at 1:18 AM, Mohit Anchlia mohitanch...@gmail.com wrote: On Tue, Feb 14

Re: Hadoop install

2012-02-18 Thread Mohit Anchlia
, as always, are well worth reading. Tom Deutsch Program Director Information Management Big Data Technologies IBM 3565 Harbor Blvd Costa Mesa, CA 92626-1420 tdeut...@us.ibm.com Mohit Anchlia mohitanch...@gmail.com 02/18/2012 06:24 AM

Re: Processing small xml files

2012-02-17 Thread Mohit Anchlia
On Tue, Feb 14, 2012 at 10:56 AM, W.P. McNeill bill...@gmail.com wrote: I'm not sure what you mean by flat format here. In my scenario, I have an file input.xml that looks like this. myfile section value1/value /section section value2/value /section /myfile

Re: Processing small xml files

2012-02-12 Thread Mohit Anchlia
On Sun, Feb 12, 2012 at 9:24 AM, W.P. McNeill bill...@gmail.com wrote: I've used the Mahout XMLInputFormat. It is the right tool if you have an XML file with one type of section repeated over and over again and want to turn that into Sequence file where each repeated section is a value. I've

Developing MapReduce

2011-10-10 Thread Mohit Anchlia
I use eclipse. Is this http://wiki.apache.org/hadoop/EclipsePlugIn still the best way to develop mapreduce programs in hadoop? Just want to make sure before I go down this path. Or should I just add hadoop jars in my classpath of eclipse and create my own MapReduce programs. Thanks

Re: incremental loads into hadoop

2011-10-03 Thread Mohit Anchlia
This process of managing looks like more pain long term. Would it be easier to store in Hbase which has smaller block size? What's the avg. file size? On Sun, Oct 2, 2011 at 7:34 PM, Vitthal Suhas Gogate gog...@hortonworks.com wrote: Agree with Bejoy, although to minimize the processing latency

Re: Binary content

2011-09-01 Thread Mohit Anchlia
On Thu, Sep 1, 2011 at 1:25 AM, Dieter Plaetinck dieter.plaeti...@intec.ugent.be wrote: On Wed, 31 Aug 2011 08:44:42 -0700 Mohit Anchlia mohitanch...@gmail.com wrote: Does map-reduce work well with binary contents in the file? This binary content is basically some CAD files and map reduce

Binary content

2011-08-31 Thread Mohit Anchlia
Does map-reduce work well with binary contents in the file? This binary content is basically some CAD files and map reduce program need to read these files using some proprietry tool extract values and do some processing. Wondering if there are others doing similar type of processing. Best

Re: Question about RAID controllers and hadoop

2011-08-11 Thread Mohit Anchlia
On Thu, Aug 11, 2011 at 3:26 PM, Charles Wimmer cwim...@yahoo-inc.com wrote: We currently use P410s in 12 disk system.  Each disk is set up as a RAID0 volume.  Performance is at least as good as a bare disk. Can you please share what throughput you see with P410s? Are these SATA or SAS? On

Re: maprd vs mapreduce api

2011-08-05 Thread Mohit Anchlia
On Fri, Aug 5, 2011 at 3:42 PM, Stevens, Keith D. steven...@llnl.gov wrote: The Mapper and Reducer class in org.apache.hadoop.mapreduce implement the identity function.  So you should be able to just do conf.setMapperClass(org.apache.hadoop.mapreduce.Mapper.class);

Re: Hadoop cluster network requirement

2011-08-01 Thread Mohit Anchlia
Assuming everything is up this solution still will not scale given the latency, tcpip buffers, sliding window etc. See BDP Sent from my iPad On Aug 1, 2011, at 4:57 PM, Michael Segel michael_se...@hotmail.com wrote: Yeah what he said. Its never a good idea. Forget about losing a NN or a

Re: Moving Files to Distributed Cache in MapReduce

2011-07-29 Thread Mohit Anchlia
Is this what you are looking for? http://hadoop.apache.org/common/docs/current/mapred_tutorial.html search for jobConf On Fri, Jul 29, 2011 at 1:51 PM, Roger Chen rogc...@ucdavis.edu wrote: Thanks for the response! However, I'm having an issue with this line Path[] cacheFiles =

Re: Replication and failure

2011-07-28 Thread Mohit Anchlia
operation? I am assuming there will be some errors in this case. On Thu, Jul 28, 2011 at 5:08 AM, Mohit Anchlia mohitanch...@gmail.com wrote: Just trying to understand what happens if there are 3 nodes with replication set to 3 and one node fails. Does it fail the writes too? If there is a link

Replication and failure

2011-07-27 Thread Mohit Anchlia
Just trying to understand what happens if there are 3 nodes with replication set to 3 and one node fails. Does it fail the writes too? If there is a link that I can look at will be great. I tried searching but didn't see any definitive answer. Thanks, Mohit

Re: No. of Map and reduce tasks

2011-05-31 Thread Mohit Anchlia
fire up some nix commands and pack together that file onto itself a bunch if times and the put it back into hdfs and let 'er rip Sent from my mobile. Please excuse the typos. On 2011-05-26, at 4:56 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I think I understand that by last 2 replies

Using own InputSplit

2011-05-27 Thread Mohit Anchlia
I am new to hadoop and from what I understand by default hadoop splits the input into blocks. Now this might result in splitting a line of record into 2 pieces and getting spread accross 2 maps. For eg: Line abcd might get split into ab and cd. How can one prevent this in hadoop and pig? I am

Re: Using own InputSplit

2011-05-27 Thread Mohit Anchlia
, does not happen; and the way the splitting is done for Text files is explained in good detail here: http://wiki.apache.org/hadoop/HadoopMapReduce Hope this solves your doubt :) On Fri, May 27, 2011 at 10:25 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am new to hadoop and from what I

Re: Using own InputSplit

2011-05-27 Thread Mohit Anchlia
announcing something. What you describe, does not happen; and the way the splitting is done for Text files is explained in good detail here: http://wiki.apache.org/hadoop/HadoopMapReduce Hope this solves your doubt :) On Fri, May 27, 2011 at 10:25 PM, Mohit Anchlia mohitanch...@gmail.com wrote

How to copy over using dfs

2011-05-27 Thread Mohit Anchlia
If I have to overwrite a file I generally use hadoop dfs -rm file hadoop dfs -copyFromLocal or -put file Is there a command to overwrite/replace the file instead of doing rm first?

Help with pigsetup

2011-05-26 Thread Mohit Anchlia
I sent this to pig apache user mailing list but have got no response. Not sure if that list is still active. thought I will post here if someone is able to help me. I am in process of installing and learning pig. I have a hadoop cluster and when I try to run pig in mapreduce mode it errors out:

Re: Help with pigsetup

2011-05-26 Thread Mohit Anchlia
basically an issue of hadoop version differences between the one Pig 0.8.1 release got bundled with vs. Hadoop 0.20.203 release which is newer. On Thu, May 26, 2011 at 10:26 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I sent this to pig apache user mailing list but have got no response

Re: Help with pigsetup

2011-05-26 Thread Mohit Anchlia
/antlr-runtime-3.2.jar; Is this a windows command? Sorry, have not used this before. 2011/5/26 Mohit Anchlia mohitanch...@gmail.com For some reason I don't see that reply from Jonathan in my Inbox. I'll try to google it. What should be my next step in that case? I can't use pig then? On Thu

  1   2   >