Re: Delivery Status Notification (Failure)

2013-02-12 Thread Raj Vishwanathan
Cos,

I understand that there are rules. What are these rules? Is it Hive vs hadoop ( 
this I understand) or apache  hadoop vs a specific distribution? ( this I am 
not clear about.)


Sent from my iPad
Please excuse the typos. 

On Feb 12, 2013, at 8:56 PM, Konstantin Boudnik  wrote:

> With all due respect, sir, these mailing lists have certain rules, that aren't
> evidently coincide with your philosophy. 
> 
> Cos
> 
> On Tue, Feb 12, 2013 at 08:45PM, Raj Vishwanathan wrote:
>> Arun
>> 
>> I don't understand your reply! ═Had you redirected this person to the hive
>> mailing list I would have understood..
>> 
>> My═philosophy═any mailing list ═has always been ; If I know the answer to a
>> question, I reply.. Else I humbly walk away!.═
>> 
>> I got a lot of help from this group for my (mostly stupid) questions - and
>> people helped me. I would like to return the favor when ( and if) I can.
>> 
>> My humble $0.01.:-)
>> 
>> And for the record- I don't know the answer to the question on microstrategy 
>> :-)
>> 
>> 
>> Raj
>> 
>> 
>> 
>> 
>> 
>>> 
>>> From: Arun C Murthy 
>>> To: user@hadoop.apache.org 
>>> Sent: Tuesday, February 12, 2013 6:42 PM
>>> Subject: Re: Delivery Status Notification (Failure)
>>> 
>>> 
>>> Pls don't cross-post, this belong only to cdh lists.
>>> 
>>> 
>>> On Feb 12, 2013, at 12:55 AM, samir das mohapatra wrote:
>>> 
>>> 
 
 
 Hi All,
 ═ ═I wanted to know how to connect Hive(hadoop-cdh4 distribution) with
 MircoStrategy
 ═ ═Any help is very helpfull.
 
 ═ Witing for you response
 
 Note: It is little bit urgent do any one have exprience in that
 Thanks,
 samir
>>> 
>>> --
>>> Arun C. Murthy
>>> Hortonworks Inc.
>>> http://hortonworks.com/
>>> 
>>> 
>>> 
>>> 
>>> 


Re: Anyway to load certain Key/Value pair fast?

2013-02-12 Thread William Kang
Hi Harsh,
Thanks a lot for your reply and great suggestions.

In the practical cases, the values usually do not reside in the same
data node. Instead, they are mostly distributed by the key range
itself. So, it does require 20G of memory, but distributed in
different nodes.

The MapFile solution is very intriguing. I am not very familiar with
it though. I assume it kinda resemble the basic idea of HBase? I will
certainly try it out and follow up if there are questions.

I agree that using HBase would be much easier. But the value size
makes worry if it is going to push it to the edge. If I do this more
often, I will definitely consider using HBase.

Many thanks for the great reply.


William


On Wed, Feb 13, 2013 at 12:38 AM, Harsh J  wrote:
> My reply to your questions is inline.
>
> On Wed, Feb 13, 2013 at 10:59 AM, Harsh J  wrote:
>> Please do not use the general@ lists for any user-oriented questions.
>> Please redirect them to user@hadoop.apache.org lists, which is where
>> the user community and questions lie.
>>
>> I've moved your post there and have added you on CC in case you
>> haven't subscribed there. Please reply back only to the user@
>> addresses. The general@ list is for Apache Hadoop project-level
>> management and release oriented discussions alone.
>>
>> On Wed, Feb 13, 2013 at 10:54 AM, William Kang  
>> wrote:
>>> Hi All,
>>> I am trying to figure out a good solution for such a scenario as following.
>>>
>>> 1. I have a 2T file (let's call it A), filled by key/value pairs,
>>> which is stored in the HDFS with the default 64M block size. In A,
>>> each key is less than 1K and each value is about 20M.
>>>
>>> 2. Occasionally, I will run analysis by using a different type of data
>>> (usually less than 10G, and let's call it B) and do look-up table
>>> alike operations by using the values in A. B resides in HDFS as well.
>>>
>>> 3. This analysis would require loading only a small number of values
>>> from A (usually less than 1000 of them) into the memory for fast
>>> look-up against the data in B. The way B finds the few values in A is
>>> by looking up for the key in A.
>
> About 1000 such rows would equal a memory expense of near 20 GB, given
> the value size of A you've noted above. The solution may need to be
> considered with this in mind, if the whole lookup table is to be
> eventually generated into the memory and never discarded until the end
> of processing.
>
>>> Is there an efficient way to do this?
>
> Since HBase may be too much for your simple needs, have you instead
> considered using MapFiles, which allow fast key lookups at a file
> level over HDFS/MR? You can have these files either highly replicated
> (if their size is large), or distributed via the distributed cache in
> the lookup jobs (if they are infrequently used and small sized), and
> be able to use the  MapFile reader API to perform lookups of keys and
> read values only when you want them.
>
>>> I was thinking if I could identify the locality of the block that
>>> contains the few values, I might be able to push the B into the few
>>> nodes that contains the few values in A?  Since I only need to do this
>>> occasionally, maintaining a distributed database such as HBase cant be
>>> justified.
>
> I agree that HBase may not be wholly suited to be run just for this
> purpose (unless A's also gonna be scaling over time).
>
> Maintaining value -> locality mapping would need to be done by you. FS
> APIs provide locality info calls, and your files may be
> key-partitioned enough to identify each one's range, and you can
> combine the knowledge of these two to do something along these lines.
>
> Using HBase may also turn out to be "easier", but thats upto you. You
> can also choose to tear it down (i.e. the services) when not needed,
> btw.
>
>>> Many thanks.
>>>
>>>
>>> Cao
>>
>>
>>
>> --
>> Harsh J
>
>
>
> --
> Harsh J


Re: Anyway to load certain Key/Value pair fast?

2013-02-12 Thread William Kang
Hi Harsh,
Thanks for moving the post to the correct list.


William

On Wed, Feb 13, 2013 at 12:29 AM, Harsh J  wrote:
> Please do not use the general@ lists for any user-oriented questions.
> Please redirect them to user@hadoop.apache.org lists, which is where
> the user community and questions lie.
>
> I've moved your post there and have added you on CC in case you
> haven't subscribed there. Please reply back only to the user@
> addresses. The general@ list is for Apache Hadoop project-level
> management and release oriented discussions alone.
>
> On Wed, Feb 13, 2013 at 10:54 AM, William Kang  wrote:
>> Hi All,
>> I am trying to figure out a good solution for such a scenario as following.
>>
>> 1. I have a 2T file (let's call it A), filled by key/value pairs,
>> which is stored in the HDFS with the default 64M block size. In A,
>> each key is less than 1K and each value is about 20M.
>>
>> 2. Occasionally, I will run analysis by using a different type of data
>> (usually less than 10G, and let's call it B) and do look-up table
>> alike operations by using the values in A. B resides in HDFS as well.
>>
>> 3. This analysis would require loading only a small number of values
>> from A (usually less than 1000 of them) into the memory for fast
>> look-up against the data in B. The way B finds the few values in A is
>> by looking up for the key in A.
>>
>> Is there an efficient way to do this?
>>
>> I was thinking if I could identify the locality of the block that
>> contains the few values, I might be able to push the B into the few
>> nodes that contains the few values in A?  Since I only need to do this
>> occasionally, maintaining a distributed database such as HBase cant be
>> justified.
>>
>> Many thanks.
>>
>>
>> Cao
>
>
>
> --
> Harsh J


Re: Anyway to load certain Key/Value pair fast?

2013-02-12 Thread Harsh J
My reply to your questions is inline.

On Wed, Feb 13, 2013 at 10:59 AM, Harsh J  wrote:
> Please do not use the general@ lists for any user-oriented questions.
> Please redirect them to user@hadoop.apache.org lists, which is where
> the user community and questions lie.
>
> I've moved your post there and have added you on CC in case you
> haven't subscribed there. Please reply back only to the user@
> addresses. The general@ list is for Apache Hadoop project-level
> management and release oriented discussions alone.
>
> On Wed, Feb 13, 2013 at 10:54 AM, William Kang  wrote:
>> Hi All,
>> I am trying to figure out a good solution for such a scenario as following.
>>
>> 1. I have a 2T file (let's call it A), filled by key/value pairs,
>> which is stored in the HDFS with the default 64M block size. In A,
>> each key is less than 1K and each value is about 20M.
>>
>> 2. Occasionally, I will run analysis by using a different type of data
>> (usually less than 10G, and let's call it B) and do look-up table
>> alike operations by using the values in A. B resides in HDFS as well.
>>
>> 3. This analysis would require loading only a small number of values
>> from A (usually less than 1000 of them) into the memory for fast
>> look-up against the data in B. The way B finds the few values in A is
>> by looking up for the key in A.

About 1000 such rows would equal a memory expense of near 20 GB, given
the value size of A you've noted above. The solution may need to be
considered with this in mind, if the whole lookup table is to be
eventually generated into the memory and never discarded until the end
of processing.

>> Is there an efficient way to do this?

Since HBase may be too much for your simple needs, have you instead
considered using MapFiles, which allow fast key lookups at a file
level over HDFS/MR? You can have these files either highly replicated
(if their size is large), or distributed via the distributed cache in
the lookup jobs (if they are infrequently used and small sized), and
be able to use the  MapFile reader API to perform lookups of keys and
read values only when you want them.

>> I was thinking if I could identify the locality of the block that
>> contains the few values, I might be able to push the B into the few
>> nodes that contains the few values in A?  Since I only need to do this
>> occasionally, maintaining a distributed database such as HBase cant be
>> justified.

I agree that HBase may not be wholly suited to be run just for this
purpose (unless A's also gonna be scaling over time).

Maintaining value -> locality mapping would need to be done by you. FS
APIs provide locality info calls, and your files may be
key-partitioned enough to identify each one's range, and you can
combine the knowledge of these two to do something along these lines.

Using HBase may also turn out to be "easier", but thats upto you. You
can also choose to tear it down (i.e. the services) when not needed,
btw.

>> Many thanks.
>>
>>
>> Cao
>
>
>
> --
> Harsh J



--
Harsh J


Re: Anyway to load certain Key/Value pair fast?

2013-02-12 Thread Harsh J
Please do not use the general@ lists for any user-oriented questions.
Please redirect them to user@hadoop.apache.org lists, which is where
the user community and questions lie.

I've moved your post there and have added you on CC in case you
haven't subscribed there. Please reply back only to the user@
addresses. The general@ list is for Apache Hadoop project-level
management and release oriented discussions alone.

On Wed, Feb 13, 2013 at 10:54 AM, William Kang  wrote:
> Hi All,
> I am trying to figure out a good solution for such a scenario as following.
>
> 1. I have a 2T file (let's call it A), filled by key/value pairs,
> which is stored in the HDFS with the default 64M block size. In A,
> each key is less than 1K and each value is about 20M.
>
> 2. Occasionally, I will run analysis by using a different type of data
> (usually less than 10G, and let's call it B) and do look-up table
> alike operations by using the values in A. B resides in HDFS as well.
>
> 3. This analysis would require loading only a small number of values
> from A (usually less than 1000 of them) into the memory for fast
> look-up against the data in B. The way B finds the few values in A is
> by looking up for the key in A.
>
> Is there an efficient way to do this?
>
> I was thinking if I could identify the locality of the block that
> contains the few values, I might be able to push the B into the few
> nodes that contains the few values in A?  Since I only need to do this
> occasionally, maintaining a distributed database such as HBase cant be
> justified.
>
> Many thanks.
>
>
> Cao



--
Harsh J


Re: Delivery Status Notification (Failure)

2013-02-12 Thread Konstantin Boudnik
With all due respect, sir, these mailing lists have certain rules, that aren't
evidently coincide with your philosophy. 

Cos

On Tue, Feb 12, 2013 at 08:45PM, Raj Vishwanathan wrote:
> Arun
> 
> I don't understand your reply! ═Had you redirected this person to the hive
> mailing list I would have understood..
> 
> My═philosophy═any mailing list ═has always been ; If I know the answer to a
> question, I reply.. Else I humbly walk away!.═
> 
> I got a lot of help from this group for my (mostly stupid) questions - and
> people helped me. I would like to return the favor when ( and if) I can.
> 
> My humble $0.01.:-)
> 
> And for the record- I don't know the answer to the question on microstrategy 
> :-)
> 
> 
> Raj
> 
> 
> 
> 
> 
> >
> > From: Arun C Murthy 
> >To: user@hadoop.apache.org 
> >Sent: Tuesday, February 12, 2013 6:42 PM
> >Subject: Re: Delivery Status Notification (Failure)
> > 
> >
> >Pls don't cross-post, this belong only to cdh lists.
> >
> >
> >On Feb 12, 2013, at 12:55 AM, samir das mohapatra wrote:
> >
> >
> >>
> >>
> >>Hi All,
> >>═ ═I wanted to know how to connect Hive(hadoop-cdh4 distribution) with
> >>MircoStrategy
> >>═ ═Any help is very helpfull.
> >>
> >>═ Witing for you response
> >>
> >>Note: It is little bit urgent do any one have exprience in that
> >>Thanks,
> >>samir
> >>
> >>
> >
> >--
> >Arun C. Murthy
> >Hortonworks Inc.
> >http://hortonworks.com/
> >
> > 
> >
> >
> >


signature.asc
Description: Digital signature


Re: Delivery Status Notification (Failure)

2013-02-12 Thread Raj Vishwanathan
Arun

I don't understand your reply!  Had you redirected this person to the hive 
mailing list I would have understood..

My philosophy any mailing list  has always been ; If I know the answer to a 
question, I reply.. Else I humbly walk away!. 


I got a lot of help from this group for my (mostly stupid) questions - and 
people helped me. I would like to return the favor when ( and if) I can.

My humble $0.01.:-)

And for the record- I don't know the answer to the question on microstrategy :-)


Raj





>
> From: Arun C Murthy 
>To: user@hadoop.apache.org 
>Sent: Tuesday, February 12, 2013 6:42 PM
>Subject: Re: Delivery Status Notification (Failure)
> 
>
>Pls don't cross-post, this belong only to cdh lists.
>
>
>On Feb 12, 2013, at 12:55 AM, samir das mohapatra wrote:
>
>
>>
>>
>>Hi All,
>>   I wanted to know how to connect Hive(hadoop-cdh4 distribution) with
>>MircoStrategy
>>   Any help is very helpfull.
>>
>>  Witing for you response
>>
>>Note: It is little bit urgent do any one have exprience in that
>>Thanks,
>>samir
>>
>>
>
>--
>Arun C. Murthy
>Hortonworks Inc.
>http://hortonworks.com/
>
> 
>
>
>

Re: Delivery Status Notification (Failure)

2013-02-12 Thread Arun C Murthy
Pls don't cross-post, this belong only to cdh lists.

On Feb 12, 2013, at 12:55 AM, samir das mohapatra wrote:

> 
> 
> Hi All,
>I wanted to know how to connect Hive(hadoop-cdh4 distribution) with
> MircoStrategy
>Any help is very helpfull.
> 
>   Witing for you response
> 
> Note: It is little bit urgent do any one have exprience in that
> Thanks,
> samir
> 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/




Re: Java submit job to remote server

2013-02-12 Thread Hemanth Yamijala
Can you please include the complete stack trace and not just the root.
Also, have you set fs.default.name to a hdfs location like
hdfs://localhost:9000 ?

Thanks
Hemanth

On Wednesday, February 13, 2013, Alex Thieme wrote:

> Thanks for the prompt reply and I'm sorry I forgot to include the
> exception. My bad. I've included it below. There certainly appears to be a
> server running on localhost:9001. At least, I was able to telnet to that
> address. While in development, I'm treating the server on localhost as the
> remote server. Moving to production, there'd obviously be a different
> remote server address configured.
>
> Root Exception stack trace:
> java.io.EOFException
> at java.io.DataInputStream.readInt(DataInputStream.java:375)
> at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
> + 3 more (set debug level logging or '-Dmule.verbose.exceptions=true'
> for everything)
>
> 
>
> On Feb 12, 2013, at 4:22 PM, Nitin Pawar  wrote:
>
> conf.set("mapred.job.tracker", "localhost:9001");
>
> this means that your jobtracker is on port 9001 on localhost
>
> if you change it to the remote host and thats the port its running on then
> it should work as expected
>
> whats the exception you are getting?
>
>
> On Wed, Feb 13, 2013 at 2:41 AM, Alex Thieme  wrote:
>
> I apologize for asking what seems to be such a basic question, but I would
> use some help with submitting a job to a remote server.
>
> I have downloaded and installed hadoop locally in pseudo-distributed mode.
> I have written some Java code to submit a job.
>
> Here's the org.apache.hadoop.util.Tool
> and org.apache.hadoop.mapreduce.Mapper I've written.
>
> If I enable the conf.set("mapred.job.tracker", "localhost:9001") line,
> then I get the exception included below.
>
> If that line is disabled, then the job is completed. However, in reviewing
> the hadoop server administration page (
> http://localhost:50030/jobtracker.jsp) I don't see the job as processed
> by the server. Instead, I wonder if my Java code is simply running the
> necessary mapper Java code, bypassing the locally installed server.
>
> Thanks in advance.
>
> Alex
>
> public class OfflineDataTool extends Configured implements Tool {
>
> public int run(final String[] args) throws Exception {
> final Configuration conf = getConf();
> //conf.set("mapred.job.tracker", "localhost:9001");
>
> final Job job = new Job(conf);
> job.setJarByClass(getClass());
> job.setJobName(getClass().getName());
>
> job.setMapperClass(OfflineDataMapper.class);
>
> job.setInputFormatClass(TextInputFormat.class);
>
> job.setMapOutputKeyClass(Text.class);
> job.setMapOutputValueClass(Text.class);
>
> job.setOutputKeyClass(Text.class);
> job.setOutputValueClass(Text.class);
>
> FileInputFormat.addInputPath(job, new
> org.apache.hadoop.fs.Path(args[0]));
>
> final org.apache.hadoop.fs.Path output = new org.a
>
>


Re: Question related to Decompressor interface

2013-02-12 Thread George Datskos

Hello

Can someone share some idea what the Hadoop source code of class 
org.apache.hadoop.io.compress.BlockDecompressorStream, method 
rawReadInt() is trying to do here?


The BlockDecompressorStream class is used for block-based decompression 
(e.g. snappy).  Each chunk has a header indicating how many bytes it is. 
That header is obtained by the rawReadInt method so it is expected to 
return a non-negative value (since you can't have a negative length).



George


No standup today - I, Nicholas and Brandon are out

2013-02-12 Thread Suresh Srinivas
-- 
http://hortonworks.com/download/


Re: Java submit job to remote server

2013-02-12 Thread Alex Thieme
Thanks for the prompt reply and I'm sorry I forgot to include the exception. My 
bad. I've included it below. There certainly appears to be a server running on 
localhost:9001. At least, I was able to telnet to that address. While in 
development, I'm treating the server on localhost as the remote server. Moving 
to production, there'd obviously be a different remote server address 
configured.

Root Exception stack trace:
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:375)
at 
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
+ 3 more (set debug level logging or '-Dmule.verbose.exceptions=true' for 
everything)


On Feb 12, 2013, at 4:22 PM, Nitin Pawar  wrote:

> conf.set("mapred.job.tracker", "localhost:9001");
> 
> this means that your jobtracker is on port 9001 on localhost 
> 
> if you change it to the remote host and thats the port its running on then it 
> should work as expected 
> 
> whats the exception you are getting? 
> 
> 
> On Wed, Feb 13, 2013 at 2:41 AM, Alex Thieme  wrote:
> I apologize for asking what seems to be such a basic question, but I would 
> use some help with submitting a job to a remote server.
> 
> I have downloaded and installed hadoop locally in pseudo-distributed mode. I 
> have written some Java code to submit a job. 
> 
> Here's the org.apache.hadoop.util.Tool and org.apache.hadoop.mapreduce.Mapper 
> I've written.
> 
> If I enable the conf.set("mapred.job.tracker", "localhost:9001") line, then I 
> get the exception included below.
> 
> If that line is disabled, then the job is completed. However, in reviewing 
> the hadoop server administration page (http://localhost:50030/jobtracker.jsp) 
> I don't see the job as processed by the server. Instead, I wonder if my Java 
> code is simply running the necessary mapper Java code, bypassing the locally 
> installed server.
> 
> Thanks in advance.
> 
> Alex
> 
> public class OfflineDataTool extends Configured implements Tool {
> 
> public int run(final String[] args) throws Exception {
> final Configuration conf = getConf();
> //conf.set("mapred.job.tracker", "localhost:9001");
> 
> final Job job = new Job(conf);
> job.setJarByClass(getClass());
> job.setJobName(getClass().getName());
> 
> job.setMapperClass(OfflineDataMapper.class);
> 
> job.setInputFormatClass(TextInputFormat.class);
> 
> job.setMapOutputKeyClass(Text.class);
> job.setMapOutputValueClass(Text.class);
> 
> job.setOutputKeyClass(Text.class);
> job.setOutputValueClass(Text.class);
> 
> FileInputFormat.addInputPath(job, new 
> org.apache.hadoop.fs.Path(args[0]));
> 
> final org.apache.hadoop.fs.Path output = new 
> org.apache.hadoop.fs.Path(args[1]);
> FileSystem.get(conf).delete(output, true);
> FileOutputFormat.setOutputPath(job, output);
> 
> return job.waitForCompletion(true) ? 0 : 1;
> }
> 
> public static void main(final String[] args) {
> try {
> int result = ToolRunner.run(new Configuration(), new 
> OfflineDataTool(), new String[]{"offline/input", "offline/output"});
> log.error("result = {}", result);
> } catch (final Exception e) {
> throw new RuntimeException(e);
> }
> }
> }
> 
> public class OfflineDataMapper extends Mapper 
> {
> 
> public OfflineDataMapper() {
> super();
> }
> 
> @Override
> protected void map(final LongWritable key, final Text value, final 
> Context context) throws IOException, InterruptedException {
> final String inputString = value.toString();
> OfflineDataMapper.log.error("inputString = {}", inputString);
> }
> }
> 
> 
> 
> 
> -- 
> Nitin Pawar



Re: Java submit job to remote server

2013-02-12 Thread Nitin Pawar
conf.set("mapred.job.tracker", "localhost:9001");

this means that your jobtracker is on port 9001 on localhost

if you change it to the remote host and thats the port its running on then
it should work as expected

whats the exception you are getting?


On Wed, Feb 13, 2013 at 2:41 AM, Alex Thieme  wrote:

> I apologize for asking what seems to be such a basic question, but I would
> use some help with submitting a job to a remote server.
>
> I have downloaded and installed hadoop locally in pseudo-distributed mode.
> I have written some Java code to submit a job.
>
> Here's the org.apache.hadoop.util.Tool
> and org.apache.hadoop.mapreduce.Mapper I've written.
>
> If I enable the conf.set("mapred.job.tracker", "localhost:9001") line,
> then I get the exception included below.
>
> If that line is disabled, then the job is completed. However, in reviewing
> the hadoop server administration page (
> http://localhost:50030/jobtracker.jsp) I don't see the job as processed
> by the server. Instead, I wonder if my Java code is simply running the
> necessary mapper Java code, bypassing the locally installed server.
>
> Thanks in advance.
>
> Alex
>
> public class OfflineDataTool extends Configured implements Tool {
>
> public int run(final String[] args) throws Exception {
> final Configuration conf = getConf();
> //conf.set("mapred.job.tracker", "localhost:9001");
>
> final Job job = new Job(conf);
> job.setJarByClass(getClass());
> job.setJobName(getClass().getName());
>
> job.setMapperClass(OfflineDataMapper.class);
>
> job.setInputFormatClass(TextInputFormat.class);
>
> job.setMapOutputKeyClass(Text.class);
> job.setMapOutputValueClass(Text.class);
>
> job.setOutputKeyClass(Text.class);
> job.setOutputValueClass(Text.class);
>
> FileInputFormat.addInputPath(job, new
> org.apache.hadoop.fs.Path(args[0]));
>
> final org.apache.hadoop.fs.Path output = new
> org.apache.hadoop.fs.Path(args[1]);
> FileSystem.get(conf).delete(output, true);
> FileOutputFormat.setOutputPath(job, output);
>
> return job.waitForCompletion(true) ? 0 : 1;
> }
>
> public static void main(final String[] args) {
> try {
> int result = ToolRunner.run(new Configuration(), new
> OfflineDataTool(), new String[]{"offline/input", "offline/output"});
> log.error("result = {}", result);
> } catch (final Exception e) {
> throw new RuntimeException(e);
> }
> }
> }
>
> public class OfflineDataMapper extends Mapper Text> {
>
> public OfflineDataMapper() {
> super();
> }
>
> @Override
> protected void map(final LongWritable key, final Text value, final
> Context context) throws IOException, InterruptedException {
> final String inputString = value.toString();
> OfflineDataMapper.log.error("inputString = {}", inputString);
> }
> }
>
>


-- 
Nitin Pawar


Java submit job to remote server

2013-02-12 Thread Alex Thieme
I apologize for asking what seems to be such a basic question, but I would use 
some help with submitting a job to a remote server.

I have downloaded and installed hadoop locally in pseudo-distributed mode. I 
have written some Java code to submit a job. 

Here's the org.apache.hadoop.util.Tool and org.apache.hadoop.mapreduce.Mapper 
I've written.

If I enable the conf.set("mapred.job.tracker", "localhost:9001") line, then I 
get the exception included below.

If that line is disabled, then the job is completed. However, in reviewing the 
hadoop server administration page (http://localhost:50030/jobtracker.jsp) I 
don't see the job as processed by the server. Instead, I wonder if my Java code 
is simply running the necessary mapper Java code, bypassing the locally 
installed server.

Thanks in advance.

Alex

public class OfflineDataTool extends Configured implements Tool {

public int run(final String[] args) throws Exception {
final Configuration conf = getConf();
//conf.set("mapred.job.tracker", "localhost:9001");

final Job job = new Job(conf);
job.setJarByClass(getClass());
job.setJobName(getClass().getName());

job.setMapperClass(OfflineDataMapper.class);

job.setInputFormatClass(TextInputFormat.class);

job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

FileInputFormat.addInputPath(job, new 
org.apache.hadoop.fs.Path(args[0]));

final org.apache.hadoop.fs.Path output = new 
org.apache.hadoop.fs.Path(args[1]);
FileSystem.get(conf).delete(output, true);
FileOutputFormat.setOutputPath(job, output);

return job.waitForCompletion(true) ? 0 : 1;
}

public static void main(final String[] args) {
try {
int result = ToolRunner.run(new Configuration(), new 
OfflineDataTool(), new String[]{"offline/input", "offline/output"});
log.error("result = {}", result);
} catch (final Exception e) {
throw new RuntimeException(e);
}
}
}

public class OfflineDataMapper extends Mapper {

public OfflineDataMapper() {
super();
}

@Override
protected void map(final LongWritable key, final Text value, final Context 
context) throws IOException, InterruptedException {
final String inputString = value.toString();
OfflineDataMapper.log.error("inputString = {}", inputString);
}
}



RE: Question related to Decompressor interface

2013-02-12 Thread java8964 java8964

Can someone share some idea what the Hadoop source code of class 
org.apache.hadoop.io.compress.BlockDecompressorStream, method rawReadInt() is 
trying to do here?
There is a comment in the code this this method shouldn't return negative 
number, but in my testing file, it contains the following bytes from the 
inputStream: 248, 19, 20, 116, which corresponding to b1, b2, b3, b4.
After the 4 bytes is read fromt the input stream, then the return result will 
be a negative number here, as 
(b1 << 24) = -134217728(b2 << 16) = 1245184(b3 << 8) = 5120(b4 << 0) = 116
I am not sure what logic of this method is trying to do here, can anyone share 
some idea about it?
Thanks








  private int rawReadInt() throws IOException {
int b1 = in.read();
int b2 = in.read();
int b3 = in.read();
int b4 = in.read();
if ((b1 | b2 | b3 | b4) < 0)
  throw new EOFException();
return ((b1 << 24) + (b2 << 16) + (b3 << 8) + (b4 << 0));
  }
From: java8...@hotmail.com
To: user@hadoop.apache.org
Subject: Question related to Decompressor interface
Date: Sat, 9 Feb 2013 15:49:31 -0500





HI, 
Currently I am researching about options of encrypting the data in the 
MapReduce, as we plan to use the Amazon EMR or EC2 services for our data.
I am thinking that the compression codec is good place to integrate with the 
encryption logic, and I found out there are some people having the same idea as 
mine.
I google around and found out this code:
https://github.com/geisbruch/HadoopCryptoCompressor/
It doesn't seem maintained any more, but it gave me a starting point. I 
download the source code, and try to do some tests with it.
It doesn't work out of box. There are some bugs I have to fix to make it work. 
I believe it contains 'AES' as an example algorithm.
But right now, I faced a problem when I tried to use it in my testing MapReduer 
program. Here is the stack trace I got:
2013-02-08 23:16:47,038 INFO 
org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor: buf length = 512, 
and offset = 0, length = -132967308java.lang.IndexOutOfBoundsExceptionat 
java.nio.ByteBuffer.wrap(ByteBuffer.java:352)at 
org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor.setInput(CryptoBasicDecompressor.java:100)
at 
org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:97)
at 
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:83)
at java.io.InputStream.read(InputStream.java:82)at 
org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209)at 
org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)at 
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:114)
at 
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:458)
at 
org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)
at 
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)at 
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:645)at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)at 
org.apache.hadoop.mapred.Child$4.run(Child.java:268)at 
java.security.AccessController.doPrivileged(Native Method)at 
javax.security.auth.Subject.doAs(Subject.java:396)at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
I know the error is thrown out of this custom CryptoBasicDecompressor class, 
but I really have questions related to the interface it implemented: 
Decompressor.
There is limited document about this interface, for example, when and how the 
method setInput() will be invoked. If I want to write my own Decompressor, what 
do these methods mean in the interface?In the above case, I enable some debug 
information, you can see that in this case, the byte[] array passed to setInput 
method, only have 512 as the length, but the 3rd parameter of length passed in 
is a negative number: -132967308. That caused the IndexOutOfBoundsException. If 
I check the GzipDecompressor class of this method in the hadoop, the code will 
also throw IndexOutoutBoundsException in this case, so this is a 
RuntimeException case. Why it happened in my test case?
Here is my test case:
I have a simpel log text file about 700k. I encrypted it with above code using 
'AES'. I can encrypted and decrypted to get my original content. The file name 
is foo.log.crypto, this file extension is registered to invoke this 
CryptoBasicDecompressor in my testing hadoop using CDH4.1.2 release (hadoop 
2.0). Everything works as I expected. The CryptoBasicDecompressor is invoked 
when the input file is foo.log.crypto, as you can see in the above stack trace. 
But I don't know why the 3rd parameter (length) in setInput() is a negative 
number at runtime.

Re: Loader for small files

2013-02-12 Thread Something Something
No, Yong, I believe you misunderstood. David's explanation makes sense.  As
pointed out in my original email, everything is going to 1 Mapper.  It's
not creating multiple mappers.

BTW, the code given in my original email, indeed works as expected.  It
does trigger multiple mappers, but it doesn't really improve the
performance.

We believe the problem is that there's a data skew.  We are looking into
creating Partitioner to solve it.  Thanks.


On Tue, Feb 12, 2013 at 7:15 AM, java8964 java8964 wrote:

>   Hi, Davie:
>
> I am not sure I understand this suggestion. Why smaller block size will
> help this performance issue?
>
> From what the original question about, it looks like the performance
> problem is due to that there are a lot of small files, and each file will
> run in its own mapper.
>
> As hadoop needs to start a lot of mappers (I think creating a mapper also
> takes time and resource), but each mapper only take small amount of data
> (maybe hundreds K or several M of data, much less than the block size),
> most of the time is wasting on creating task instance for mapper, but each
> mapper finishes very quickly.
>
> This is the reason of performance problem, right? Do I understand the
> problem wrong?
>
> If so, reducing the block size won't help in this case, right? To fix it,
> we need to merge multi-files into one mapper, so let one mapper has enough
> data to process.
>
> Unless my understanding is total wrong, I don't know how reducing block
> size will help in this case.
>
> Thanks
>
> Yong
>
> > Subject: Re: Loader for small files
> > From: davidlabarb...@localresponse.com
> > Date: Mon, 11 Feb 2013 15:38:54 -0500
> > CC: user@hadoop.apache.org
> > To: u...@pig.apache.org
>
> >
> > What process creates the data in HDFS? You should be able to set the
> block size there and avoid the copy.
> >
> > I would test the dfs.block.size on the copy and see if you get the
> mapper split you want before worrying about optimizing.
> >
> > David
> >
> > On Feb 11, 2013, at 2:10 PM, Something Something <
> mailinglist...@gmail.com> wrote:
> >
> > > David: Your suggestion would add an additional step of copying data
> from
> > > one place to another. Not bad, but not ideal. Is there no way to avoid
> > > copying of data?
> > >
> > > BTW, we have tried changing the following options to no avail :(
> > >
> > > set pig.splitCombination false;
> > >
> > > & a few other 'dfs' options given below:
> > >
> > > mapreduce.min.split.size
> > > mapreduce.max.split.size
> > >
> > > Thanks.
> > >
> > > On Mon, Feb 11, 2013 at 10:29 AM, David LaBarbera <
> > > davidlabarb...@localresponse.com> wrote:
> > >
> > >> You could store your data in smaller block sizes. Do something like
> > >> hadoop fs HADOOP_OPTS="-Ddfs.block.size=1048576
> > >> -Dfs.local.block.size=1048576" -cp /org-input /small-block-input
> > >> You might only need one of those parameters. You can verify the block
> size
> > >> with
> > >> hadoop fsck /small-block-input
> > >>
> > >> In your pig script, you'll probably need to set
> > >> pig.maxCombinedSplitSize
> > >> to something around the block size
> > >>
> > >> David
> > >>
> > >> On Feb 11, 2013, at 1:24 PM, Something Something <
> mailinglist...@gmail.com>
> > >> wrote:
> > >>
> > >>> Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not
> related to
> > >>> HBase. Adding 'hadoop' user group.
> > >>>
> > >>> On Mon, Feb 11, 2013 at 10:22 AM, Something Something <
> > >>> mailinglist...@gmail.com> wrote:
> > >>>
> >  Hello,
> > 
> >  We are running into performance issues with Pig/Hadoop because our
> input
> >  files are small. Everything goes to only 1 Mapper. To get around
> > >> this, we
> >  are trying to use our own Loader like this:
> > 
> >  1) Extend PigStorage:
> > 
> >  public class SmallFileStorage extends PigStorage {
> > 
> >  public SmallFileStorage(String delimiter) {
> >  super(delimiter);
> >  }
> > 
> >  @Override
> >  public InputFormat getInputFormat() {
> >  return new NLineInputFormat();
> >  }
> >  }
> > 
> > 
> > 
> >  2) Add command line argument to the Pig command as follows:
> > 
> >  -Dmapreduce.input.lineinputformat.linespermap=50
> > 
> > 
> > 
> >  3) Use SmallFileStorage in the Pig script as follows:
> > 
> >  USING com.xxx.yyy.SmallFileStorage ('\t')
> > 
> > 
> >  But this doesn't seem to work. We still see that everything is
> going to
> >  one mapper. Before we spend any more time on this, I am wondering if
> > >> this
> >  is a good approach – OR – if there's a better approach? Please let
> me
> >  know. Thanks.
> > 
> > 
> > 
> > >>
> > >>
> >
>


Re: Decommissioning Nodes in Production Cluster.

2013-02-12 Thread shashwat shriparv
On Tue, Feb 12, 2013 at 11:43 PM, Robert Molina wrote:

> to do it, there should be some information he


this is best way to remove data node from a cluster. you have done the
right thing.



∞
Shashwat Shriparv


Re: Decommissioning Nodes in Production Cluster.

2013-02-12 Thread sudhakara st
The decommissioning process is controlled by an exclude file, which for 
HDFS is set by the* dfs.hosts.exclude* property, and for MapReduce by 
the*mapred.hosts.exclude
* property. In most cases, there is one shared file,referred to as the 
exclude file.This  exclude file name should be specified as a configuration 
parameter *dfs.hosts.exclude *in the name node start up.


To remove nodes from the cluster:
1. Add the network addresses of the nodes to be decommissioned to the 
exclude file.

2. Restart the MapReduce cluster to stop the tasktrackers on the nodes being
decommissioned.
3. Update the namenode with the new set of permitted datanodes, with this
command:
% hadoop dfsadmin -refreshNodes
4. Go to the web UI and check whether the admin state has changed to 
“Decommission
In Progress” for the datanodes being decommissioned. They will start copying
their blocks to other datanodes in the cluster.

5. When all the datanodes report their state as “Decommissioned,” then all 
the blocks
have been replicated. Shut down the decommissioned nodes.
6. Remove the nodes from the include file, and run:
% hadoop dfsadmin -refreshNodes
7. Remove the nodes from the slaves file.

 Decommission data nodes in small percentage(less than 2%) at time don't 
cause any effect on cluster. But it better to pause MR-Jobs before you 
triggering Decommission to ensure  no task running in decommissioning 
subjected nodes.
 If very small percentage of task running in the decommissioning node it 
can submit to other task tracker, but percentage queued jobs  larger then 
threshold  then there is chance of job failure. Once triggering the 'hadoop 
dfsadmin -refreshNodes' command and decommission started, you can resume 
the MR jobs.

*Source : The Definitive Guide [Tom White]*



On Tuesday, February 12, 2013 5:20:07 PM UTC+5:30, Dhanasekaran Anbalagan 
wrote:
>
> Hi Guys,
>
> It's recommenced do with removing one the datanode in production cluster.
> via Decommission the particular datanode. please guide me.
>  
> -Dhanasekaran,
>
> Did I learn something today? If not, I wasted it.
>  


Re: Decommissioning Nodes in Production Cluster.

2013-02-12 Thread Benjamin Kim
Hi,

I would like to add another scenario. What are the steps for removing a 
dead node when the server had a hard failure that is unrecoverable.

Thanks,
Ben

On Tuesday, February 12, 2013 7:30:57 AM UTC-8, sudhakara st wrote:
>
> The decommissioning process is controlled by an exclude file, which for 
> HDFS is set by the* dfs.hosts.exclude* property, and for MapReduce by 
> the*mapred.hosts.exclude
> * property. In most cases, there is one shared file,referred to as the 
> exclude file.This  exclude file name should be specified as a configuration 
> parameter *dfs.hosts.exclude *in the name node start up.
>
>
> To remove nodes from the cluster:
> 1. Add the network addresses of the nodes to be decommissioned to the 
> exclude file.
>
> 2. Restart the MapReduce cluster to stop the tasktrackers on the nodes 
> being
> decommissioned.
> 3. Update the namenode with the new set of permitted datanodes, with this
> command:
> % hadoop dfsadmin -refreshNodes
> 4. Go to the web UI and check whether the admin state has changed to 
> “Decommission
> In Progress” for the datanodes being decommissioned. They will start 
> copying
> their blocks to other datanodes in the cluster.
>
> 5. When all the datanodes report their state as “Decommissioned,” then all 
> the blocks
> have been replicated. Shut down the decommissioned nodes.
> 6. Remove the nodes from the include file, and run:
> % hadoop dfsadmin -refreshNodes
> 7. Remove the nodes from the slaves file.
>
>  Decommission data nodes in small percentage(less than 2%) at time don't 
> cause any effect on cluster. But it better to pause MR-Jobs before you 
> triggering Decommission to ensure  no task running in decommissioning 
> subjected nodes.
>  If very small percentage of task running in the decommissioning node it 
> can submit to other task tracker, but percentage queued jobs  larger then 
> threshold  then there is chance of job failure. Once triggering the 'hadoop 
> dfsadmin -refreshNodes' command and decommission started, you can resume 
> the MR jobs.
>
> *Source : The Definitive Guide [Tom White]*
>
>
>
> On Tuesday, February 12, 2013 5:20:07 PM UTC+5:30, Dhanasekaran Anbalagan 
> wrote:
>>
>> Hi Guys,
>>
>> It's recommenced do with removing one the datanode in production cluster.
>> via Decommission the particular datanode. please guide me.
>>  
>> -Dhanasekaran,
>>
>> Did I learn something today? If not, I wasted it.
>>  
>

Re: Decommissioning Nodes in Production Cluster.

2013-02-12 Thread Robert Molina
Hi Dhanasekaran,
I believe you are trying to ask if it is recommended to use the
decommissioning feature to remove datanodes from your cluster, the answer
would be yes.  As far as how to do it, there should be some information
here http://wiki.apache.org/hadoop/FAQ that should help.

Regards,
Robert

On Tue, Feb 12, 2013 at 3:50 AM, Dhanasekaran Anbalagan
wrote:

> Hi Guys,
>
> It's recommenced do with removing one the datanode in production cluster.
> via Decommission the particular datanode. please guide me.
>
> -Dhanasekaran,
>
> Did I learn something today? If not, I wasted it.
>


Re: NullPointerException in Spring Data Hadoop with CDH4

2013-02-12 Thread Christian Schneider
With the help of Costin I got a running Maven configuration.

Thank you :).

This is a pom.xml for Spring Data Hadoop and CDH4:

http://maven.apache.org/POM/4.0.0"; 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/xsd/maven-4.0.0.xsd";>
4.0.0

com.example
com.example.main
0.0.1-SNAPSHOT
jar


1.7

UTF-8
3.2.0.RELEASE

1.0.0.BUILD-SNAPSHOT
2.0.0-cdh4.1.3
2.0.0-mr1-cdh4.1.3





org.springframework
spring-core
${spring.version}


commons-logging
commons-logging





org.springframework
spring-context
${spring.version}




org.springframework.data
spring-data-hadoop
${spring.hadoop.version}





hadoop-streaming
org.apache.hadoop


hadoop-tools
org.apache.hadoop







org.apache.hadoop
hadoop-common
${hadoop.version.generic}



org.apache.hadoop
hadoop-hdfs
${hadoop.version.generic}



org.apache.hadoop
hadoop-tools
2.0.0-mr1-cdh4.1.3



org.apache.hadoop
hadoop-streaming
2.0.0-mr1-cdh4.1.3








org.apache.maven.plugins
maven-compiler-plugin

${java-version}
${java-version}








spring-milestones
http://repo.springsource.org/libs-milestone

false




cloudera

https://repository.cloudera.com/artifactory/cloudera-repos/

false




spring-snapshot
Spring Maven SNAPSHOT Repository
http://repo.springframework.org/snapshot





Best Regards,
Christian.

 Original-Nachricht 
> Datum: Tue, 12 Feb 2013 16:56:50 +0100
> Von: Christian Schneider 
> An: ironh...@gmx.com
> Betreff: Fwd: NullPointerException in Spring Data Hadoop with CDH4

> -- Forwarded message --
> From: Costin Leau 
> Date: 2013/2/12
> Subject: Re: NullPointerException in Spring Data Hadoop with CDH4
> To: user@hadoop.apache.org
> 
> 
> Hi,
> 
> For Spring Data Hadoop problems, it's best to use the designated forum
> [1].
> These being said I've tried to reproduce your error but I can't - I've
> upgraded the build to CDH 4.1.3 which runs fine against the VM on the CI
> (4.1.1).
> Maybe you have some other libraries on the client classpath?
> 
> From the stacktrace, it looks like the org.apache.hadoop.mapreduce.**Job
> class has no 'state' or 'info' fields...
> 
> Anyway, let's continue the discussion on the forum.
> 
> Cheers,
> [1]
> http://forum.springsource.org/**forumdisplay.php?87-Hadoop
> 
> 
> On 02/12/13 2:51 PM, Christian Schneider wrote:
> 
> > Hi,
> > I try to use Spring Data Hadoop with CDH4 to write a Map Reduce Job.
> >
> > On startup, I get the following exception:
> >
> > Exception in thread "SimpleAsyncTaskExecutor-1" java.lang.**
> > ExceptionInInitializerError
> > at org.springframework.data.**hadoop.mapreduce.JobExecutor$**
> > 2.run(JobExecutor.ja

RE: Loader for small files

2013-02-12 Thread java8964 java8964

 Hi, Davie:
I am not sure I understand this suggestion. Why smaller block size will help 
this performance issue?
>From what the original question about, it looks like the performance problem 
>is due to that there are a lot of small files, and each file will run in its 
>own mapper.
As hadoop needs to start a lot of mappers (I think creating a mapper also takes 
time and resource), but each mapper only take small amount of data (maybe 
hundreds K or several M of data, much less than the block size), most of the 
time is wasting on creating task instance for mapper, but each mapper finishes 
very quickly.
This is the reason of performance problem, right? Do I understand the problem 
wrong?
If so, reducing the block size won't help in this case, right? To fix it, we 
need to merge multi-files into one mapper, so let one mapper has enough data to 
process. 
Unless my understanding is total wrong, I don't know how reducing block size 
will help in this case.
Thanks
Yong

> Subject: Re: Loader for small files
> From: davidlabarb...@localresponse.com
> Date: Mon, 11 Feb 2013 15:38:54 -0500
> CC: user@hadoop.apache.org
> To: u...@pig.apache.org
> 
> What process creates the data in HDFS? You should be able to set the block 
> size there and avoid the copy.
> 
> I would test the dfs.block.size on the copy and see if you get the mapper 
> split you want before worrying about optimizing.
> 
> David
> 
> On Feb 11, 2013, at 2:10 PM, Something Something  
> wrote:
> 
> > David:  Your suggestion would add an additional step of copying data from
> > one place to another.  Not bad, but not ideal.  Is there no way to avoid
> > copying of data?
> > 
> > BTW, we have tried changing the following options to no avail :(
> > 
> > set pig.splitCombination false;
> > 
> > & a few other 'dfs' options given below:
> > 
> > mapreduce.min.split.size
> > mapreduce.max.split.size
> > 
> > Thanks.
> > 
> > On Mon, Feb 11, 2013 at 10:29 AM, David LaBarbera <
> > davidlabarb...@localresponse.com> wrote:
> > 
> >> You could store your data in smaller block sizes. Do something like
> >> hadoop fs HADOOP_OPTS="-Ddfs.block.size=1048576
> >> -Dfs.local.block.size=1048576" -cp /org-input /small-block-input
> >> You might only need one of those parameters. You can verify the block size
> >> with
> >> hadoop fsck /small-block-input
> >> 
> >> In your pig script, you'll probably need to set
> >> pig.maxCombinedSplitSize
> >> to something around the block size
> >> 
> >> David
> >> 
> >> On Feb 11, 2013, at 1:24 PM, Something Something 
> >> wrote:
> >> 
> >>> Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related to
> >>> HBase.  Adding 'hadoop' user group.
> >>> 
> >>> On Mon, Feb 11, 2013 at 10:22 AM, Something Something <
> >>> mailinglist...@gmail.com> wrote:
> >>> 
>  Hello,
>  
>  We are running into performance issues with Pig/Hadoop because our input
>  files are small.  Everything goes to only 1 Mapper.  To get around
> >> this, we
>  are trying to use our own Loader like this:
>  
>  1)  Extend PigStorage:
>  
>  public class SmallFileStorage extends PigStorage {
>  
>    public SmallFileStorage(String delimiter) {
>    super(delimiter);
>    }
>  
>    @Override
>    public InputFormat getInputFormat() {
>    return new NLineInputFormat();
>    }
>  }
>  
>  
>  
>  2)  Add command line argument to the Pig command as follows:
>  
>  -Dmapreduce.input.lineinputformat.linespermap=50
>  
>  
>  
>  3)  Use SmallFileStorage in the Pig script as follows:
>  
>  USING com.xxx.yyy.SmallFileStorage ('\t')
>  
>  
>  But this doesn't seem to work.  We still see that everything is going to
>  one mapper.  Before we spend any more time on this, I am wondering if
> >> this
>  is a good approach – OR – if there's a better approach?  Please let me
>  know.  Thanks.
>  
>  
>  
> >> 
> >> 
> 
  

RE: number input files to mapreduce job

2013-02-12 Thread java8964 java8964

I don't think you can get list of all input files in the mapper, but what you 
can get is the current file's information.
In the context object reference, you can get the InputSplit(), which should 
give you all the information you want of the current input file.
http://hadoop.apache.org/docs/r2.0.2-alpha/api/org/apache/hadoop/mapred/FileSplit.html
Date: Tue, 12 Feb 2013 12:35:16 +0530
Subject: number input files to mapreduce job
From: vikascjadha...@gmail.com
To: user@hadoop.apache.org

Hi all,How to get number of Input files and thier to particular mapreduce job 
in java MapReduce  program.
-- 



Thanx and Regards Vikas Jadhav

RE: Error for Pseudo-distributed Mode

2013-02-12 Thread Vijay Thakorlal
Hi,

 

Could you first try running the example:

$ /usr/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar
grep input output 'dfs[a-z.]+'

 

Do you receive the same error?

 

Not sure if it's related to a lack of RAM, but as the stack trace shows
errors with "network" timeout (I realise that you're running in
pseudo-distributed mode):

 

Caused by: com.google.protobuf.ServiceException:
java.net.SocketTimeoutException: Call From localhost.localdomain/127.0.0.1
to localhost.localdomain:54113 failed on socket timeout exception:
java.net.SocketTimeoutException: 6 millis timeout while waiting for
channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/127.0.0.1:60976 remote=localhost.localdomain/127.0.0.1:54113]; For
more details see:  http://wiki.apache.org/hadoop/SocketTimeout


 

Your best bet is probably to start with checking the items mentioned in the
wiki page linked to above. While the default firewall rules (on CentOS)
usually allows pretty much all traffic on the lo interface it might be worth
temporarily turning off iptables (assuming it is on).

 

Vijay

 

 

 

From: yeyu1899 [mailto:yeyu1...@163.com] 
Sent: 12 February 2013 12:58
To: user@hadoop.apache.org
Subject: Error for Pseudo-distributed Mode

 

Hi all,

I installed a redhat_enterprise-linux-x86 in VMware Workstation, and set the
virtual machine 1G memory. 

 

Then I followed steps guided by "Installing CDH4 on a Single Linux Node in
Pseudo-distributed Mode" --
https://ccp.cloudera.com/display/CDH4DOC/Installing+CDH4+on+a+Single+Linux+N
ode+in+Pseudo-distributed+Mode.

 

When at last, I ran an example Hadoop job with the command "$ hadoop jar
/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23
'dfs[a-z.]+'"

 

then the screen showed as follows, 

depending "AttemptID:attempt_1360528029309_0001_r_00_0 Timed out after
600 secs" and I wonder is that because my virtual machine's memory too
little~~??

 

[hadoop@localhost hadoop-mapreduce]$ hadoop jar
/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23
'dfs[a-z]+'


13/02/11 04:30:44 WARN mapreduce.JobSubmitter: No job jar file set.  User
classes may not be found. See Job or Job#setJar(String).


13/02/11 04:30:44 INFO input.FileInputFormat: Total input paths to process :
4   

13/02/11 04:30:45 INFO mapreduce.JobSubmitter: number of splits:4


13/02/11 04:30:45 WARN conf.Configuration: mapred.output.value.class is
deprecated. Instead, use mapreduce.job.output.value.class


13/02/11 04:30:45 WARN conf.Configuration: mapreduce.combine.class is
deprecated. Instead, use mapreduce.job.combine.class


13/02/11 04:30:45 WARN conf.Configuration: mapreduce.map.class is
deprecated. Instead, use mapreduce.job.map.class


13/02/11 04:30:45 WARN conf.Configuration: mapred.job.name is deprecated.
Instead, use mapreduce.job.name

13/02/11 04:30:45 WARN conf.Configuration: mapreduce.reduce.class is
deprecated. Instead, use mapreduce.job.reduce.class


13/02/11 04:30:45 WARN conf.Configuration: mapred.input.dir is deprecated.
Instead, use mapreduce.input.fileinputformat.inputdir


13/02/11 04:30:45 WARN conf.Configuration: mapred.output.dir is deprecated.
Instead, use mapreduce.output.fileoutputformat.outputdir


13/02/11 04:30:45 WARN conf.Configuration: mapreduce.outputformat.class is
deprecated. Instead, use mapreduce.job.outputformat.class


13/02/11 04:30:45 WARN conf.Configuration: mapred.map.tasks is deprecated.
Instead, use mapreduce.job.maps   

13/02/11 04:30:45 WARN conf.Configuration: mapred.output.key.class is
deprecated. Instead, use mapreduce.job.output.key.class


13/02/11 04:30:45 WARN conf.Configuration: mapred.working.dir is deprecated.
Instead, use mapreduce.job.working.dir


13/02/11 04:30:46 INFO mapred.YARNRunner: Job jar is not present. Not adding
any jar to the list of resources.   

13/02/11 04:30:46 INFO mapred.ResourceMgrDelegate: Submitted application
application_1360528029309_0001 to ResourceManager at /0.0.0.0:8032


13/02/11 04:30:46 INFO mapreduce.Job: The url to track the job:
http://localhost.localdomain:8088/proxy/application_1360528029309_0001/


13/02/11 04:30:46 INFO mapreduce.Job: Running job: job_1360528029309_0001


13/02/11 04:31:01 INFO mapreduce.Job: Job job_1360528029309_0001 running in
uber mode : false

13/02/11 04:31:01 INFO mapreduce.Job:  map 0% reduce 0%


13/02/11 04:47:22 INFO mapreduce.Job: Task Id :
attempt_1360528029309_0001_r_00_0, Status : FAILED   

AttemptID:attempt_1360528029309_0001_r_00_0 Timed out after 600 secs


cleanup failed for container container_1360528029309_0001_01_06 :
java.lang.reflect.UndeclaredThrowableException


at
org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.unwrapAn
dThrowException(YarnRemoteExceptionPBImpl.java:135)


at
org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagerPBClientImpl.stopC
onta

Re: NullPointerException in Spring Data Hadoop with CDH4

2013-02-12 Thread Costin Leau

Hi,

For Spring Data Hadoop problems, it's best to use the designated forum 
[1]. These being said I've tried to reproduce your error but I can't - 
I've upgraded the build to CDH 4.1.3 which runs fine against the VM on 
the CI (4.1.1).

Maybe you have some other libraries on the client classpath?

From the stacktrace, it looks like the org.apache.hadoop.mapreduce.Job 
class has no 'state' or 'info' fields...


Anyway, let's continue the discussion on the forum.

Cheers,
[1] http://forum.springsource.org/forumdisplay.php?87-Hadoop

On 02/12/13 2:51 PM, Christian Schneider wrote:

Hi,
I try to use Spring Data Hadoop with CDH4 to write a Map Reduce Job.

On startup, I get the following exception:

Exception in thread "SimpleAsyncTaskExecutor-1" 
java.lang.ExceptionInInitializerError
at 
org.springframework.data.hadoop.mapreduce.JobExecutor$2.run(JobExecutor.java:183)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.lang.NullPointerException
at 
org.springframework.util.ReflectionUtils.makeAccessible(ReflectionUtils.java:405)
at 
org.springframework.data.hadoop.mapreduce.JobUtils.(JobUtils.java:123)
... 2 more

I guess there is a problem with my Hadoop related dependencies. I couldn't find 
any reference showing how to configure Spring Data together with CDH4. But 
Costin showed, he is able to configure it: 
https://build.springsource.org/browse/SPRINGDATAHADOOP-CDH4-JOB1


**Maven Setup**


1.0.0.BUILD-SNAPSHOT
2.0.0-cdh4.1.3



...

org.springframework.data
spring-data-hadoop
${spring.hadoop.version}



org.apache.hadoop
hadoop-common
${hadoop.version}



org.apache.hadoop
hadoop-client
${hadoop.version}



org.apache.hadoop
hadoop-streaming
${hadoop.version}



org.apache.hadoop
hadoop-test
2.0.0-mr1-cdh4.1.3



org.apache.hadoop
hadoop-tools
2.0.0-mr1-cdh4.1.3

...

...


cloudera

https://repository.cloudera.com/artifactory/cloudera-repos/

false




spring-snapshot
Spring Maven SNAPSHOT Repository
http://repo.springframework.org/snapshot



**Application Context**


http://www.springframework.org/schema/beans";
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
xmlns:hdp="http://www.springframework.org/schema/hadoop"; 
xmlns:context="http://www.springframework.org/schema/context";
xmlns:hadoop="http://www.springframework.org/schema/hadoop";
xsi:schemaLocation="
http://www.springframework.org/schema/beans 
http://www.springframework.org/schema/beans/spring-beans.xsd
http://www.springframework.org/schema/hadoop 
http://www.springframework.org/schema/hadoop/spring-hadoop.xsd
http://www.springframework.org/schema/context/spring-context.xsd 
http://www.springframework.org/schema/integration
http://www.springframework.org/schema/context 
http://www.springframework.org/schema/context/spring-context-3.1.xsd";>




fs.defaultFS=${hd.fs}




   



**Cluster version**

Hadoop 2.0.0-cdh4.1.3


**Note:**

This small Unittest is running fine with the current configuration:

@RunWith(SpringJUnit4ClassRunner.class)
@ContextConfiguration(locations = { "classpath:/applicationContext.xml" })
public class Starter {

 @Autowired
 private Configuration configuration;

 @Test
 public void shellOps() {
 Assert.assertNotNull(this.configuration);
 FsShell fsShell = new FsShell(this.configuration);
 final Collection coll = fsShell.ls("/user");
 System.out.println(coll);
 }
}


It would be nice if someone can give me an example configuration.

Best Regards,
Christian.



--
Costin


Error for Pseudo-distributed Mode

2013-02-12 Thread yeyu1899
Hi all,
I installed a redhat_enterprise-linux-x86 in VMware Workstation, and set the 
virtual machine 1G memory. 


Then I followed steps guided by "Installing CDH4 on a Single Linux Node in 
Pseudo-distributed Mode" —— 
https://ccp.cloudera.com/display/CDH4DOC/Installing+CDH4+on+a+Single+Linux+Node+in+Pseudo-distributed+Mode.


When at last, I ran an example Hadoop job with the command "$ hadoop jar 
/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23 
'dfs[a-z.]+'"


then the screen showed as follows, 
depending "AttemptID:attempt_1360528029309_0001_r_00_0 Timed out after 600 
secs" and I wonder is that because my virtual machine's memory too little~~??


[hadoop@localhost hadoop-mapreduce]$ hadoop jar 
/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23 
'dfs[a-z]+' 
  
13/02/11 04:30:44 WARN mapreduce.JobSubmitter: No job jar file set.  User 
classes may not be found. See Job or Job#setJar(String).

13/02/11 04:30:44 INFO input.FileInputFormat: Total input paths to process : 4  
 
13/02/11 04:30:45 INFO mapreduce.JobSubmitter: number of splits:4   
 
13/02/11 04:30:45 WARN conf.Configuration: mapred.output.value.class is 
deprecated. Instead, use mapreduce.job.output.value.class   
  
13/02/11 04:30:45 WARN conf.Configuration: mapreduce.combine.class is 
deprecated. Instead, use mapreduce.job.combine.class

13/02/11 04:30:45 WARN conf.Configuration: mapreduce.map.class is deprecated. 
Instead, use mapreduce.job.map.class

13/02/11 04:30:45 WARN conf.Configuration: mapred.job.name is deprecated. 
Instead, use mapreduce.job.name
13/02/11 04:30:45 WARN conf.Configuration: mapreduce.reduce.class is 
deprecated. Instead, use mapreduce.job.reduce.class 
 
13/02/11 04:30:45 WARN conf.Configuration: mapred.input.dir is deprecated. 
Instead, use mapreduce.input.fileinputformat.inputdir   
   
13/02/11 04:30:45 WARN conf.Configuration: mapred.output.dir is deprecated. 
Instead, use mapreduce.output.fileoutputformat.outputdir
  
13/02/11 04:30:45 WARN conf.Configuration: mapreduce.outputformat.class is 
deprecated. Instead, use mapreduce.job.outputformat.class   
   
13/02/11 04:30:45 WARN conf.Configuration: mapred.map.tasks is deprecated. 
Instead, use mapreduce.job.maps   
13/02/11 04:30:45 WARN conf.Configuration: mapred.output.key.class is 
deprecated. Instead, use mapreduce.job.output.key.class 

13/02/11 04:30:45 WARN conf.Configuration: mapred.working.dir is deprecated. 
Instead, use mapreduce.job.working.dir  
 
13/02/11 04:30:46 INFO mapred.YARNRunner: Job jar is not present. Not adding 
any jar to the list of resources.   
13/02/11 04:30:46 INFO mapred.ResourceMgrDelegate: Submitted application 
application_1360528029309_0001 to ResourceManager at /0.0.0.0:8032  
 
13/02/11 04:30:46 INFO mapreduce.Job: The url to track the job: 
http://localhost.localdomain:8088/proxy/application_1360528029309_0001/ 

  
13/02/11 04:30:46 INFO mapreduce.Job: Running job: job_1360528029309_0001   
 
13/02/11 04:31:01 INFO mapreduce.Job: Job job_1360528029309_0001 running in 
uber mode : false
13/02/11 04:31:01 INFO mapreduce.Job:  map 0% reduce 0% 
 
13/02/11 04:47:22 INFO mapreduce.Job: Task Id : 
attempt_1360528029309_0001_r_00_0, Status : FAILED   
AttemptID:attempt_1360528029309_0001_r_00_0 Timed out after 600 secs
 
cleanup failed for container container_1360528029309_0001_01_06 : 
java.lang.reflect.UndeclaredThrowableException  

NullPointerException in Spring Data Hadoop with CDH4

2013-02-12 Thread Christian Schneider
Hi,
I try to use Spring Data Hadoop with CDH4 to write a Map Reduce Job.

On startup, I get the following exception:

Exception in thread "SimpleAsyncTaskExecutor-1" 
java.lang.ExceptionInInitializerError
at 
org.springframework.data.hadoop.mapreduce.JobExecutor$2.run(JobExecutor.java:183)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.lang.NullPointerException
at 
org.springframework.util.ReflectionUtils.makeAccessible(ReflectionUtils.java:405)
at 
org.springframework.data.hadoop.mapreduce.JobUtils.(JobUtils.java:123)
... 2 more

I guess there is a problem with my Hadoop related dependencies. I couldn't find 
any reference showing how to configure Spring Data together with CDH4. But 
Costin showed, he is able to configure it: 
https://build.springsource.org/browse/SPRINGDATAHADOOP-CDH4-JOB1


**Maven Setup**


1.0.0.BUILD-SNAPSHOT
2.0.0-cdh4.1.3



...

org.springframework.data
spring-data-hadoop
${spring.hadoop.version}



org.apache.hadoop
hadoop-common
${hadoop.version}



org.apache.hadoop
hadoop-client
${hadoop.version}



org.apache.hadoop
hadoop-streaming
${hadoop.version}



org.apache.hadoop
hadoop-test
2.0.0-mr1-cdh4.1.3



org.apache.hadoop
hadoop-tools
2.0.0-mr1-cdh4.1.3

...

...
   

cloudera

https://repository.cloudera.com/artifactory/cloudera-repos/

false




spring-snapshot
Spring Maven SNAPSHOT Repository
http://repo.springframework.org/snapshot



**Application Context**


http://www.springframework.org/schema/beans";
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
xmlns:hdp="http://www.springframework.org/schema/hadoop"; 
xmlns:context="http://www.springframework.org/schema/context";
xmlns:hadoop="http://www.springframework.org/schema/hadoop";
xsi:schemaLocation="
http://www.springframework.org/schema/beans 
http://www.springframework.org/schema/beans/spring-beans.xsd
http://www.springframework.org/schema/hadoop 
http://www.springframework.org/schema/hadoop/spring-hadoop.xsd
http://www.springframework.org/schema/context/spring-context.xsd 
http://www.springframework.org/schema/integration
http://www.springframework.org/schema/context 
http://www.springframework.org/schema/context/spring-context-3.1.xsd";>




fs.defaultFS=${hd.fs}




   



**Cluster version**

Hadoop 2.0.0-cdh4.1.3


**Note:**

This small Unittest is running fine with the current configuration:

@RunWith(SpringJUnit4ClassRunner.class)
@ContextConfiguration(locations = { "classpath:/applicationContext.xml" })
public class Starter {

 @Autowired
 private Configuration configuration;

 @Test
 public void shellOps() {
 Assert.assertNotNull(this.configuration);
 FsShell fsShell = new FsShell(this.configuration);
 final Collection coll = fsShell.ls("/user");
 System.out.println(coll);
 }
}


It would be nice if someone can give me an example configuration.

Best Regards,
Christian.


Re: number input files to mapreduce job

2013-02-12 Thread Mahesh Balija
Hi Vikas,

 You can get the FileSystem instance by calling
FileSystem.get(Configuration);
 Once you get the FileSystem instance you can use
FileSystem.listStatus(InputPath); to get the fileStatus instances.

Best,
Mahesh Balija,
Calsoft Labs.

On Tue, Feb 12, 2013 at 12:35 PM, Vikas Jadhav wrote:

> Hi all,
> How to get number of Input files and thier to particular mapreduce job in
> java MapReduce  program.
>
> --
> *
> *
> *
>
> Thanx and Regards*
> * Vikas Jadhav*
>


Fwd: Delivery Status Notification (Failure)

2013-02-12 Thread samir das mohapatra
Hi All,
   I wanted to know how to connect Hive(hadoop-cdh4 distribution) with
MircoStrategy
   Any help is very helpfull.

  Witing for you response

Note: It is little bit urgent do any one have exprience in that
Thanks,
samir