Building Mahout Issue

2014-06-02 Thread Botelho, Andrew
I am trying to build Mahout version 0.9 and make it compatible with Hadoop 
2.4.0.
I unpacked mahout-distribution-0.9-src.tar.gz and then ran the following 
command:

mvn -Phadoop-0.23 clean install -Dhadoop.version=2.4.0 -DskipTests

Then I get the following error:

[ERROR] Failed to execute goal on project mahout-integration: Could not resolve 
dependencies for project org.apache.mahout:mahout-integration:jar:0.9: Could 
not find artifact org.apache.hadoop:hadoop-core:jar:2.4.0 in central 
(http://repo.maven.apache.org/maven2) - [Help 1]

Any ideas what is causing this problem and how to fix it?  Any advice would be 
much appreciated.

Thanks,

Andrew Botelho
Intern
EMC Corporation
Education Services
Email: andrew.bote...@emc.commailto:andrew.bote...@emc.com


Getting HBaseStorage() to work in Pig

2013-08-23 Thread Botelho, Andrew
I am trying to use the function HBaseStorage() in my Pig code in order to load 
an HBase table into Pig.

When I run my code, I get this error:

ERROR 2998: Unhandled internal error. 
org/apache/hadoop/hbase/filter/WritableByteArrayComparable


I believe the PIG_CLASSPATH needs to be extended to include the classpath for 
loading HBase, but I am not sure how to do this.  I've tried several export 
commands in the unix shell to change the PIG_CLASSPATH, but nothing seems to be 
working.

Any advice would be much appreciated.

Thanks,

Andrew Botelho



RE: Getting HBaseStorage() to work in Pig

2013-08-23 Thread Botelho, Andrew
Could you explain what is going on here:
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop 
jar ${HBASE_HOME}/hbase-VERSION.jar

I'm not a Unix expert by any means.
How can I use this to enable HBaseStorage() in Pig?

Thanks,

Andrew

From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Friday, August 23, 2013 4:50 PM
To: common-u...@hadoop.apache.org
Subject: Re: Getting HBaseStorage() to work in Pig

Please look at the example in 15.1.1 under 
http://hbase.apache.org/book.html#tools

On Fri, Aug 23, 2013 at 1:41 PM, Botelho, Andrew 
andrew.bote...@emc.commailto:andrew.bote...@emc.com wrote:
I am trying to use the function HBaseStorage() in my Pig code in order to load 
an HBase table into Pig.

When I run my code, I get this error:

ERROR 2998: Unhandled internal error. 
org/apache/hadoop/hbase/filter/WritableByteArrayComparable


I believe the PIG_CLASSPATH needs to be extended to include the classpath for 
loading HBase, but I am not sure how to do this.  I've tried several export 
commands in the unix shell to change the PIG_CLASSPATH, but nothing seems to be 
working.

Any advice would be much appreciated.

Thanks,

Andrew Botelho




RE: DistributedCache incompatibility issue between 1.0 and 2.0

2013-07-19 Thread Botelho, Andrew
I have been using Job.addCacheFile() to cache files in the distributed cache.  
It has been working for me on Hadoop 2.0.5:

public void addCacheFile(URI uri)
Add a file to be localized
Parameters:
uri - The uri of the cache to be localized

-Original Message-
From: Edward J. Yoon [mailto:edwardy...@apache.org] 
Sent: Friday, July 19, 2013 8:03 AM
To: user@hadoop.apache.org
Subject: DistributedCache incompatibility issue between 1.0 and 2.0

Hi,

I wonder why setLocalFiles and addLocalFiles methods have been removed, and 
what should I use instead of them?

--
Best Regards, Edward J. Yoon
@eddieyoon



Make job output be a comma separated file

2013-07-18 Thread Botelho, Andrew
What is the best way to make the output of my Hadoop job be comma separated?  
Basically, how can I have the keys and values be separated by a comma?
My keys are Text objects, and some of them have actual commas within the field. 
 Will this matter?

Thanks,

Andrew


RE: Make job output be a comma separated file

2013-07-18 Thread Botelho, Andrew
I believe that mapred.textoutputformat.separator is from the old API, but now 
the field is mapreduce.output.textoutputformat.separator in the new API.
So I ran this code in my driver class, but it is making no difference:

Configuration conf = new Configuration();
conf.set(mapreduce.output.textoutputformat.separator, ,);

Am I changing the field right?

Thanks,
Andrew

From: Ravi Kiran [mailto:ravikiranmag...@gmail.com]
Sent: Thursday, July 18, 2013 1:45 PM
To: user@hadoop.apache.org
Subject: Re: Make job output be a comma separated file

Hi Andrew,

You can pass change the default keyValueSeparator of the output format from 
a \t to a , by
setting the following property mapred.textoutputformat.separator to 
Configuration of the job.

   You will face difficulties if this output is an input to another job as you 
wouldn't know what part of the row data is a key and what is the value.

Regards
Ravi M.

On Thu, Jul 18, 2013 at 10:46 PM, Botelho, Andrew 
andrew.bote...@emc.commailto:andrew.bote...@emc.com wrote:
What is the best way to make the output of my Hadoop job be comma separated?  
Basically, how can I have the keys and values be separated by a comma?
My keys are Text objects, and some of them have actual commas within the field. 
 Will this matter?

Thanks,

Andrew



RE: Make job output be a comma separated file

2013-07-18 Thread Botelho, Andrew
I am using the latest version of Hadoop - Hadoop 2.0.5.

From: Ravi Kiran [mailto:ravikiranmag...@gmail.com]
Sent: Thursday, July 18, 2013 2:16 PM
To: user@hadoop.apache.org
Subject: Re: Make job output be a comma separated file

Hi Andrew,

  Can you please tell me which version of Hadoop you use.. I noticed that in 
Hadoop 1.0.4 , the class 
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat is looking for 
mapred.textoutputformat.separator .
Regards
Ravi M

On Thu, Jul 18, 2013 at 11:32 PM, Botelho, Andrew 
andrew.bote...@emc.commailto:andrew.bote...@emc.com wrote:
I believe that mapred.textoutputformat.separator is from the old API, but now 
the field is mapreduce.output.textoutputformat.separator in the new API.
So I ran this code in my driver class, but it is making no difference:

Configuration conf = new Configuration();
conf.set(mapreduce.output.textoutputformat.separator, ,);

Am I changing the field right?

Thanks,
Andrew

From: Ravi Kiran 
[mailto:ravikiranmag...@gmail.commailto:ravikiranmag...@gmail.com]
Sent: Thursday, July 18, 2013 1:45 PM
To: user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: Re: Make job output be a comma separated file

Hi Andrew,

You can pass change the default keyValueSeparator of the output format from 
a \t to a , by
setting the following property mapred.textoutputformat.separator to 
Configuration of the job.

   You will face difficulties if this output is an input to another job as you 
wouldn't know what part of the row data is a key and what is the value.

Regards
Ravi M.

On Thu, Jul 18, 2013 at 10:46 PM, Botelho, Andrew 
andrew.bote...@emc.commailto:andrew.bote...@emc.com wrote:
What is the best way to make the output of my Hadoop job be comma separated?  
Basically, how can I have the keys and values be separated by a comma?
My keys are Text objects, and some of them have actual commas within the field. 
 Will this matter?

Thanks,

Andrew




RE: Make job output be a comma separated file

2013-07-18 Thread Botelho, Andrew
I am doing exactly what this website tells: 
http://cloudfront.blogspot.com/2012/06/how-to-change-default-key-value.html
But it isn't changing anything.

Andrew

From: Ravi Kiran [mailto:ravikiranmag...@gmail.com]
Sent: Thursday, July 18, 2013 2:16 PM
To: user@hadoop.apache.org
Subject: Re: Make job output be a comma separated file

Hi Andrew,

  Can you please tell me which version of Hadoop you use.. I noticed that in 
Hadoop 1.0.4 , the class 
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat is looking for 
mapred.textoutputformat.separator .
Regards
Ravi M

On Thu, Jul 18, 2013 at 11:32 PM, Botelho, Andrew 
andrew.bote...@emc.commailto:andrew.bote...@emc.com wrote:
I believe that mapred.textoutputformat.separator is from the old API, but now 
the field is mapreduce.output.textoutputformat.separator in the new API.
So I ran this code in my driver class, but it is making no difference:

Configuration conf = new Configuration();
conf.set(mapreduce.output.textoutputformat.separator, ,);

Am I changing the field right?

Thanks,
Andrew

From: Ravi Kiran 
[mailto:ravikiranmag...@gmail.commailto:ravikiranmag...@gmail.com]
Sent: Thursday, July 18, 2013 1:45 PM
To: user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: Re: Make job output be a comma separated file

Hi Andrew,

You can pass change the default keyValueSeparator of the output format from 
a \t to a , by
setting the following property mapred.textoutputformat.separator to 
Configuration of the job.

   You will face difficulties if this output is an input to another job as you 
wouldn't know what part of the row data is a key and what is the value.

Regards
Ravi M.

On Thu, Jul 18, 2013 at 10:46 PM, Botelho, Andrew 
andrew.bote...@emc.commailto:andrew.bote...@emc.com wrote:
What is the best way to make the output of my Hadoop job be comma separated?  
Basically, how can I have the keys and values be separated by a comma?
My keys are Text objects, and some of them have actual commas within the field. 
 Will this matter?

Thanks,

Andrew




RE: New Distributed Cache

2013-07-11 Thread Botelho, Andrew
So in my driver code, I try to store the file in the cache with this line of 
code:

job.addCacheFile(new URI(file location));

Then in my Mapper code, I do this to try and access the cached file:

URI[] localPaths = context.getCacheFiles();
File f = new File(localPaths[0]);

However, I get a NullPointerException when I do that in the Mapper code.

Any suggesstions?

Andrew

From: Shahab Yunus [mailto:shahab.yu...@gmail.com]
Sent: Wednesday, July 10, 2013 9:43 PM
To: user@hadoop.apache.org
Subject: Re: New Distributed Cache

Also, once you have the array of URIs after calling getCacheFiles  you can 
iterate over them using File class or Path 
(http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/Path.html#Path(java.net.URI))

Regards,
Shahab

On Wed, Jul 10, 2013 at 5:08 PM, Omkar Joshi 
ojo...@hortonworks.commailto:ojo...@hortonworks.com wrote:
did you try JobContext.getCacheFiles() ?


Thanks,
Omkar Joshi
Hortonworks Inc.http://www.hortonworks.com

On Wed, Jul 10, 2013 at 10:15 AM, Botelho, Andrew 
andrew.bote...@emc.commailto:andrew.bote...@emc.com wrote:
Hi,

I am trying to store a file in the Distributed Cache during my Hadoop job.
In the driver class, I tell the job to store the file in the cache with this 
code:

Job job = Job.getInstance();
job.addCacheFile(new URI(file name));

That all compiles fine.  In the Mapper code, I try accessing the cached file 
with this method:

Path[] localPaths = context.getLocalCacheFiles();

However, I am getting warnings that this method is deprecated.
Does anyone know the newest way to access cached files in the Mapper code? (I 
am using Hadoop 2.0.5)

Thanks in advance,

Andrew




RE: CompositeInputFormat

2013-07-11 Thread Botelho, Andrew
Sorry I should've specified that I need an example of CompositeInputFormat that 
uses the new API.
The example linked below uses old API objects like JobConf.

Any known examples of CompositeInputFormat using the new API?

Thanks in advance,

Andrew

From: Jay Vyas [mailto:jayunit...@gmail.com]
Sent: Thursday, July 11, 2013 5:10 PM
To: common-u...@hadoop.apache.org
Subject: Re: CompositeInputFormat

Map Side joins will use the CompositeInputFormat.  They will only really be 
worth doing if one data set is small, and the other is large.
This is a good example : 
http://www.congiu.com/joins-in-hadoop-using-compositeinputformat/
the trick is to google for CompositeInputFormat.compose()  :)

On Thu, Jul 11, 2013 at 5:02 PM, Botelho, Andrew 
andrew.bote...@emc.commailto:andrew.bote...@emc.com wrote:
Hi,

I want to perform a JOIN on two sets of data with Hadoop.  I read that the 
class CompositeInputFormat can be used to perform joins on data, but I can't 
find any examples of how to do it.
Could someone help me out? It would be much appreciated. :)

Thanks in advance,

Andrew



--
Jay Vyas
http://jayunit100.blogspot.com


RE: Distributed Cache

2013-07-10 Thread Botelho, Andrew
Ok using job.addCacheFile() seems to compile correctly.
However, how do I then access the cached file in my Mapper code?  Is there a 
method that will look for any files in the cache?

Thanks,

Andrew

From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Tuesday, July 09, 2013 6:08 PM
To: user@hadoop.apache.org
Subject: Re: Distributed Cache

You should use Job#addCacheFile()

Cheers
On Tue, Jul 9, 2013 at 3:02 PM, Botelho, Andrew 
andrew.bote...@emc.commailto:andrew.bote...@emc.com wrote:
Hi,

I was wondering if I can still use the DistributedCache class in the latest 
release of Hadoop (Version 2.0.5).
In my driver class, I use this code to try and add a file to the distributed 
cache:

import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

Configuration conf = new Configuration();
DistributedCache.addCacheFile(new URI(file path in HDFS), conf);
Job job = Job.getInstance();
...

However, I keep getting warnings that the method addCacheFile() is deprecated.
Is there a more current way to add files to the distributed cache?

Thanks in advance,

Andrew



New Distributed Cache

2013-07-10 Thread Botelho, Andrew
Hi,

I am trying to store a file in the Distributed Cache during my Hadoop job.
In the driver class, I tell the job to store the file in the cache with this 
code:

Job job = Job.getInstance();
job.addCacheFile(new URI(file name));

That all compiles fine.  In the Mapper code, I try accessing the cached file 
with this method:

Path[] localPaths = context.getLocalCacheFiles();

However, I am getting warnings that this method is deprecated.
Does anyone know the newest way to access cached files in the Mapper code? (I 
am using Hadoop 2.0.5)

Thanks in advance,

Andrew


RE: Distributed Cache

2013-07-10 Thread Botelho, Andrew
Ok so JobContext.getCacheFiles() retures URI[].
Let's say I only stored one folder in the cache that has several .txt files 
within it.  How do I use that returned URI to read each line of those .txt 
files?

Basically, how do I read my cached file(s) after I call 
JobContext.getCacheFiles()?

Thanks,

Andrew

From: Omkar Joshi [mailto:ojo...@hortonworks.com]
Sent: Wednesday, July 10, 2013 5:15 PM
To: user@hadoop.apache.org
Subject: Re: Distributed Cache

try JobContext.getCacheFiles()

Thanks,
Omkar Joshi
Hortonworks Inc.http://www.hortonworks.com

On Wed, Jul 10, 2013 at 6:31 AM, Botelho, Andrew 
andrew.bote...@emc.commailto:andrew.bote...@emc.com wrote:
Ok using job.addCacheFile() seems to compile correctly.
However, how do I then access the cached file in my Mapper code?  Is there a 
method that will look for any files in the cache?

Thanks,

Andrew

From: Ted Yu [mailto:yuzhih...@gmail.commailto:yuzhih...@gmail.com]
Sent: Tuesday, July 09, 2013 6:08 PM
To: user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: Re: Distributed Cache

You should use Job#addCacheFile()

Cheers
On Tue, Jul 9, 2013 at 3:02 PM, Botelho, Andrew 
andrew.bote...@emc.commailto:andrew.bote...@emc.com wrote:
Hi,

I was wondering if I can still use the DistributedCache class in the latest 
release of Hadoop (Version 2.0.5).
In my driver class, I use this code to try and add a file to the distributed 
cache:

import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

Configuration conf = new Configuration();
DistributedCache.addCacheFile(new URI(file path in HDFS), conf);
Job job = Job.getInstance();
...

However, I keep getting warnings that the method addCacheFile() is deprecated.
Is there a more current way to add files to the distributed cache?

Thanks in advance,

Andrew




Distributed Cache

2013-07-09 Thread Botelho, Andrew
Hi,

I was wondering if I can still use the DistributedCache class in the latest 
release of Hadoop (Version 2.0.5).
In my driver class, I use this code to try and add a file to the distributed 
cache:

import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

Configuration conf = new Configuration();
DistributedCache.addCacheFile(new URI(file path in HDFS), conf);
Job job = Job.getInstance();
...

However, I keep getting warnings that the method addCacheFile() is deprecated.
Is there a more current way to add files to the distributed cache?

Thanks in advance,

Andrew