Building Mahout Issue
I am trying to build Mahout version 0.9 and make it compatible with Hadoop 2.4.0. I unpacked mahout-distribution-0.9-src.tar.gz and then ran the following command: mvn -Phadoop-0.23 clean install -Dhadoop.version=2.4.0 -DskipTests Then I get the following error: [ERROR] Failed to execute goal on project mahout-integration: Could not resolve dependencies for project org.apache.mahout:mahout-integration:jar:0.9: Could not find artifact org.apache.hadoop:hadoop-core:jar:2.4.0 in central (http://repo.maven.apache.org/maven2) - [Help 1] Any ideas what is causing this problem and how to fix it? Any advice would be much appreciated. Thanks, Andrew Botelho Intern EMC Corporation Education Services Email: andrew.bote...@emc.commailto:andrew.bote...@emc.com
Getting HBaseStorage() to work in Pig
I am trying to use the function HBaseStorage() in my Pig code in order to load an HBase table into Pig. When I run my code, I get this error: ERROR 2998: Unhandled internal error. org/apache/hadoop/hbase/filter/WritableByteArrayComparable I believe the PIG_CLASSPATH needs to be extended to include the classpath for loading HBase, but I am not sure how to do this. I've tried several export commands in the unix shell to change the PIG_CLASSPATH, but nothing seems to be working. Any advice would be much appreciated. Thanks, Andrew Botelho
RE: Getting HBaseStorage() to work in Pig
Could you explain what is going on here: HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar I'm not a Unix expert by any means. How can I use this to enable HBaseStorage() in Pig? Thanks, Andrew From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Friday, August 23, 2013 4:50 PM To: common-u...@hadoop.apache.org Subject: Re: Getting HBaseStorage() to work in Pig Please look at the example in 15.1.1 under http://hbase.apache.org/book.html#tools On Fri, Aug 23, 2013 at 1:41 PM, Botelho, Andrew andrew.bote...@emc.commailto:andrew.bote...@emc.com wrote: I am trying to use the function HBaseStorage() in my Pig code in order to load an HBase table into Pig. When I run my code, I get this error: ERROR 2998: Unhandled internal error. org/apache/hadoop/hbase/filter/WritableByteArrayComparable I believe the PIG_CLASSPATH needs to be extended to include the classpath for loading HBase, but I am not sure how to do this. I've tried several export commands in the unix shell to change the PIG_CLASSPATH, but nothing seems to be working. Any advice would be much appreciated. Thanks, Andrew Botelho
RE: DistributedCache incompatibility issue between 1.0 and 2.0
I have been using Job.addCacheFile() to cache files in the distributed cache. It has been working for me on Hadoop 2.0.5: public void addCacheFile(URI uri) Add a file to be localized Parameters: uri - The uri of the cache to be localized -Original Message- From: Edward J. Yoon [mailto:edwardy...@apache.org] Sent: Friday, July 19, 2013 8:03 AM To: user@hadoop.apache.org Subject: DistributedCache incompatibility issue between 1.0 and 2.0 Hi, I wonder why setLocalFiles and addLocalFiles methods have been removed, and what should I use instead of them? -- Best Regards, Edward J. Yoon @eddieyoon
Make job output be a comma separated file
What is the best way to make the output of my Hadoop job be comma separated? Basically, how can I have the keys and values be separated by a comma? My keys are Text objects, and some of them have actual commas within the field. Will this matter? Thanks, Andrew
RE: Make job output be a comma separated file
I believe that mapred.textoutputformat.separator is from the old API, but now the field is mapreduce.output.textoutputformat.separator in the new API. So I ran this code in my driver class, but it is making no difference: Configuration conf = new Configuration(); conf.set(mapreduce.output.textoutputformat.separator, ,); Am I changing the field right? Thanks, Andrew From: Ravi Kiran [mailto:ravikiranmag...@gmail.com] Sent: Thursday, July 18, 2013 1:45 PM To: user@hadoop.apache.org Subject: Re: Make job output be a comma separated file Hi Andrew, You can pass change the default keyValueSeparator of the output format from a \t to a , by setting the following property mapred.textoutputformat.separator to Configuration of the job. You will face difficulties if this output is an input to another job as you wouldn't know what part of the row data is a key and what is the value. Regards Ravi M. On Thu, Jul 18, 2013 at 10:46 PM, Botelho, Andrew andrew.bote...@emc.commailto:andrew.bote...@emc.com wrote: What is the best way to make the output of my Hadoop job be comma separated? Basically, how can I have the keys and values be separated by a comma? My keys are Text objects, and some of them have actual commas within the field. Will this matter? Thanks, Andrew
RE: Make job output be a comma separated file
I am using the latest version of Hadoop - Hadoop 2.0.5. From: Ravi Kiran [mailto:ravikiranmag...@gmail.com] Sent: Thursday, July 18, 2013 2:16 PM To: user@hadoop.apache.org Subject: Re: Make job output be a comma separated file Hi Andrew, Can you please tell me which version of Hadoop you use.. I noticed that in Hadoop 1.0.4 , the class org.apache.hadoop.mapreduce.lib.output.TextOutputFormat is looking for mapred.textoutputformat.separator . Regards Ravi M On Thu, Jul 18, 2013 at 11:32 PM, Botelho, Andrew andrew.bote...@emc.commailto:andrew.bote...@emc.com wrote: I believe that mapred.textoutputformat.separator is from the old API, but now the field is mapreduce.output.textoutputformat.separator in the new API. So I ran this code in my driver class, but it is making no difference: Configuration conf = new Configuration(); conf.set(mapreduce.output.textoutputformat.separator, ,); Am I changing the field right? Thanks, Andrew From: Ravi Kiran [mailto:ravikiranmag...@gmail.commailto:ravikiranmag...@gmail.com] Sent: Thursday, July 18, 2013 1:45 PM To: user@hadoop.apache.orgmailto:user@hadoop.apache.org Subject: Re: Make job output be a comma separated file Hi Andrew, You can pass change the default keyValueSeparator of the output format from a \t to a , by setting the following property mapred.textoutputformat.separator to Configuration of the job. You will face difficulties if this output is an input to another job as you wouldn't know what part of the row data is a key and what is the value. Regards Ravi M. On Thu, Jul 18, 2013 at 10:46 PM, Botelho, Andrew andrew.bote...@emc.commailto:andrew.bote...@emc.com wrote: What is the best way to make the output of my Hadoop job be comma separated? Basically, how can I have the keys and values be separated by a comma? My keys are Text objects, and some of them have actual commas within the field. Will this matter? Thanks, Andrew
RE: Make job output be a comma separated file
I am doing exactly what this website tells: http://cloudfront.blogspot.com/2012/06/how-to-change-default-key-value.html But it isn't changing anything. Andrew From: Ravi Kiran [mailto:ravikiranmag...@gmail.com] Sent: Thursday, July 18, 2013 2:16 PM To: user@hadoop.apache.org Subject: Re: Make job output be a comma separated file Hi Andrew, Can you please tell me which version of Hadoop you use.. I noticed that in Hadoop 1.0.4 , the class org.apache.hadoop.mapreduce.lib.output.TextOutputFormat is looking for mapred.textoutputformat.separator . Regards Ravi M On Thu, Jul 18, 2013 at 11:32 PM, Botelho, Andrew andrew.bote...@emc.commailto:andrew.bote...@emc.com wrote: I believe that mapred.textoutputformat.separator is from the old API, but now the field is mapreduce.output.textoutputformat.separator in the new API. So I ran this code in my driver class, but it is making no difference: Configuration conf = new Configuration(); conf.set(mapreduce.output.textoutputformat.separator, ,); Am I changing the field right? Thanks, Andrew From: Ravi Kiran [mailto:ravikiranmag...@gmail.commailto:ravikiranmag...@gmail.com] Sent: Thursday, July 18, 2013 1:45 PM To: user@hadoop.apache.orgmailto:user@hadoop.apache.org Subject: Re: Make job output be a comma separated file Hi Andrew, You can pass change the default keyValueSeparator of the output format from a \t to a , by setting the following property mapred.textoutputformat.separator to Configuration of the job. You will face difficulties if this output is an input to another job as you wouldn't know what part of the row data is a key and what is the value. Regards Ravi M. On Thu, Jul 18, 2013 at 10:46 PM, Botelho, Andrew andrew.bote...@emc.commailto:andrew.bote...@emc.com wrote: What is the best way to make the output of my Hadoop job be comma separated? Basically, how can I have the keys and values be separated by a comma? My keys are Text objects, and some of them have actual commas within the field. Will this matter? Thanks, Andrew
RE: New Distributed Cache
So in my driver code, I try to store the file in the cache with this line of code: job.addCacheFile(new URI(file location)); Then in my Mapper code, I do this to try and access the cached file: URI[] localPaths = context.getCacheFiles(); File f = new File(localPaths[0]); However, I get a NullPointerException when I do that in the Mapper code. Any suggesstions? Andrew From: Shahab Yunus [mailto:shahab.yu...@gmail.com] Sent: Wednesday, July 10, 2013 9:43 PM To: user@hadoop.apache.org Subject: Re: New Distributed Cache Also, once you have the array of URIs after calling getCacheFiles you can iterate over them using File class or Path (http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/Path.html#Path(java.net.URI)) Regards, Shahab On Wed, Jul 10, 2013 at 5:08 PM, Omkar Joshi ojo...@hortonworks.commailto:ojo...@hortonworks.com wrote: did you try JobContext.getCacheFiles() ? Thanks, Omkar Joshi Hortonworks Inc.http://www.hortonworks.com On Wed, Jul 10, 2013 at 10:15 AM, Botelho, Andrew andrew.bote...@emc.commailto:andrew.bote...@emc.com wrote: Hi, I am trying to store a file in the Distributed Cache during my Hadoop job. In the driver class, I tell the job to store the file in the cache with this code: Job job = Job.getInstance(); job.addCacheFile(new URI(file name)); That all compiles fine. In the Mapper code, I try accessing the cached file with this method: Path[] localPaths = context.getLocalCacheFiles(); However, I am getting warnings that this method is deprecated. Does anyone know the newest way to access cached files in the Mapper code? (I am using Hadoop 2.0.5) Thanks in advance, Andrew
RE: CompositeInputFormat
Sorry I should've specified that I need an example of CompositeInputFormat that uses the new API. The example linked below uses old API objects like JobConf. Any known examples of CompositeInputFormat using the new API? Thanks in advance, Andrew From: Jay Vyas [mailto:jayunit...@gmail.com] Sent: Thursday, July 11, 2013 5:10 PM To: common-u...@hadoop.apache.org Subject: Re: CompositeInputFormat Map Side joins will use the CompositeInputFormat. They will only really be worth doing if one data set is small, and the other is large. This is a good example : http://www.congiu.com/joins-in-hadoop-using-compositeinputformat/ the trick is to google for CompositeInputFormat.compose() :) On Thu, Jul 11, 2013 at 5:02 PM, Botelho, Andrew andrew.bote...@emc.commailto:andrew.bote...@emc.com wrote: Hi, I want to perform a JOIN on two sets of data with Hadoop. I read that the class CompositeInputFormat can be used to perform joins on data, but I can't find any examples of how to do it. Could someone help me out? It would be much appreciated. :) Thanks in advance, Andrew -- Jay Vyas http://jayunit100.blogspot.com
RE: Distributed Cache
Ok using job.addCacheFile() seems to compile correctly. However, how do I then access the cached file in my Mapper code? Is there a method that will look for any files in the cache? Thanks, Andrew From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Tuesday, July 09, 2013 6:08 PM To: user@hadoop.apache.org Subject: Re: Distributed Cache You should use Job#addCacheFile() Cheers On Tue, Jul 9, 2013 at 3:02 PM, Botelho, Andrew andrew.bote...@emc.commailto:andrew.bote...@emc.com wrote: Hi, I was wondering if I can still use the DistributedCache class in the latest release of Hadoop (Version 2.0.5). In my driver class, I use this code to try and add a file to the distributed cache: import java.net.URI; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.filecache.DistributedCache; import org.apache.hadoop.fs.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; Configuration conf = new Configuration(); DistributedCache.addCacheFile(new URI(file path in HDFS), conf); Job job = Job.getInstance(); ... However, I keep getting warnings that the method addCacheFile() is deprecated. Is there a more current way to add files to the distributed cache? Thanks in advance, Andrew
New Distributed Cache
Hi, I am trying to store a file in the Distributed Cache during my Hadoop job. In the driver class, I tell the job to store the file in the cache with this code: Job job = Job.getInstance(); job.addCacheFile(new URI(file name)); That all compiles fine. In the Mapper code, I try accessing the cached file with this method: Path[] localPaths = context.getLocalCacheFiles(); However, I am getting warnings that this method is deprecated. Does anyone know the newest way to access cached files in the Mapper code? (I am using Hadoop 2.0.5) Thanks in advance, Andrew
RE: Distributed Cache
Ok so JobContext.getCacheFiles() retures URI[]. Let's say I only stored one folder in the cache that has several .txt files within it. How do I use that returned URI to read each line of those .txt files? Basically, how do I read my cached file(s) after I call JobContext.getCacheFiles()? Thanks, Andrew From: Omkar Joshi [mailto:ojo...@hortonworks.com] Sent: Wednesday, July 10, 2013 5:15 PM To: user@hadoop.apache.org Subject: Re: Distributed Cache try JobContext.getCacheFiles() Thanks, Omkar Joshi Hortonworks Inc.http://www.hortonworks.com On Wed, Jul 10, 2013 at 6:31 AM, Botelho, Andrew andrew.bote...@emc.commailto:andrew.bote...@emc.com wrote: Ok using job.addCacheFile() seems to compile correctly. However, how do I then access the cached file in my Mapper code? Is there a method that will look for any files in the cache? Thanks, Andrew From: Ted Yu [mailto:yuzhih...@gmail.commailto:yuzhih...@gmail.com] Sent: Tuesday, July 09, 2013 6:08 PM To: user@hadoop.apache.orgmailto:user@hadoop.apache.org Subject: Re: Distributed Cache You should use Job#addCacheFile() Cheers On Tue, Jul 9, 2013 at 3:02 PM, Botelho, Andrew andrew.bote...@emc.commailto:andrew.bote...@emc.com wrote: Hi, I was wondering if I can still use the DistributedCache class in the latest release of Hadoop (Version 2.0.5). In my driver class, I use this code to try and add a file to the distributed cache: import java.net.URI; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.filecache.DistributedCache; import org.apache.hadoop.fs.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; Configuration conf = new Configuration(); DistributedCache.addCacheFile(new URI(file path in HDFS), conf); Job job = Job.getInstance(); ... However, I keep getting warnings that the method addCacheFile() is deprecated. Is there a more current way to add files to the distributed cache? Thanks in advance, Andrew
Distributed Cache
Hi, I was wondering if I can still use the DistributedCache class in the latest release of Hadoop (Version 2.0.5). In my driver class, I use this code to try and add a file to the distributed cache: import java.net.URI; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.filecache.DistributedCache; import org.apache.hadoop.fs.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; Configuration conf = new Configuration(); DistributedCache.addCacheFile(new URI(file path in HDFS), conf); Job job = Job.getInstance(); ... However, I keep getting warnings that the method addCacheFile() is deprecated. Is there a more current way to add files to the distributed cache? Thanks in advance, Andrew