Re: Need FileName with Content

2014-03-21 Thread Shahab Yunus
t;>  Regards,
>>> *Stanley Shi,*
>>>
>>>
>>> --
>>> From: *Ranjini Rathinam* 
>>> Date: Thu, Mar 20, 2014 at 10:56 AM
>>> To: ranjin...@polarisft.com
>>>
>>>
>>>
>>> --
>>> From: *Ranjini Rathinam* 
>>> Date: Thu, Mar 20, 2014 at 11:20 AM
>>> To: user@hadoop.apache.org, s...@gopivotal.com
>>>
>>>
>>> Hi,
>>>
>>> If we give the below code,
>>> ===
>>>  word.set("filename"+""+tokenizer.nextToken());
>>> output.collect(word,one);
>>> ==
>>>
>>> The output is wrong. because it shows the
>>>
>>>  filename   word   occurance
>>> vinitha   java   4
>>> vinitha oracle  3
>>> sony   java   4
>>> sony  oracle  3
>>>
>>>
>>> Here vinitha does not have oracle word . Similarlly sony does not have
>>> java has word. File name is merging for  all words.
>>>
>>> I need the output has given below
>>>
>>>  filename   word   occurance
>>>
>>> vinitha   java   4
>>> vinitha C++3
>>> sony   ETL 4
>>> sony  oracle  3
>>>
>>>
>>>  Need fileaName along with the word in that particular file only. No
>>> merge should happen.
>>>
>>> Please help me out for this issue.
>>>
>>> Please help.
>>>
>>> Thanks in advance.
>>>
>>> Ranjini
>>>
>>> --
>>> From: *Felix Chern* 
>>> Date: Thu, Mar 20, 2014 at 11:25 PM
>>> To: user@hadoop.apache.org
>>> Cc: s...@gopivotal.com
>>>
>>>
>>>  I've written two blog post of how to get directory context in hadoop
>>> mapper.
>>>
>>>
>>> http://www.idryman.org/blog/2014/01/26/capture-directory-context-in-hadoop-mapper/
>>>
>>> http://www.idryman.org/blog/2014/01/27/capture-path-info-in-hadoop-inputformat-class/
>>>
>>> Cheers,
>>> Felix
>>>
>>> --
>>> From: *Stanley Shi* 
>>> Date: Fri, Mar 21, 2014 at 7:02 AM
>>>
>>> To: Ranjini Rathinam 
>>> Cc: user@hadoop.apache.org
>>>
>>>
>>> Just reviewed the code again, you are not really using map-reduce. you
>>> are reading all files in one map process, this is not a normal map-reduce
>>> job works.
>>>
>>>
>>>  Regards,
>>> *Stanley Shi,*
>>>
>>>
>>> --
>>> From: *Stanley Shi* 
>>> Date: Fri, Mar 21, 2014 at 7:43 AM
>>> To: Ranjini Rathinam 
>>> Cc: user@hadoop.apache.org
>>>
>>>
>>> Change you mapper to be something like this:
>>>
>>>  public static class TokenizerMapper extends
>>>
>>>   Mapper {
>>>
>>>
>>> private final static IntWritable one = new IntWritable(1);
>>>
>>> private Text word = new Text();
>>>
>>>
>>> public void map(Object key, Text value, Context context)
>>>
>>> throws IOException, InterruptedException {
>>>
>>>   Path pp = ((FileSplit) context.getInputSplit()).getPath();
>>>
>>>   StringTokenizer itr = new StringTokenizer(value.toString());
>>>
>>>   log.info("map on string: " + new String(value.getBytes()));
>>>
>>>   while (itr.hasMoreTokens()) {
>>>
>>> word.set(pp.getName() + " " + itr.nextToken());
>>>
>>> context.write(word, one);
>>>
>>>   }
>>>
>>> }
>>>
>>>   }
>>>
>>> Note: add your filtering code here;
>>>
>>> and then when running the command, use you input path as param;
>>>
>>>  Regards,
>>> *Stanley Shi,*
>>>
>>>
>>> --
>>> From: *Ranjini Rathinam* 
>>> Date: Fri, Mar 21, 2014 at 9:57 AM
>>> To: ranjin...@polarisft.com
>>>
>>>
>>>
>>>
>>>  -- Forwarded message --
>>> From: Stanley Shi 
>>> Date: Fri, Mar 21, 2014 at 7:43 AM
>>> Subject: Re: Need FileName with Content
>>>
>>>
>>> --
>>> From: *Ranjini Rathinam* 
>>> Date: Fri, Mar 21, 2014 at 9:58 AM
>>> To: ranjin...@polarisft.com
>>>
>>>
>>>
>>>
>>>
>>
>


Re: Need FileName with Content

2014-03-21 Thread Ranjini Rathinam
tion);
>>FileStatus[] status = hdfs.listStatus(srcPath);
>>fs=hdfs.open(srcPath);
>>BufferedReader br=new BufferedReader(new
>> InputStreamReader(hdfs.open(srcPath)));
>>
>> String[] splited = line.split("\\s+");
>> for( i=0;i>  {
>>  String sp[]=splited[i].split(",");
>>  for( k=0;k>  {
>>
>>if(!sp[k].isEmpty()){
>> StringTokenizer tokenizer = new StringTokenizer(sp[k]);
>> if((sp[k].equalsIgnoreCase("C"))){
>> while (tokenizer.hasMoreTokens()) {
>>   word.set(tokenizer.nextToken());
>>   output.collect(word, one);
>> }
>> }
>> if((sp[k].equalsIgnoreCase("JAVA"))){
>> while (tokenizer.hasMoreTokens()) {
>>   word.set(tokenizer.nextToken());
>>   output.collect(word, one);
>> }
>> }
>>   }
>> }
>> }
>>  } catch (IOException e) {
>> e.printStackTrace();
>>  }
>> }
>> }
>> public static class Reduce extends MapReduceBase implements
>> Reducer {
>>   public void reduce(Text key, Iterator values,
>> OutputCollector output, Reporter reporter) throws
>> IOException {
>> int sum = 0;
>> while (values.hasNext()) {
>>   sum += values.next().get();
>> }
>> output.collect(key, new IntWritable(sum));
>>   }
>> }
>> public static void main(String[] args) throws Exception {
>>
>>
>>   JobConf conf = new JobConf(WordCount.class);
>>   conf.setJobName("wordcount");
>>   conf.setOutputKeyClass(Text.class);
>>   conf.setOutputValueClass(IntWritable.class);
>>   conf.setMapperClass(Map.class);
>>   conf.setCombinerClass(Reduce.class);
>>   conf.setReducerClass(Reduce.class);
>>   conf.setInputFormat(TextInputFormat.class);
>>   conf.setOutputFormat(TextOutputFormat.class);
>>   FileInputFormat.setInputPaths(conf, new Path(args[0]));
>>   FileOutputFormat.setOutputPath(conf, new Path(args[1]));
>>   JobClient.runJob(conf);
>> }
>>  }
>>
>>
>>
>> Please help
>>
>> Thanks in advance.
>>
>> Ranjini
>>
>>
>>
>> --
>> From: *Stanley Shi* 
>> Date: Thu, Mar 20, 2014 at 7:39 AM
>> To: user@hadoop.apache.org
>>
>>
>> You want to do a word count for each file, but the code give you a word
>> count for all the files, right?
>>
>> =
>>  word.set(tokenizer.nextToken());
>>   output.collect(word, one);
>> ==
>> change it to:
>> word.set("filename"+""+tokenizer.nextToken());
>> output.collect(word,one);
>>
>>
>>
>>
>>  Regards,
>> *Stanley Shi,*
>>
>>
>> --
>> From: *Ranjini Rathinam* 
>> Date: Thu, Mar 20, 2014 at 10:56 AM
>> To: ranjin...@polarisft.com
>>
>>
>>
>> --
>> From: *Ranjini Rathinam* 
>> Date: Thu, Mar 20, 2014 at 11:20 AM
>> To: user@hadoop.apache.org, s...@gopivotal.com
>>
>>
>> Hi,
>>
>> If we give the below code,
>> ===
>>  word.set("filename"+""+tokenizer.nextToken());
>> output.collect(word,one);
>> ==
>>
>> The output is wrong. because it shows the
>>
>>  filename   word   occurance
>> vinitha   java   4
>> vinitha oracle  3
>> sony   java   4
>> sony  oracle  3
>>
>>
>> Here vinitha does not have oracle word . Similarlly sony does not have
>> java has word. File name is merging for  all words.
>>
>> I need the output has given below
>>
>>  filename   word   occurance
>>
>> vinitha   java   4
>> vinitha C++3
>> sony   ETL 4
>> sony  oracle  3
>>
>>
>>  Need fileaName along with the word in that particular file only. No
>> merge should happen.
>>
>> Please help me out for this issue.
>>
>> Please help.
>>
>> Thanks in advance.
>>
>> Ranjini
>>
>> --
>> From: *Felix Chern* 
>> Date: Thu, Mar 20, 2014 at 11:25 PM
>> To: user@hadoop.apache.org
>> Cc: s...@gopivotal.com
>>
>>
>>  I've written two blog post of how to get directory context in hadoop
>> mapper.
>>
>>
>> http://www.idryman.org/blog/2014/01/26/capture-directory-context-in-hadoop-mapper/
>>
>> http://www.idryman.org/blog/2014/01/27/capture-path-info-in-hadoop-inputformat-class/
>>
>> Cheers,
>> Felix
>>
>> --
>> From: *Stanley Shi* 
>> Date: Fri, Mar 21, 2014 at 7:02 AM
>>
>> To: Ranjini Rathinam 
>> Cc: user@hadoop.apache.org
>>
>>
>> Just reviewed the code again, you are not really using map-reduce. you
>> are reading all files in one map process, this is not a normal map-reduce
>> job works.
>>
>>
>>  Regards,
>> *Stanley Shi,*
>>
>>
>> --
>> From: *Stanley Shi* 
>> Date: Fri, Mar 21, 2014 at 7:43 AM
>> To: Ranjini Rathinam 
>> Cc: user@hadoop.apache.org
>>
>>
>> Change you mapper to be something like this:
>>
>>  public static class TokenizerMapper extends
>>
>>   Mapper {
>>
>>
>> private final static IntWritable one = new IntWritable(1);
>>
>> private Text word = new Text();
>>
>>
>> public void map(Object key, Text value, Context context)
>>
>> throws IOException, InterruptedException {
>>
>>   Path pp = ((FileSplit) context.getInputSplit()).getPath();
>>
>>   StringTokenizer itr = new StringTokenizer(value.toString());
>>
>>   log.info("map on string: " + new String(value.getBytes()));
>>
>>   while (itr.hasMoreTokens()) {
>>
>> word.set(pp.getName() + " " + itr.nextToken());
>>
>> context.write(word, one);
>>
>>   }
>>
>> }
>>
>>   }
>>
>> Note: add your filtering code here;
>>
>> and then when running the command, use you input path as param;
>>
>>  Regards,
>> *Stanley Shi,*
>>
>>
>> --
>> From: *Ranjini Rathinam* 
>> Date: Fri, Mar 21, 2014 at 9:57 AM
>> To: ranjin...@polarisft.com
>>
>>
>>
>>
>>  -- Forwarded message --
>> From: Stanley Shi 
>> Date: Fri, Mar 21, 2014 at 7:43 AM
>> Subject: Re: Need FileName with Content
>>
>>
>> --
>> From: *Ranjini Rathinam* 
>> Date: Fri, Mar 21, 2014 at 9:58 AM
>> To: ranjin...@polarisft.com
>>
>>
>>
>>
>>
>


Re: Need FileName with Content

2014-03-20 Thread Ranjini Rathinam
word . Similarlly sony does not have
> java has word. File name is merging for  all words.
>
> I need the output has given below
>
>  filename   word   occurance
>
> vinitha   java   4
> vinitha C++3
> sony   ETL 4
> sony  oracle  3
>
>
>  Need fileaName along with the word in that particular file only. No merge
> should happen.
>
> Please help me out for this issue.
>
> Please help.
>
> Thanks in advance.
>
> Ranjini
>
> --
> From: *Felix Chern* 
> Date: Thu, Mar 20, 2014 at 11:25 PM
> To: user@hadoop.apache.org
> Cc: s...@gopivotal.com
>
>
>  I've written two blog post of how to get directory context in hadoop
> mapper.
>
>
> http://www.idryman.org/blog/2014/01/26/capture-directory-context-in-hadoop-mapper/
>
> http://www.idryman.org/blog/2014/01/27/capture-path-info-in-hadoop-inputformat-class/
>
> Cheers,
> Felix
>
> --
> From: *Stanley Shi* 
> Date: Fri, Mar 21, 2014 at 7:02 AM
>
> To: Ranjini Rathinam 
> Cc: user@hadoop.apache.org
>
>
> Just reviewed the code again, you are not really using map-reduce. you are
> reading all files in one map process, this is not a normal map-reduce job
> works.
>
>
>  Regards,
> *Stanley Shi,*
>
>
> --
> From: *Stanley Shi* 
> Date: Fri, Mar 21, 2014 at 7:43 AM
> To: Ranjini Rathinam 
> Cc: user@hadoop.apache.org
>
>
> Change you mapper to be something like this:
>
>  public static class TokenizerMapper extends
>
>   Mapper {
>
>
> private final static IntWritable one = new IntWritable(1);
>
> private Text word = new Text();
>
>
> public void map(Object key, Text value, Context context)
>
> throws IOException, InterruptedException {
>
>   Path pp = ((FileSplit) context.getInputSplit()).getPath();
>
>   StringTokenizer itr = new StringTokenizer(value.toString());
>
>   log.info("map on string: " + new String(value.getBytes()));
>
>   while (itr.hasMoreTokens()) {
>
> word.set(pp.getName() + " " + itr.nextToken());
>
> context.write(word, one);
>
>   }
>
> }
>
>   }
>
> Note: add your filtering code here;
>
> and then when running the command, use you input path as param;
>
>  Regards,
> *Stanley Shi,*
>
>
> --
> From: *Ranjini Rathinam* 
> Date: Fri, Mar 21, 2014 at 9:57 AM
> To: ranjin...@polarisft.com
>
>
>
>
>  -- Forwarded message --
> From: Stanley Shi 
> Date: Fri, Mar 21, 2014 at 7:43 AM
> Subject: Re: Need FileName with Content
>
>
> --
> From: *Ranjini Rathinam* 
> Date: Fri, Mar 21, 2014 at 9:58 AM
> To: ranjin...@polarisft.com
>
>
>
>
>


Re: Need FileName with Content

2014-03-20 Thread Stanley Shi
Change you mapper to be something like this:

public static class TokenizerMapper extends

  Mapper {


private final static IntWritable one = new IntWritable(1);

private Text word = new Text();


public void map(Object key, Text value, Context context)

throws IOException, InterruptedException {

  Path pp = ((FileSplit) context.getInputSplit()).getPath();

  StringTokenizer itr = new StringTokenizer(value.toString());

  log.info("map on string: " + new String(value.getBytes()));

  while (itr.hasMoreTokens()) {

word.set(pp.getName() + " " + itr.nextToken());

context.write(word, one);

  }

}

  }

Note: add your filtering code here;

and then when running the command, use you input path as param;

Regards,
*Stanley Shi,*



On Fri, Mar 21, 2014 at 9:32 AM, Stanley Shi  wrote:

> Just reviewed the code again, you are not really using map-reduce. you are
> reading all files in one map process, this is not a normal map-reduce job
> works.
>
>
> Regards,
> *Stanley Shi,*
>
>
>
> On Thu, Mar 20, 2014 at 1:50 PM, Ranjini Rathinam 
> wrote:
>
>> Hi,
>>
>> If we give the below code,
>> ===
>> word.set("filename"+""+tokenizer.nextToken());
>> output.collect(word,one);
>> ==
>>
>> The output is wrong. because it shows the
>>
>> filename   word   occurance
>> vinitha   java   4
>> vinitha oracle  3
>> sony   java   4
>> sony  oracle  3
>>
>>
>> Here vinitha does not have oracle word . Similarlly sony does not have
>> java has word. File name is merging for  all words.
>>
>> I need the output has given below
>>
>>  filename   word   occurance
>>
>> vinitha   java   4
>> vinitha C++3
>> sony   ETL 4
>> sony  oracle  3
>>
>>
>>  Need fileaName along with the word in that particular file only. No
>> merge should happen.
>>
>> Please help me out for this issue.
>>
>> Please help.
>>
>> Thanks in advance.
>>
>> Ranjini
>>
>>
>>
>>
>> On Thu, Mar 20, 2014 at 10:56 AM, Ranjini Rathinam <
>> ranjinibe...@gmail.com> wrote:
>>
>>
>>>
>>> -- Forwarded message --
>>> From: Stanley Shi 
>>> Date: Thu, Mar 20, 2014 at 7:39 AM
>>> Subject: Re: Need FileName with Content
>>> To: user@hadoop.apache.org
>>>
>>>
>>> You want to do a word count for each file, but the code give you a word
>>> count for all the files, right?
>>>
>>> =
>>>  word.set(tokenizer.nextToken());
>>>   output.collect(word, one);
>>> ==
>>> change it to:
>>> word.set("filename"+""+tokenizer.nextToken());
>>> output.collect(word,one);
>>>
>>>
>>>
>>>
>>>  Regards,
>>> *Stanley Shi,*
>>>
>>>
>>>
>>> On Wed, Mar 19, 2014 at 8:50 PM, Ranjini Rathinam <
>>> ranjinibe...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have folder named INPUT.
>>>>
>>>> Inside INPUT i have 5 resume are there.
>>>>
>>>> hduser@localhost:~/Ranjini$ hadoop fs -ls /user/hduser/INPUT
>>>> Found 5 items
>>>> -rw-r--r--   1 hduser supergroup   5438 2014-03-18 15:20
>>>> /user/hduser/INPUT/Rakesh Chowdary_Microstrategy.txt
>>>> -rw-r--r--   1 hduser supergroup   6022 2014-03-18 15:22
>>>> /user/hduser/INPUT/Ramarao Devineni_Microstrategy.txt
>>>> -rw-r--r--   1 hduser supergroup   3517 2014-03-18 15:21
>>>> /user/hduser/INPUT/vinitha.txt
>>>> -rw-r--r--   1 hduser supergroup   3517 2014-03-18 15:21
>>>> /user/hduser/INPUT/sony.txt
>>>> -rw-r--r--   1 hduser supergroup   3517 2014-03-18 15:21
>>>> /user/hduser/INPUT/ravi.txt
>>>> hduser@localhost:~/Ranjini$
>>>>
>>>> I have to process the folder and the content .
>>>>
>>>> I need ouput has
>>>>
>>>> filename   word   occurance
>>>> vinitha   java   4
>>>> sony  oracle  3
>>>>
>>>>
>>>>
>>>> But iam not getting the filename.  Has the input file content are
>>>> merged file name is not getting correct .
>>>>
>>>>
>>

Re: Need FileName with Content

2014-03-20 Thread Stanley Shi
Just reviewed the code again, you are not really using map-reduce. you are
reading all files in one map process, this is not a normal map-reduce job
works.


Regards,
*Stanley Shi,*



On Thu, Mar 20, 2014 at 1:50 PM, Ranjini Rathinam wrote:

> Hi,
>
> If we give the below code,
> ===
> word.set("filename"+""+tokenizer.nextToken());
> output.collect(word,one);
> ==
>
> The output is wrong. because it shows the
>
> filename   word   occurance
> vinitha   java   4
> vinitha oracle  3
> sony   java   4
> sony  oracle  3
>
>
> Here vinitha does not have oracle word . Similarlly sony does not have
> java has word. File name is merging for  all words.
>
> I need the output has given below
>
>  filename   word   occurance
>
> vinitha   java   4
> vinitha C++3
> sony   ETL 4
> sony  oracle  3
>
>
>  Need fileaName along with the word in that particular file only. No merge
> should happen.
>
> Please help me out for this issue.
>
> Please help.
>
> Thanks in advance.
>
> Ranjini
>
>
>
>
> On Thu, Mar 20, 2014 at 10:56 AM, Ranjini Rathinam  > wrote:
>
>
>>
>> -- Forwarded message --
>> From: Stanley Shi 
>> Date: Thu, Mar 20, 2014 at 7:39 AM
>> Subject: Re: Need FileName with Content
>> To: user@hadoop.apache.org
>>
>>
>> You want to do a word count for each file, but the code give you a word
>> count for all the files, right?
>>
>> =
>>  word.set(tokenizer.nextToken());
>>   output.collect(word, one);
>> ==
>> change it to:
>> word.set("filename"+""+tokenizer.nextToken());
>> output.collect(word,one);
>>
>>
>>
>>
>>  Regards,
>> *Stanley Shi,*
>>
>>
>>
>> On Wed, Mar 19, 2014 at 8:50 PM, Ranjini Rathinam > > wrote:
>>
>>> Hi,
>>>
>>> I have folder named INPUT.
>>>
>>> Inside INPUT i have 5 resume are there.
>>>
>>> hduser@localhost:~/Ranjini$ hadoop fs -ls /user/hduser/INPUT
>>> Found 5 items
>>> -rw-r--r--   1 hduser supergroup   5438 2014-03-18 15:20
>>> /user/hduser/INPUT/Rakesh Chowdary_Microstrategy.txt
>>> -rw-r--r--   1 hduser supergroup   6022 2014-03-18 15:22
>>> /user/hduser/INPUT/Ramarao Devineni_Microstrategy.txt
>>> -rw-r--r--   1 hduser supergroup   3517 2014-03-18 15:21
>>> /user/hduser/INPUT/vinitha.txt
>>> -rw-r--r--   1 hduser supergroup   3517 2014-03-18 15:21
>>> /user/hduser/INPUT/sony.txt
>>> -rw-r--r--   1 hduser supergroup   3517 2014-03-18 15:21
>>> /user/hduser/INPUT/ravi.txt
>>> hduser@localhost:~/Ranjini$
>>>
>>> I have to process the folder and the content .
>>>
>>> I need ouput has
>>>
>>> filename   word   occurance
>>> vinitha   java   4
>>> sony  oracle  3
>>>
>>>
>>>
>>> But iam not getting the filename.  Has the input file content are merged
>>> file name is not getting correct .
>>>
>>>
>>> please help in this issue to fix.  I have given by code below
>>>
>>>
>>>  import java.io.IOException;
>>>  import java.util.*;
>>>  import org.apache.hadoop.fs.Path;
>>>  import org.apache.hadoop.conf.*;
>>>  import org.apache.hadoop.io.*;
>>>  import org.apache.hadoop.mapred.*;
>>>  import org.apache.hadoop.util.*;
>>> import java.io.File;
>>> import java.io.FileReader;
>>> import java.io.FileWriter;
>>> import java.io.IOException;
>>> import org.apache.hadoop.fs.Path;
>>> import org.apache.hadoop.conf.Configuration;
>>> import org.apache.hadoop.fs.FileSystem;
>>> import org.apache.hadoop.fs.FileStatus;
>>> import org.apache.hadoop.conf.*;
>>> import org.apache.hadoop.io.*;
>>> import org.apache.hadoop.mapred.*;
>>> import org.apache.hadoop.util.*;
>>> import org.apache.hadoop.mapred.lib.*;
>>>
>>>  public class WordCount {
>>> public static class Map extends MapReduceBase implements
>>> Mapper {
>>>  private final static IntWritable one = new IntWritable(1);
>>>   private Text word = new Text();
>>>   public void map(LongWritable key, Text value,
>>> OutputCollector output, Reporter reporter) throws
>>> IOException {
>&

Re: Need FileName with Content

2014-03-20 Thread Felix Chern
I've written two blog post of how to get directory context in hadoop mapper.

http://www.idryman.org/blog/2014/01/26/capture-directory-context-in-hadoop-mapper/
http://www.idryman.org/blog/2014/01/27/capture-path-info-in-hadoop-inputformat-class/

Cheers,
Felix

On Mar 19, 2014, at 10:50 PM, Ranjini Rathinam  wrote:

> Hi,
>  
> If we give the below code,
> ===
> word.set("filename"+""+tokenizer.nextToken());
> output.collect(word,one);
> ==
>  
> The output is wrong. because it shows the
>  
> filename   word   occurance
> vinitha   java   4
> vinitha oracle  3
> sony   java   4
> sony  oracle  3
>  
>  
> Here vinitha does not have oracle word . Similarlly sony does not have java 
> has word. File name is merging for  all words.
>  
> I need the output has given below
>  
> filename   word   occurance
> 
> vinitha   java   4
> vinitha C++3
> sony   ETL 4
> sony  oracle  3
>  
>  
>  Need fileaName along with the word in that particular file only. No merge 
> should happen.
>  
> Please help me out for this issue.
>  
> Please help.
>  
> Thanks in advance.
>  
> Ranjini
>  
>  
> 
>  
> On Thu, Mar 20, 2014 at 10:56 AM, Ranjini Rathinam  
> wrote:
> 
> 
> -- Forwarded message --
> From: Stanley Shi 
> Date: Thu, Mar 20, 2014 at 7:39 AM
> Subject: Re: Need FileName with Content
> To: user@hadoop.apache.org
> 
> 
> You want to do a word count for each file, but the code give you a word count 
> for all the files, right?
> 
> =
> word.set(tokenizer.nextToken());
>   output.collect(word, one);
> ==
> change it to:
> word.set("filename"+""+tokenizer.nextToken());
> output.collect(word,one);
> 
> 
> 
> 
> Regards,
> Stanley Shi,
> 
> 
> 
> On Wed, Mar 19, 2014 at 8:50 PM, Ranjini Rathinam  
> wrote:
> Hi,
> 
> I have folder named INPUT.
> 
> Inside INPUT i have 5 resume are there.
> 
> hduser@localhost:~/Ranjini$ hadoop fs -ls /user/hduser/INPUT
> Found 5 items
> -rw-r--r--   1 hduser supergroup   5438 2014-03-18 15:20 
> /user/hduser/INPUT/Rakesh Chowdary_Microstrategy.txt
> -rw-r--r--   1 hduser supergroup   6022 2014-03-18 15:22 
> /user/hduser/INPUT/Ramarao Devineni_Microstrategy.txt
> -rw-r--r--   1 hduser supergroup   3517 2014-03-18 15:21 
> /user/hduser/INPUT/vinitha.txt
> -rw-r--r--   1 hduser supergroup   3517 2014-03-18 15:21 
> /user/hduser/INPUT/sony.txt
> -rw-r--r--   1 hduser supergroup   3517 2014-03-18 15:21 
> /user/hduser/INPUT/ravi.txt
> hduser@localhost:~/Ranjini$ 
> 
> I have to process the folder and the content .
> 
> I need ouput has 
> 
> filename   word   occurance
> vinitha   java   4
> sony  oracle  3
> 
> 
> 
> But iam not getting the filename.  Has the input file content are merged file 
> name is not getting correct .
> 
> 
> please help in this issue to fix.  I have given by code below
>  
>  
>  import java.io.IOException;
>  import java.util.*;
>  import org.apache.hadoop.fs.Path;
>  import org.apache.hadoop.conf.*;
>  import org.apache.hadoop.io.*;
>  import org.apache.hadoop.mapred.*;
>  import org.apache.hadoop.util.*;
> import java.io.File;
> import java.io.FileReader;
> import java.io.FileWriter;
> import java.io.IOException;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.FileStatus;
> import org.apache.hadoop.conf.*;
> import org.apache.hadoop.io.*;
> import org.apache.hadoop.mapred.*;
> import org.apache.hadoop.util.*;
> import org.apache.hadoop.mapred.lib.*;
> 
>  public class WordCount {
> public static class Map extends MapReduceBase implements 
> Mapper {
>  private final static IntWritable one = new IntWritable(1);
>   private Text word = new Text();
>   public void map(LongWritable key, Text value, OutputCollector IntWritable> output, Reporter reporter) throws IOException {
>FSDataInputStream fs=null;
>FileSystem hdfs = null;
>String line = value.toString();
>  int i=0,k=0;
>   try{
>Configuration configuration = new Configuration();
>   configuration.set("fs.default.name", "hdfs://localhost:4440/");
>
>Path srcPath = new Path("/user/hduser/INPUT/");
>
>hdfs = FileSystem.get(configuration);
>FileStatus[] status = hdfs.listStatus(srcPath);
>fs=hdfs.open(sr

Re: Need FileName with Content

2014-03-19 Thread Ranjini Rathinam
Hi,

If we give the below code,
===
word.set("filename"+""+tokenizer.nextToken());
output.collect(word,one);
==

The output is wrong. because it shows the

filename   word   occurance
vinitha   java   4
vinitha oracle  3
sony   java   4
sony  oracle  3


Here vinitha does not have oracle word . Similarlly sony does not have java
has word. File name is merging for  all words.

I need the output has given below

 filename   word   occurance

vinitha   java   4
vinitha C++3
sony   ETL 4
sony  oracle  3


 Need fileaName along with the word in that particular file only. No merge
should happen.

Please help me out for this issue.

Please help.

Thanks in advance.

Ranjini




On Thu, Mar 20, 2014 at 10:56 AM, Ranjini Rathinam
wrote:

>
>
> -- Forwarded message --
> From: Stanley Shi 
> Date: Thu, Mar 20, 2014 at 7:39 AM
> Subject: Re: Need FileName with Content
> To: user@hadoop.apache.org
>
>
> You want to do a word count for each file, but the code give you a word
> count for all the files, right?
>
> =
>  word.set(tokenizer.nextToken());
>   output.collect(word, one);
> ==
> change it to:
> word.set("filename"+""+tokenizer.nextToken());
> output.collect(word,one);
>
>
>
>
>  Regards,
> *Stanley Shi,*
>
>
>
> On Wed, Mar 19, 2014 at 8:50 PM, Ranjini Rathinam 
> wrote:
>
>> Hi,
>>
>> I have folder named INPUT.
>>
>> Inside INPUT i have 5 resume are there.
>>
>> hduser@localhost:~/Ranjini$ hadoop fs -ls /user/hduser/INPUT
>> Found 5 items
>> -rw-r--r--   1 hduser supergroup   5438 2014-03-18 15:20
>> /user/hduser/INPUT/Rakesh Chowdary_Microstrategy.txt
>> -rw-r--r--   1 hduser supergroup   6022 2014-03-18 15:22
>> /user/hduser/INPUT/Ramarao Devineni_Microstrategy.txt
>> -rw-r--r--   1 hduser supergroup   3517 2014-03-18 15:21
>> /user/hduser/INPUT/vinitha.txt
>> -rw-r--r--   1 hduser supergroup   3517 2014-03-18 15:21
>> /user/hduser/INPUT/sony.txt
>> -rw-r--r--   1 hduser supergroup   3517 2014-03-18 15:21
>> /user/hduser/INPUT/ravi.txt
>> hduser@localhost:~/Ranjini$
>>
>> I have to process the folder and the content .
>>
>> I need ouput has
>>
>> filename   word   occurance
>> vinitha   java   4
>> sony  oracle  3
>>
>>
>>
>> But iam not getting the filename.  Has the input file content are merged
>> file name is not getting correct .
>>
>>
>> please help in this issue to fix.  I have given by code below
>>
>>
>>  import java.io.IOException;
>>  import java.util.*;
>>  import org.apache.hadoop.fs.Path;
>>  import org.apache.hadoop.conf.*;
>>  import org.apache.hadoop.io.*;
>>  import org.apache.hadoop.mapred.*;
>>  import org.apache.hadoop.util.*;
>> import java.io.File;
>> import java.io.FileReader;
>> import java.io.FileWriter;
>> import java.io.IOException;
>> import org.apache.hadoop.fs.Path;
>> import org.apache.hadoop.conf.Configuration;
>> import org.apache.hadoop.fs.FileSystem;
>> import org.apache.hadoop.fs.FileStatus;
>> import org.apache.hadoop.conf.*;
>> import org.apache.hadoop.io.*;
>> import org.apache.hadoop.mapred.*;
>> import org.apache.hadoop.util.*;
>> import org.apache.hadoop.mapred.lib.*;
>>
>>  public class WordCount {
>> public static class Map extends MapReduceBase implements
>> Mapper {
>>  private final static IntWritable one = new IntWritable(1);
>>   private Text word = new Text();
>>   public void map(LongWritable key, Text value, OutputCollector> IntWritable> output, Reporter reporter) throws IOException {
>>FSDataInputStream fs=null;
>>FileSystem hdfs = null;
>>String line = value.toString();
>>  int i=0,k=0;
>>   try{
>>Configuration configuration = new Configuration();
>>   configuration.set("fs.default.name", "hdfs://localhost:4440/");
>>
>>Path srcPath = new Path("/user/hduser/INPUT/");
>>
>>hdfs = FileSystem.get(configuration);
>>FileStatus[] status = hdfs.listStatus(srcPath);
>>fs=hdfs.open(srcPath);
>>BufferedReader br=new BufferedReader(new
>> InputStreamReader(hdfs.open(srcPath)));
>>
>> String[] splited = line.split("\\s+");
>> for( i=0;i>  {
>>  String sp[]=splited[i].split(",");
>>  for( k=0;k> 

Re: Need FileName with Content

2014-03-19 Thread Stanley Shi
You want to do a word count for each file, but the code give you a word
count for all the files, right?

=
word.set(tokenizer.nextToken());
  output.collect(word, one);
==
change it to:
word.set("filename"+""+tokenizer.nextToken());
output.collect(word,one);




Regards,
*Stanley Shi,*



On Wed, Mar 19, 2014 at 8:50 PM, Ranjini Rathinam wrote:

> Hi,
>
> I have folder named INPUT.
>
> Inside INPUT i have 5 resume are there.
>
> hduser@localhost:~/Ranjini$ hadoop fs -ls /user/hduser/INPUT
> Found 5 items
> -rw-r--r--   1 hduser supergroup   5438 2014-03-18 15:20
> /user/hduser/INPUT/Rakesh Chowdary_Microstrategy.txt
> -rw-r--r--   1 hduser supergroup   6022 2014-03-18 15:22
> /user/hduser/INPUT/Ramarao Devineni_Microstrategy.txt
> -rw-r--r--   1 hduser supergroup   3517 2014-03-18 15:21
> /user/hduser/INPUT/vinitha.txt
> -rw-r--r--   1 hduser supergroup   3517 2014-03-18 15:21
> /user/hduser/INPUT/sony.txt
> -rw-r--r--   1 hduser supergroup   3517 2014-03-18 15:21
> /user/hduser/INPUT/ravi.txt
> hduser@localhost:~/Ranjini$
>
> I have to process the folder and the content .
>
> I need ouput has
>
> filename   word   occurance
> vinitha   java   4
> sony  oracle  3
>
>
>
> But iam not getting the filename.  Has the input file content are merged
> file name is not getting correct .
>
>
> please help in this issue to fix.  I have given by code below
>
>
>  import java.io.IOException;
>  import java.util.*;
>  import org.apache.hadoop.fs.Path;
>  import org.apache.hadoop.conf.*;
>  import org.apache.hadoop.io.*;
>  import org.apache.hadoop.mapred.*;
>  import org.apache.hadoop.util.*;
> import java.io.File;
> import java.io.FileReader;
> import java.io.FileWriter;
> import java.io.IOException;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.FileStatus;
> import org.apache.hadoop.conf.*;
> import org.apache.hadoop.io.*;
> import org.apache.hadoop.mapred.*;
> import org.apache.hadoop.util.*;
> import org.apache.hadoop.mapred.lib.*;
>
>  public class WordCount {
> public static class Map extends MapReduceBase implements
> Mapper {
>  private final static IntWritable one = new IntWritable(1);
>   private Text word = new Text();
>   public void map(LongWritable key, Text value, OutputCollector IntWritable> output, Reporter reporter) throws IOException {
>FSDataInputStream fs=null;
>FileSystem hdfs = null;
>String line = value.toString();
>  int i=0,k=0;
>   try{
>Configuration configuration = new Configuration();
>   configuration.set("fs.default.name", "hdfs://localhost:4440/");
>
>Path srcPath = new Path("/user/hduser/INPUT/");
>
>hdfs = FileSystem.get(configuration);
>FileStatus[] status = hdfs.listStatus(srcPath);
>fs=hdfs.open(srcPath);
>BufferedReader br=new BufferedReader(new
> InputStreamReader(hdfs.open(srcPath)));
>
> String[] splited = line.split("\\s+");
> for( i=0;i  {
>  String sp[]=splited[i].split(",");
>  for( k=0;k  {
>
>if(!sp[k].isEmpty()){
> StringTokenizer tokenizer = new StringTokenizer(sp[k]);
> if((sp[k].equalsIgnoreCase("C"))){
> while (tokenizer.hasMoreTokens()) {
>   word.set(tokenizer.nextToken());
>   output.collect(word, one);
> }
> }
> if((sp[k].equalsIgnoreCase("JAVA"))){
> while (tokenizer.hasMoreTokens()) {
>   word.set(tokenizer.nextToken());
>   output.collect(word, one);
> }
> }
>   }
> }
> }
>  } catch (IOException e) {
> e.printStackTrace();
>  }
> }
> }
> public static class Reduce extends MapReduceBase implements
> Reducer {
>   public void reduce(Text key, Iterator values,
> OutputCollector output, Reporter reporter) throws
> IOException {
> int sum = 0;
> while (values.hasNext()) {
>   sum += values.next().get();
> }
> output.collect(key, new IntWritable(sum));
>   }
> }
> public static void main(String[] args) throws Exception {
>
>
>   JobConf conf = new JobConf(WordCount.class);
>   conf.setJobName("wordcount");
>   conf.setOutputKeyClass(Text.class);
>   conf.setOutputValueClass(IntWritable.class);
>   conf.setMapperClass(Map.class);
>   conf.setCombinerClass(Reduce.class);
>   conf.setReducerClass(Reduce.class);
>   conf.setInputFormat(TextInputFormat.class);
>   conf.setOutputFormat(TextOutputFormat.class);
>   FileInputFormat.setInputPaths(conf, new Path(args[0]));
>   FileOutputFormat.setOutputPath(conf, new Path(args[1]));
>   JobClient.runJob(conf);
> }
>  }
>
>
>
> Please help
>
> Thanks in advance.
>
> Ranjini
>
>
>


Need FileName with Content

2014-03-19 Thread Ranjini Rathinam
Hi,

I have folder named INPUT.

Inside INPUT i have 5 resume are there.

hduser@localhost:~/Ranjini$ hadoop fs -ls /user/hduser/INPUT
Found 5 items
-rw-r--r--   1 hduser supergroup   5438 2014-03-18 15:20
/user/hduser/INPUT/Rakesh Chowdary_Microstrategy.txt
-rw-r--r--   1 hduser supergroup   6022 2014-03-18 15:22
/user/hduser/INPUT/Ramarao Devineni_Microstrategy.txt
-rw-r--r--   1 hduser supergroup   3517 2014-03-18 15:21
/user/hduser/INPUT/vinitha.txt
-rw-r--r--   1 hduser supergroup   3517 2014-03-18 15:21
/user/hduser/INPUT/sony.txt
-rw-r--r--   1 hduser supergroup   3517 2014-03-18 15:21
/user/hduser/INPUT/ravi.txt
hduser@localhost:~/Ranjini$

I have to process the folder and the content .

I need ouput has

filename   word   occurance
vinitha   java   4
sony  oracle  3



But iam not getting the filename.  Has the input file content are merged
file name is not getting correct .


please help in this issue to fix.  I have given by code below


 import java.io.IOException;
 import java.util.*;
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.conf.*;
 import org.apache.hadoop.io.*;
 import org.apache.hadoop.mapred.*;
 import org.apache.hadoop.util.*;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
import org.apache.hadoop.mapred.lib.*;

 public class WordCount {
public static class Map extends MapReduceBase implements
Mapper {
 private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();
  public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException {
   FSDataInputStream fs=null;
   FileSystem hdfs = null;
   String line = value.toString();
 int i=0,k=0;
  try{
   Configuration configuration = new Configuration();
  configuration.set("fs.default.name", "hdfs://localhost:4440/");

   Path srcPath = new Path("/user/hduser/INPUT/");

   hdfs = FileSystem.get(configuration);
   FileStatus[] status = hdfs.listStatus(srcPath);
   fs=hdfs.open(srcPath);
   BufferedReader br=new BufferedReader(new
InputStreamReader(hdfs.open(srcPath)));

String[] splited = line.split("\\s +");
for( i=0;i {
  public void reduce(Text key, Iterator values,
OutputCollector output, Reporter reporter) throws
IOException {
int sum = 0;
while (values.hasNext()) {
  sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
  }
}
public static void main(String[] args) throws Exception {


  JobConf conf = new JobConf(WordCount.class);
  conf.setJobName("wordcount");
  conf.setOutputKeyClass(Text.class);
  conf.setOutputValueClass(IntWritable.class);
  conf.setMapperClass(Map.class);
  conf.setCombinerClass(Reduce.class);
  conf.setReducerClass(Reduce.class);
  conf.setInputFormat(TextInputFormat.class);
  conf.setOutputFormat(TextOutputFormat.class);
  FileInputFormat.setInputPaths(conf, new Path(args[0]));
  FileOutputFormat.setOutputPath(conf, new Path(args[1]));
  JobClient.runJob(conf);
}
 }



Please help

Thanks in advance.

Ranjini