Hello, 

On 4 Jun 2013, at 23:49, Max Lebedev <[email protected]> wrote:

> Hi. I've been trying to use JSONObjects to identify duplicates in 
> JSONStrings. 
> The duplicate strings contain the same data, but not necessarily in the same 
> order. For example the following two lines should be identified as duplicates 
> (and filtered). 
> 
> {"ts":1368758947.291035,"isSecure":true,"version":2,"source":"sdk","debug":false
> {"ts":1368758947.291035,"version":2,"source":"sdk","isSecure":true,"debug":false}
>  
> 
Can you not use the timestamp as a URI and emit them as URIs. Then you have 
your mapper emit the following kv : 

output.collect(ts, value); 

And you would have a straight forward reducer that can dedup based on the 
timestamps. 

If above doesn't work for you, I would look at the jackson library for mangling 
json in java. It method of using java beans for json is clean from a code pov 
and comes with lots of nice features. 
http://stackoverflow.com/a/2255893

P.S. In your code you are using the old'er map reduce API, I would look at 
using the newer APIs in this package org.apache.hadoop.mapreduce

Mischa
> This is the code: 
> 
> class DupFilter{
> 
>         public static class Map extends MapReduceBase implements 
> Mapper<LongWritable, Text, JSONObject, Text> {
> 
>                 public void map(LongWritable key, Text value, 
> OutputCollector<JSONObject, Text> output, Reporter reporter) throws 
> IOException{ 
> 
>                 JSONObject jo = null; 
> 
>                 try { 
> 
>                         jo = new JSONObject(value.toString()); 
> 
>                         } catch (JSONException e) { 
> 
>                                 e.printStackTrace(); 
> 
>                         } 
> 
>                 output.collect(jo, value); 
> 
>                 } 
> 
>         } 
> 
>         public static class Reduce extends MapReduceBase implements 
> Reducer<JSONObject, Text, NullWritable, Text> { 
> 
>                 public void reduce(JSONObject jo, Iterator<Text> lines, 
> OutputCollector<NullWritable, Text> output, Reporter reporter) throws 
> IOException { 
> 
> 
>                         output.collect(null, lines.next()); 
> 
>                 } 
> 
>         } 
> 
>         public static void main(String[] args) throws Exception { 
> 
>                 JobConf conf = new JobConf(DupFilter.class); 
> 
>                 conf.setOutputKeyClass(JSONObject.class); 
> 
>                 conf.setOutputValueClass(Text.class); 
> 
>                 conf.setMapperClass(Map.class); 
> 
>                 conf.setReducerClass(Reduce.class); 
> 
>                 conf.setInputFormat(TextInputFormat.class); 
> 
>                 conf.setOutputFormat(TextOutputFormat.class);
> 
>                 FileInputFormat.setInputPaths(conf, new Path(args[0])); 
> 
>                 FileOutputFormat.setOutputPath(conf, new Path(args[1]));
> 
>                 JobClient.runJob(conf); 
> 
>         } 
> 
> } 
> 
> I get the following error:
>  
> 
> java.lang.ClassCastException: class org.json.JSONObject 
> 
>         at java.lang.Class.asSubclass(Class.java:3027) 
> 
>         at 
> org.apache.hadoop.mapred.JobConf.getOutputKeyComparator(JobConf.java:795) 
> 
>         at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:817) 
> 
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:383) 
> 
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325) 
> 
>         at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)
>  
> 
> 
> 
> It looks like it has something to do with conf.setOutputKeyClass(). Am I 
> doing something wrong here? 
> 
> 
> 
> Thanks, 
> 
> Max Lebedev
> 

_______________________________
Mischa Tuffield PhD
http://mmt.me.uk/
@mischat





Reply via email to