Hello,
On 4 Jun 2013, at 23:49, Max Lebedev <[email protected]> wrote:
> Hi. I've been trying to use JSONObjects to identify duplicates in
> JSONStrings.
> The duplicate strings contain the same data, but not necessarily in the same
> order. For example the following two lines should be identified as duplicates
> (and filtered).
>
> {"ts":1368758947.291035,"isSecure":true,"version":2,"source":"sdk","debug":false
> {"ts":1368758947.291035,"version":2,"source":"sdk","isSecure":true,"debug":false}
>
>
Can you not use the timestamp as a URI and emit them as URIs. Then you have
your mapper emit the following kv :
output.collect(ts, value);
And you would have a straight forward reducer that can dedup based on the
timestamps.
If above doesn't work for you, I would look at the jackson library for mangling
json in java. It method of using java beans for json is clean from a code pov
and comes with lots of nice features.
http://stackoverflow.com/a/2255893
P.S. In your code you are using the old'er map reduce API, I would look at
using the newer APIs in this package org.apache.hadoop.mapreduce
Mischa
> This is the code:
>
> class DupFilter{
>
> public static class Map extends MapReduceBase implements
> Mapper<LongWritable, Text, JSONObject, Text> {
>
> public void map(LongWritable key, Text value,
> OutputCollector<JSONObject, Text> output, Reporter reporter) throws
> IOException{
>
> JSONObject jo = null;
>
> try {
>
> jo = new JSONObject(value.toString());
>
> } catch (JSONException e) {
>
> e.printStackTrace();
>
> }
>
> output.collect(jo, value);
>
> }
>
> }
>
> public static class Reduce extends MapReduceBase implements
> Reducer<JSONObject, Text, NullWritable, Text> {
>
> public void reduce(JSONObject jo, Iterator<Text> lines,
> OutputCollector<NullWritable, Text> output, Reporter reporter) throws
> IOException {
>
>
> output.collect(null, lines.next());
>
> }
>
> }
>
> public static void main(String[] args) throws Exception {
>
> JobConf conf = new JobConf(DupFilter.class);
>
> conf.setOutputKeyClass(JSONObject.class);
>
> conf.setOutputValueClass(Text.class);
>
> conf.setMapperClass(Map.class);
>
> conf.setReducerClass(Reduce.class);
>
> conf.setInputFormat(TextInputFormat.class);
>
> conf.setOutputFormat(TextOutputFormat.class);
>
> FileInputFormat.setInputPaths(conf, new Path(args[0]));
>
> FileOutputFormat.setOutputPath(conf, new Path(args[1]));
>
> JobClient.runJob(conf);
>
> }
>
> }
>
> I get the following error:
>
>
> java.lang.ClassCastException: class org.json.JSONObject
>
> at java.lang.Class.asSubclass(Class.java:3027)
>
> at
> org.apache.hadoop.mapred.JobConf.getOutputKeyComparator(JobConf.java:795)
>
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:817)
>
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:383)
>
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
>
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)
>
>
>
>
> It looks like it has something to do with conf.setOutputKeyClass(). Am I
> doing something wrong here?
>
>
>
> Thanks,
>
> Max Lebedev
>
_______________________________
Mischa Tuffield PhD
http://mmt.me.uk/
@mischat