Comparison of serialized objects

Max Tue, 15 Dec 2015 04:27:46 -0800

Hi All,

I’m currently trying to compare the types String (java.lang.String) andText (org.apache.hadoop.io.Text) in the topic of comparison ofserialized objects on Spark. Either of the types should be used as key,so this might be relevant in the following use-cases:

a)RDD.saveAsObjectFile and SparkContext.objectFile that support savingan RDD as serialized objects and load it.


b)StorageLevel.MEMORY_AND_DISK_SER as storage level

Hadoop provides the RawComparatoras Extension of Java’s Comparator. Itallows to compare objects read from stream, without deserializing theminto objects. WritableComparator implements the RawComparator interfacefor WritableComparable classes, such as Text, while there seems to be noimplementation for String. [1, p. 96]


package org.apache.hadoop.io;
import java.util.Comparator;

public interface RawComparator<T> extends Comparator<T> {

public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);

}

So the question is: How deals Spark with that, p. ex. for an RDD in theuse-cases of a and b, when reduceByKey() is called?Have both types to be deserialized, before they can be compared? Or isthere any mechanism like the RawComparator interface on Hadoop?I have already searched in the documentation, on the web and even in theSpark sources, but wasn’t able to find the answer yet.


I would be appreciated for your help.

[1] White T (2012) Hadoop; The definitive guide. O'Reilly, Sebastopol, CA.

Thanks

Sincerely Max

Comparison of serialized objects

Reply via email to