Hi All,
I’m currently trying to compare the types String (java.lang.String) and
Text (org.apache.hadoop.io.Text) in the topic of comparison of
serialized objects on Spark. Either of the types should be used as key,
so this might be relevant in the following use-cases:
a)RDD.saveAsObjectFile and SparkContext.objectFile that support saving
an RDD as serialized objects and load it.
b)StorageLevel.MEMORY_AND_DISK_SER as storage level
Hadoop provides the RawComparatoras Extension of Java’s Comparator. It
allows to compare objects read from stream, without deserializing them
into objects. WritableComparator implements the RawComparator interface
for WritableComparable classes, such as Text, while there seems to be no
implementation for String. [1, p. 96]
package org.apache.hadoop.io;
import java.util.Comparator;
public interface RawComparator<T> extends Comparator<T> {
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);
}
So the question is: How deals Spark with that, p. ex. for an RDD in the
use-cases of a and b, when reduceByKey() is called?
Have both types to be deserialized, before they can be compared? Or is
there any mechanism like the RawComparator interface on Hadoop?
I have already searched in the documentation, on the web and even in the
Spark sources, but wasn’t able to find the answer yet.
I would be appreciated for your help.
[1] White T (2012) Hadoop; The definitive guide. O'Reilly, Sebastopol, CA.
Thanks
Sincerely Max