Hi All,

I’m currently trying to compare the types String (java.lang.String) and Text (org.apache.hadoop.io.Text) in the topic of comparison of serialized objects on Spark. Either of the types should be used as key, so this might be relevant in the following use-cases:

a)RDD.saveAsObjectFile and SparkContext.objectFile that support saving an RDD as serialized objects and load it.

b)StorageLevel.MEMORY_AND_DISK_SER as storage level

Hadoop provides the RawComparatoras Extension of Java’s Comparator. It allows to compare objects read from stream, without deserializing them into objects. WritableComparator implements the RawComparator interface for WritableComparable classes, such as Text, while there seems to be no implementation for String. [1, p. 96]

package org.apache.hadoop.io;
import java.util.Comparator;

public interface RawComparator<T> extends Comparator<T> {

public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);

}


So the question is: How deals Spark with that, p. ex. for an RDD in the use-cases of a and b, when reduceByKey() is called? Have both types to be deserialized, before they can be compared? Or is there any mechanism like the RawComparator interface on Hadoop? I have already searched in the documentation, on the web and even in the Spark sources, but wasn’t able to find the answer yet.

I would be appreciated for your help.

[1] White T (2012) Hadoop; The definitive guide. O'Reilly, Sebastopol, CA.

Thanks

Sincerely Max

Reply via email to