I want to split a single big rdd into small rdds without reading too much from disk (hdfs). what is the best way to do that?
this is my current code: subclass_pairs = schema_triples.filter(lambda (s, p, o): p == PROPERTIES['subClassOf']).map(lambda (s, p, o): (s, o)) subproperty_pairs = schema_triples.filter(lambda (s, p, o): p == PROPERTIES['subPropertyOf']).map(lambda (s, p, o): (s, o)).cache() domain_pairs = schema_triples.filter(lambda (s, p, o): p == PROPERTIES['domain']).map(lambda (s, p, o): (s, o)) range_pairs = schema_triples.filter(lambda (s, p, o): p == PROPERTIES['range']).map(lambda (s, p, o): (s, o)) total_triples = instance_triples.union(schema_triples) type_pairs = total_triples.filter(lambda (s, p, o): p == PROPERTIES['type']).map(lambda (s, p, o): (s, o)).distinct().cache() triples = total_triples.filter(lambda (s, p, o): isUserDefined(p)).map(lambda (s, p, o): (s, p, o)).distinct().cache() -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-split-one-big-RDD-about-100G-into-several-small-ones-tp4450.html Sent from the Apache Spark User List mailing list archive at Nabble.com.