I want to split a single big rdd into small rdds without reading too much
from disk (hdfs). what is the best way to do that?

this is my current code: 
 subclass_pairs    = schema_triples.filter(lambda (s, p, o): p ==
PROPERTIES['subClassOf']).map(lambda (s, p, o): (s, o))
    subproperty_pairs = schema_triples.filter(lambda (s, p, o): p ==
PROPERTIES['subPropertyOf']).map(lambda (s, p, o): (s, o)).cache()
    domain_pairs      = schema_triples.filter(lambda (s, p, o): p ==
PROPERTIES['domain']).map(lambda (s, p, o): (s, o))
    range_pairs       = schema_triples.filter(lambda (s, p, o): p ==
PROPERTIES['range']).map(lambda (s, p, o): (s, o))
    total_triples     = instance_triples.union(schema_triples)
    type_pairs        = total_triples.filter(lambda (s, p, o):  p ==
PROPERTIES['type']).map(lambda (s, p, o): (s, o)).distinct().cache()
    triples           = total_triples.filter(lambda (s, p, o): 
isUserDefined(p)).map(lambda (s, p, o): (s, p, o)).distinct().cache()  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-split-one-big-RDD-about-100G-into-several-small-ones-tp4450.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to