Hi Guys,
I have a serious problem regarding the 'None' in RDD(pyspark).
Take a example of transformations that produce 'None'.
leftOuterJoin(self, other, numPartitions=None)
Perform a left outer join of self and other. (K, V) and (K, W), returns a
dataset of (K, (V, W)) pairs with all pairs of elements for each key.
Because it is leftOuterJoin, The result RDD also contains None in *(K, (V,
None))*. The None will be a trouble in subsequent transformations, every
transformations need to check the None otherwise a error will be thrown.
Another example about the CSV load function,
MOV = sc.textFile('/movie.csv');
MOV = MOV.map(lambda strLine: strLine.split(",")).map(lambda
data:{"MOVIE_ID":int(data[0]), "MOVIE_NAME":str(data[1]),
"MOVIE_DIRECTOR":str(data[2])});
It is expected to have 3 fields and seperated by comma in the CSV file,
However some dirty data maybe only 2 fields. Than
"MOVIE_DIRECTOR":str(data[2])} is dangerous.(IndexError: list index out of
range)
It is common to check "None" or illegal format in a common programming
language.
However for big data programming, it is tedious to check None or illegal
data as the illegal data is expected.
For Apache Pig, there have a special handling for the nulls, it looks better
as none check is not needed and takeing care of illegal data as well.
http://pig.apache.org/docs/r0.12.1/basic.html#nulls
For Spark, what is the best practice to handle none and illegal data as in
above exmple?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/None-in-RDD-tp12167.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]