Regarding only your last point, you could always split backwards to avoid
having to worry about updated indices (i.e., split the highest index column
first). But if you're additionally worried about efficiency, a combined
approach could make more sense to avoid making two full passes on the data.

Otherwise, I don't see anything particularly amiss here, but I'm no expert.


On Wed, Nov 13, 2013 at 3:00 PM, Philip Ogren <philip.og...@oracle.com>wrote:

> Hi Spark community,
>
> I learned a lot the last time I posted some elementary Spark code here.
>  So, I thought I would do it again.  Someone politely tell me offline if
> this is noise or unfair use of the list!  I acknowledge that this borders
> on asking Scala 101 questions....
>
> I have an RDD[List[String]] corresponding to columns of data and I want to
> split one of the columns using some arbitrary function and return an RDD
> updated with the new columns.  Here is the code I came up with.
>
> def splitColumn(columnsRDD: RDD[List[String]], columnIndex: Int,
> numSplits: Int, splitFx: String => List[String]): RDD[List[String]] = {
>
>     def insertColumns(columns: List[String]) : List[String] = {
>       val split = columns.splitAt(columnIndex)
>       val left = split._1
>       val splitColumn = split._2.head
>       val splitColumns = splitFx(splitColumn).padTo(numSplits,
> "").take(numSplits)
>       val right = split._2.tail
>       left ++ splitColumns ++ right
>     }
>
>     columnsRDD.map(columns => insertColumns(columns))
>   }
>
> Here is a simple test that demonstrates the behavior:
>
>       val spark = new SparkContext("local", "test spark")
>       val testStrings = List(List("1.2", "a b"), List("3.4", "c d e"),
> List("5.6", "f"))
>       var testRDD: RDD[List[String]] = spark.parallelize(testStrings)
>       testRDD = splitColumn(testRDD, 0, 2, _.split("\\.").toList)
>       testRDD = splitColumn(testRDD, 2, 2, _.split(" ").toList) //Line 5
>       val actualStrings = testRDD.collect.toList
>       assertEquals(4, actualStrings(0).length)
>       assertEquals("1, 2, a, b", actualStrings(0).mkString(", "))
>       assertEquals(4, actualStrings(1).length)
>       assertEquals("3, 4, c, d", actualStrings(1).mkString(", "))
>       assertEquals(4, actualStrings(2).length)
>       assertEquals("5, 6, f, ", actualStrings(2).mkString(", "))
>
>
> My first concern about this code is that I'm missing out on something that
> does exactly this in the API.  This seems like such a common use case that
> I would not be surprised if there's a readily available way to do this.
>
> I'm a little uncertain about the typing of splitColumn - i.e. the first
> parameter and the return value.  It seems like a general solution wouldn't
> require every column to be a String value.  I'm also annoyed that line 5 in
> the test code requires that I use an updated index to split what was
> originally the second column.  This suggests that perhaps I should split
> all the columns that need splitting in one function call - but it seems
> like doing that would require an unwieldy function signature.
>
> Any advice or insight is appreciated!
>
> Thanks,
> Philip
>

Reply via email to