Hi, I'm exposing a custom source to the Spark environment. I have a question about the best way to approach this problem.
I created a custom relation for my source and it creates a DataFrame<Row>. My custom source knows the data types which are dynamic so this seemed to be the appropriate return type. This works fine. The next step I want to take is to expose some custom mapping functions (written in Java). But when I look at the APIs, the map method for DataFrame returns an RDD (not a DataFrame). (Should I use SqlContext.createDataFrame on the result? -- does this result in additional processing overhead?) The Dataset type seems to be more of what I'd be looking for, it's map method returns the Dataset type. So chaining them together is a natural exercise. But to create the Dataset from a DataFrame, it appears that I have to provide the types of each field in the Row in the DataFrame.as[...] method. I would think that the DataFrame would be able to do this automatically since it has all the types already. This leads me to wonder how I should be approaching this effort. As all the fields and types are dynamic, I cannot use beans as my type when passing data around. Any advice would be appreciated. Thanks, Martin