DataFrame versus Dataset creation and usage

Martin Serrano Fri, 24 Jun 2016 07:28:19 -0700

Hi,

I'm exposing a custom source to the Spark environment.  I have a question about 
the best way to approach this problem.


I created a custom relation for my source and it creates a DataFrame<Row>.  My 
custom source knows the data types which are dynamic so this seemed to be the 
appropriate return type.  This works fine.

The next step I want to take is to expose some custom mapping functions 
(written in Java).  But when I look at the APIs, the map method for DataFrame 
returns an RDD (not a DataFrame).  (Should I use SqlContext.createDataFrame on 
the result? -- does this result in additional processing overhead?)  The 
Dataset type seems to be more of what I'd be looking for, it's map method 
returns the Dataset type.  So chaining them together is a natural exercise.

But to create the Dataset from a DataFrame, it appears that I have to provide 
the types of each field in the Row in the DataFrame.as[...] method.  I would 
think that the DataFrame would be able to do this automatically since it has 
all the types already.

This leads me to wonder how I should be approaching this effort.  As all the 
fields and types are dynamic, I cannot use beans as my type when passing data 
around.  Any advice would be appreciated.

Thanks,
Martin

DataFrame versus Dataset creation and usage

Reply via email to