Re: Using Dataframe API vs. RDD API?

Pat Ferrel Fri, 05 Jan 2018 14:03:07 -0800

Yes and I do not recommend that because the EventServer schema is not a 
developer contract. It may change at any time. Use the conversion method and go 
through the PIO API to get the RDD then convert to DF for now.


I’m not sure what PIO uses to get an RDD from Postgres but if they do not use 
something like the lib you mention, a PR would be nice. Also if you have an 
interest in adding the DF APIs to the EventServer contributions are encouraged. 
Committers will give some guidance I’m sure—once that know more than me on the 
subject.

If you want to donate some DF code, create a Jira and we’ll easily find a 
mentor to make suggestions. There are many benefits to this including not 
having to support a fork of PIO through subsequent versions. Also others are 
interested in this too.

 

On Jan 5, 2018, at 7:39 AM, Daniel O' Shaughnessy <danieljamesda...@gmail.com> 
wrote:

....Should have mentioned that I used org.apache.spark.rdd.JdbcRDD to read in 
the RDD from a postgres DB initially.

This was you don't need to use an EventServer!

On Fri, 5 Jan 2018 at 15:37 Daniel O' Shaughnessy <danieljamesda...@gmail.com 
<mailto:danieljamesda...@gmail.com>> wrote:
Hi Shane, 

I've successfully used : 

import org.apache.spark.ml.classification.{ RandomForestClassificationModel, 
RandomForestClassifier }

with pio. You can access feature importance through the RandomForestClassifier 
also.

Very simple to convert RDDs to DFs as Pat mentioned, something like:

val RDD_2_DF = sqlContext.createDataFrame(yourRDD).toDF("col1", "col2")



On Thu, 4 Jan 2018 at 23:10 Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:
Actually there are libs that will read DFs from HBase 
https://svn.apache.org/repos/asf/hbase/hbase.apache.org/trunk/_chapters/spark.html
 
<https://svn.apache.org/repos/asf/hbase/hbase.apache.org/trunk/_chapters/spark.html>

This is out of band with PIO and should not be used IMO because the schema of 
the EventStore is not guaranteed to remain as-is. The safest way is to 
translate or get DFs integrated to PIO. I think there is an existing Jira that 
request Spark ML support, which assumes DFs. 


On Jan 4, 2018, at 12:25 PM, Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:

Funny you should ask this. Yes, we are working on a DF based Universal 
Recommender but you have to convert the RDD into a DF since PIO does not read 
out data in the form of a DF (yet). This is a fairly simple step of maybe one 
line of code but would be better supported in PIO itself. The issue is that the 
EventStore uses libs that may not read out DFs, but RDDs. This is certainly the 
case with Elasticsearch, which provides an RDD lib. I haven’t seen one from 
them that read out DFs though it would make a lot of sense for ES especially.

So TLDR; yes, just convert the RDD into a DF for now.

Also please add a feature request as a PIO Jira ticket to look into this. I for 
one would +1


On Jan 4, 2018, at 11:55 AM, Shane Johnson <shanewaldenjohn...@gmail.com 
<mailto:shanewaldenjohn...@gmail.com>> wrote:

Hello group, Happy new year! Does anyone have a working example or template 
using the DataFrame API vs. the RDD based APIs. We are wanting to migrate to 
using the new DataFrame APIs to take advantage of the Feature Importance 
function for our Regression Random Forest Models.

We are wanting to move from 

import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils
to
import org.apache.spark.ml.regression.{RandomForestRegressionModel, 
RandomForestRegressor}

Is this something that should be fairly straightforward by adjusting parameters 
and calling new classes within DASE or is it much more involved development.

Thank You!
Shane Johnson | 801.360.3350 <tel:(801)%20360-3350>
LinkedIn <https://www.linkedin.com/in/shanewjohnson> | Facebook 
<https://www.facebook.com/shane.johnson.71653>

Re: Using Dataframe API vs. RDD API?

Reply via email to